### Modeling: Predicting Surrounding Demographics from Yelp 

In this section, we develop regression models predicting median rent, home value, and income based on the surrounding businesses accourding to Yelp.

Given random GPS coordinates within Seattle, we establish a 0.5-mile and 1-mile radius around the point, and calculate the proportion of businesses within each price tier (also referred to as dollar tier). The GPS coordinates are also placed within a neighborhood and assigned a median rent, income, and home value based on that placement. The price tier metrics are used as parameters in the regression models, with target features of the regression models being one of the demographic metrics. 

##### Imports

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

from geopy import distance as d

from CRA import * 
c = CRA()

In [10]:
yelp = pd.read_csv("../datasets/seattle_restaurants.csv")

seattle = pd.read_csv("../datasets/seattle_demographics.csv")

##### Formatting

Narrowing each dataset down to the desired features. For the Yelp dataset restaurant categories are not important, and neither is the original index ("Unnamed: 0"). For the Seattle demographics dataset the desired features in this case are GEN_ALIAS (neighborhood name), MEDIAN_GROSS_RENT (median rent for the neighborhood), HU_VALUE_MEDIAN_DOLLARS (median home value), and MEDIAN_HH_INC_PAST_12MO_DOLLAR (median annual income).

In [11]:
# drop unwanted columns from Yelp dataset

yelp.drop(columns = ["Unnamed: 0", "categories"], inplace = True)

In [13]:
# specify columns to keep from Seattle demographics dataset

keeps = ["GEN_ALIAS", "MEDIAN_GROSS_RENT", "HU_VALUE_MEDIAN_DOLLARS", "MEDIAN_HH_INC_PAST_12MO_DOLLAR"]

seattle_rent = seattle.loc[:, keeps].copy()
seattle_rent.rename(columns = {"GEN_ALIAS": "neighborhood", "MEDIAN_GROSS_RENT": "median_rent", 
                               "HU_VALUE_MEDIAN_DOLLARS": "median_home_value", 
                               "MEDIAN_HH_INC_PAST_12MO_DOLLAR": "median_income"}, inplace = True)

In [14]:
# check for neighborhoods that do not match between the two datasets

y = yelp["cra"].unique()
s = seattle_rent["neighborhood"].unique()

[print(x) for x in y if x not in s]

Not Found


[None]

In [15]:
# remove point where yelp businesses do not have an assigned neighborhood

yelp = yelp[yelp["cra"] != "Not Found"].copy()

yelp.reset_index(inplace = True)

In [131]:
# add columns in seattle df for yelp metrics 

filler = [0 for i in range(seattle_rent.shape[0])]

for i in range(1, 5): 
    seattle_rent[f"{i} dollar"] = filler

# define function for populating df with proportion of each tier of dollar sign rating 
def dollar_rating(seattle_df, yelp_df, neighborhood, dollar_tier): 
    yelp_hood = yelp_df[yelp_df["cra"] == neighborhood]
    
    if yelp_df[yelp_df["cra"] == neighborhood].shape[0] == 0: 
        proportion = 0
    else: 
        num_businesses = yelp_df[yelp_df["cra"] == neighborhood].shape[0]

        proportion = (yelp_hood[yelp_hood["price"] == dollar_tier].shape[0])/num_businesses
    
    hood_index = seattle_df.index[seattle_df["neighborhood"] == neighborhood].tolist()
    
    seattle_df.loc[hood_index, [f"{dollar_tier} dollar"]] = proportion
    

In [132]:
# iterate over df

hoods = list(seattle_rent["neighborhood"].unique())

for hood in hoods: 
    for i in range(1, 5): 
        dollar_rating(seattle_rent, yelp, hood, i)

##### Generating Dataset for Regression Model
Generate GPS coordinates within Seattle, drop values that do not fall within a neighborhood, then calculate yelp metrics for each GPS coordinate. Some formatting of the coordinates in the Yelp dataset was also required. 

In [None]:
# create separate columns for latitude and longitude in Yelp dataframe 

filler = [0 for x in range(yelp.shape[0])]

yelp["latitude"] = filler 
yelp["longitude"] = filler

In [None]:
# define function for separating coordinates and populating appropriate columns with values

def separate_coordinates(yelp_df): 
    for coordinate in yelp_df["coordinates"]: 
        index = yelp.index[yelp["coordinates"] == coordinate]
        
        split = coordinate.split(",")
        lat = split[0].replace("[", "")
        long = split[1].replace(" ", "").replace("]", "")
        yelp.loc[index, ["latitude"]] = lat
        yelp.loc[index, ["longitude"]] = long

In [None]:
# call function 

separate_coordinates(yelp)

# cast populated columns as floats

yelp["latitude"] = yelp["latitude"].astype("float")

yelp["longitude"] = yelp["longitude"].astype("float")

In [None]:
# generate gps locations 

gps_dict = {"latitude": [], "longitude": [], "neighborhood": []}

for i in range(1300): 
    lat = round(random.uniform(yelp["latitude"].min(), yelp["latitude"].max()), 6)
    gps_dict["latitude"].append(lat)
    
    long = round(random.uniform(yelp["longitude"].min(), yelp["longitude"].max()), 6)
    gps_dict["longitude"].append(long)
    
    n = c.to_cra([long, lat])
    gps_dict["neighborhood"].append(n)

In [None]:
# create dataframe of generated gps coord's

gps_df = pd.DataFrame(gps_dict)

# scrub values that do not fall within a neighborhood
gps_df = gps_df[gps_df["neighborhood"] != "Not Found"].copy()

gps_df.shape

In [None]:
# create columns for dollar tier proportions within a 0.5- and 1-mile radius of point 
for i in range(1, 5): 
    gps_df[f"0.5mi {i} dollar"] = [0 for x in range(gps_df.shape[0])]
    gps_df[f"1.0mi {i} dollar"] = [0 for x in range(gps_df.shape[0])]

In [None]:
# define function for determining dollar tier proportion within a certain radius 

# this ends up taking about 2 seconds to run

def radius_dollar_proportion(location, df, radius, dollar_tier):
    length = df.shape[0]
    
    indices = []
    
    for i in range(length): 
        coordinates = (float(df.loc[i, ["latitude"]]), float(df.loc[i, ["longitude"]]))
        if d.distance(location, coordinates).miles <= radius: 
            indices.append(i)
    
    if len(indices) == 0: 
        proportion = 0
    else: 
        surrounding_businesses = df.iloc[indices]
        total_businesses = surrounding_businesses.shape[0]
        proportion = (surrounding_businesses[surrounding_businesses["price"] == dollar_tier].shape[0]/total_businesses)
    
    return proportion
    

In [20]:
# test gps values
house = (47.679981, -122.290608)
bread = (47.679656, -122.290546)
house2 = (47.618432, -122.322973)

NameError: name 'sasha' is not defined

In [148]:
# test field: single data point
radius_dollar_proportion(house2, yelp, 0.5, 2)

In [164]:
# iterate over entire dataframe, populate with yelp metrics for each coordinate

# this will take almost 5 hours to run over the whole dataframe -___-

for i in range(gps_df.shape[0]): 
    for dollar in range(1, 5): 
        for radius in [0.5, 1.0]: 
            location = (float(gps_df.loc[i, ["latitude"]]), float(gps_df.loc[i, ["longitude"]]))
            gps_df.loc[i, [f"{radius}mi {dollar} dollar"]] = radius_dollar_proportion(location, yelp, radius, dollar)

##### Modeling
Regression modeling with Yelp metrics as features, with the goal being predicting the median income, home value, and rent surrounding a GPS coordinate. 