### Dataset Generation: GPS Coordinates, Yelp metrics, Seattle Neighborhood demographics

In this section, we generate data to use in regression models that will predict area affluence given the Yelp metrics of an area surrounding a given GPS coordinate. Here, we produce the GPS data by randomly generating GPS coordinates that fall within Seattle neighborhoods. The GPS coordinates will be generated, while Yelp metrics and demographic information will be pulled from sourced datasets (Yelp and Seattle Demographics).

Given random GPS coordinates within Seattle, we establish a 0.5-mile and 1-mile radius around the point, and calculate the proportion of businesses within each price tier (also referred to as dollar tier). The GPS coordinates are also placed within a neighborhood and assigned a median rent, income, and home value based on that placement.

##### Imports

In [1]:
import pandas as pd
import numpy as np
import random

from geopy import distance as d

from CRA import * 
c = CRA()

In [2]:
yelp = pd.read_csv("../datasets/seattle_restaurants.csv")

seattle = pd.read_csv("../datasets/seattle_demographics.csv")

##### Formatting

Narrowing each sourced dataset down to the desired features. For the Yelp dataset restaurant categories are not important, and neither is the original index ("Unnamed: 0"). For the Seattle demographics dataset the desired features in this case are GEN_ALIAS (neighborhood name), MEDIAN_GROSS_RENT (median rent for the neighborhood), HU_VALUE_MEDIAN_DOLLARS (median home value), and MEDIAN_HH_INC_PAST_12MO_DOLLAR (median annual income).

In [3]:
# drop unwanted columns from Yelp dataset

yelp.drop(columns = ["Unnamed: 0", "categories"], inplace = True)

In [4]:
# specify columns to keep from Seattle demographics dataset

keeps = ["GEN_ALIAS", "MEDIAN_GROSS_RENT", "HU_VALUE_MEDIAN_DOLLARS", "MEDIAN_HH_INC_PAST_12MO_DOLLAR"]

seattle_rent = seattle.loc[:, keeps].copy()
seattle_rent.rename(columns = {"GEN_ALIAS": "neighborhood", "MEDIAN_GROSS_RENT": "median_rent", 
                               "HU_VALUE_MEDIAN_DOLLARS": "median_home_value", 
                               "MEDIAN_HH_INC_PAST_12MO_DOLLAR": "median_income"}, inplace = True)

In [5]:
# check for neighborhoods that do not match between the two datasets

y = yelp["cra"].unique()
s = seattle_rent["neighborhood"].unique()

[print(x) for x in y if x not in s]

Not Found


[None]

In [6]:
# remove point where yelp businesses do not have an assigned neighborhood

yelp = yelp[yelp["cra"] != "Not Found"].copy()

yelp.reset_index(inplace = True)

In [7]:
# add columns in seattle df for yelp metrics 

filler = [0 for i in range(seattle_rent.shape[0])]

for i in range(1, 5): 
    seattle_rent[f"{i} dollar"] = filler

# define function for populating df with proportion of each tier of dollar sign rating 
def dollar_rating(seattle_df, yelp_df, neighborhood, dollar_tier): 
    yelp_hood = yelp_df[yelp_df["cra"] == neighborhood]
    
    if yelp_df[yelp_df["cra"] == neighborhood].shape[0] == 0: 
        proportion = 0
    else: 
        num_businesses = yelp_df[yelp_df["cra"] == neighborhood].shape[0]

        proportion = (yelp_hood[yelp_hood["price"] == dollar_tier].shape[0])/num_businesses
    
    hood_index = seattle_df.index[seattle_df["neighborhood"] == neighborhood].tolist()
    
    seattle_df.loc[hood_index, [f"{dollar_tier} dollar"]] = proportion
    

In [8]:
# iterate over df

hoods = list(seattle_rent["neighborhood"].unique())

for hood in hoods: 
    for i in range(1, 5): 
        dollar_rating(seattle_rent, yelp, hood, i)

##### Generating Dataset for Regression Model
Generate GPS coordinates within Seattle, drop values that do not fall within a neighborhood, then calculate yelp metrics for each GPS coordinate. Some formatting of the coordinates in the Yelp dataset was also required. Neighborhood demographic data will also be added. 

In [9]:
# create separate columns for latitude and longitude in Yelp dataframe 

filler = [0 for x in range(yelp.shape[0])]

yelp["latitude"] = filler 
yelp["longitude"] = filler

In [10]:
# define function for separating coordinates and populating appropriate columns with values

def separate_coordinates(yelp_df): 
    for coordinate in yelp_df["coordinates"]: 
        index = yelp.index[yelp["coordinates"] == coordinate]
        
        split = coordinate.split(",")
        lat = split[0].replace("[", "")
        long = split[1].replace(" ", "").replace("]", "")
        yelp.loc[index, ["latitude"]] = lat
        yelp.loc[index, ["longitude"]] = long

In [11]:
# call function 

separate_coordinates(yelp)

# cast populated columns as floats

yelp["latitude"] = yelp["latitude"].astype("float")

yelp["longitude"] = yelp["longitude"].astype("float")

In [12]:
# generate gps locations 

gps_dict = {"latitude": [], "longitude": [], "neighborhood": []}

for i in range(1300): 
    lat = round(random.uniform(yelp["latitude"].min(), yelp["latitude"].max()), 6)
    gps_dict["latitude"].append(lat)
    
    long = round(random.uniform(yelp["longitude"].min(), yelp["longitude"].max()), 6)
    gps_dict["longitude"].append(long)
    
    n = c.to_cra([long, lat])
    gps_dict["neighborhood"].append(n)

In [20]:
# create dataframe of generated gps coord's

gps_df = pd.DataFrame(gps_dict)

# scrub values that do not fall within a Seattle neighborhood
gps_df = gps_df[gps_df["neighborhood"] != "Not Found"].copy()

gps_df.reset_index(inplace = True)

In [21]:
# create columns for dollar tier proportions within a 0.5- and 1-mile radius of point 
for i in range(1, 5): 
    gps_df[f"0.5mi {i} dollar"] = [0 for x in range(gps_df.shape[0])]
    gps_df[f"1.0mi {i} dollar"] = [0 for x in range(gps_df.shape[0])]

In [22]:
# define function for determining dollar tier proportion within a certain radius 

# this ends up taking about 2 seconds to run

def radius_dollar_proportion(location, df, radius, dollar_tier):
    length = df.shape[0]
    
    indices = []
    
    for i in range(length): 
        coordinates = (float(df.loc[i, ["latitude"]]), float(df.loc[i, ["longitude"]]))
        if d.distance(location, coordinates).miles <= radius: 
            indices.append(i)
    
    if len(indices) == 0: 
        proportion = 0
    else: 
        surrounding_businesses = df.iloc[indices]
        total_businesses = surrounding_businesses.shape[0]
        proportion = (surrounding_businesses[surrounding_businesses["price"] == dollar_tier].shape[0]/total_businesses)
    
    return proportion
    

In [23]:
# test gps values
house = (47.679981, -122.290608)
bread = (47.679656, -122.290546)
house2 = (47.618432, -122.322973)

In [24]:
# test field: single data point
radius_dollar_proportion(house2, yelp, 0.5, 2)

0.7421875

In [25]:
# iterate over entire dataframe, populate with yelp metrics for each coordinate

# this will take almost 5 hours to run over the whole dataframe -___-

for i in range(gps_df.shape[0]): 
    for dollar in range(1, 5): 
        for radius in [0.5, 1.0]: 
            location = (float(gps_df.loc[i, ["latitude"]]), float(gps_df.loc[i, ["longitude"]]))
            gps_df.loc[i, [f"{radius}mi {dollar} dollar"]] = radius_dollar_proportion(location, yelp, radius, dollar)

In [30]:
# add data from seattle demographics dataset 

filler = [0 for i in range(gps_df.shape[0])]

gps_df["median income"] = filler 
gps_df["median rent"] = filler
gps_df["median home value"] = filler

In [31]:
gps_df.head()


Unnamed: 0,index,latitude,longitude,neighborhood,0.5mi 1 dollar,1.0mi 1 dollar,0.5mi 2 dollar,1.0mi 2 dollar,0.5mi 3 dollar,1.0mi 3 dollar,0.5mi 4 dollar,1.0mi 4 dollar,median income,median rent,median home value
0,0,47.698699,-122.359579,Greenwood/Phinney Ridge,0.5,0.25,0.5,0.6875,0.0,0.0625,0.0,0.0,0,0,0
1,1,47.629,-122.29701,Montlake/Portage Bay,0.0,0.269231,0.75,0.653846,0.25,0.076923,0.0,0.0,0,0,0
2,2,47.603136,-122.301123,Central Area/Squire Park,0.3,0.352941,0.7,0.588235,0.0,0.058824,0.0,0.0,0,0,0
3,4,47.627629,-122.31755,North Capitol Hill,0.258065,0.227941,0.709677,0.720588,0.0,0.029412,0.032258,0.022059,0,0,0
4,5,47.701419,-122.290185,Wedgwood/View Ridge,0.0,0.666667,0.0,0.333333,0.0,0.0,0.0,0.0,0,0,0


In [32]:
seattle_rent.head()

Unnamed: 0,neighborhood,median_rent,median_home_value,median_income,1 dollar,2 dollar,3 dollar,4 dollar
0,Ballard,1542,543200,79162,0.229167,0.75,0.020833,0.0
1,North Beach/Blue Ridge,1476,658600,94804,0.0,1.0,0.0,0.0
2,Montlake/Portage Bay,1723,821250,132573,0.142857,0.714286,0.142857,0.0
3,Interbay,1490,571300,74679,0.4,0.6,0.0,0.0
4,North Capitol Hill,1576,896200,96220,0.25,0.75,0.0,0.0


In [41]:
# populate demographics columns

for i in range(gps_df.shape[0]): 
    neighborhood = gps_df.loc[i, ["neighborhood"]].item()
    income = seattle_rent[seattle_rent["neighborhood"] == neighborhood]["median_income"].item()
    rent = seattle_rent[seattle_rent["neighborhood"] == neighborhood]["median_rent"].item()
    home = seattle_rent[seattle_rent["neighborhood"] == neighborhood]["median_home_value"].item()
    
    gps_df.loc[i, ["median income"]] = income
    gps_df.loc[i, ["median rent"]] = rent
    gps_df.loc[i, ["median home value"]] = home

  after removing the cwd from sys.path.
  """
  
  import sys


In [43]:
gps_df.to_csv("../datasets/generated_gps_price_radius.csv", index = False)