In this final project, I am interested in exploring what are the distribution of venue categories for the most affluent 10 US Neighborhoods.

To do this I will need to follow the following steps:

1) Identify High Income US Neighborhood data

2) Gather location data for the neighborhoods involved

3) Leverage the Foursquare API to download Recommended venue data in the selected neighborhoods (top 50 venues by Neighborhoods, the maximum number allowed by the API)

4) Anaylze data and develop Venue Category % by Neighborhood, and look at potential patterns



Importing necessary packages..

In [69]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
import numpy as np
print("done")

done


Getting High Income Neighborhood data from the web

In [70]:
url = 'https://en.wikipedia.org/wiki/List_of_highest-income_urban_neighborhoods_in_the_United_States'
r = requests.get(url)
r_html = r.text

Rendering the data in a pandas dataframe

In [71]:
soup = BeautifulSoup(r_html,'html.parser')
rows=soup.find_all("tr")
rows=rows[19:85]

headers=["Neighborhood","City","State","Mean_Income","Homes","%_black","%_Asian","%_Hispanic","%_White"]

neighborhoods=[rows[i].text.split("\n")[1::2][0] for i in range(len(rows))]
cities=[rows[i].text.split("\n")[1::2][1] for i in range(len(rows))]
states=[rows[i].text.split("\n")[1::2][2] for i in range(len(rows))]
mean_income=[int(rows[i].text.split("\n")[1::2][3].replace("$","").replace(",","")) for i in range(len(rows))]
homes=[int(rows[i].text.split("\n")[1::2][4].replace(",","")) for i in range(len(rows))]
perc_black=[float(rows[i].text.split("\n")[1::2][5].replace("%","")) for i in range(len(rows))]
perc_asian=[float(rows[i].text.split("\n")[1::2][6].replace("%","")) for i in range(len(rows))]
perc_hispanic=[float(rows[i].text.split("\n")[1::2][7].replace("%","")) for i in range(len(rows))]
perc_white=[float(rows[i].text.split("\n")[1::2][8].replace("%","")) for i in range(len(rows))]



dict_={headers[0]:neighborhoods,
       headers[1]:cities,
       headers[2]:states,
       headers[3]:mean_income,
       headers[4]:homes,
       headers[5]:perc_black,
       headers[6]:perc_asian,
       headers[7]:perc_hispanic,
       headers[8]:perc_white}

data=pd.DataFrame(dict_)

data.head()

Unnamed: 0,Neighborhood,City,State,Mean_Income,Homes,%_black,%_Asian,%_Hispanic,%_White
0,Grove Isle-Bayshore,Miami,FL,206683,758,0.0,0.0,20.2,79.8
1,Beekman Place,New York,NY,201623,803,0.0,1.2,5.2,93.5
2,Sutton Place,New York,NY,176980,6822,0.2,3.5,2.8,92.1
3,Old Town (Alexandria),Washington,VA,169658,2057,0.7,1.4,1.7,95.1
4,Tribeca,New York,NY,163425,4013,5.7,6.2,4.8,80.9


Installing geopy to get latitude and longitude neighborhood for all neighborhoods

In [72]:
!pip install geopy
import geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Coursera_Capstone_Romani",timeout=10)




You are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Downloading location data with geopy and rendering it in a pandas dataframe

In [73]:
locations=[geolocator.geocode(neighborhoods[i]+" ,"+cities[i]+", "+states[i]) for i in range(len(neighborhoods))]


In [74]:
found_neighborhoods=[locations[i].address.split(",")[0] for i in range(len(locations)) if locations[i] is not None ]
latitudes=[locations[i].latitude for i in range(len(locations))if locations[i] is not None]
longitudes=[locations[i].longitude for i in range(len(locations))if locations[i] is not None]



In [75]:
found_df_dict={"Neighborhood":found_neighborhoods,
              "Neighborhood_Lat":latitudes,
              "Neighborhood_lon":longitudes}

found_df=pd.DataFrame(found_df_dict)

found_df.head()

Unnamed: 0,Neighborhood,Neighborhood_Lat,Neighborhood_lon
0,Northeast 79th Street,25.848113,-80.173364
1,Beekman Place,40.753314,-73.964811
2,Sutton Place,40.664373,-73.469407
3,TriBeCa,40.71538,-74.009306
4,Pacific Heights,37.792717,-122.435644


As geopy did not fetch all correct location data, I keep only the results about the neighborhoods
who do correctly correspond to the actual locations used for the query, and eliminate the others.

Out of 66 total neighborhoods, 59 rendered non-null results

In [76]:
common={"Neighborhood":list(set(data["Neighborhood"]).intersection(found_neighborhoods))}


common_df=pd.DataFrame(common)

common_df=common_df.merge(data,how="left",on="Neighborhood").merge(found_df,how="left",on="Neighborhood")

common_df.head()



Unnamed: 0,Neighborhood,City,State,Mean_Income,Homes,%_black,%_Asian,%_Hispanic,%_White,Neighborhood_Lat,Neighborhood_lon
0,Upper West Side,New York,NY,124960,96146,6.5,5.5,9.2,76.8,40.787045,-73.975416
1,Beekman Place,New York,NY,201623,803,0.0,1.2,5.2,93.5,40.753314,-73.964811
2,Marina,Los Angeles,CA,151934,2449,2.4,2.0,4.2,89.5,37.799793,-122.435205
3,Marina,San Francisco,CA,124750,5413,0.3,8.1,2.8,86.8,37.799793,-122.435205
4,Chestnut Hill,Philadelphia,PA,101471,4267,15.3,2.7,1.8,78.2,40.077055,-75.2074


Sorting values by Mean Income by Neighborhood and keeping only top 10 highest income areas

In [77]:
common_df=common_df.sort_values("Mean_Income",ascending=False)
common_df.reset_index(inplace=True,drop=True)


top10=common_df.head(10)

top10

Unnamed: 0,Neighborhood,City,State,Mean_Income,Homes,%_black,%_Asian,%_Hispanic,%_White,Neighborhood_Lat,Neighborhood_lon
0,Beekman Place,New York,NY,201623,803,0.0,1.2,5.2,93.5,40.753314,-73.964811
1,Sutton Place,New York,NY,176980,6822,0.2,3.5,2.8,92.1,40.664373,-73.469407
2,Pacific Heights,San Francisco,CA,158937,8794,0.4,8.4,2.9,86.7,37.792717,-122.435644
3,Marina,Los Angeles,CA,151934,2449,2.4,2.0,4.2,89.5,37.799793,-122.435205
4,Upper East Side,New York,NY,143323,115792,1.8,5.6,4.9,86.3,40.773702,-73.96412
5,Central Park South,New York,NY,141085,2736,2.2,5.1,4.6,85.9,40.764636,-73.973766
6,Kalorama Heights,Washington,DC,140157,1530,3.7,5.8,6.4,87.5,38.916778,-77.052477
7,Dinner Key,Miami,FL,137210,597,0.0,0.0,18.9,79.6,25.728513,-80.23469
8,Telegraph Hill,San Francisco,CA,133977,924,0.9,16.8,1.1,80.4,37.80273,-122.405851
9,Midtown East,New York,NY,129650,31962,1.5,10.6,4.4,81.3,40.759822,-73.972471


Foursquare API Credentials

In [78]:
Client_ID="XX"
Client_Secret="XX"
Version="20190319"

radius=500
limit=50


Using the Foursqure API to gather up to 50 recommended venues, and corresponding data, for all top 10 Neighborhoods;
processing it and rendering it as a pandas dataframe

In [79]:
neighborhoods=[]
cities=[]
neigh_lat=[]
neigh_lon=[]
venues=[]
venues_categories=[]
venues_lats=[]
venues_longs=[]


for i in range(len(top10)):


    url='https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            Client_ID, 
            Client_Secret, 
            Version, 
            top10.loc[i].Neighborhood_Lat, 
            top10.loc[i].Neighborhood_lon, 
            radius,
            limit
            )
    
    results=requests.get(url).json()
    
    for j in range(len(results["response"]["groups"][0]["items"])):
        
        venue_name=results["response"]["groups"][0]["items"][j]["venue"]["name"]
        venue_category=results["response"]["groups"][0]["items"][j]["venue"]["categories"][0]["name"]
        venue_lat=results["response"]["groups"][0]["items"][j]["venue"]["location"]["lat"]
        venue_lon=results["response"]["groups"][0]["items"][j]["venue"]["location"]["lng"]
        
        neighborhoods.append(top10.loc[i].Neighborhood)
        cities.append(top10.loc[i].City)
        neigh_lat.append(top10.loc[i].Neighborhood_Lat)
        neigh_lon.append(top10.loc[i].Neighborhood_lon)
    
        venues.append(venue_name)
        venues_categories.append(venue_category)
        venues_lats.append(venue_lat)
        venues_longs.append(venue_lon)

In [80]:
venue_dict_={"Neighborhood":neighborhoods,
             "City":cities,
            "Latitude":neigh_lat,
            "Longitude":neigh_lon,
            "Venue":venues,
            "Category":venues_categories,
            "Venue_Latitude":venues_lats,
            "Venue_Longitude":venues_longs}

Checking the data

In [81]:
venue_data=pd.DataFrame(venue_dict_)
venue_data.head(20)


Unnamed: 0,Neighborhood,City,Latitude,Longitude,Venue,Category,Venue_Latitude,Venue_Longitude
0,Beekman Place,New York,40.753314,-73.964811,Ideal Cheese Shop,Cheese Shop,40.75504,-73.965347
1,Beekman Place,New York,40.753314,-73.964811,Ethos,Greek Restaurant,40.754526,-73.966048
2,Beekman Place,New York,40.753314,-73.964811,Ophelia,Cocktail Bar,40.753318,-73.966438
3,Beekman Place,New York,40.753314,-73.964811,Deux Amis,French Restaurant,40.754648,-73.966044
4,Beekman Place,New York,40.753314,-73.964811,Peter Detmold Park,Park,40.753599,-73.963648
5,Beekman Place,New York,40.753314,-73.964811,Mario Badescu,Spa,40.75554,-73.966961
6,Beekman Place,New York,40.753314,-73.964811,Jubilee,French Restaurant,40.755194,-73.964817
7,Beekman Place,New York,40.753314,-73.964811,Japan Society,Museum,40.752287,-73.968431
8,Beekman Place,New York,40.753314,-73.964811,Peter Detmold Park Dog Run,Dog Run,40.753607,-73.96363
9,Beekman Place,New York,40.753314,-73.964811,Pathos Cafe,Greek Restaurant,40.75459,-73.965355


Checking number of results/venues fetched by Neighborhoods, 
I keep only the 7 areas for which the full limit of 50 venues were renderered, and filter the venue dataset for those 7 regions

In [82]:
venue_data.groupby("Neighborhood")["Venue"].count()

Neighborhood
Beekman Place         50
Central Park South    50
Dinner Key             8
Kalorama Heights       6
Marina                50
Midtown East          50
Pacific Heights       50
Sutton Place           5
Telegraph Hill        50
Upper East Side       50
Name: Venue, dtype: int64

In [83]:
sample_neighborhoods=[venue_data.groupby("Neighborhood")["Venue"].count().keys()[i] for i in range(len(top10)) if venue_data.groupby("Neighborhood")["Venue"].count()[i] !=50]
sample_neighborhoods

['Dinner Key', 'Kalorama Heights', 'Sutton Place']

In [84]:
venue_data=venue_data[venue_data["Neighborhood"] !="Dinner Key"]
venue_data=venue_data[venue_data["Neighborhood"] !="Kalorama Heights"]
venue_data=venue_data[venue_data["Neighborhood"] !="Sutton Place"]

venue_data.reset_index(drop=True,inplace=True)



Exploring venue categories' distribution and consolidating data into more comprehensive buckets (Restaurants, Shops, Sports, etc..)

In [85]:
venue_data["Category"].value_counts()

French Restaurant                           17
Italian Restaurant                          14
Hotel                                       12
Park                                        11
Wine Bar                                    10
Pizza Place                                 10
American Restaurant                          9
Coffee Shop                                  9
Spa                                          9
Boutique                                     9
Bakery                                       8
Cosmetics Shop                               7
Jewelry Store                                7
Gym / Fitness Center                         6
Sushi Restaurant                             6
Sandwich Place                               5
Mediterranean Restaurant                     5
Steakhouse                                   5
Seafood Restaurant                           5
Gym                                          5
Café                                         5
Salon / Barbe

In [86]:
venue_data=venue_data.replace("Coffee Shop","Cafes")

for i in range(len(venue_data)):

    if "Restaurant" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Restaurants"
        
    elif "Place" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Restaurants"
        
    elif "Steakhouse" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Restaurants"
    
    elif "Food" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Restaurants"
        
    elif "Diner" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Restaurants"
    
    elif "Pub" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Restaurants"
    
    elif "Store" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Shops"
    
    elif "Shop" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Shops"
        
    elif "Bakery" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Shops"
    
    elif "Boutique" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Shops"
        
    elif "Market" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Shops"
        
    elif "Bar" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Cafes"
    
    elif "Café" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Cafes"
    
    
    elif "Gym" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Sports"
        
    elif "Spa" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Sports"
    
    elif "Playground" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Sports"
    
    elif "Studio" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Sports"
    
    elif "Museum" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Art"
        
    elif "Gallery" in venue_data.loc[i,"Category"]:
        venue_data.loc[i,"Category"]="Art"

    else:
        venue_data.loc[i,"Category"]="Enterntainment"


venue_data["Category"].value_counts()

Restaurants       124
Shops              92
Enterntainment     68
Cafes              33
Sports             25
Art                 8
Name: Category, dtype: int64

Checking newly consolidated data

In [87]:
venue_data.head(30)

Unnamed: 0,Neighborhood,City,Latitude,Longitude,Venue,Category,Venue_Latitude,Venue_Longitude
0,Beekman Place,New York,40.753314,-73.964811,Ideal Cheese Shop,Shops,40.75504,-73.965347
1,Beekman Place,New York,40.753314,-73.964811,Ethos,Restaurants,40.754526,-73.966048
2,Beekman Place,New York,40.753314,-73.964811,Ophelia,Cafes,40.753318,-73.966438
3,Beekman Place,New York,40.753314,-73.964811,Deux Amis,Restaurants,40.754648,-73.966044
4,Beekman Place,New York,40.753314,-73.964811,Peter Detmold Park,Enterntainment,40.753599,-73.963648
5,Beekman Place,New York,40.753314,-73.964811,Mario Badescu,Sports,40.75554,-73.966961
6,Beekman Place,New York,40.753314,-73.964811,Jubilee,Restaurants,40.755194,-73.964817
7,Beekman Place,New York,40.753314,-73.964811,Japan Society,Art,40.752287,-73.968431
8,Beekman Place,New York,40.753314,-73.964811,Peter Detmold Park Dog Run,Enterntainment,40.753607,-73.96363
9,Beekman Place,New York,40.753314,-73.964811,Pathos Cafe,Restaurants,40.75459,-73.965355


Calculating % of venue category by neighborhood and saving file 

In [88]:
venue_data["Neighborhood"]=venue_data.Neighborhood +", "+venue_data.City
final=(venue_data.groupby(["Neighborhood","Category"])["Venue"].count()/50).reset_index(name="Count")
final.head()
final.to_csv("final")


Unnamed: 0,Neighborhood,Category,Count
0,"Beekman Place, New York",Art,0.02
1,"Beekman Place, New York",Cafes,0.16
2,"Beekman Place, New York",Enterntainment,0.2
3,"Beekman Place, New York",Restaurants,0.44
4,"Beekman Place, New York",Shops,0.16
