# Applied Data Science Capstone Final Project
# Cluster Comparison of Multiple Cities
### Richard C. Anderson

   ## 1 Introduction & Description

As I worked through the New York and Toronto clustering execises, I found my self wondering how the neighborhoods of New York's Manhattan would compare with those of central Boston, where I now live in the suburbs, and with Houston, where I have lived in the past.

I would expect Boston and Manhattan neighborhoods to be similar, as both cities are older, denser, pre-date the automobile, and have extensive mass-transit options. Houston, on the other hand, has developed entirely with the automobile as its primary transportation method and, in part due to cheaper land prices, has a much lower population density. Houston's central downtown is all business and mostly deserted after 8pm as very few people live there. Its mass transit system is mostly buses with an emphasis on workers commuting to/from the downtown area.

The question in my mind is whether or not venues in the Houston neighborhoods will cluster with those of Boston and Manhattan. I theorize that Houston might contina mostly independent venue clusters while Boston and Manhattan show similar cluster types. My project will perform multiple K-means investigations. First, the project will replicate the Manhattan neighborhoods venue clustering evalation and develop venue clustering evaluations for Boston and Houston neighborhoods. Second, the Manhattan, Boston, and Houston neighborhood venue data will be joined for a multi-city K-means venue clustering evaluation. The multi-city evaluation will compare/constrast the older cites of Manhattan and Boston with the much newer Houston.

The multi-city evaluation performed for this project is an approach that has potential marketing and operational benefits for businesses. For instance, the venue evaluation can help locate saturated or underserved market arees. For a business considering opening into to completely new markets, the multi-city evaluation can provide clues as to how a business might need to adapt its product offerings to compensate for differences in predominate neighborhood charactaristics. The multi-city evaluation could also be useful for personal use as well. As an example, someone relocating from one city to another could start by determining which neighborhoods in a new city most resemble (or differ!) from their current neighborhood. 

## 2 Data Requirements & Sources

The data requirements for this project are an extension of the data required for the New York Manhattan borough exercise project. A list of neighborhoods for Boston and Houston and their associated geolocations will be needed so that venue data can be extracted from Foursquare.

Unfortunately, a google search for tabular data of Boston and Houston neighborhoods and their geolocations did not yield any directly usable results. However, it was possible to manually construct CSV files that combined list of neighborhoods found on Wikipedia with the necessary geolocation information. The manual construction of neighborhood information did require several subjective judgements to be made, as there did not appear to be a single definitive neighborhood list for either Boston or Houston. In this case, my personal familiarity with both cities was used to determine a suitable list. 

Another issue for both Boston and central Houston is how to define their boundaries for comparison with the borough of Manhattan. Central Houston is typically defined as the neighborhoods inside the 610 loop freeway. However, there are two independent cities, Bellaire and West University Place, that are fully contained in this area and will be included as part of central Houston. Also, I included the first ring of neighborhoods just outsie the 610 loop as part of central Houston for the purpose of this study. I made this decision in part to give each metro area approximately the same number of neighborhoods. Defining the Boston metro area also required some subjective adjustments as there are independent major suburbs, particularly Brookline, Cambridge, and Somerville that are sufficiently close enough to downtown Boston to be considered part of its metro area.

Foursquare will be used for gathering the venue data for the lists of neighborhoods. However, the developmental differences between Houston and the older cities of Boston and Manhattan will have an impact on the queries used for gathering the venue data. In the orignal exercise for Manhattan the venues were pulled from Foursquare using a 500 meter radius around the center geolocation of the neighborhood. I was concerned that this setting may be not valid for Boston or Houston. Houston, given its large area and low population density, was likely require a much larger radius setting to obtain a representative sample of venues for a given neighborhood.

Iterative clustering evaluations were made for Boston and Houston to determine reasonable radius settings. At the initial setting of 500 meters, many Boston neighborhoods and most Houston neighborhoods returned less than 20 venues within a neighborhood. Houston, using the initial 500 meter radius setting, had several neighborhoods that returned fewer than 5 venues. I adjusted the search radius for both Boston and Houston until each city had a least 10 neighborhoods that returned 100 venues (max limit) and the sparser neighborhood returned 10 venues. As suspected, Houston required a much larger search radius to obtain reasonable venue lists.

The search radius settings used to develope venue lists were set as follows:

    Manhattan:  500 meters
    Boston:    1000 meters
    Houston:   2500 meters

#### Install and import python libraries

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import json
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import types
from botocore.client import Config
import ibm_boto3

print('Libraries imported.')

Libraries imported.


In [2]:
!pip install geopy
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
print('geopy imported')

geopy imported


In [3]:
!pip install folium==0.5.0
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if folium not previously installed
import folium # map rendering library
print('folium imported')

Collecting folium==0.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/07/37/456fb3699ed23caa0011f8b90d9cad94445eddc656b601e6268090de35f5/folium-0.5.0.tar.gz (79kB)
[K     |████████████████████████████████| 81kB 6.1MB/s eta 0:00:011
[?25hCollecting branca (from folium==0.5.0)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/f8/98/ff/954791afc47740d554f0d9e5885fa09dd60c2265d42578e665
Successfully built folium
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.5.0
folium imported


### 2.1 Create Neighborhood Geolocation Dataframes

The following sections will create the Manhattan, Boston Metro, and Central Houston geolocation dataframes from their source data files. The neighborhood geolocation data is used to create folium map plots for visually evaluating the neighborhood data.

#### 2.1.1 NYC and Manhattan neighborhood geolocation datasets

In [4]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('NY neighborhoods dataset downloaded')

NY neighborhoods dataset downloaded


In [5]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

neighborhoods_data = newyork_data['features']

In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
df_ny_hoods = pd.DataFrame(columns=column_names)

In [7]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    df_ny_hoods = df_ny_hoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [8]:
print('The New York City dataframe contains {} boroughs and {} neighborhoods.'.format(
        len(df_ny_hoods['Borough'].unique()), df_ny_hoods.shape[0])
)
df_ny_hoods.head()

The New York City dataframe contains 5 boroughs and 306 neighborhoods.


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [9]:
df_manhattan_hoods = df_ny_hoods[df_ny_hoods['Borough'] == 'Manhattan'].reset_index(drop=True)
print('The Manhattan dataframe contains {} neighborhoods.'.format(df_manhattan_hoods.shape[0]))
df_manhattan_hoods.head()

The Manhattan dataframe contains 40 neighborhoods.


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [10]:
def __iter__(self): return 0

#### 2.1.2 Boston metro neighborhood geolocation dataset

In [11]:
# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_c579e19718a54ab9a82d7c9ca0a9f1c6 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='mp4LEJi4lI6Xx5OZCBoAucp6YCOyhmTj_4QpFcfFjob2',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_c579e19718a54ab9a82d7c9ca0a9f1c6.get_object(Bucket='courseradatasciencecapstoneprojec-donotdelete-pr-o9sutjrlk6n7po',Key='BostonMetroNeighborhoodsGeolocations.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_boston_hoods = pd.read_csv(body)
print('The Boston metro dataframe contains {} neighborhoods.'.format(df_boston_hoods.shape[0]))
df_boston_hoods.head()

The Boston metro dataframe contains 50 neighborhoods.


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Allston,42.3529,-71.1321
1,Back Bay,42.351294,-71.080356
2,Bay Village (South Cove),42.3491,-71.068
3,Beacon Hill,42.3583,-71.0661
4,Brighton,42.35,-71.16


#### 2.1.3 Central Houston neighborhood geolocation dataset 

In [12]:
body = client_c579e19718a54ab9a82d7c9ca0a9f1c6.get_object(Bucket='courseradatasciencecapstoneprojec-donotdelete-pr-o9sutjrlk6n7po',Key='HoustonNeighborhoodsGeolocations2.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_houston_hoods = pd.read_csv(body)
print('The central Houston dataframe contains {} neighborhoods.'.format(df_houston_hoods.shape[0]))
df_houston_hoods.head()

The central Houston dataframe contains 49 neighborhoods.


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Central Northwest,29.8327,-95.4448
1,Indepence Heights,29.8284,-95.3977
2,Lazybrook / Timbergrove,29.8016,-95.4381
3,Greater Heights,29.798056,-95.398056
4,Greater Uptown,29.746111,-95.463889


### 2.2 Plot City Neighborhoods
#### Folium Map setup

In [13]:
def getAddressGeolocation(address):
    geolocator = Nominatim(user_agent="geo_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    #print('The geograpical coordinates of {} are {}, {}.'.format(address,latitude, longitude))
    
    return (latitude, longitude)

In [14]:
city_addresses = ['Manhattan, NY', 'Boston, MA', 'Houston, TX', 'New York City, NY']
#city_addresses = ['New York City, NY', 'Boston, MA', 'Houston, TX']
city_geoloc = []
for index, address in enumerate(city_addresses):
    # print(index, address)
    geoloc = getAddressGeolocation(address)
    city_geoloc.append([address,geoloc])
    # print('The geograpical coordinates of {} are lat: {}, long: {}.'.format(city_geoloc[index][0], city_geoloc[index][1][0], city_geoloc[index][1][1]))

city_geoloc

[['Manhattan, NY', (40.7896239, -73.9598939)],
 ['Boston, MA', (42.3602534, -71.0582912)],
 ['Houston, TX', (29.7589382, -95.3676974)],
 ['New York City, NY', (40.7127281, -74.0060152)]]

In [15]:
map_width = 250
map_height = 250
fig_width = map_width
fig_height = map_height
initial_zoom = 10

In [16]:
def createHoodMap(city_hoods_df,city_geoloc):
    # create city neighborhood map
    fig_hoods = folium.Figure(width=fig_width, height=fig_height)
#    title_html = '''
#             <h3 align="left" style="font-size:12px"><b>%s Neighborhoods</b></h3>
#             ''' % city_geoloc[0]
#    fig_hoods.get_root().html.add_child(folium.Element(title_html))
    map_hoods = folium.Map(location=[city_geoloc[1][0], city_geoloc[1][1]], zoom_start=initial_zoom, width=map_width, height=map_height)

    # add neighborhood markers to map
    for lat, lng, label in zip(city_hoods_df['Latitude'], city_hoods_df['Longitude'], city_hoods_df['Neighborhood']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_hoods)
    
    fig_hoods.add_child(map_hoods)
    return fig_hoods

#### 2.2.1 Manhattan Neighborhoods

In [17]:
fig_manhattan = createHoodMap(df_manhattan_hoods,city_geoloc[0])
fig_manhattan

#### 2.2.2 Boston Metro Neighborhoods

In [18]:
fig_boston = createHoodMap(df_boston_hoods,city_geoloc[1])
fig_boston

#### 2.2.3 Central Houston Neighborhoods
Note that the neighborhood density is lower than Manhattan and Boston.

In [19]:
fig_houston = createHoodMap(df_houston_hoods,city_geoloc[2])
fig_houston

#### 2.2.4 NYC Neighborhoods

In [20]:
initial_zoom=9
fig_manhattan = createHoodMap(df_ny_hoods,city_geoloc[3])
fig_manhattan

#### Foursquare venue exploration setup

In [21]:
# @hidden_cell
CLIENT_ID = 'NAE2KEGTXHYPOXDP23T5PUOJHXKUPTSSZV4AWVZ5YUE0KWPW'
CLIENT_SECRET = 'CYT3A5YVLCSMMGYOKAQZMYJDSG2WGWSKRN12PJ4PZPCAGU43'
VERSION = '20180605'

In [22]:
search_radius = 500
search_limit = 100

In [23]:
def getNeighborhoodVenues(names, latitudes, longitudes):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#        print(name)
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, search_radius, search_limit)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(name, lat, lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    hood_venues_df = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    hood_venues_df.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 
                  'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    
    return(hood_venues_df)

### 2.3 Create Venue Dataframes
#### 2.3.1 Manhattan Neighborhood Venues

In [24]:
df_manhattan_venues = getNeighborhoodVenues(names=df_manhattan_hoods['Neighborhood'],
                                   latitudes=df_manhattan_hoods['Latitude'], longitudes=df_manhattan_hoods['Longitude'] )
print('Manhattan neighborhoods returned {} venues.'.format(df_manhattan_venues.shape[0]))
print('There are {} unique categories.'.format(len(df_manhattan_venues['Venue Category'].unique())))

Manhattan neighborhoods returned 3071 venues.
There are 329 unique categories.


In [25]:
df_manhattan_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
4,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop


In [26]:
print('Manhattan venue counts with search radius: ',search_radius,' meters.')
df_manhattan_venues['Neighborhood'].value_counts()

Manhattan venue counts with search radius:  500  meters.


Clinton                100
Midtown                100
Noho                   100
Chinatown              100
Little Italy           100
Turtle Bay             100
Greenwich Village      100
Yorkville              100
East Village           100
West Village           100
Chelsea                100
Lenox Hill             100
Financial District     100
Flatiron                99
Civic Center            97
Midtown South           94
Sutton Place            93
Lincoln Square          93
Washington Heights      91
Upper East Side         89
Gramercy                84
Carnegie Hill           82
Upper West Side         82
Murray Hill             81
Soho                    79
Tudor City              74
Tribeca                 74
Battery Park City       66
Hudson Yards            58
Hamilton Heights        58
Inwood                  58
Lower East Side         47
Central Harlem          44
Manhattanville          43
Morningside Heights     43
East Harlem             41
Manhattan Valley        37
M

#### 2.3.2 Boston Metro Neighborhood Venues

In [27]:
search_radius = 1000
df_boston_venues = getNeighborhoodVenues(names=df_boston_hoods['Neighborhood'],
                                   latitudes=df_boston_hoods['Latitude'], longitudes=df_boston_hoods['Longitude'] )
print('Boston metro neighborhoods returned {} venues.'.format(df_boston_venues.shape[0]))
print('There are {} unique categories.'.format(len(df_boston_venues['Venue Category'].unique())))

Boston metro neighborhoods returned 3940 venues.
There are 303 unique categories.


In [28]:
df_boston_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Allston,42.3529,-71.1321,Fish Market Sushi Bar,42.353039,-71.132975,Sushi Restaurant
1,Allston,42.3529,-71.1321,Tous les Jours,42.351753,-71.131665,Bakery
2,Allston,42.3529,-71.1321,BonChon Chicken,42.353105,-71.130921,Fried Chicken Joint
3,Allston,42.3529,-71.1321,Mala Restaurant,42.35296,-71.131033,Chinese Restaurant
4,Allston,42.3529,-71.1321,Azama Grill,42.354422,-71.132358,Falafel Restaurant


In [29]:
print('Metro Boston venue counts with search radius: ',search_radius,' meters.')
df_boston_venues['Neighborhood'].value_counts()

Metro Boston venue counts with search radius:  1000  meters.


Chinatown/Leather District           100
Bay Village (South Cove)             100
The Port (Cambridge)                 100
Spring Hill (Somerville)             100
East Cambridge                       100
Ward Two/Cobble Hill (Somerville)    100
Riverside (Cambridge)                100
South Boston                         100
South End                            100
West End                             100
Wellington-Harrington (Cambridge)    100
MIT (Cambridge)                      100
Davis Square (Somerville)            100
North End                            100
Propspect Hill (Somerville)          100
Cambridgeport                        100
Back Bay                             100
Downtown/Financial District          100
Powder House (Somerville)            100
Ten Hills (Somerville)               100
Mid-Cambridge                        100
Allston                              100
Beacon Hill                          100
Fenway/Kenmore                       100
West Cambridge  

#### 2.3.3 Central Houston Neighborhood Venues

In [30]:
search_radius = 2500
df_houston_venues = getNeighborhoodVenues(names=df_houston_hoods['Neighborhood'],
                                   latitudes=df_houston_hoods['Latitude'], longitudes=df_houston_hoods['Longitude'] )
print('Central Houston neighborhoods returned {} venues.'.format(df_houston_venues.shape[0]))
print('There are {} unique categories.'.format(len(df_houston_venues['Venue Category'].unique())))

Central Houston neighborhoods returned 3716 venues.
There are 282 unique categories.


In [31]:
df_houston_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Central Northwest,29.8327,-95.4448,Mytiburger,29.832268,-95.450974,Burger Joint
1,Central Northwest,29.8327,-95.4448,T C Jester Park,29.828376,-95.45726,Park
2,Central Northwest,29.8327,-95.4448,Plonk! Beer & Wine Bistro,29.829114,-95.431201,Wine Bar
3,Central Northwest,29.8327,-95.4448,Tacos A Go-Go,29.817542,-95.44672,Taco Place
4,Central Northwest,29.8327,-95.4448,European Wax Center,29.828911,-95.430725,Health & Beauty Service


In [32]:
print('Central Houston venue counts with search radius: ',search_radius,' meters.')
df_houston_venues['Neighborhood'].value_counts()

Central Houston venue counts with search radius:  2500  meters.


Greenway / Upper Kirby                100
Astrodome Area                        100
Afton Oaks / River Oaks               100
Spring Branch East                    100
Washington Avenue / Memorial Park     100
Gulfgate Riverview / Pine Valley      100
Midtown                               100
Willow Meadows/Willowbend             100
MacGregor                             100
Greater Uptown                        100
West University Place                 100
South Main                            100
Medical Center                        100
Near Northside                        100
Meyerland                             100
Downtown                              100
Lazybrook / Timbergrove               100
Museum Park                           100
Greater Heights                       100
Fourth Ward                           100
Neartown / Montrose                   100
Indepence Heights                     100
Gulfton                               100
Greater Third Ward                