# Capstone project -- week 3 -- clustering neighborhood data

The assignment for week 3 of the capstone project consisted of three parts: downloading, geocoding and clustering of Toronto neighborhood data. The source code of this notebook follows on from the <a href="https://github.com/AnikaC-git/Coursera_Capstone/blob/master/notebooks/Capstone_week3_geocoding.ipynb">notebook</a> that geocoded the data. Instead of repeating this process, it will load the saved, geocoded data from file.

In [1]:
import pandas as pd
import numpy as np
import folium, requests

import matplotlib.cm as cm
import matplotlib.colors as colors

from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans

INPUT_FILE = r"../data/postal_codes_toronto_geocoded.csv"

In [2]:
###############################
# Data needed to access FourSquare API, but partially deleted to keep keys secret
###############################
CLIENT_ID = '' 
CLIENT_SECRET = '' 
VERSION = '20200630' 
LIMIT = 100 # set to 100 results

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2FUND2JZPO3GTYG0XQYJQ5A2LSDD1RUXXB1AVCDW0ZAGIQDU
CLIENT_SECRET:25ZURLZWFH3LTODOU21V3WBAJHGSNIUOEH3PS2T4ELCBF3MO


In [3]:
# reading in the geocoded data of Toronto neighborhoods downloaded from Wikipedia
df_neighborhoods = pd.read_csv(INPUT_FILE)
print(df_neighborhoods.head())

  Postal Code           Borough                                 Neighborhood  \
0         M3A        North York                                    Parkwoods   
1         M4A        North York                             Victoria Village   
2         M5A  Downtown Toronto                    Regent Park, Harbourfront   
3         M6A        North York             Lawrence Manor, Lawrence Heights   
4         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government   

   Longitude   Latitude  
0 -79.329656  43.753259  
1 -79.315572  43.725882  
2 -79.360636  43.654260  
3 -79.464763  43.718518  
4 -79.389494  43.662301  


In [4]:
# retrieving Longitude and Latitude for Toronto for creating map plots further on
address_toronto = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address_toronto)
lat_toronto = location.latitude
long_toronto = location.longitude

print('The geograpical coordinate of New York City are {}, {}.'.format(lat_toronto, long_toronto))

The geograpical coordinate of New York City are 43.6534817, -79.3839347.


Before clustering the data, let's check that all has gone well so far and plotting the neighborhoods on a map.

In [5]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[lat_toronto, long_toronto], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_neighborhoods['Latitude'], df_neighborhoods['Longitude'], df_neighborhoods['Borough'], df_neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [6]:
# function to retrieve nearby venues from Toronto neighborhoods (as was done in lab)
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [7]:
# to reduce the amount of data queried, I will focus on the neighborhoods of two boroughs (Downtown Toronto and 
# East York) only

df_neighborhoods_filtered = df_neighborhoods[(df_neighborhoods['Borough'] == 'Downtown Toronto') |
                                             (df_neighborhoods['Borough'] == 'Central Toronto')]
print(df_neighborhoods_filtered.head())
print("\n\n" + str(len(df_neighborhoods_filtered)))

   Postal Code           Borough                                 Neighborhood  \
2          M5A  Downtown Toronto                    Regent Park, Harbourfront   
4          M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government   
9          M5B  Downtown Toronto                     Garden District, Ryerson   
15         M5C  Downtown Toronto                               St. James Town   
20         M5E  Downtown Toronto                                  Berczy Park   

    Longitude   Latitude  
2  -79.360636  43.654260  
4  -79.389494  43.662301  
9  -79.378937  43.657162  
15 -79.375418  43.651494  
20 -79.373306  43.644771  


28


These 28 neighborhoods will now be clustered into groups depending on their constution. Data will be queried from FourSquare as was done in the lab.

In [8]:
filtered_neighborhood_venues = getNearbyVenues(names=df_neighborhoods_filtered['Neighborhood'],
                                               latitudes=df_neighborhoods_filtered['Latitude'],
                                               longitudes=df_neighborhoods_filtered['Longitude'])
print(filtered_neighborhood_venues)

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Davisville
University of Toronto, Harbord
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town, Cabbagetown
First Canadian Place, Underground city
Church and Wellesley
                   Neighborhood  Neighborhood Latitude  \
0     Regent Park, Harbourfront               43.65426   
1     Regent Park, Harbourfront 

1342 have been retrieved for the 28 neighborhoods that describe their constitution and based on which the neighborhoods will be clustered. Before the clustering can be conducted, the neighborhoods need to be represented as vectors based on the venue categories found in the neighborhood.

In [9]:
# representing neighborhoods with vectors based on venue categories found
filtered_neighborhoods_onehot = pd.get_dummies(filtered_neighborhood_venues[['Venue Category']], prefix="", prefix_sep="")
del filtered_neighborhoods_onehot['Neighborhood'] # deleting neighborhood column
filtered_neighborhoods_onehot.insert(0, 'Neighborhood', filtered_neighborhood_venues['Neighborhood']) # to then insert it as first column
filtered_neighborhoods_grouped = filtered_neighborhoods_onehot.groupby('Neighborhood').mean().reset_index()
filtered_neighborhoods_grouped

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,...,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0
1,"CN Tower, King and Spadina, Railway Lands, Har...",0.071429,0.071429,0.071429,0.214286,0.071429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.014286,0.0,0.0,0.014286,0.0,0.0,0.014286
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.014085,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028169
5,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.01,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
6,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"First Canadian Place, Underground city",0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.01,...,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0
9,"Forest Hill North & West, Forest Hill Road Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that the neighborhoods are represented with vectors, they can be grouped into categories. Without having conducted a test for the ideal number of clusters (if there are any), the number for clusters will be arbitrarily set to four.

In [10]:
kclusters = 4

filtered_neighborhoods_grouped_clustering = filtered_neighborhoods_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(filtered_neighborhoods_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print(kmeans.labels_)

[0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 3 0 0 0 0 0 0 0]


In [11]:
# merge clusters with neighborhood geospatial data for plotting
filtered_neighborhoods_grouped['cluster label'] = kmeans.labels_
filtered_neighborhoods_grouped = filtered_neighborhoods_grouped[['Neighborhood', 'cluster label']]
filtered_neighborhoods_venues = filtered_neighborhood_venues[['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude']].groupby('Neighborhood').first()
neigh_clustered = filtered_neighborhoods_grouped.join(filtered_neighborhoods_venues, on='Neighborhood')
neigh_clustered

Unnamed: 0,Neighborhood,cluster label,Neighborhood Latitude,Neighborhood Longitude
0,Berczy Park,0,43.644771,-79.373306
1,"CN Tower, King and Spadina, Railway Lands, Har...",0,43.628947,-79.39442
2,Central Bay Street,0,43.657952,-79.387383
3,Christie,0,43.669542,-79.422564
4,Church and Wellesley,0,43.66586,-79.38316
5,"Commerce Court, Victoria Hotel",0,43.648198,-79.379817
6,Davisville,0,43.704324,-79.38879
7,Davisville North,0,43.712751,-79.390197
8,"First Canadian Place, Underground city",0,43.648429,-79.38228
9,"Forest Hill North & West, Forest Hill Road Park",0,43.696948,-79.411307


In [12]:
# plotting map with cluster information
# create map
map_clusters = folium.Map(location=[lat_toronto, long_toronto], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(neigh_clustered['Neighborhood Latitude'], neigh_clustered['Neighborhood Longitude'], neigh_clustered['Neighborhood'], neigh_clustered['cluster label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

From the plot, it seems that most neighborhoods of the selected are homogenous. 