# Neighborhood Clusterting Toronto
This is a defacto recreation of the [Segmenting and Clustering Neighborhoods in New York City](https://labs.cognitiveclass.ai/tools/jupyterlab/lab/tree/labs/DP0701EN/DP0701EN-3-3-2-Neighborhoods-New-York-py-v1.0.ipynb) lab by **Cognitive Class.ai** and the **Coursera** *Applied Data Science Capstone*

I am using this to familiarize myself with the **Foursquare API** and **Folium**

First I load the daset with the Toronto PostCodes and Coordinates

In [22]:
import pandas as pd
import requests # library to handle requests

df_T= pd.read_csv('Toronto_Merged.csv', index_col = 0)
df_T.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M5N,Central Toronto,Roselawn,M5N,43.711695,-79.416936
1,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",M5L,43.648198,-79.379817
2,M6G,Downtown Toronto,Christie,M6G,43.669542,-79.422564
3,M6J,West Toronto,"Little Portugal, Trinity",M6J,43.647927,-79.41975
4,M9M,North York,"Emery, Humberlea",M9M,43.724766,-79.532242


I am only using postcodes/neighborhoods, which are located in a Central Toronto borough

In [98]:
df_TCenter = df_T[df_T['Borough'].str.contains('Toronto')]
df_TCenter.shape

(38, 6)

Loading my Foursquare Key

In [99]:
KEY= pd.read_csv('Foursquare_Key.csv')
VERSION = '20180605'
CLIENT_ID = KEY.iloc[0,0]
CLIENT_SECRET = KEY.iloc[0,1]

## Getting data from Foursquare
Using the function provided from the Lab I am connecting to the FourSquare API to loop over all central postcodes, downloading all venue data within a 500m radius of the coordinates

In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [26]:
Toronto_venues = getNearbyVenues(names=df_TCenter['Postcode'],
                                   latitudes=df_TCenter['Latitude'],
                                   longitudes=df_TCenter['Longitude']
                                  )

M5N
M5L
M6G
M6J
M5V
M5C
M4R
M5P
M6P
M4M
M6S
M5A
M6R
M5S
M4N
M4X
M4T
M4V
M5E
M4L
M5B
M4K
M4E
M4W
M5K
M5T
M5J
M6K
M5H
M5R
M4S
M5G
M7Y
M6H
M5W
M5X
M4Y
M4P


In [52]:
Toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5N,43.711695,-79.416936,Ceiling Champions,43.713891,-79.420702,Home Service
1,M5N,43.711695,-79.416936,Rosalind's Garden Oasis,43.712189,-79.411978,Garden
2,M5L,43.648198,-79.379817,Canoe,43.647452,-79.38132,Restaurant
3,M5L,43.648198,-79.379817,Equinox Bay Street,43.6481,-79.379989,Gym
4,M5L,43.648198,-79.379817,Mos Mos Coffee,43.648159,-79.378745,Café


We can see after grouping by neighborhood/postcode that there are many not very commerical neighborhoods in the dataset, this might point to residential neighborhoods. 

In [100]:
Toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4E,5,5,5,5,5,5
M4K,44,44,44,44,44,44
M4L,19,19,19,19,19,19
M4M,37,37,37,37,37,37
M4N,3,3,3,3,3,3
M4P,10,10,10,10,10,10
M4R,19,19,19,19,19,19
M4S,34,34,34,34,34,34
M4T,2,2,2,2,2,2
M4V,15,15,15,15,15,15


## Segementation/ Clustering
Creating features for segmentation with One-Hot Encoding.

We normalize features (each row adds up to 1)

In [102]:
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")
Toronto_onehot['Postcode'] = Toronto_venues['Neighborhood'] 

Toronto_onehot.columns=='Neighborhood'
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_grouped = Toronto_onehot.groupby('Postcode').mean().reset_index()
Toronto_grouped.head()


Unnamed: 0,Postcode,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.022727
2,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027
4,M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Running the actual clustering
For simplicity *KMeans* is being used. We chose 6 clusters for demonstration purposes. 

Using feature columns only. Columns: [1:end].

In [103]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 6

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering.iloc[:,1:])

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 1, 1, 1, 5, 1, 1, 1, 2, 1])

## Evaluating and visualizing the clustering

Using the fuction and script from the lecture, we create a table which displays the most common venue types for each neighborhood and merge this with the overall dataframe

In [90]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [104]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postcode'] = Toronto_grouped['Postcode']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = df_TCenter

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('Postcode'), on='Postcode')

Toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5N,Central Toronto,Roselawn,M5N,43.711695,-79.416936,4,Home Service,Garden,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
1,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",M5L,43.648198,-79.379817,1,Coffee Shop,Café,Hotel,American Restaurant,Restaurant,Bakery,Italian Restaurant,Gastropub,Seafood Restaurant,Deli / Bodega
2,M6G,Downtown Toronto,Christie,M6G,43.669542,-79.422564,1,Grocery Store,Café,Park,Athletics & Sports,Nightclub,Restaurant,Diner,Italian Restaurant,Baby Store,Coffee Shop
3,M6J,West Toronto,"Little Portugal, Trinity",M6J,43.647927,-79.41975,1,Bar,Coffee Shop,Asian Restaurant,Bakery,Vietnamese Restaurant,Men's Store,Restaurant,Cocktail Bar,Pizza Place,Café
7,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",M5V,43.628947,-79.39442,3,Airport Service,Airport Lounge,Airport Terminal,Harbor / Marina,Sculpture Garden,Boutique,Boat or Ferry,Airport Gate,Airport,Airport Food Court


This is then displayed using Folium

In [112]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium 
import matplotlib.cm as cm
import matplotlib.colors as colors

latitude = 43.6532
longitude = -79.3832

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Postcode'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters
map_clusters.save('plot_data.html')
from IPython.display import HTML
HTML('<iframe src=plot_data.html width=700 height=450></iframe>')



In [122]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 1,:].head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",M5L,43.648198,-79.379817,1,Coffee Shop,Café,Hotel,American Restaurant,Restaurant,Bakery,Italian Restaurant,Gastropub,Seafood Restaurant,Deli / Bodega
2,M6G,Downtown Toronto,Christie,M6G,43.669542,-79.422564,1,Grocery Store,Café,Park,Athletics & Sports,Nightclub,Restaurant,Diner,Italian Restaurant,Baby Store,Coffee Shop
3,M6J,West Toronto,"Little Portugal, Trinity",M6J,43.647927,-79.41975,1,Bar,Coffee Shop,Asian Restaurant,Bakery,Vietnamese Restaurant,Men's Store,Restaurant,Cocktail Bar,Pizza Place,Café
9,M5C,Downtown Toronto,St. James Town,M5C,43.651494,-79.375418,1,Coffee Shop,Café,Hotel,Restaurant,Cosmetics Shop,Bakery,Italian Restaurant,Cocktail Bar,Breakfast Spot,Gastropub
10,M4R,Central Toronto,North Toronto West,M4R,43.715383,-79.405678,1,Clothing Store,Coffee Shop,Sporting Goods Shop,Yoga Studio,Chinese Restaurant,Dessert Shop,Diner,Salon / Barbershop,Sandwich Place,Fast Food Restaurant


In [121]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 2, :].head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,M5P,Central Toronto,"Forest Hill North, Forest Hill West",M5P,43.696948,-79.411307,2,Park,Trail,Jewelry Store,Sushi Restaurant,Yoga Studio,Dim Sum Restaurant,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
41,M4T,Central Toronto,"Moore Park, Summerhill East",M4T,43.689574,-79.38316,2,Playground,Trail,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
57,M4W,Downtown Toronto,Rosedale,M4W,43.679563,-79.377529,2,Park,Playground,Trail,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


In [123]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 3, :].head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",M5V,43.628947,-79.39442,3,Airport Service,Airport Lounge,Airport Terminal,Harbor / Marina,Sculpture Garden,Boutique,Boat or Ferry,Airport Gate,Airport,Airport Food Court


In [126]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 4, :].head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5N,Central Toronto,Roselawn,M5N,43.711695,-79.416936,4,Home Service,Garden,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


In [125]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 5, :].head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
36,M4N,Central Toronto,Lawrence Park,M4N,43.72802,-79.38879,5,Park,Bus Line,Swim School,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


In [127]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 0, :].head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
52,M4E,East Toronto,The Beaches,M4E,43.676357,-79.293031,0,Coffee Shop,Health Food Store,Neighborhood,Pub,Yoga Studio,Dog Run,Diner,Discount Store,Dive Bar,Donut Shop


## Discussion

Overall, it appears as if most postcodes are grouped into the same category, while there are a few neighborhoods that are clearly different. For example, and based on the table, cluster 4 and 3 are services other than restaurants, while cluster 1 is dominated by cafe's and restaurants. Cluster number 2 is parks.

I would assume that the clumping of similar businessed is a problem with respect to Foursquare data in general, which only captures customer facing businesses (e.g. where people check-in and buy stuff or hang out). 