# Segmenting and Clustering

This notebook will be used for the overall assignment "Segmenting and Clustering Neighborhoods in Toronto".

## Importing libraries

We import immediatly the two most important libraries for data science.

In [1]:
import pandas as pd
import numpy as np

## Importing data

We import data from [Wikipedia][1] about the neighborhood of Toronto with *pandas* method. Then, we convert the data into a *pandas* dataframe to be able to use the full potential of this librarie.

[1]: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
#Importing data
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

df = pd.read_html(url)[0]
list_reject = list()

#Precleaning of the data
for i in range(len(df)):
    if df.loc[i, 'Borough'] == 'Not assigned':
        list_reject.append(i)
    elif df.loc[i, 'Neighbourhood'] == 'Not assigned':
        df.loc[i, 'Neighbourhood'] = df.loc[i, 'Borough']

df.drop(list_reject, inplace = True)
df.reset_index(inplace = True)
df.drop(['index'], axis=1, inplace = True)


In the following cell, we are trying to verify if the previous dataframe is the one we want.

In [3]:
for i in range(len(df)):
    if df.loc[i, 'Borough'] == 'Not assigned' or df.loc[i, 'Neighbourhood'] == 'Not assigned':
        print("Still a 'Not assigned' value")
        break
print("No 'Not assigned' value remaining")

if df['Postal Code'].value_counts()[0] == 1:
    print("Any two different rows have different postal codes")
else:
    print("Issue on postal codes")
    
print("Shape of the dataframe : ", df.shape)

No 'Not assigned' value remaining
Any two different rows have different postal codes
Shape of the dataframe :  (103, 3)


## Adding informations to the dataframe

Since *geocoder* package doesn't work, we will use the csv file provided [here][1] to complete our dataframe.

[1]: https://cocl.us/Geospatial_data

In [4]:
geo_path = 'Data\Geospatial_Coordinates.csv'
df_geo = pd.read_csv(geo_path)
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Then, we merge the two previous dataframes on "Postal Code" to obtain the final dataframe.

In [5]:
df_tor = pd.merge(df, df_geo, on="Postal Code")
df_tor.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [6]:
df_tor.dtypes

Postal Code       object
Borough           object
Neighbourhood     object
Latitude         float64
Longitude        float64
dtype: object

We can use *folium* package to have a first map of the city of Toronto.

In [7]:
import folium

lat_tor = 43.7
lon_tor = -79.4

tor_map = folium.Map(location = [lat_tor, lon_tor], zoom_start=11)
tor_map

Then, we add markers to indicate locations of all postal code across the city. We use colors to distinguish the different boroughs.

In [8]:
import matplotlib.cm as cm
import matplotlib.colors as colors

num_bor = len(pd.unique(df_tor["Borough"]))
color_list = cm.rainbow(np.linspace(0,1,num_bor))
rainbow = [colors.rgb2hex(i) for i in color_list]

tor_map = folium.Map(location = [lat_tor, lon_tor], zoom_start=11)

marker_color = {}
i=-1

for lat, lon, code, bor in zip(df_tor["Latitude"], df_tor["Longitude"], df_tor["Postal Code"], df_tor["Borough"]):
    label = folium.Popup(code +', '+ bor, parse_html=True)
    try :
        col = marker_color[str(bor)]
    except :
        i+=1
        marker_color[str(bor)]=i
            
    folium.CircleMarker([lat, lon], radius =5, popup = label, color = rainbow[marker_color[str(bor)]], fill =True).add_to(tor_map)

tor_map

## Exploring and clustering the neighborhoods in Toronto

First, we will begin to gather informations on the neighborhoods using the Foursquare API. We must define our Foursquare API credentials to be able to use it.

In [9]:
#Define my credentials and version

CLIENT_ID = 'SLUHRYUL0LL0TTU1XEBXTMRF2VEKGSNVOPJ24T3MAKXDT1II'
CLIENT_SECRET = 'QDE5KQAESJOKO1YRLB4UW4MVC0XIXJBCMXRJY4FUW4RJNS4Q'
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

Then, we make several requests to have informations on eache neighborhoods. To do not exceed the maximum request we can made in a day with a free developer account, we will work only on neighborhoods containing *'Toronto'* in the borough's name:
* Downtown Toronto
* East Toronto
* West Toronto
* Central Toronto

In [10]:
#Get the best venues around each neighborhood
import json
import requests

def getVenues(code, borough, latitude, longitude, radius=500):

    nearby_venues = pd.DataFrame(columns = ["Postal Code", "Borough", "PC_lat", "PC_lon", "Venue_name", "Venue_categorie"])
    i=0
    
    for code, borough, lat, lon in zip(code, borough, latitude, longitude):
        if 'Toronto' in str(borough):
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lon, 
                radius, 
                LIMIT)

            
            results = requests.get(url).json()
            
            results = results["response"]['groups'][0]['items']


            for res in results:
                pc = code
                pc_lat = lat
                pc_lon = lon
                venue = res['venue']['name']
                cat = res['venue']['categories'][0]['name']

                nearby_venues.loc[i] = [pc, borough, pc_lat, pc_lon, venue, cat]
                i+=1
    
    return(nearby_venues)

In [11]:
venues = getVenues(df_tor['Postal Code'],df_tor['Borough'], df_tor['Latitude'], df_tor['Longitude'])
venues.head()

Unnamed: 0,Postal Code,Borough,PC_lat,PC_lon,Venue_name,Venue_categorie
0,M5A,Downtown Toronto,43.65426,-79.360636,Roselle Desserts,Bakery
1,M5A,Downtown Toronto,43.65426,-79.360636,Tandem Coffee,Coffee Shop
2,M5A,Downtown Toronto,43.65426,-79.360636,Cooper Koo Family YMCA,Distribution Center
3,M5A,Downtown Toronto,43.65426,-79.360636,Body Blitz Spa East,Spa
4,M5A,Downtown Toronto,43.65426,-79.360636,Impact Kitchen,Restaurant


In [12]:
venues.shape

(1624, 6)

In [13]:
venues.groupby("Postal Code").count()

Unnamed: 0_level_0,Borough,PC_lat,PC_lon,Venue_name,Venue_categorie
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M4E,4,4,4,4,4
M4K,43,43,43,43,43
M4L,19,19,19,19,19
M4M,37,37,37,37,37
M4N,3,3,3,3,3
M4P,9,9,9,9,9
M4R,18,18,18,18,18
M4S,33,33,33,33,33
M4T,2,2,2,2,2
M4V,14,14,14,14,14


In [14]:
venues_cat_sep = pd.get_dummies(venues[["Venue_categorie"]])
venues_cat_sep["Postal Code"] = venues["Postal Code"]

fixed_columns = [venues_cat_sep.columns[-1]] + list(venues_cat_sep.columns[:-1])
venues_cat_sep = venues_cat_sep[fixed_columns]

venues_cat_sep.head()

Unnamed: 0,Postal Code,Venue_categorie_Afghan Restaurant,Venue_categorie_Airport,Venue_categorie_Airport Food Court,Venue_categorie_Airport Gate,Venue_categorie_Airport Lounge,Venue_categorie_Airport Service,Venue_categorie_Airport Terminal,Venue_categorie_American Restaurant,Venue_categorie_Antique Shop,...,Venue_categorie_Theater,Venue_categorie_Theme Restaurant,Venue_categorie_Toy / Game Store,Venue_categorie_Trail,Venue_categorie_Train Station,Venue_categorie_Vegetarian / Vegan Restaurant,Venue_categorie_Video Game Store,Venue_categorie_Vietnamese Restaurant,Venue_categorie_Wine Bar,Venue_categorie_Yoga Studio
0,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
venues_grouped = venues_cat_sep.groupby("Postal Code").sum()
venues_grouped["Total"] = venues_grouped.sum(axis=1)
columns = venues_grouped.columns
venues_grouped.head()

Unnamed: 0_level_0,Venue_categorie_Afghan Restaurant,Venue_categorie_Airport,Venue_categorie_Airport Food Court,Venue_categorie_Airport Gate,Venue_categorie_Airport Lounge,Venue_categorie_Airport Service,Venue_categorie_Airport Terminal,Venue_categorie_American Restaurant,Venue_categorie_Antique Shop,Venue_categorie_Aquarium,...,Venue_categorie_Theme Restaurant,Venue_categorie_Toy / Game Store,Venue_categorie_Trail,Venue_categorie_Train Station,Venue_categorie_Vegetarian / Vegan Restaurant,Venue_categorie_Video Game Store,Venue_categorie_Vietnamese Restaurant,Venue_categorie_Wine Bar,Venue_categorie_Yoga Studio,Total
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M4E,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,4
M4K,0,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,1,43
M4L,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,19
M4M,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,1,1,37
M4N,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3


In [16]:
for col in columns[:-1]:
    venues_grouped[col] = venues_grouped[col]/venues_grouped["Total"]

venues_grouped.head()

Unnamed: 0_level_0,Venue_categorie_Afghan Restaurant,Venue_categorie_Airport,Venue_categorie_Airport Food Court,Venue_categorie_Airport Gate,Venue_categorie_Airport Lounge,Venue_categorie_Airport Service,Venue_categorie_Airport Terminal,Venue_categorie_American Restaurant,Venue_categorie_Antique Shop,Venue_categorie_Aquarium,...,Venue_categorie_Theme Restaurant,Venue_categorie_Toy / Game Store,Venue_categorie_Trail,Venue_categorie_Train Station,Venue_categorie_Vegetarian / Vegan Restaurant,Venue_categorie_Video Game Store,Venue_categorie_Vietnamese Restaurant,Venue_categorie_Wine Bar,Venue_categorie_Yoga Studio,Total
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,4
M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023256,0.0,0.0,...,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,0.0,0.023256,43
M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19
M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054054,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027,0.027027,37
M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3


In [17]:
venues_grouped["Total"] = (venues_grouped["Total"]-venues_grouped["Total"].min())/(venues_grouped["Total"].max()-venues_grouped["Total"].min())
venues_grouped.head()

Unnamed: 0_level_0,Venue_categorie_Afghan Restaurant,Venue_categorie_Airport,Venue_categorie_Airport Food Court,Venue_categorie_Airport Gate,Venue_categorie_Airport Lounge,Venue_categorie_Airport Service,Venue_categorie_Airport Terminal,Venue_categorie_American Restaurant,Venue_categorie_Antique Shop,Venue_categorie_Aquarium,...,Venue_categorie_Theme Restaurant,Venue_categorie_Toy / Game Store,Venue_categorie_Trail,Venue_categorie_Train Station,Venue_categorie_Vegetarian / Vegan Restaurant,Venue_categorie_Video Game Store,Venue_categorie_Vietnamese Restaurant,Venue_categorie_Wine Bar,Venue_categorie_Yoga Studio,Total
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.020408
M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023256,0.0,0.0,...,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,0.0,0.023256,0.418367
M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.173469
M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054054,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027,0.027027,0.357143
M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010204


## Clustering

Now, it's time for clustering. Because we made our researcg on only four boroughs, it can be interessing to see if a 4-cluster will distinguish these four boroughs.

In [18]:
from sklearn.cluster import KMeans

k_means = KMeans(init="k-means++", n_clusters = 4, n_init = 15)
k_means.fit(venues_grouped)
k_means_labels = k_means.labels_
k_means_labels

array([0, 0, 0, 0, 3, 0, 0, 0, 3, 0, 3, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       2, 3, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [19]:
venues_grouped["Cluster"] = k_means_labels

In [20]:
column_drop = venues_grouped.columns[:-1]

In [21]:
df_fin = pd.merge(venues_grouped, df_tor, on="Postal Code")

df_fin.drop(columns = column_drop, inplace=True)
df_fin

Unnamed: 0,Postal Code,Cluster,Borough,Neighbourhood,Latitude,Longitude
0,M4E,0,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,0,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,0,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,0,East Toronto,Studio District,43.659526,-79.340923
4,M4N,3,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,0,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,0,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
7,M4S,0,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,3,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,0,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


We finally obtain our final dataframe. We can now print a map as visualization of these clusters.

In [22]:
color_list = cm.rainbow(np.linspace(0,1,4))
rainbow = [colors.rgb2hex(i) for i in color_list]

tor_map = folium.Map(location = [43.66, -79.39], zoom_start=12)

for lat, lon, code, bor, neigh, cluster in zip(df_fin["Latitude"], df_fin["Longitude"], df_fin["Postal Code"], df_fin["Borough"], df_fin["Neighbourhood"], df_fin["Cluster"]):
    label = folium.Popup(code +',\n '+ bor+ ',\n'+ neigh, parse_html=True)
    folium.CircleMarker([lat, lon], radius =5, popup = label, color = rainbow[cluster], fill =True).add_to(tor_map)

tor_map

Those results seems coherent. Indeed, for instance, it seems normal that all Downtown neighboor share lots of common services.