# Segmenting and Clustering Neighborhoods in Toronto

## 1. Introduction
In this notebook, we are going to explore, segment, and cluster the neighborhoods in the city of Toronto. First, we need to scrap some necessary data of the neighborhoods in Toronto.

## 2. Part I: Web-scraping for neighborhood information of Toronto

In [1]:
!pip install bs4

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup



In [2]:
# Obtain the contents of the wikipedia page in text format
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wiki_data = requests.get(url).text


# Create a BeautifulSoup object and find the tables
wiki_soup = BeautifulSoup(wiki_data, "html5lib")
tables = wiki_soup.find_all("table")

Apparently, we are only interested in the first table of this Wikipedia page which contains postal codes of different neighborhoods from the six Boroughs of Toronto.

In [3]:
# Read and convert the first table to a dataframe
table_contents = []
for row in tables[0].findAll('td'):
    cell = {}
    if row.span.text == 'Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df = pd.DataFrame(table_contents)
df['Borough'] = df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade': 'Downtown Toronto Stn A',
                                       'East TorontoBusiness reply mail Processing Centre969 Eastern': 'East Toronto Business',
                                       'EtobicokeNorthwest': 'Etobicoke Northwest',
                                       'East YorkEast Toronto': 'East York/East Toronto',
                                       'MississaugaCanada Post Gateway Processing Centre': 'Mississauga'})
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [4]:
# The number of rows & columns in the dataframe
df.shape

(103, 3)

## 3. Part II: Ascertain geographical information

Since Google's API is no longer free to use, the library geocoder will be called in the following as an alternative.

In [5]:
!pip install geocoder
import geocoder



In [6]:
# initialize the latitude and longitude for each neighborhood
latitude = np.zeros(df.shape[0])
longitude = np.zeros_like(latitude)

for i, postal_code in enumerate(df['PostalCode'].tolist()):
    # initialize coordinates
    latlng_neigh = None

    # loop until you get the coordinates
    while(latlng_neigh is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        # g = geocoder.bing('{}, Toronto, Ontario'.format(postal_code))
        # g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        latlng_neigh = g.latlng

    latitude[i] = latlng_neigh[0]
    longitude[i] = latlng_neigh[1]

In [7]:
# attach latitude and longitude information to the existing dataframe
df['Latitude'] = latitude
df['Longitude'] = longitude

df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Queen's Park,Ontario Provincial Government,43.66253,-79.39188
5,M9A,Etobicoke,Islington Avenue,43.66263,-79.52831
6,M1B,Scarborough,"Malvern, Rouge",43.81139,-79.19662
7,M3B,North York,Don Mills North,43.74923,-79.36186
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70718,-79.31192
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804


## 4. Part III: Explore and cluster the neighborhoods in Toronto

In this section, we will perform k-means algorithm to cluster the neighborhoods.

In [8]:
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from sklearn.cluster import KMeans

Let us have a glance of the geographic distribution of all neighborhoods in Toronto.

In [9]:
# obtain the latitude and longitude of toronto
address = 'Toronto, Ontario'
latlng_toronto = None
while(latlng_toronto is None):
    g = geocoder.arcgis('Toronto, Ontario')
    latlng_toronto = g.latlng

print('The geograpic coordinate of Toronto are latitude {} and longitude {}.'.format(latlng_toronto[0], latlng_toronto[1]))

The geograpic coordinate of Toronto are latitude 43.648690000000045 and longitude -79.38543999999996.


In [10]:
map_Toronto = folium.Map(location=latlng_toronto, zoom_start=10)

# add markers to map
for lat, lng, borough, neigh in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label_text = '{}, {}'.format(neigh, borough)
    label = folium.Popup(label_text, parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color='blue',
                        fill=True,
                        fill_color='#3186cc',
                        fill_opacity=0.7,
                        parse_html=False).add_to(map_Toronto)

map_Toronto

In [11]:
# number of clusters
k = 8

# drop unnecessary columns
df_clustering = df.drop(['PostalCode', 'Borough', 'Neighborhood'], axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 4, 1, 4, 6, 7, 5, 2, 4], dtype=int32)

Let's look at the results.

In [12]:
# attach clustering labels
df['Cluster_Label'] = kmeans.labels_
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_Label
0,M3A,North York,Parkwoods,43.75245,-79.32991,0
1,M4A,North York,Victoria Village,43.73057,-79.31306,2
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264,4
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042,1
4,M7A,Queen's Park,Ontario Provincial Government,43.66253,-79.39188,4
5,M9A,Etobicoke,Islington Avenue,43.66263,-79.52831,6
6,M1B,Scarborough,"Malvern, Rouge",43.81139,-79.19662,7
7,M3B,North York,Don Mills North,43.74923,-79.36186,5
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70718,-79.31192,2
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804,4


In [13]:
# visialize the clusters
map_clusters = folium.Map(location=latlng_toronto, zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in np.arange(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(df['Latitude'], df['Longitude'], df['Neighborhood'], df['Cluster_Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color=rainbow[cluster-1],
                        fill=True,
                        fill_color=rainbow[cluster-1],
                        fill_opacity=0.7).add_to(map_clusters)

map_clusters