# Segmenting and Clustering Neighborhoods in Toronto

**<p style='text-align: justify;'> This notebook presents the first Capstone project, part of the *Applied Data Science Capstone* course, for Segmenting and Clustering Neighborhoods in Toronto. The data about Toronto neighborhoods and their postal code is found by scraping a Wikipedia page where the information is available. - May 2021 </p>**

First, scrape the wikipedia page *List of postal codes of Canada: M* (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) using the Beautiful Soup library. Store the list of Postal Codes, Boroughs and Neighborhoods in a DataFrame. 

In [1]:
#Import required libraries
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium # map rendering library

In [2]:
#Upload Wikipedia page using the Beautiful Soup package
import requests
from bs4 import BeautifulSoup

page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# Create a BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')

In [3]:
# Search table for Postal Code, Borough and Neighbourhood and store data in a dataframe
table=soup.find('table')
postal_code = []
borough = []
neighborhood = []
for tr in table.find_all('tr'):
    for td_row in tr.find_all('td'):
        text = td_row.text.strip()
        postal_code.append(text[0:3])
        mid_car = text.find('(')
        end_car = text.find(')')
        borough.append(text[3:mid_car])
        neighborhood.append(text[mid_car+1:end_car])
        
data = {'Postal Code':postal_code,'Borough':borough, 'Neighborhood':neighborhood}
df = pd.DataFrame(data)
df = df[~df['Borough'].isin(['Not assigne'])]

In [4]:
#Preview data obtained
df.head(103)

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Queen's Park,Ontario Provincial Government
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge
11,M3B,North York,Don Mills
12,M4B,East York,Parkview Hill / Woodbine Gardens
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
#Clean dataframe based on errors in Borough names 
df.at[57,'Borough']='East Toronto'
df.at[114,'Borough']='Mississauga'
df.at[148,'Borough']='Downtown Toronto'
df.at[168,'Borough']='East Toronto'

In [6]:
#Clean dataframe and preview results
df.set_index('Postal Code', inplace=True)
df.head()

Unnamed: 0_level_0,Borough,Neighborhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Regent Park / Harbourfront
M6A,North York,Lawrence Manor / Lawrence Heights
M7A,Queen's Park,Ontario Provincial Government


In [7]:
print('The dataframe has {0} rows.'.format(df.shape[0]))

The dataframe has 103 rows.


Then, obtain the latitude and longitude coordinates of each Postal Code using the Geospatial dataset given (Geocoder Python package could not work). 

In [8]:
#Load longitute data latitude data from CSV file, and merge data with postal code information
coord = pd.read_csv("Geospatial_Coordinates.csv") 
toronto = pd.merge(df, coord, on='Postal Code', how='outer')
toronto.head(103)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,Parkview Hill / Woodbine Gardens,43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [9]:
#Learn about the dataset of Toronto's neighborhoods
print('The dataframe has {} boroughs and {} neighborhoods.'.format(len(toronto['Borough'].unique()),toronto.shape[0]))

The dataframe has 12 boroughs and 103 neighborhoods.


In [10]:
#Select only the ones which contain the word 'Toronto'
toronto = toronto[toronto['Borough'].str.contains('Toronto')] 
print('The dataframe now has {} boroughs and {} neighborhoods.'.format(len(toronto['Borough'].unique()),toronto.shape[0]))
print(toronto['Borough'].value_counts())
toronto.head()

The dataframe now has 4 boroughs and 39 neighborhoods.
Downtown Toronto    18
Central Toronto      9
East Toronto         6
West Toronto         6
Name: Borough, dtype: int64


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


The Folium package is then used to visualize the neighborhoods of Toronto on the city map. 

In [11]:
from geopy.geocoders import Nominatim

In [12]:
#Get Toronto coordinates first
address = 'Toronto'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {0}, {1}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [13]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto['Latitude'], toronto['Longitude'], toronto['Borough'], toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Cluster the neighborhoods of Toronto into the 4 main boroughs: Downtown, Central, East and West Toronto. 

In [14]:
toronto['Cluster']=toronto['Borough'].replace(to_replace=['Downtown Toronto','Central Toronto','West Toronto','East Toronto'],value=[1,2,3,4],inplace=False)
toronto.head(20)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636,1
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,4
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564,1
30,M5H,Downtown Toronto,Richmond / Adelaide / King,43.650571,-79.384568,1
31,M6H,West Toronto,Dufferin / Dovercourt Village,43.669005,-79.442259,3
35,M4J,East Toronto,The Danforth East,43.685347,-79.338106,4


Finally, let's recreate the Toronto map with the clustered boroughs 

In [15]:
kclusters=len(toronto.Cluster.unique())

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto['Latitude'], toronto['Longitude'], toronto['Neighborhood'], toronto['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto)
    
map_toronto

The map of Toronto with the clustered boroughs is displayed! 