# Segmenting and Clustring Neighborhoods in Toronto

### Importing Libraries
1. BeautifulSoup is imported for webscraping
2. requests is imported for retrieving the html code from the wikipedia
3. pandas is imported for converted the data in dataframe

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

### Web Scrapping
Used request module to get request from the page and assigned text to the source variable and then pass it through BeautifulSoup to find out the table of class = "wikitable sortable"

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
table= soup.find("table", class_="wikitable sortable").text
print(table)



Postcode
Borough
Neighbourhood


M1A
Not assigned
Not assigned


M2A
Not assigned
Not assigned


M3A
North York
Parkwoods


M4A
North York
Victoria Village


M5A
Downtown Toronto
Harbourfront


M6A
North York
Lawrence Heights


M6A
North York
Lawrence Manor


M7A
Downtown Toronto
Queen's Park


M8A
Not assigned
Not assigned


M9A
Queen's Park
Not assigned


M1B
Scarborough
Rouge


M1B
Scarborough
Malvern


M2B
Not assigned
Not assigned


M3B
North York
Don Mills North


M4B
East York
Woodbine Gardens


M4B
East York
Parkview Hill


M5B
Downtown Toronto
Ryerson


M5B
Downtown Toronto
Garden District


M6B
North York
Glencairn


M7B
Not assigned
Not assigned


M8B
Not assigned
Not assigned


M9B
Etobicoke
Cloverdale


M9B
Etobicoke
Islington


M9B
Etobicoke
Martin Grove


M9B
Etobicoke
Princess Gardens


M9B
Etobicoke
West Deane Park


M1C
Scarborough
Highland Creek


M1C
Scarborough
Rouge Hill


M1C
Scarborough
Port Union


M2C
Not assigned
Not assigned


M3C
North York
Flemingdon Par

### Data Wrangling
As our table is still in string format, we have to remove the blank spaces. Firstly we converted the string into list of elements by spliting with __\n__ and then removed the empty elements in the raw table_list. Our data contain 3 elements in a row thus we chunked the list with interval of 3 element and converted chunked data into pandas DataFrame. Then droped the first row as we had defined our column name already and reset the index. Lastly removed the rows for which Borough was Not assigned.

In [3]:
table_list=table.split("\n")
table_list[:] = [x for x in table_list if x] #remove empty elements
chunked=[table_list[i:i + 3] for i in range(0, len(table_list), 3)]
column=table_list[0:3] #coloumn name
df=pd.DataFrame(chunked,columns=column)
df.drop([0], inplace=True)
df = df[df.Borough != 'Not assigned'].reset_index(drop=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


#### Combining neighborhoods that have the same postcode and separating them with a comma

In [4]:
df=df.groupby(['Postcode','Borough'], sort = False)['Neighbourhood'].aggregate(lambda x: ', '.join(x)).reset_index()
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So..."


#### Assignment of borough to neighbourhood if neighbourbood is not assinged

In [5]:
df.loc[df['Neighbourhood']=="Not assigned",'Neighbourhood']=df['Borough']

In [6]:
df.shape

(103, 3)

### Coordinates of Neighbourhood

Firstly we have to read the csv file of Geospatial Coodrinates of Toronto, Ontario for each Postal Code to get the latitude and longitude

In [7]:
coord= pd.read_csv('Geospatial_Coordinates.csv')
coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We have to join both table using merge method. Both Dataframes merges using inner joing and then drop the duplicate column of Postal Code

In [8]:
df_coord=pd.merge(df, coord, left_on ='Postcode', right_on ='Postal Code', how = 'inner')
df_coord.drop("Postal Code", axis=1, inplace=True)
df_coord.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


### Clustring of Postal Code on Map

First we have to import the folium library, which is the best visualization module for map

In [9]:
# Matplotlib and associated plotting modules
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

To initiale the map we define the latitiude an longitute of Toronto by taking mean of each postal code and zoom_start =12

In [10]:
toronto_map = folium.Map(location=[df_coord.Latitude.mean(), df_coord.Longitude.mean()], zoom_start=12)
toronto_map

filtered the data frame containing __Toronto__ in their Borough using str.contain method and reset the index

In [11]:
df_toronto = df_coord[df_coord["Borough"].str.contains("Toronto")].reset_index(drop=True)
df_toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259


#### Adding Postal Code on Map

In [12]:
for lat, lng, pcode, borough in zip(df_toronto.Latitude, df_toronto.Longitude, df_toronto.Postcode, df_toronto.Borough):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=f'{pcode},{borough}',
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(toronto_map)

#### Display Map

In [13]:
toronto_map

### Clustring of Neighbourhood

After visualization we can see that apprently data in 4 different groups can be clustered. So we set the n_clusters= 4 and fit the toronto_clustring data

In [14]:
kclusters = 4

toronto_clustering = df_toronto[['Latitude','Longitude']]

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 3, 1, 1, 2, 1, 2])

Insert the cluster label in to original dataframe

In [15]:
df_toronto.insert(0, 'Cluster Labels', kmeans.labels_)
df_toronto

Unnamed: 0,Cluster Labels,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,1,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,1,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
3,1,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,3,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,1,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,1,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,2,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,1,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,2,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259


In [16]:
# create map
map_clusters = folium.Map(location=[df_coord.Latitude.mean(), df_coord.Longitude.mean()], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighbourhood'], df_toronto['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters