<h1 align=center><font size = 5>Clustering Neighbourhoods in Toronto</font></h1>

In this project, I scrapped the data from the Wikipedia page which includes Canadian postal codes associated with city of Toronto then data was processed and cleaned for the clustering. The clustering is carried out by K Means and the clusters are plotted using the Folium Library. The Boroughs situated in Toronto and contains the word 'Toronto' are first plotted and then clustered and plotted again.

### Installing and Importing the required Libraries

In [1]:
!pip install lxml
!pip install beautifulsoup4
import random
import requests
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup
from pandas.io.json import json_normalize
from IPython.display import Image
from geopy.geocoders import Nominatim
from IPython.core.display import HTML 
from IPython.display import display_html

!conda install -c conda-forge folium=0.5.0 --yes
import folium
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                       

### Obtaining the data from the Wikipedia page for the table of postal codes
BeautifulSoup Library used for this task 

In [25]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
from IPython.display import display_html
tab = str(soup.table)
display_html(tab,raw=True)

Postal Code,Borough,Neighborhood
M1A,Not assigned,
M2A,Not assigned,
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,"Malvern, Rouge"


### The table is converted from HTML to Pandas 

In [26]:
df = pd.read_html(tab)
df_data=df[0]
df_data.columns=['Postalcode','Borough','Neighbourhood']
df_data.head(20)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"


### Data cleaning
All postal codes that contain 'Not assigned' Boroughs excluded from the table

In [27]:
df1 = df_data [df_data.Borough != 'Not assigned']
df1

Unnamed: 0,Postalcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [28]:
df1.shape

(103, 3)

### Importing the csv file conatining the latitudes and Longitudesfor each Postal code

In [29]:
path='http://cocl.us/Geospatial_data'
df2 = pd.read_csv(path)
df2.columns=['Postalcode','Latitude','Longitude']
df2.head()

Unnamed: 0,Postalcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merging the Latitudes and Longitudes with Neighbourhoods, Borough, and Postal codes 

In [30]:
df3 = pd.merge(df1,df2,on='Postalcode')
df3.head(21)

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### Visualizing all the Neighbourhoods using Folium

In [31]:
map_toronto = folium.Map(location=[43.654260,-79.360636],zoom_start=10)

for lat,lng,borough,neighbourhood in zip(df3['Latitude'],df3['Longitude'],df3['Borough'],df3['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

### KMeans clustering for the clsutering of the neighbourhoods

In [43]:
k=5
toronto_clustering = df3.drop(['Postalcode','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df3.insert(0, 'Cluster Labels', kmeans.labels_)

In [44]:
df3

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,4,M3A,North York,Parkwoods,43.753259,-79.329656
1,4,M4A,North York,Victoria Village,43.725882,-79.315572
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,1,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,4,M3B,North York,Don Mills,43.745906,-79.352188
8,4,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### Excluding all the Neighbourhoods except those that contains the word 'Toronto'  

In [45]:
df4=df3[df3.Borough.str.contains("Toronto",case=False)]
df4

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighbourhood,Latitude,Longitude
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,4,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,2,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,2,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,2,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,2,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,2,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


### Visualizing clusters within Toronto using Foleum

In [52]:
map_clusters = folium.Map(location=[43.654260,-79.360636],zoom_start=12)

x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df4['Latitude'], df4['Longitude'], df4['Neighbourhood'], df4['Cluster Labels']):
    label = folium.Popup('Cluster' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)  
    
map_clusters