# Segmenting and Clustering Neighbourhoods in Toronto

## Assignment Description

In this project the neighbourhoods in the city of Toronto will be grouped into clusters by completing the following task:
1. A Wikipedia page will be scraped that contains information about the neighbourhoods in the city of Toronto, clean the data and place in a dataframe.

2. The geographical coordinates for each neighbourhood with their postal code will be retrieved from a csv file then merged to the original dataframe.

3. Using the K Means algorithm the data will be clustered and visualized on a map.


**Installing and importing Libraries**

In [1]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Folium installed
Libraries imported.


**Webscraping and cleaning the data**

The Wikipedia page was scraped using BeautifulSoup. The data  was wrangled such that cells that have borough Not assigned was ignored. If there are the same postal code with different neighbourhoods it was combined into one row such that the neighbourhoods are separated by commas. If a cell has a borough but neighbourhood Not assigned then the neighbourhood will be same as the borough. Then transform the data into following pandas dataframe.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
data  = requests.get(url).text

In [4]:
soup = BeautifulSoup(data,"lxml")  # create a soup object using the variable 'data'

In [32]:
#Transform table data into a dataframe
table_contents=[] #create list
table=soup.find('table') #finding the table
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighbourhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [23]:
df.shape

(103, 3)

**Latitude and longitude coordinates**

Read the csv file conatining the latitudes and longitudes coordinates with their postal codes and place into a dataframe.

In [33]:
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**Merge both dataframes on PostalCode**

In [34]:
lat_lon.rename(columns={'Postal Code':'PostalCode'},inplace=True)

In [35]:
df = pd.merge(df,lat_lon,on='PostalCode')
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


**Retrieve rows which contains Toronto in their borough from the dataframe**

In [36]:
df = df[df['Borough'].str.contains('Toronto',regex=False)]
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


**Visualizing all the Borough from the dataframe using Folium**

In [11]:
map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

for lat,lng,borough,neighbourhood in zip(df['Latitude'],df['Longitude'],df['Borough'],df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='green',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

**Cluster the neighbourhoods using K Means Clustering**

In [37]:
k=4
toronto_clustering = df.drop(['PostalCode','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_

array([3, 3, 3, 0, 3, 3, 1, 3, 1, 0, 3, 1, 0, 3, 1, 0, 3, 0, 2, 2, 2, 2,
       1, 2, 3, 1, 2, 3, 1, 2, 3, 2, 3, 3, 3, 3, 3, 3, 0])

In [38]:
df.insert(2, 'Cluster Labels', kmeans.labels_)
df

Unnamed: 0,PostalCode,Borough,Cluster Labels,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,3,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,3,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,3,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,0,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,3,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,3,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,1,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,3,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,1,"Dufferin, Dovercourt Village",43.669005,-79.442259
35,M4J,East York/East Toronto,0,The Danforth East,43.685347,-79.338106


**Visualizing all the Clusters**

In [40]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df['Latitude'], df['Longitude'], df['Neighbourhood'], df['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters