# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

This project explores, segments, and cluster the neighborhoods in the city of Toronto based on the postal code and borough information. The process involves scraping the Wikipedia page using BeautifulSoup package, then wrangle/clean the data, and read it into a pandas dataframe for structured format. Moreover, clustering is done for the neighborhoods in Downtown Toronto using K-Means Clustering and visualization is completed before and after clustering using Folium Library.

####  NOTE: All the 3 parts of this project: Web Scraping, Data Pre-Processing and Data Exploration, Analysis & Clustering are implemented in this notebook.

**First installing and importing required libraries**

In [1]:
!pip install -U numpy
!pip install -U pandas
!pip install -U scipy==1.4.1
!pip install -U scikit-learn
!pip install -U imbalanced-learn

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Requirement already up-to-date: numpy in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.20.3)
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Requirement already up-to-date: pandas in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.2.4)
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Requirement already up-to-date: scipy==1.4.1 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.4.1)
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Requirement already up-to-date: scikit-learn in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (0.24.2)
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Requirement already up-to-date: imbalanced-learn in /opt/conda/envs

In [2]:
!pip install beautifulsoup4
!pip install lxml

import bs4 as bs    # library for beautiful soup object 

import pandas as pd # library for data analysis

!pip install numpy
import numpy as np # library to handle data in vectorized manner

import json # library to handle JSON files
from pandas.io.json import json_normalize # transform JSON file into a pandas dataframe

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests        # library to handle requests
import urllib.request  # url library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Libraries to display images
from IPython.display import Image
from IPython.core.display import HTML
from IPython.display import display_html

# import k-meanns from clustering stage
!pip install sklearn
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --y 
import folium # map rendering library

print("Done! LIbraries imported.")

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python-3.7-main

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      c

### Web Scraping
Scraping the Wikipedia page using BeautifulSoup Library of Python. Retrieving the URL and creating a Beautifulsoup object.

In [3]:
# Retrieving URL and creating a Beautifulsoup object

url = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').read()
soup = bs.BeautifulSoup(url, 'lxml')
soup.title

<title>List of postal codes of Canada: M - Wikipedia</title>

Displaying Wikipedia page contents

In [4]:
rawTable = str(soup.table)
display_html(rawTable, raw=True)
rawTable

0,1,2,3,4,5,6,7,8
M1A Not assigned,M2A Not assigned,M3A North York (Parkwoods),M4A North York (Victoria Village),M5A Downtown Toronto (Regent Park / Harbourfront),M6A North York (Lawrence Manor / Lawrence Heights),M7A Queen's Park (Ontario Provincial Government),M8A Not assigned,M9A Etobicoke (Islington Avenue)
M1B Scarborough (Malvern / Rouge),M2B Not assigned,M3B North York (Don Mills) North,M4B East York (Parkview Hill / Woodbine Gardens),"M5B Downtown Toronto (Garden District, Ryerson)",M6B North York (Glencairn),M7B Not assigned,M8B Not assigned,M9B Etobicoke (West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale)
M1C Scarborough (Rouge Hill / Port Union / Highland Creek),M2C Not assigned,M3C North York (Don Mills) South (Flemingdon Park),M4C East York (Woodbine Heights),M5C Downtown Toronto (St. James Town),M6C York (Humewood-Cedarvale),M7C Not assigned,M8C Not assigned,M9C Etobicoke (Eringate / Bloordale Gardens / Old Burnhamthorpe / Markland Wood)
M1E Scarborough (Guildwood / Morningside / West Hill),M2E Not assigned,M3E Not assigned,M4E East Toronto (The Beaches),M5E Downtown Toronto (Berczy Park),M6E York (Caledonia-Fairbanks),M7E Not assigned,M8E Not assigned,M9E Not assigned
M1G Scarborough (Woburn),M2G Not assigned,M3G Not assigned,M4G East York (Leaside),M5G Downtown Toronto (Central Bay Street),M6G Downtown Toronto (Christie),M7G Not assigned,M8G Not assigned,M9G Not assigned
M1H Scarborough (Cedarbrae),M2H North York (Hillcrest Village),M3H North York (Bathurst Manor / Wilson Heights / Downsview North),M4H East York (Thorncliffe Park),M5H Downtown Toronto (Richmond / Adelaide / King),M6H West Toronto (Dufferin / Dovercourt Village),M7H Not assigned,M8H Not assigned,M9H Not assigned
M1J Scarborough (Scarborough Village),M2J North York (Fairview / Henry Farm / Oriole),M3J North York (Northwood Park / York University),M4J East York East Toronto (The Danforth East),M5J Downtown Toronto (Harbourfront East / Union Station / Toronto Islands),M6J West Toronto (Little Portugal / Trinity),M7J Not assigned,M8J Not assigned,M9J Not assigned
M1K Scarborough (Kennedy Park / Ionview / East Birchmount Park),M2K North York (Bayview Village),M3K North York (Downsview) East (CFB Toronto),M4K East Toronto (The Danforth West / Riverdale),M5K Downtown Toronto (Toronto Dominion Centre / Design Exchange),M6K West Toronto (Brockton / Parkdale Village / Exhibition Place),M7K Not assigned,M8K Not assigned,M9K Not assigned
M1L Scarborough (Golden Mile / Clairlea / Oakridge),M2L North York (York Mills / Silver Hills),M3L North York (Downsview) West,M4L East Toronto (India Bazaar / The Beaches West),M5L Downtown Toronto (Commerce Court / Victoria Hotel),M6L North York (North Park / Maple Leaf Park / Upwood Park),M7L Not assigned,M8L Not assigned,M9L North York (Humber Summit)
M1M Scarborough (Cliffside / Cliffcrest / Scarborough Village West),M2M North York (Willowdale / Newtonbrook),M3M North York (Downsview) Central,M4M East Toronto (Studio District),M5M North York (Bedford Park / Lawrence Manor East),M6M York (Del Ray / Mount Dennis / Keelsdale and Silverthorn),M7M Not assigned,M8M Not assigned,M9M North York (Humberlea / Emery)


'<table cellpadding="2" cellspacing="0" rules="all" style="width:100%; border-collapse:collapse; border:1px solid #ccc;">\n<tbody><tr>\n<td style="width:11%; vertical-align:top; color:#ccc;">\n<p><b>M1A</b><br/><span style="font-size:85%;"><i>Not assigned</i></span>\n</p>\n</td>\n<td style="width:11%; vertical-align:top; color:#ccc;">\n<p><b>M2A</b><br/><span style="font-size:85%;"><i>Not assigned</i></span>\n</p>\n</td>\n<td style="width:11%; vertical-align:top;">\n<p><b>M3A</b><br/><span style="font-size:85%;"><a href="/wiki/North_York" title="North York">North York</a><br/>(<a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>)</span>\n</p>\n</td>\n<td style="width:11%; vertical-align:top;">\n<p><b>M4A</b><br/><span style="font-size:85%;"><a href="/wiki/North_York" title="North York">North York</a><br/>(<a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>)</span>\n</p>\n</td>\n<td style="width:11%; vertical-align:top;">\n<p><b>M5A</b><br/><span style="f

### Data Pre-Processing
Displaying table contents.

In [5]:
# First create a list
table_contents = []

# Find the table and table data
table = soup.find('table')
for row in table.findAll('td'):
    # Create a dictionary called "cell" having 3 keys PostalCode, Borough and Neighborhood
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        # As postal code contains upto 3 characters extract that using tablerow.p.text
        cell['PostalCode'] = row.p.text[:3]
        
        # Use split, strip and replace functions for getting Borough and Neighborhood information
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        
        # Append to the list
        table_contents.append(cell)

        
table_contents


[{'PostalCode': 'M3A', 'Borough': 'North York', 'Neighborhood': 'Parkwoods'},
 {'PostalCode': 'M4A',
  'Borough': 'North York',
  'Neighborhood': 'Victoria Village'},
 {'PostalCode': 'M5A',
  'Borough': 'Downtown Toronto',
  'Neighborhood': 'Regent Park, Harbourfront'},
 {'PostalCode': 'M6A',
  'Borough': 'North York',
  'Neighborhood': 'Lawrence Manor, Lawrence Heights'},
 {'PostalCode': 'M7A',
  'Borough': "Queen's Park",
  'Neighborhood': 'Ontario Provincial Government'},
 {'PostalCode': 'M9A',
  'Borough': 'Etobicoke',
  'Neighborhood': 'Islington Avenue'},
 {'PostalCode': 'M1B',
  'Borough': 'Scarborough',
  'Neighborhood': 'Malvern, Rouge'},
 {'PostalCode': 'M3B',
  'Borough': 'North York',
  'Neighborhood': 'Don Mills North'},
 {'PostalCode': 'M4B',
  'Borough': 'East York',
  'Neighborhood': 'Parkview Hill, Woodbine Gardens'},
 {'PostalCode': 'M5B',
  'Borough': 'Downtown Toronto',
  'Neighborhood': 'Garden District, Ryerson'},
 {'PostalCode': 'M6B', 'Borough': 'North York', 'N

Creating a dataframe with the list of table contents.

In [6]:
# Create a dataframe with list

df = pd.DataFrame(table_contents)
df['Borough'] = df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                          'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                          'EtobicokeNorthwest':'Etobicoke Northwest', 'East YorkEast Toronto':'East York/East Toronto',
                                          'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

df


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


__Displaying the shape of the dataframe for the number of rows and columns of the dataframe.__

In [7]:
# Rows and Columns of dataframe
df.shape

(103, 3)

### Neighborhood Exploration, Analysis and Clustering.

**Creating a new dataframe that includes the geographical coordinates of each postal code.**

Getting the geographical coordinates of the neighborhoods by importing the csv file of each postal code.

In [8]:
latlong = pd.read_csv('https://cocl.us/Geospatial_data')
latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Create the new dataframe by merging the geospatial table with the existing dataframe.

In [9]:
latlong.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df1 = pd.merge(df, latlong, on='PostalCode')

df1

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto Business,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


### Exploring and Visualizing the Neighborhoods in Downtown Toronto.

__Slicing the original dataframe and creating a new dataframe consisting only the Borough of Downtown Toronto.__

In [10]:
DtT_data = df1[df1['Borough']=='Downtown Toronto'].reset_index(drop=True)
DtT_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
4,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
5,M6G,Downtown Toronto,Christie,43.669542,-79.422564
6,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
7,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752
8,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576
9,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817


**Getting the geographical coordinates of Downtown Toronto.**

In [11]:
# Use geopy library to get the latitude and longitude values of Downtown Toronto

address = 'Downtown Toronto'

geolocator = Nominatim(user_agent="dt_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Downtown Toronto are: {}, {}.'.format(latitude, longitude))

The geographical coordinates of Downtown Toronto are: 43.6541737, -79.38081162653639.


**Creating Map of Downtown Toronto to visualize the neighborhoods.**

In [12]:
# create map of Downtown Toronto using latitude and longitude values
map_DtT = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df1['Latitude'], df1['Longitude'], df1['Borough'], df1['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_DtT)  
    
map_DtT

### Clustering the Neighborhoods of Downtown Toronto.

__Using K-Means to cluster the neighborhoods into 5 clusters.__

In [13]:
# set number of clusters
kc = 5

DtT_clustering = DtT_data.drop(['PostalCode', 'Borough', 'Neighborhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kc, random_state=0).fit(DtT_clustering)

# check cluster labels generated for the dataframe
kmeans.labels_ 

array([0, 0, 0, 0, 3, 1, 0, 0, 0, 0, 3, 3, 4, 2, 2, 0, 2], dtype=int32)

__Creating a new dataframe that include the cluster labels.__

In [14]:
# adding clustering labels
DtT_data.insert(0, 'Cluster Labels', kmeans.labels_)
DtT_data

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
4,3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
5,1,M6G,Downtown Toronto,Christie,43.669542,-79.422564
6,0,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
7,0,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752
8,0,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576
9,0,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817


**Finally, Creating Map of Downtown Toronto to visualize K-Means clusters of the neighborhoods.**

In [15]:
# create map
map_DtT_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kc)
ys = [i + x + (i*x)**2 for i in range(kc)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighborhood, cluster in zip(DtT_data['Latitude'], DtT_data['Longitude'], DtT_data['Neighborhood'], DtT_data['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_DtT_clusters)
       
map_DtT_clusters