## Clustering neighborhoods in the city of Toronto 

In this notebook, Toronto's neighborhoods are analysed based on their postalcode and borough information. The analysis implements three tasks: exploring, segmenting, and clustering Toronto's neighborhoods. 


The following steps are implemented:

1. Installations and Imports
2. Data Collection (Webscraping)
3. Pre-processing
   - 3.1. Cleaning
   - 3.2. Adding Location Data
4. Exploring Data  
5. Segmenting and Clustering 


NB: The maps generated in this notebook might not be visible on Github. Please refer to the github README for the map.

### 1. Installations and Imports

In [14]:
#1 installations and imports
print('----- Start Installing and Importing -----\n')
!pip install bs4 #BeautifulSoup version 4
from bs4 import BeautifulSoup # module for web scrapping.
import requests  #module for downloading a web page
import pandas as pd
import numpy as np
import re #module to work with regular expressions

!pip install geopy
from geopy.geocoders import Nominatim #module to convert an address into latitude and longitude values

!pip install folium==0.5.0
import folium #map rendering library

from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('\n----- Libraries Imported -----')

----- Start Installing and Importing -----


----- Libraries Imported -----


### 2. Data Collection (Webscraping)
Use BeautifulSoup (a Python library) to scrap Toronto neighborhood data from a Wikipedia page (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

The following steps are implemented:
- 2.1. Dowload the webpage in text format
- 2.2. Create a soup object
- 2.3. Read in the webpage tables 

In [2]:
#2.1. dowload the webpage in text format
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
toronto_data  = requests.get(url).text #download the contents of the webpage in text format

#2.2. create a soup object
soup = BeautifulSoup(toronto_data, "lxml") 

#replace <br> (html tag that seperates each table's cell's content) by | (pipe)
preprocessed_webpage = re.sub('<br\s*/>', '|', str(soup)) 

#2.3. read the pre-processed webpage
webpage_tables = pd.read_html(preprocessed_webpage)
print("Number of tables in the webpage:", len(webpage_tables))

#As seen in the wikipedia webpage, the table of interest is the first table 
print("Shape (rows, cols) of table of interest:", webpage_tables[0].shape)
webpage_tables[0]

Number of tables in the webpage: 3
Shape (rows, cols) of table of interest: (20, 9)


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1A|Not assigned,M2A|Not assigned,M3A|North York|(Parkwoods),M4A|North York|(Victoria Village),M5A|Downtown Toronto|(Regent Park / Harbourfront),M6A|North York|(Lawrence Manor / Lawrence Heig...,M7A|Queen's Park|(Ontario Provincial Government),M8A|Not assigned,M9A|Etobicoke|(Islington Avenue)
1,M1B|Scarborough|(Malvern / Rouge),M2B|Not assigned,M3B|North York|(Don Mills)|North,M4B|East York|(Parkview Hill / Woodbine Gardens),"M5B|Downtown Toronto|(Garden District, Ryerson)",M6B|North York|(Glencairn),M7B|Not assigned,M8B|Not assigned,M9B|Etobicoke|(West Deane Park / Princess Gard...
2,M1C|Scarborough|(Rouge Hill / Port Union / Hig...,M2C|Not assigned,M3C|North York|(Don Mills)|South|(Flemingdon P...,M4C|East York|(Woodbine Heights),M5C|Downtown Toronto|(St. James Town),M6C|York|(Humewood-Cedarvale),M7C|Not assigned,M8C|Not assigned,M9C|Etobicoke|(Eringate / Bloordale Gardens / ...
3,M1E|Scarborough|(Guildwood / Morningside / Wes...,M2E|Not assigned,M3E|Not assigned,M4E|East Toronto|(The Beaches),M5E|Downtown Toronto|(Berczy Park),M6E|York|(Caledonia-Fairbanks),M7E|Not assigned,M8E|Not assigned,M9E|Not assigned
4,M1G|Scarborough|(Woburn),M2G|Not assigned,M3G|Not assigned,M4G|East York|(Leaside),M5G|Downtown Toronto|(Central Bay Street),M6G|Downtown Toronto|(Christie),M7G|Not assigned,M8G|Not assigned,M9G|Not assigned
5,M1H|Scarborough|(Cedarbrae),M2H|North York|(Hillcrest Village),M3H|North York|(Bathurst Manor / Wilson Height...,M4H|East York|(Thorncliffe Park),M5H|Downtown Toronto|(Richmond / Adelaide / King),M6H|West Toronto|(Dufferin / Dovercourt Village),M7H|Not assigned,M8H|Not assigned,M9H|Not assigned
6,M1J|Scarborough|(Scarborough Village),M2J|North York|(Fairview / Henry Farm / Oriole),M3J|North York|(Northwood Park / York University),M4J|East York|East Toronto|(The Danforth | East),M5J|Downtown Toronto|(Harbourfront East / Unio...,M6J|West Toronto|(Little Portugal / Trinity),M7J|Not assigned,M8J|Not assigned,M9J|Not assigned
7,M1K|Scarborough|(Kennedy Park / Ionview / East...,M2K|North York|(Bayview Village),M3K|North York|(Downsview)|East | (CFB Toronto),M4K|East Toronto|(The Danforth West / Riverdale),M5K|Downtown Toronto|(Toronto Dominion Centre ...,M6K|West Toronto|(Brockton / Parkdale Village ...,M7K|Not assigned,M8K|Not assigned,M9K|Not assigned
8,M1L|Scarborough|(Golden Mile / Clairlea / Oakr...,M2L|North York|(York Mills / Silver Hills),M3L|North York|(Downsview)|West,M4L|East Toronto|(India Bazaar / The Beaches W...,M5L|Downtown Toronto|(Commerce Court / Victori...,M6L|North York|(North Park / Maple Leaf Park /...,M7L|Not assigned,M8L|Not assigned,M9L|North York|(Humber Summit)
9,M1M|Scarborough|(Cliffside / Cliffcrest / Scar...,M2M|North York|(Willowdale / Newtonbrook),M3M|North York|(Downsview)|Central,M4M|East Toronto|(Studio District),M5M|North York|(Bedford Park / Lawrence Manor ...,M6M|York|(Del Ray / Mount Dennis / Keelsdale a...,M7M|Not assigned,M8M|Not assigned,M9M|North York|(Humberlea / Emery)


### 3. Data Pre-processing
Consists of 2 parts:
 - 3.1. Cleaning
 - 3.2. Add Location Data


### 3.1. Cleaning
The following steps are implemented:
- 3.1.1. Create a dataframe that consists of three columns: PostalCode, Borough, and Neighborhood.
- 3.1.2. Ignore postal code areas that don't have an assigned borough. 
- 3.1.3. If a postal code area has more than one neighborhood, separate each neighborhood with a comma.
- 3.1.4. if a postal code area has a borough but a _"Not assigned"_ neighborhood, then assign the borough value to the neighborhood.

In [3]:
#3.1.1. create a dataframe that consists of three columns: PostalCode, Borough, and Neighborhood
toronto_data = pd.DataFrame(columns=["Postal Code", "Borough", "Neighborhood"])

for col in range(0, webpage_tables[0].shape[1]):
    for row in webpage_tables[0][col]:
        p = row.split("|")[0] #Postal Code
        
        b = row.split("|")[1] #Borough
        if (b != "Not assigned"): #3.1.2. ignoring postal code areas that don't have an assigned borough
        
            #Neighborhood
            if (row.split("|")[2].find("/") > 0): #3.1.3. checking if the postal code area has more than one neighborhood
                n = str(row.split("|")[2]).replace(" / ", ", ").strip("()") #3.1.3. separating each neighborhood with a comma
            else:
                if (row.split("|")[2] == "Not assigned"): #3.1.4. checking if a neighborhood has not been assigned
                    n = b #3.1.4. assign the borough value to the neighborhood.
                else: 
                    n = row.split("|")[2].strip("()")

            toronto_data = toronto_data.append([{"Postal Code":p, "Borough":b, "Neighborhood":n}], ignore_index=True) #saving the content to the dataframe 

toronto_data      

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [4]:
#shape (rows, cols) of the dataframe
toronto_data.shape

(103, 3)

### 3.2. Add Location Data
The following steps are implemented:
- 3.2.1. Read the given .csv file (https://cocl.us/Geospatial_data) that contains the longigtude and latitude geographical coordinates of each postal code.
- 3.2.2. Merge the coordinates to the toronto data.


In [5]:
#3.2.1. read the given .csv file that contains the longigtude and latitude geographical coordinates of each postal code
geo_coordinates = pd.read_csv('https://cocl.us/Geospatial_data')
print("Input file shape (rows, cols):", geo_coordinates.shape)
geo_coordinates.head()

Input file shape (rows, cols): (103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [6]:
#3.2.2. merge the coordinates to the toronto data
toronto_info = pd.merge(toronto_data, geo_coordinates, on='Postal Code')
print("Toronto Info shape (rows, cols):", toronto_info.shape)
toronto_info.head()

Toronto Info shape (rows, cols): (103, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [7]:
#check how many venues were returned for each borough
toronto_info.groupby('Borough').count()

Unnamed: 0_level_0,Postal Code,Neighborhood,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Central Toronto,9,9,9,9
Downtown Toronto,18,18,18,18
East Toronto,5,5,5,5
East York,5,5,5,5
Etobicoke,12,12,12,12
Mississauga,1,1,1,1
North York,24,24,24,24
Queen's Park,1,1,1,1
Scarborough,17,17,17,17
West Toronto,6,6,6,6


### 4. Explore 
The following steps are implemented:
- 4.1. Select the boroughs that contain the word Toronto.
- 4.2. Create a map of Toronto with neighborhoods superimposed on top. To do so, the following steps are implemented:
    - 4.2.1. Get Toronto's latitude and longitude geographical coordinates.
    - 4.2.2. Create a map of Toronto with the neighborhoods superimposed on top.


In [8]:
#4.1. select the boroughs that contain the word Toronto
filtered_toronto_info = toronto_info[toronto_info['Borough'].str.contains('Toronto',regex=False)]
print("Filtered Toronto Info shape (rows, cols):", filtered_toronto_info.shape)
filtered_toronto_info

Filtered Toronto Info shape (rows, cols): (38, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


In [9]:
#4.2.1. get the geographical coordinates of Toronto
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [10]:
#4.2.2. create map of Toronto with the neighborhoods superimposed on top.
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(filtered_toronto_info['Latitude'], filtered_toronto_info['Longitude'], filtered_toronto_info['Borough'], filtered_toronto_info['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto


## The map might not be visible on Github. Please refer to the github README for the map.

### 5. Segment and Cluster Data
The following steps are implemented:
- 5.1. Run k-means to cluster the neighborhood into 5 clusters.
- 5.2. Create map of Toronto with the clustered neighborhoods superimposed on top.

In [11]:
#5.1. Run k-means to cluster the neighborhood into 5 clusters.
kclusters = 5 #set number of clusters
toronto_grouped_clustering = filtered_toronto_info.drop(['Postal Code','Borough','Neighborhood'], 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering) #run k-means clustering
print("Total number of cluster labels:", len(kmeans.labels_))
kmeans.labels_ 

Total number of cluster labels: 38


array([3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       1, 1, 0, 0, 0, 2, 2, 2, 0, 4, 0, 0, 4, 4, 4, 3], dtype=int32)

In [12]:
#merge cluster labels with the filtered_toronto_info dataframe
filtered_toronto_info.insert(filtered_toronto_info.shape[1], 'Cluster Labels', kmeans.labels_)
print("Filtered Toronto Info with Cluster labels shape (rows, cols):", filtered_toronto_info.shape)
filtered_toronto_info

Filtered Toronto Info with Cluster labels shape (rows, cols): (38, 6)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,3
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,3
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,3
43,M4M,East Toronto,Studio District,43.659526,-79.340923,3
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197,1
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,1
47,M4S,Central Toronto,Davisville,43.704324,-79.38879,1
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,1
49,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049,1


In [15]:
#5.2. create map of Toronto with the clustered neighborhoods superimposed on top
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

#set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(filtered_toronto_info['Latitude'], filtered_toronto_info['Longitude'], filtered_toronto_info['Neighborhood'], filtered_toronto_info['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## The map might not be visible on Github. Please refer to the github README for the map.