# Segmenting and Clustering Neighborhoods in Toronto

### Author: Bryan Choi

Note: This notebook contains answers for all 3 parts of the assignment.

## TOC:
1. [Part 1: Data Scraping and Pre-processing](#part-1)
2. [Part 2: Importing Geospatial Data and Merging](#part-2)
3. [Part 3: Plotting Folium Map](#part-3)

## Initialisation

In [1]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans


## Part 1: Data Scraping and Pre-processing <a class="anchor" id="part-1"></a>

In [2]:
# Pulling data from Wikipedia table
wiki_data = requests.get(
    "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1012118802"
).text
soup = BeautifulSoup(wiki_data, "lxml")
df = pd.read_html(str(soup.table))
df = df[0]


In [3]:
# Removing rows which Borough = "Not assigned"
df = df[df["Borough"] != "Not assigned"]

# Checking if there are other "Not assigned" cells and duplicated postal code
print(
    "Any remaining 'Not assigned' cells >>> {} | Any duplicated postal code >>> {}".format(
        "Not assigned" in df, True in df["Postal Code"].duplicated()
    )
)


Any remaining 'Not assigned' cells >>> False | Any duplicated postal code >>> False


In [4]:
# Previewing some data to ensure validity
print(df.head())
df.shape


  Postal Code           Borough                                Neighbourhood
2         M3A        North York                                    Parkwoods
3         M4A        North York                             Victoria Village
4         M5A  Downtown Toronto                    Regent Park, Harbourfront
5         M6A        North York             Lawrence Manor, Lawrence Heights
6         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government


(103, 3)

## Part 2: Importing Geospatial Data and Merging <a class="anchor" id="part-2"></a>

In [5]:
# Importing GeoSpatial .csv
geodata = pd.read_csv(
    "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv"
)

# Checking geodata df to have the same number of rows as main df
print(geodata.head())
geodata.shape


  Postal Code   Latitude  Longitude
0         M1B  43.806686 -79.194353
1         M1C  43.784535 -79.160497
2         M1E  43.763573 -79.188711
3         M1G  43.770992 -79.216917
4         M1H  43.773136 -79.239476


(103, 3)

In [6]:
# Merging main df and geodata
df1 = pd.merge(df, geodata, on="Postal Code")
df1.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Part 3: Plotting Folium Map <a class="anchor" id="part-3"></a>

In [7]:
# Generating Folium map object
map = folium.Map(location=[43.6532, -79.3832], zoom_start=10)

# For loop to label and plot each df1 row
for lat, lng, bor, neigh in zip(
    df1["Latitude"], df1["Longitude"], df1["Borough"], df1["Neighbourhood"]
):
    popup = folium.Popup(
        "{}, {}".format(neigh, bor), parse_html=True
    )  # Defining popup object
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=popup,
        color="Blue",
        fill=False,
        fill_color="Blue",
        fill_opacity=1,
    ).add_to(map)

map


### Clustering of Neighbourhood
We'll use k-means clustering with 5 clusters.

In [8]:
# Setting number of clusters
k = 5
df_clust = df1.drop(["Postal Code", "Borough", "Neighbourhood"], 1)

# K-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_clust)
kmeans.labels_

# Inserting cluster labels into main df
df1.insert(0, "Cluster Labels", kmeans.labels_)

df1.head()


Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,4,M3A,North York,Parkwoods,43.753259,-79.329656
1,4,M4A,North York,Victoria Village,43.725882,-79.315572
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [9]:
# set colours for the clusters
x = np.arange(k)
ys = [i + x + (i * x) ** 2 for i in range(k)]
colour_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colour_array]

In [10]:
# Creating new Folium map with neighbourhood labelled according to clusters
map2 = folium.Map(location=[43.6532, -79.3832], zoom_start=10)

# For loop to label and plot each df1 row
for lat, lng, bor, neigh, clust in zip(
    df1["Latitude"],
    df1["Longitude"],
    df1["Borough"],
    df1["Neighbourhood"],
    df1["Cluster Labels"],
):
    popup = folium.Popup(
        "{}, {} (Cluster {})".format(neigh, bor, clust), parse_html=True
    )  # Defining popup object
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=popup,
        color=rainbow[clust - 1],
        fill=False,
        fill_color=rainbow[clust - 1],
        fill_opacity=1,
    ).add_to(map2)

map2

### Observation
Clustering seems appropriate as they are split according to the Toronto city grid and along the NW, N, NE and E directions from the city centre.