# Segmenting and Clustering Neighborhoods in Toronto
The goal of this notebook is to segment and cluster neighborhoods in Toronto while gathering their postal code and borough.

## Building the dataframe

In [1]:
!pip install bs4



In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

---

First we will scrape the data from Wikipedia. I am using an older version of the Wikipedia page as it uses a basic HTML table - which is more easily parsed to a dataframe than the grid used in newer versions of the page. I am also renaming the column names.

In [3]:
req = requests.get("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1012118802") 

soup = BeautifulSoup(req.content,'html.parser') 

table = soup.find_all('table')[0]

df = pd.read_html(str(table))

df = df[0]

df.columns = ['PostalCode', 'Borough', 'Neighborhood']

df.shape

(180, 3)

Now we will clean the data. First replace all occurances of "Not assigned" with "NaN". If a cell has a borough but no neighborhood, then the neighborhood will be the same as the borough. Afterwards, every row that has "NaN" in it is dropped. The index is also reset to fill up any gaps in the indexing.

In [4]:
df.replace("Not assigned", np.nan, inplace=True)
df.Neighborhood.fillna(df.Borough, inplace=True)
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

In [5]:
df.shape

(103, 3)

## Adding latitude and longitude

Seeing how the geocoder is not functioning properly I am importing the given .csv file. I rename the "Postal Code" column to "PostalCode" so it is consistent with the other dataframe and I can merge them. I then merge the dataframes.

In [6]:
csv_df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv")
csv_df.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
df = df.merge(csv_df, how='inner', on='PostalCode')

We can see the data has been merged succesfully.

In [7]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


And that no rows have been lost.

In [8]:
df.shape

(103, 5)

# Clustering

In [9]:
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium
import folium # map rendering library



---

Let's first have a look at the map before clustering the neighborhoods

In [10]:
neighborhoods = df

# Get geogrpahical coords for Toronto
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# Add markers to map
for lat, lng, label in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

# Display the map
map_toronto

Now we will cluster the neighborhoods into 5 clusters using k-means clustering.

In [11]:
# Copy the dataframe and remove every column except for lat and long
toronto_grouped_clustering = df.drop('Neighborhood', 1)
toronto_grouped_clustering = toronto_grouped_clustering.drop('PostalCode', 1)
toronto_grouped_clustering = toronto_grouped_clustering.drop('Borough', 1)

# Run k-means clustering with 5 clusters
kclusters=5

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Add clustering labels to our data
toronto_labeled = df
toronto_labeled.insert(0, 'Cluster Labels', kmeans.labels_)

Now we will display the clustered neighborhoods on a map

In [12]:
# Create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_labeled['Latitude'], toronto_labeled['Longitude'], toronto_labeled['Neighborhood'], toronto_labeled['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We can see the data has been clustered into 5 clusters.