<h1 align=center> Segmenting and Clustering Neighborhoods in Toronto</h1>
<br />
<br />

__This notebook includes data scrapping, data wrangling and clustering.__

<br />
<br />
<h2> I. Preparing data </h2>

__1.__ We are going to import the necessary libraries :

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.html import read_html # read_html

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

__2.__ We are going to scrap data from the wikipedia page :

In [2]:
wikipage="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" # url

wikitable=read_html(wikipage, attrs={"class":"wikitable"}) # reading the url

print("%d extracted tables!" %len(wikitable))

1 extracted tables!


__3.__ We are going to put the dataset into another variable :

In [3]:
data=wikitable[0] # reading the wikitable into the dataframe

print('This dataset contains %d rows' %data.shape[0])

data.head() # visualizing data

This dataset contains 180 rows


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


__4.__ We are going to clean the dataset :

In [4]:
data.drop(data[data['Borough']=="Not assigned"].index, inplace=True) # dropping rows with 'Not assigned' in 'Borough' column

data.reset_index(drop=True, inplace=True) # reseting index

print('After dropping rows with \"Not assigned\" in \"Borough\" column, the dataset contains %d rows' %data.shape[0])

data.head() # visualizing data

After dropping rows with "Not assigned" in "Borough" column, the dataset contains 103 rows


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


__5.__ Lastly, We are going to display our dataset shape :

In [5]:
print('The final dataset contains %d rows' %data.shape[0])

The final dataset contains 103 rows


<h2> II. Adding coordinates to data </h2>

__1.__ We are going to read the csv file :

In [6]:
!wget -q -O "Geospatial_Coordinates.csv" http://cocl.us/Geospatial_data #downloading the csv file

coord = pd.read_csv("Geospatial_Coordinates.csv") #loading the csv to coord

__2.__ We are going to extract the coordinates from the csv file :

In [7]:
coordinates=pd.DataFrame(columns=["Latitude","Longitude"])#visualizing the data

for i in range(data.shape[0]):#visualizing the data
    #visualizing the data
    coordinates = coordinates.append({"Latitude": coord[coord["Postal Code"]==data.iloc[i,0]][["Latitude"]].iloc[0,0], 
                        "Longitude": coord[coord["Postal Code"]==data.iloc[i,0]][["Longitude"]].iloc[0,0]}, ignore_index=True)

coordinates.head() #visualizing the data

Unnamed: 0,Latitude,Longitude
0,43.753259,-79.329656
1,43.725882,-79.315572
2,43.65426,-79.360636
3,43.718518,-79.464763
4,43.662301,-79.389494


__3.__ Finally, we are going to add coordinates to our dataset :

In [8]:
Toronto=pd.concat([data,coordinates],axis=1) #Concatinating datasets
Toronto.head() #visualizing the data

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<h2> III. Clustering </h2>

__1.__ We are going to slice the dataframe, keeping only neighborhoods with "Toronto" substring :

In [9]:
Boroughs_Toronto = Toronto.copy()#visualizing the data

for i in range(Toronto.shape[0]):#visualizing the data
    if "Toronto" not in Toronto.iloc[i, 1]:#visualizing the data
        Boroughs_Toronto.drop(index=i, inplace=True, axis=0)#visualizing the data
        
Boroughs_Toronto.reset_index(drop=True, inplace=True)#visualizing the data
Boroughs_Toronto.head()#visualizing the data

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


__2.__ We are going to get Boroughs' dummies, so that we can group our data into clusters :

In [10]:
Boroughs_Toronto_dummies = pd.get_dummies(Boroughs_Toronto[['Borough']], prefix="", prefix_sep="")#visualizing the data
Boroughs_Toronto_dummies.head()#visualizing the data

Unnamed: 0,Central Toronto,Downtown Toronto,East Toronto,West Toronto
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,0,1,0


__3.__ We are going to cluster our samples :

In [11]:
kmeans = KMeans(n_clusters=4, random_state=0).fit(Boroughs_Toronto_dummies)#visualizing the data
Boroughs=np.asanyarray(["Central Toronto", "Downtown Toronto", "East Toronto", "West Toronto"])
Boroughs_Toronto.insert(0, 'Cluster Labels', Boroughs[kmeans.labels_])#visualizing the data
Boroughs_Toronto.head()#visualizing the data

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,Central Toronto,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,Central Toronto,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,Central Toronto,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,Central Toronto,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,West Toronto,M4E,East Toronto,The Beaches,43.676357,-79.293031


__4.__ We are going to visualize our clustering on a folium map :

In [23]:
# create map
latitude_toronto=43.70011
longitude_toronto=-79.4163
map_clusters = folium.Map(location=[latitude_toronto, longitude_toronto], zoom_start=12)

# set color scheme for the clusters
x = np.arange(4)
ys = [i + x + (i*x)**2 for i in range(4)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, label, cluster in zip(Boroughs_Toronto['Latitude'], Boroughs_Toronto['Longitude'], Boroughs_Toronto['Neighborhood'], 
                                  Boroughs_Toronto['Cluster Labels'], kmeans.labels_):
    label = folium.Popup(str(poi) +", "+ str(label), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters #visualizing the data