# Segmenting and Clustering Neighborhoods in Toronto

## Applied Data Science Capstone

This notebook is part of the Applied Data Science Capstone of Coursera

#### Juan Diego Moreno Gracia
##### 28-07-2020
##### Bogotá, Colombia

To build the code we will scrape the following Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

First, we need to download all the libraries and dependencies that we will be using    

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

# import k-means from clustering stage
from sklearn.cluster import KMeans

## 1. Download and explore the data

In [2]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df_toronto=pd.read_html(url)[0]
df_toronto

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


If we check the Wikipedia link, we will find that the table was load sucsesfully

Now, we need to get rid of all the rows that don't have any Borough assigned.
For this end, we can group our data by Boroughs to see the number of Neighborhods in each one.

In [3]:
df_toronto.groupby("Borough").count()

Unnamed: 0_level_0,Postal Code,Neighbourhood
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Central Toronto,9,9
Downtown Toronto,19,19
East Toronto,5,5
East York,5,5
Etobicoke,12,12
Mississauga,1,1
North York,24,24
Not assigned,77,77
Scarborough,17,17
West Toronto,6,6


We can notice that there are 77 neighborhoods that don´t have a borough assigned. We proceed to erase these rows from our data frame

In [4]:
df = df_toronto[df_toronto["Borough"]!="Not assigned"].reset_index(drop=True)
print("The number of rows is now: ",df.shape[0])
df.head()

The number of rows is now:  103


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. We will check if there is any additional duplicated Postal Codes in order to Merge them.

In [5]:
duplicated = df["Postal Code"].duplicated().value_counts()
duplicated

False    103
Name: Postal Code, dtype: int64

As we can see, there are not duplicated postal codes

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [6]:
#Look if there is any Borough that don't have a neighborhood assigned
df[df["Neighbourhood"]=="Not assigned"]

Unnamed: 0,Postal Code,Borough,Neighbourhood


We can notice that there are not boroghs with a neighborhood not assigned

In [7]:
print("Final size of our data frame: ", df.shape)

Final size of our data frame:  (103, 3)


## 2. Add to the data frame the coordinates of each Postal Code

In [9]:
!pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 6.8MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [12]:
df["Postal Code"].head()

0    M3A
1    M4A
2    M5A
3    M6A
4    M7A
Name: Postal Code, dtype: object

In [15]:
###import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]###

"import geocoder # import geocoder\n\n# initialize your variable to None\nlat_lng_coords = None\n\n# loop until you get the coordinates\nwhile(lat_lng_coords is None):\n  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))\n  lat_lng_coords = g.latlng\n\nlatitude = lat_lng_coords[0]\nlongitude = lat_lng_coords[1]"

In [14]:
df_coords = pd.read_csv("http://cocl.us/Geospatial_data")
df_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
df_mrg = pd.merge(df, df_coords, on="Postal Code")
df_mrg.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
