# Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Scraping and preprocessing postal code data

We will be obtaining Toronto postal code data from the following table https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and converting it to a Pandas Dataframe

In [1]:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'


In [2]:
#Use Pandas to read the first table on the webpage into a dataframe 
df_postalcodes = pd.read_html(url)[0]


In [3]:
#Inspect top rows of dataframe
df_postalcodes.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


Lets see the initial shape of the table before we refine it

In [4]:
df_postalcodes.shape

(180, 3)

It initially has 180 records and 3 columns

## Preprocessing data

The requirements state to only process rows that are assigned to boroughs, so we will drop all "Not assigned" rows

In [5]:
df_postalcodes = df_postalcodes[df_postalcodes['Borough'] != "Not assigned"]

In [6]:
df_postalcodes

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


The requirements state to group every postal code that is assigned to multiple neighbourhoods together, seperated by comas. So we will find out how many rows contain the same postal code.

In [7]:
occurances = df_postalcodes['Postal Code'].value_counts()

for index, val in occurances.items():
    if(val > 1):
        print(index, " appears more than once")

Running this code we find there are no occurances of postal codes appearing on more than once. Inspecting the table further reveals the wikipedia page has been updated to group all neighbourhoods with the same postal code together.

Next, we are to assign the borough name to the neighborhood column if the neighborhood is not assigned. Lets create a boolean mask of all rows with unassigned neighborhoods and filter the dataframe to see if there are any occurances.

In [8]:
occurances = df_postalcodes['Neighbourhood'] == 'Not assigned'
df_postalcodes[occurances]

Unnamed: 0,Postal Code,Borough,Neighbourhood


The output table is empty, indicating there are no postal codes without assigned neighbourhoods left in the dataframe and no further processing is required. 

Finally, lets see the shape of the refined table

In [9]:
df_postalcodes.shape

(103, 3)

There are 103 records and 3 columns in the table.

## Part 2: Obtaining location data

We will now use the Geocoder Python package to get lattitude and longitude coordinates of each neighbourhood

In [10]:
!pip install geocoder # install geocoder library
import geocoder # import geocoder



We will loop through each postal code in the dataframe and make API calls to the geocoder service. Due to the unreliable nature of the geocoder API it sometimes takes multiple calls to recieve a response and therefore we will loop for each code until a response is recieved.


In [33]:
lat = []
long = []

for code in df_postalcodes['Postal Code']:
    lat_lng_coords = None
    
    while (lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
        lat_lng_coords = g.latlng
        #print(code, g.latlng)

    lat.append(lat_lng_coords[0])
    long.append(lat_lng_coords[1])


M3A [43.75245000000007, -79.32990999999998]
M4A [43.73057000000006, -79.31305999999995]
M5A [43.65512000000007, -79.36263999999994]
M6A [43.72327000000007, -79.45041999999995]
M7A [43.66253000000006, -79.39187999999996]
M9A [43.662630000000036, -79.52830999999998]
M1B [43.811390000000074, -79.19661999999994]
M3B [43.74923000000007, -79.36185999999998]
M4B [43.70718000000005, -79.31191999999999]
M5B [43.65739000000008, -79.37803999999994]
M6B [43.70687000000004, -79.44811999999996]
M9B [43.65034000000003, -79.55361999999997]
M1C [43.78574000000003, -79.15874999999994]
M3C [43.72168000000005, -79.34351999999996]
M4C [43.68970000000007, -79.30681999999996]
M5C [43.65215000000006, -79.37586999999996]
M6C [43.69211000000007, -79.43035999999995]
M9C [43.64857000000006, -79.57824999999997]
M1E [43.765750000000025, -79.17469999999997]
M4E [43.67709000000008, -79.29546999999997]
M5E [43.64536000000004, -79.37305999999995]
M6E [43.68784000000005, -79.45045999999996]
M1G [43.76812000000007, -79.2

Now lets append those lattitude and longitude lists to the dataframe

In [38]:
df_postalcodes['Lattitude'] = lat
df_postalcodes['Longitude'] = long

df_postalcodes.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Lattitude,Longitude
2,M3A,North York,Parkwoods,43.75245,-79.32991
3,M4A,North York,Victoria Village,43.73057,-79.31306
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188


KeyError: 'M1J'