# Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Scraping and preprocessing postal code data

We will be obtaining Toronto postal code data from the following table https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and converting it to a Pandas Dataframe

In [1]:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'


In [2]:
#Use Pandas to read the first table on the webpage into a dataframe 
df_postalcodes = pd.read_html(url)[0]


In [3]:
#Inspect top rows of dataframe
df_postalcodes.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


Lets see the initial shape of the table before we refine it

In [4]:
df_postalcodes.shape

(180, 3)

It initially has 180 records and 3 columns

## Preprocessing data

The requirements state to only process rows that are assigned to boroughs, so we will drop all "Not assigned" rows

In [5]:
df_postalcodes = df_postalcodes[df_postalcodes['Borough'] != "Not assigned"]

In [6]:
df_postalcodes

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


The requirements state to group every postal code that is assigned to multiple neighbourhoods together, seperated by comas. So we will find out how many rows contain the same postal code.

In [7]:
occurances = df_postalcodes['Postal Code'].value_counts()

for index, val in occurances.items():
    if(val > 1):
        print(index, " appears more than once")

Running this code we find there are no occurances of postal codes appearing on more than once. Inspecting the table further reveals the wikipedia page has been updated to group all neighbourhoods with the same postal code together.

Next, we are to assign the borough name to the neighborhood column if the neighborhood is not assigned. Lets create a boolean mask of all rows with unassigned neighborhoods and filter the dataframe to see if there are any occurances.

In [8]:
occurances = df_postalcodes['Neighbourhood'] == 'Not assigned'
df_postalcodes[occurances]

Unnamed: 0,Postal Code,Borough,Neighbourhood


The output table is empty, indicating there are no postal codes without assigned neighbourhoods left in the dataframe and no further processing is required. 

Finally, lets see the shape of the refined table

In [9]:
df_postalcodes.shape

(103, 3)

There are 103 records and 3 columns in the table.

## Part 2: 