<h1> Segmenting and Clustering Neighbourhoods in Toronto <h1>

<h3> Author: Matias Garib <h3>

In [248]:
import pandas as pd
import numpy as np
import json 

<h2> Part 1: Scrapping Wikipedia with BeautifulSoup <h2>

We first import the required libraries to scrape the web, these are urllib and BeautifulSoup

In [249]:
import urllib.request
from bs4 import BeautifulSoup as bs 

I then specify the URL and create a soup object with BeautifulSoup

In [258]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup=bs(page,"lxml")

Then I find the table with BS functions and iterate to fill in columns and create df

In [259]:

right_table=soup.find('table', class_='wikitable sortable')

A=[]
B=[]
C=[]
for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        
df=pd.DataFrame(A, columns=['Postal Code'])
df['Borough']=B
df['Neighbourhood']=C
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
...,...,...,...
175,M5Z\n,Not assigned\n,Not assigned\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."


I then clean up the df to the required specifications

In [260]:
#Removing extra string from the result
df['Postal Code'] = df['Postal Code'].str.replace('\n', '')
df['Borough'] = df['Borough'].str.replace('\n', '')
df['Neighbourhood'] = df['Neighbourhood'].str.replace('\n', '')

# Deleting rows where Borough is not assigned
index_drops = df[df['Borough'] == 'Not assigned'].index
df.drop(index_drops , inplace=True)

# Make sure there are no rows in which value of 'Neighbourhood' column is "Not Assigned"
seriesObj = df.apply(lambda x: True if x['Neighbourhood'] == 'Not assigned' else False , axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
print("Number of Rows with neighbourhood Not Assigned:",numOfRows)

# Make sure there are no rows in which value of 'Borough' column is "Not Assigned"
seriesObj = df.apply(lambda x: True if x['Borough'] == 'Not assigned' else False , axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
print("Number of Rows with Borough Not Assigned:",numOfRows)

#We also reset the index
df.reset_index(drop=True, inplace=True)

Number of Rows with neighbourhood Not Assigned: 0
Number of Rows with Borough Not Assigned: 0


We finally see the final df and check its size

In [261]:
df.head(10)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [262]:
df.shape

(103, 3)

<h2> Part 2: Importing Latitudes and Longitudes of each Postal Code <h2>

**NOTE:** I tried importing the data with the Geocoder package but couldn't get the coordinates, so I'm using the CSV file as plan B

We start by importing the csv file

In [263]:
url="http://cocl.us/Geospatial_data"
post_codes=pd.read_csv(url)
post_codes

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


We then append the Latitudes and longitudes to each corresponding Postal Code

In [264]:
df=df.set_index('Postal Code').join(post_codes.set_index('Postal Code'))

In [268]:
df.reset_index()

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...
M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
