<h1> Segmenting and Clustering Neighbourhoods in Toronto <h1>

<h3> Author: Matias Garib <h3>

In [115]:
import pandas as pd
import numpy as np
import json 

<h2> Part 1: Scrapping Wikipedia with BeautifulSoup <h2>

We first import the required libraries to scrape the web, these are urllib and BeautifulSoup

In [116]:
import urllib.request
from bs4 import BeautifulSoup as bs 

I then specify the URL and create a soup object with BeautifulSoup

In [117]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup=bs(page,"lxml")

Then I find the table with BS functions and iterate to fill in columns and create df

In [118]:

right_table=soup.find('table', class_='wikitable sortable')

A=[]
B=[]
C=[]
for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        
df=pd.DataFrame(A, columns=['Postal Code'])
df['Borough']=B
df['Neighbourhood']=C
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
...,...,...,...
175,M5Z\n,Not assigned\n,Not assigned\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."


I then clean up the df to the required specifications

In [135]:
#Removing extra string from the result
df['Postal Code'] = df['Postal Code'].str.replace('\n', '')
df['Borough'] = df['Borough'].str.replace('\n', '')
df['Neighbourhood'] = df['Neighbourhood'].str.replace('\n', '')

# Deleting rows where Borough is not assigned
index_drops = df[df['Borough'] == 'Not assigned'].index
df.drop(index_drops , inplace=True)

# Make sure there are no rows in which value of 'Neighbourhood' column is "Not Assigned"
seriesObj = df.apply(lambda x: True if x['Neighbourhood'] == 'Not assigned' else False , axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
print("Number of Rows with neighbourhood Not Assigned:",numOfRows)

# Make sure there are no rows in which value of 'Borough' column is "Not Assigned"
seriesObj = df.apply(lambda x: True if x['Borough'] == 'Not assigned' else False , axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
print("Number of Rows with Borough Not Assigned:",numOfRows)

#We also reset the index
df.reset_index(drop=True, inplace=True)

Number of Rows with neighbourhood Not Assigned: 0
Number of Rows with Borough Not Assigned: 0


We finally see the final df and check its size

In [145]:
df.head(10)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [144]:
df.shape

(103, 3)