### IBM - Scraping & Clustering Toronto Neighbourhoods

In [1]:
# import urllib library
import urllib.request

In [2]:
# url for Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
# open the url using the imported urllib.request and assign HTML to a variable
page = urllib.request.urlopen(url)

In [4]:
# import BeautifulSoup library. Needed to parse HTML and XML documents
from bs4 import BeautifulSoup

In [5]:
# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")
# lxml is a Python library which allows for easy handling of XML and HTML files

The table we need can be found by identifying the following peice of code from the HTML:

Table class="wikitable sortable"<br>

The rows of the table start and end with tr and /tr.<br>

In [9]:
# Use the find method to search through the 
right_table = soup.find('table', class_='wikitable sortable')
right_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td>

There are three columns in our table that we want to scrape the data from, so we will set up three empty lists (A, B, C) to store our data in:

In [13]:
A=[]
B=[]
C=[]

for row in right_table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) == 3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

In [146]:
# Construct DataFrame and assign headers
import pandas as pd
headers = ["PostalCode", "Borough", "Neighbourhood"]
df = pd.DataFrame(columns=headers)
df

Unnamed: 0,PostalCode,Borough,Neighbourhood


In [147]:
# Add data into DataFrame
df["PostalCode"] = A
df["Borough"] = B
df["Neighbourhood"] = C
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [148]:
# check the size of the table matches the size of the table in Wikipedia
df.shape

(180, 3)

In [149]:
df.dtypes

PostalCode       object
Borough          object
Neighbourhood    object
dtype: object

In [150]:
df.iloc[0,1]

'Not assigned\n'

We need to remove\n from all entries in the DataFrame:

In [151]:
i=0
j=0
for j in range(0,3):
    for i in range(0,180):
        df.iloc[i,j] = df.iloc[i,j][0:-1]

In [152]:
df.iloc[0,1]

'Not assigned'

In [153]:
# ignore the cells that have a borough that is Not Assigned
condition = df["Borough"]!='Not assigned'
condition.head()

0    False
1    False
2     True
3     True
4     True
Name: Borough, dtype: bool

In [154]:
# Filter DataFrame
df = df[condition]
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [156]:
# Reset Index
df.reset_index(drop=True, inplace = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [157]:
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [158]:
df.shape

(103, 3)