# Web Scraping: Toronto Postal Codes
by: Kanishk Kumar (India)
<hr>

We will begin by scraping [this](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) Wikipedia page.

Objective:
- Obtain the data inside the html page containing a list of Toronto postal codes in the form of table and transform the data into a pandas dataframe.

Let's import some basic libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

## Import the Document
Exploring data of our .html page.

In [2]:
with open('datasets/List of postal codes of Canada_ M - Wikipedia.html', encoding='utf8') as file:
    soup = BeautifulSoup(file)
soup

<!DOCTYPE html>
<!-- saved from url=(0063)https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M --><html class="client-js ve-available" dir="ltr" lang="en"><head><script>
        (function() {
            // If GPC on, set DOM property to true if not already true
            if (true) {
                if (navigator.globalPrivacyControl) return
                Object.defineProperty(navigator, 'globalPrivacyControl', {
                    value: true,
                    enumerable: true
                })
            } else {
                // If GPC off, set DOM property prototype to false so it may be overwritten
                // with a true value by user agent or other extensions
                if (typeof navigator.globalPrivacyControl !== "undefined") return
                Object.defineProperty(Object.getPrototypeOf(navigator), 'globalPrivacyControl', {
                    value: false,
                    enumerable: true
                })
            }
           

Now, we will make a dataframe consisting of three columns: __PostalCode__, __Borough__, and __Neighborhood__.

Let's scrap the column names from the html page first.

In [3]:
# Finding all the table headers.
soup.find_all('th')

[<th class="headerSort" role="columnheader button" tabindex="0" title="Sort ascending">Postal Code
 </th>,
 <th class="headerSort" role="columnheader button" tabindex="0" title="Sort ascending">Borough
 </th>,
 <th class="headerSort" role="columnheader button" tabindex="0" title="Sort ascending">Neighbourhood
 </th>,
 <th class="navbox-title" style="font-size:110%"><a href="https://en.wikipedia.org/wiki/Postal_codes_in_Canada" title="Postal codes in Canada">Canadian postal codes</a>
 </th>]

Notice that the list has 4 elements. __We only need the first three elements (excluding the tag and white space)__ to be used as the column names of our dataframe.

In [4]:
# Initiating an empty list.
columns = []

#Looping through the first 3 elements of the list.
for i in range(3):
    column = list(soup.find_all('th')[i].stripped_strings)[0]
    column = column.replace(" ", "")
    columns.append(column)

print(columns)

# Creating an empty dataframe
toronto_df = pd.DataFrame(columns=columns)
toronto_df.head()

['PostalCode', 'Borough', 'Neighbourhood']


Unnamed: 0,PostalCode,Borough,Neighbourhood


Now, we will scrap the table data inside the page by finding all the table row tags __&lt;`tr`&gt;__

In [5]:
# Getting the first table tag and finding all <tr> tag in the form of list.
tableData = soup.table.find_all('tr')

# Filtering the first row of the table since it contains only the table header tags.
tableData = tableData[1:]

print('The original postal code table contains {} rows'.format(len(tableData)))

The original postal code table contains 180 rows


Let's take a look at the first item in this list.

In [6]:
# Each element of the list is a tag object.
# For each element, we can extract and clean the content from any tag, white space, etc.
list(tableData[0].stripped_strings)

['M1A', 'Not assigned', 'Not assigned']

We will only process cells that have an assigned borough while  __Ignoring cells with a borough that is `Not assigned`.__ If a cell has a borough but a `Not assigned` neighborhood, then the neighborhood will be the same as the borough.

In [7]:
for i in range(len(tableData)):
    data = list(tableData[i].stripped_strings)
    
    if (data[1] == 'Not assigned'):
        continue
    
    postalCode = data[0]
    borough = data[1]
    
    if (data[2] != 'Not assigned'):
        neighborhood = data[2]
    else:
        neighborhood = borough
    
    toronto_df = toronto_df.append({'PostalCode': postalCode,
                                        'Borough': borough,
                                        'Neighbourhood': neighborhood}, ignore_index=True)

In [8]:
# Saving the dataframe in a csv file without containing any index.
toronto_df.to_csv('datasets/toronto_postal_codes.csv', index=False)

# Printing the first 5 rows of the dataframe.
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Finally we can use the `.shape` method to print the number of rows of our dataframe.

In [9]:
print('The shape of our dataframe is {} with the following details:\n- {} rows\n- {} columns\n- {} unique postal codes\n- {} unique boroughs'.format(toronto_df.shape, toronto_df.shape[0], toronto_df.shape[1],
                                   len(toronto_df.PostalCode.unique()), len(toronto_df.Borough.unique())))

The shape of our dataframe is (103, 3) with the following details:
- 103 rows
- 3 columns
- 103 unique postal codes
- 10 unique boroughs
