## Importing relevant libraries

In [6]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

## Getting the HTML code of the wikipedia 
   Using requests library, I stored the html code of the Toronto Postal Code wikipedia website in text format in 'source' 
   and then created an instance of the BeautifulSoup object as 'soup' to extract the html code's contents like tables, titles, etc.

In [7]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
#print(source)
soup  = BeautifulSoup(source,'lxml')
#print(soup.prettify()) # using prettify method to show the code in idented format.

In [8]:
# Playing around a little bit, extracting title
title = soup.title.text
print(title)

List of postal codes of Canada: M - Wikipedia


## Extracting the table and its values


In [9]:
tables = soup.find('table', class_ = "wikitable sortable")
#print(tables.prettify())

 The Table body is organized in table_rows <tr> and each row is split into either table_head <th> or table_data <td>.
    
 I will first extract all rows and view them in text format. Use of '.text' removes removes '<', '/' and '>'.

In [10]:
for row in tables.find_all('tr'):
    print(row.text)


Postcode
Borough
Neighbourhood


M1A
Not assigned
Not assigned


M2A
Not assigned
Not assigned


M3A
North York
Parkwoods


M4A
North York
Victoria Village


M5A
Downtown Toronto
Harbourfront


M5A
Downtown Toronto
Regent Park


M6A
North York
Lawrence Heights


M6A
North York
Lawrence Manor


M7A
Queen's Park
Not assigned


M8A
Not assigned
Not assigned


M9A
Etobicoke
Islington Avenue


M1B
Scarborough
Rouge


M1B
Scarborough
Malvern


M2B
Not assigned
Not assigned


M3B
North York
Don Mills North


M4B
East York
Woodbine Gardens


M4B
East York
Parkview Hill


M5B
Downtown Toronto
Ryerson


M5B
Downtown Toronto
Garden District


M6B
North York
Glencairn


M7B
Not assigned
Not assigned


M8B
Not assigned
Not assigned


M9B
Etobicoke
Cloverdale


M9B
Etobicoke
Islington


M9B
Etobicoke
Martin Grove


M9B
Etobicoke
Princess Gardens


M9B
Etobicoke
West Deane Park


M1C
Scarborough
Highland Creek


M1C
Scarborough
Rouge Hill


M1C
Scarborough
Port Union


M2C
Not assigned
Not assigned


## Listing the rows

The following code will make a list of all the elements of a row.....

In [11]:
for tr in tables.find_all('tr'):
    data = (tr.find_all('td'))
    row1 = [i.text for i in data]
    print(row1)

[]
['M1A', 'Not assigned', 'Not assigned\n']
['M2A', 'Not assigned', 'Not assigned\n']
['M3A', 'North York', 'Parkwoods\n']
['M4A', 'North York', 'Victoria Village\n']
['M5A', 'Downtown Toronto', 'Harbourfront\n']
['M5A', 'Downtown Toronto', 'Regent Park\n']
['M6A', 'North York', 'Lawrence Heights\n']
['M6A', 'North York', 'Lawrence Manor\n']
['M7A', "Queen's Park", 'Not assigned\n']
['M8A', 'Not assigned', 'Not assigned\n']
['M9A', 'Etobicoke', 'Islington Avenue\n']
['M1B', 'Scarborough', 'Rouge\n']
['M1B', 'Scarborough', 'Malvern\n']
['M2B', 'Not assigned', 'Not assigned\n']
['M3B', 'North York', 'Don Mills North\n']
['M4B', 'East York', 'Woodbine Gardens\n']
['M4B', 'East York', 'Parkview Hill\n']
['M5B', 'Downtown Toronto', 'Ryerson\n']
['M5B', 'Downtown Toronto', 'Garden District\n']
['M6B', 'North York', 'Glencairn\n']
['M7B', 'Not assigned', 'Not assigned\n']
['M8B', 'Not assigned', 'Not assigned\n']
['M9B', 'Etobicoke', 'Cloverdale\n']
['M9B', 'Etobicoke', 'Islington\n']
['M9B'

## Creating the DataFrame

...but we need to make an array of the lists above so we can make a dataframe out of it. We do that below.

In [12]:
import pandas as pd
row= []
for tr in tables.find_all('tr'):
    data = tr.find_all('td')
    row.append([i.text for i in data])
       

df = pd.DataFrame( data  = row )
df.head()

Unnamed: 0,0,1,2
0,,,
1,M1A,Not assigned,Not assigned\n
2,M2A,Not assigned,Not assigned\n
3,M3A,North York,Parkwoods\n
4,M4A,North York,Victoria Village\n


## Adding Header row

In the previous code, we looped only through the rows and used an empty list to append it. Hence we need to add the column names as Headers <th> provided by the html code as below.

In [13]:
import pandas as pd
row= []
for tr in tables.find_all('tr'):
    data = tr.find_all('td')
    row.append([i.text.strip() for i in data]) # Usint strip( to remove the \n from last column)
      
labels = tables.find_all('th')
labels = [c.text for c in labels]
labels = [i.strip() for i in labels] ### removing '\n' from the column labels
#print(labels)
df = pd.DataFrame( data  = row , columns = labels)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


## Cleaning the dataframe

Now we will start cleaning the dataframe, let's start by remove the empty top row and then resetting the index.

In [14]:
df1 = df.drop([0])   # , inplace = True)
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


In [10]:
df1 = df1.reset_index(drop = True)

In [15]:
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


 Removing all the 'Not assigned' values from the Borough column

In [16]:
df1 = df1[df1.Borough != 'Not assigned']

In [17]:
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


Lets put together 'Neighbourhoods' with same Postal Code values.

In [18]:
df1.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

Grouping Neighbourhoods together first by Postal Code then by Borough and joining togethr the string values.

In [19]:
df1 = df1.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Now we replace the 'Not assigned' values in the Neigbourhood column with the corresponding Boroughs in one simple code.

In [20]:
df1.loc[df1.Neighbourhood == 'Not assigned', 'Neighbourhood'] = df1.Borough
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [21]:
df1.shape

(103, 3)