Initialise libraries & read wikitables into tables from wikipedia page and write it into pandas dataframe

In [1]:
import pandas as pd
import numpy as np
from pandas.io.html import read_html

#Read wikitables in the wikipedia page
page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikitable = pd.read_html(page, index_col = 0, attrs={"class":"wikitable"})

#write table into dataframe
ca_post_df = wikitable[0]

ca_post_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront


Remove rows with "Not assigned" Borough

In [2]:
ca_post_df = ca_post_df[ca_post_df.Borough != "Not assigned"]

ca_post_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,Lawrence Heights
M6A,North York,Lawrence Manor


Split postcode and neighbourhood into a different dataframe(ca_post_df2) in order to group neighbourhood with the same postcodes into one cell separated by ','

In [3]:
ca_post_df2 = ca_post_df.drop('Borough', axis=1)
ca_post_df2.head()

Unnamed: 0_level_0,Neighbourhood
Postcode,Unnamed: 1_level_1
M3A,Parkwoods
M4A,Victoria Village
M5A,Harbourfront
M6A,Lawrence Heights
M6A,Lawrence Manor


In [4]:
ca_post_neigh = ca_post_df2.groupby(level=['Postcode'], sort=False).agg( ','.join)
ca_post_neigh.head()

Unnamed: 0_level_0,Neighbourhood
Postcode,Unnamed: 1_level_1
M3A,Parkwoods
M4A,Victoria Village
M5A,Harbourfront
M6A,"Lawrence Heights,Lawrence Manor"
M7A,Queen's Park


In [5]:
ca_post_df = ca_post_df.drop('Neighbourhood', axis=1)
ca_post_df.head()

Unnamed: 0_level_0,Borough
Postcode,Unnamed: 1_level_1
M3A,North York
M4A,North York
M5A,Downtown Toronto
M6A,North York
M6A,North York


merge the newly grouped neighbourhood dataframe (ca_post_neigh) with the deduplicated dataframe (ca_post_df) by Postcode

In [6]:
ca_post_cleaned = pd.merge(ca_post_df, ca_post_neigh, how='left', on='Postcode')
ca_post_cleaned = ca_post_cleaned.drop_duplicates()
ca_post_cleaned.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,"Lawrence Heights,Lawrence Manor"
M7A,Downtown Toronto,Queen's Park


In [7]:
ca_post_cleaned.shape[0]

103

Read GeoCode CSV to get lattitude and longtitude and rename 'postal code' to 'Postcode' for futher meging

In [8]:
GEO = pd.read_csv("http://cocl.us/Geospatial_data")
GEO.rename(columns={'Postal Code': 'Postcode'},inplace=True)
GEO.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge final canada postcode data with latitude and longtitude

In [9]:
ca_post_final = pd.merge(ca_post_cleaned, GEO, how='left', on='Postcode')
ca_post_final.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


In [10]:
ca_post_final.shape[0]

103