# Postal codes and coordinates of Toronto boroughs

## Table of Contents

* [Download and Explore Dataset](#chapter1)
* [Creating a Pandas DataFrame](#chapter2)
* [Cleaning data values](#chapter3)
* [Getting the coordinates of each borough](#chapter4)

## Download and Explore Dataset <a class="anchor" id="chapter1"></a>

In [1]:
# import the dependencies for web scrapping Wikipedia
import requests
import lxml.html as lh
import pandas as pd

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [3]:
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Looks like all our rows have exactly 3 columns. This means all the data collected on tr_elements are from the table.

### Parse the first row as our header

In [4]:
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d: %s'%(i,name))
    col.append((name,[]))

1: Postal Code

2: Borough

3: Neighbourhood



## Creating a Pandas DataFrame <a class="anchor" id="chapter2"></a>

#### Each header is appended to a tuple along with an empty list.

In [5]:
#Since our first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [6]:
[len(C) for (title,C) in col]

[181, 181, 181]

In [7]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

In [8]:
df.head()

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [9]:
df.tail()

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z\n,Not assigned\n,Not assigned\n
180,\n,Canadian postal codes\n,\n


## Cleaning data values <a class="anchor" id="chapter3"></a>

In [10]:
# strip the "/n" from the values
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip('\n'))

In [11]:
#drop the last row as it os not part on the postal codes table
df.drop([180], inplace=True)

In [12]:
df.tail()

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z,Not assigned,Not assigned


In [13]:
#strip the "/n" from the column names
df.rename(columns=lambda x: x.strip('\n'), inplace=True)

In [14]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [15]:
#Check how many Boroughs has Not Assigned values
NABor_df = df.loc[df['Borough'] == "Not assigned"] 
  
print(NABor_df)

    Postal Code       Borough Neighbourhood
0           M1A  Not assigned  Not assigned
1           M2A  Not assigned  Not assigned
7           M8A  Not assigned  Not assigned
10          M2B  Not assigned  Not assigned
15          M7B  Not assigned  Not assigned
..          ...           ...           ...
174         M4Z  Not assigned  Not assigned
175         M5Z  Not assigned  Not assigned
176         M6Z  Not assigned  Not assigned
177         M7Z  Not assigned  Not assigned
179         M9Z  Not assigned  Not assigned

[77 rows x 3 columns]


There are 77 boroughs with not assigned values

In [16]:
#delete all rows from the dataset where Borough has a "Not assigned" value
df.drop(df[df.Borough == "Not assigned"].index, inplace=True)

In [17]:
#Check how many Neighbourhoods has Not Assigned values
NANeigh_df = df.loc[df['Neighbourhood'] == "Not assigned"] 
  
print(NANeigh_df)

Empty DataFrame
Columns: [Postal Code, Borough, Neighbourhood]
Index: []


In [18]:
#reseting the index
df.reset_index(drop=True, inplace=True)

In [19]:
df.shape

(103, 3)

## Getting the coordinates of each borough <a class="anchor" id="chapter4"></a>

In [20]:
geodata = pd.read_csv('https://cocl.us/Geospatial_data')

In [21]:
geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [22]:
geodata.shape

(103, 3)

#### Merge the dataframes with the coordinates and the Toronto Boroughs and Neighbourhoods into one dataframe. We use the Postal Code as key.

In [23]:
Tor_geo=df.merge(geodata, how='left', on='Postal Code')

In [24]:
Tor_geo.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
