The first stage of this project is to retrieve and clean the data that we will work with. Having completed this notebook we will end up creating a csv file containing a list of neighborhoods of Toronto together with boroughs they are contained in and the postal codes assigned to them. 

So, let us first upload all libraries we will use: 

In [30]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import ssl
import re

To retrieve the abovementioned data, we will use the following Wikipedia page. Before trying to to anything with it, we of course look at its code using the browser. We notice the 'tr' tags that define rows in the HTML table, and 'td' tags defining each cell. 

In [31]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Now, let us open the page using BeautifulSoup and find all the 'tr' tags. 

In [32]:
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('tr')

Let us now find all the lines that will become part of our dataframe. We make a list called 'good' of tags which satisfy the following 2 properties: 
    1. The corresponding line contains a postcode format pattern, namely, a combination of letter M, a decimal number and another capital letter.  
    2. The postcode should be assigned to some Borough. So, we check whether the entry 'Not assigned' occurs in the corresponding row more than one time. 
We also make a list of all good postcodes and check that numbers match.  

In [33]:
good=list()
postcodes=list()
for tag in tags:
    line = tag.decode()
    try:
        postcode=re.findall('>(M[0-9][A-Z])<',line)[0]
        if len(re.findall('Not assigned',line))==2:
            continue
        good.append(tag)
        postcodes.append(postcode)
        
        
    except:
        continue   
print(len(good),len(postcodes))

211 211


Now, we make a dictionary using our data. Its keys are all occuring pairs of type (postcode, borough) and at first we assign an empty list to each of the keys as the value. To extract the borough and neighborhood we first delete all newline characters and split the line into 3 lines using the tag 'td' as separator, then in the 0th line we look for the postcode, and in the 1st and 2nd lines we look for combinations which have '>' on their left, start with a capital letter and do not contain characters like < and / in them. 

In [34]:
#making a dictionary with our data
neigh=dict()
for tag in good:
    line=tag.decode().replace('\n','')
    row=line.split('/td><td')
    postcode=re.findall('M\d\w',row[0])[0]
    borough=re.findall('>([A-Z][^</]+)',row[1])[0]
    neigh[postcode,borough]=list()
print(len(neigh))    
    
   
   

103


Here, we add the neighborhoods contained in the corresponding borough and having the corresponding postcode into the list of values of the dictionary neigh(). 

In [35]:
for postcode,borough in neigh.keys():
    for tag in good:
        line=tag.decode().replace('\n','')
        row=line.split('/td><td')
        p=re.findall('M\d\w',row[0])[0]
        b=re.findall('>([A-Z][^</]+)',row[1])[0]
        n=re.findall('>([A-Z][^</]+)<*/*a*>*</td></tr>',row[2])[0]
        if p==postcode and b==borough:
            neigh[postcode,borough].append(n)            

Now we deal with 'Not assigned' in case of the neighborhood. For every such case we simply give the neighborhood the same name as the borough. 

In [36]:
for postcode,borough in neigh.keys():
    if neigh[postcode,borough]==['Not assigned']:
        neigh[postcode,borough]=[borough]      

Finally, we make a dictionary indexed by integers with values of type (postcode, borough, list of neighborhoods), turn it into a dataframe, transpose it and name the columns.

In [37]:
data=dict()
index=0
for postcode,borough in neigh.keys():
    neighborhoods=', '.join(neigh[postcode,borough])
    data[index]=[postcode,borough,neighborhoods]
    index=index+1
df=pd.DataFrame.from_dict(data).transpose() 
df.columns=['Postcode','Borough','Neighborhoods']

Let's display some part of the dataframe to make sure we see what we want to see there. 

In [38]:

df.head(15)

Unnamed: 0,Postcode,Borough,Neighborhoods
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Finally we save the dataframe as a csv file and also check its shape as was asked in the assignment. 

In [39]:
df.to_csv(r'C:\Users\arhip\Desktop\IBM\projects\Coursera_Capstone\Toronto_data.csv')
df.shape

(103, 3)