IBM Capstone

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import sys

##Chenge the DataFrame output disply format so that it permits me to 
##inspect the total ammount of data stores in the final DataFrame
pd.options.display.max_rows = 999
pd.set_option('max_colwidth', 140)

Getting the data from Wikipedia and Parsing it with BeautifulSoup

In [42]:
# getting data from internet
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw_wikipedia_page= requests.get(wikipedia_link).text

# using beautiful soup to parse the HTML/XML codes.
soup = BeautifulSoup(raw_wikipedia_page,'xml')

In [85]:
##Creating the table variable which containg the information of the wikipedia page table for 
##the Postcodes Boroughs and Neighborhoods of the Toronto area
table = soup.find('table')

Here I am creating the column title list for the data frame

In [246]:
#an empty list that will hold the titles of the table columns
dataframe_titles=[]
th_cell=table.find_all('th') # in html the /th tag declares the column titles of a table
for title in th_cell:
    dataframe_titles.append(title.text.strip()) #strip
print(dataframe_titles)

['Postcode', 'Borough', 'Neighbourhood']


Here I am filling three string lists one for each table column.
Every postcode entry has a valid string.
The borough string and the neighborhood string may have 'Not assigned' as a value.
We need to respect the following criteria:
a) If the Borough == 'Not assigned' then I skip the table entry
b) If PostCode and Borough are valid but Neighbnorhood == 'Not Assigned'
then we have to assign the Neighborhood with the name of the respective Borough.

In [278]:
##decleration of the list variables that will contain the postcodes, boroughs and neighborhoods of the table
##These are not the final lists. These are the tables that contain the total information of the wikipedia table
##Thus the suffix _orig in the end of their name
PostCode_strlist_orig=[]
Borough_strlist_orig=[]
Neighborhood_strlist_orig=[]

for index_tr,tr_cell in enumerate(table.find_all('tr')):
    ##this 1st "if" is because we need to skip the first row of the table
    ##which holds the column titles, since we already decrared them previously
    if index_tr==0:
        continue 
        
    #variable declaration 
    postcode_var=''
    borough_var=''
    neigh_var=''
        
    for index_td, td_cell in enumerate(tr_cell.find_all('td')):
        ##we use the str() cast method to retrive strings and
        ##and the .strip() to get rid of any '/n' residuals (or others) at the
        ##end (or begining) of the strings
        if index_td==0:
            postcode_var=str(td_cell.text).strip()
            continue
        elif index_td==1:
            borough_var=str(td_cell.text).strip()
            continue
        elif index_td==2:
            neigh_var=str(td_cell.text).strip()
            continue
        else:
            print("Strange Index in td loop")
            sys.exit(0)
        
    ##Here we check if the criteria for Borough and Neighbourhood are fulfilled
    ##and act acoordingly (read the markdown segment)
    if borough_var=='Not assigned':
        continue
    if neigh_var=='Not assigned':
        neigh_var=borough_var 
   
    ## Fill the lists using the append method
    PostCode_strlist_orig.append(postcode_var)
    Borough_strlist_orig.append(borough_var)
    Neighborhood_strlist_orig.append(neigh_var.strip())

Here we identify the unique PostCode string values.
Then we are going through this unique Postcode values and we search for these values in the PostCode_strlist_orig list we created above. Every time we find this unique postcode in the PostCode_strlist_orig we fill the<br> postcode_var<br>
borough_var (from the Borough_strlist_orig value)<br>
and we append into the neighborhood_var (from the Neighborhood_strlist_orig)<br>
In this last part we want to create a string of the form 'neighborhood_1, neighborhood_2, ..., neighborhood_n'<br>
ALL this string values are then used to fill their respective lists.


Remark on the Wikipedia table contents:

a) A postcode corresponds to multiple neighborhoods<br>
b) The boroughs do not identify with a postcode uniquely

So we may get the following pattern<br>
postcode_1 -> Borough_1 -> Neighborhood_1<br>
postcode_2 -> Borough_1 -> Neighborhood_2

BUT if 2 or more Neighborhoods have the same postcode they seem to belong to the same Borough<br>
postcode_3 -> Borough_3 -> Neighborhood_3<br>
postcode_3 -> Borough_3 -> Neighborhood_4

Since we are grouping the data according to the PostCode values we expect that 1 PostCode will have 1 Borough Value
and n (multiple) Neighborhood values. Though 2 different postcodes may have the same Borough value. The same postcodes must have the same Borough values.


In [279]:
unique_postcode_list=list(set(PostCode_strlist_orig))
len(unique_postcode_list)

#final lists before the dataframe creation
postcodes_list_final=[]
boroughs_list_final=[]
neighborhoods_list_final=[]

for postcode in unique_postcode_list:
    #print(postcode)
    postcode_var=''
    borough_var=''
    neighborhood_var=''

    for pc_index,pc_element in enumerate(PostCode_strlist_orig):
        
        if pc_element==postcode:
            postcode_var=PostCode_strlist_orig[pc_index]
            borough_var=Borough_strlist_orig[pc_index]
            
            
            if neighborhood_var=='':
                neighborhood_var=Neighborhood_strlist_orig[pc_index]
            else:
                neighborhood_var=neighborhood_var+', '+Neighborhood_strlist_orig[pc_index]
    
    postcodes_list_final.append(postcode_var)
    boroughs_list_final.append(borough_var)
    neighborhoods_list_final.append(neighborhood_var)

Here I create the data frame

In [280]:
toronto={dataframe_titles[0]:postcodes_list_final,dataframe_titles[1]:\
         boroughs_list_final,dataframe_titles[2]:\
         neighborhoods_list_final}
df_toronto=pd.DataFrame(toronto,columns=[dataframe_titles[0],dataframe_titles[1],dataframe_titles[2]]) 

In [284]:
#df_toronto

In [285]:
#in case i want to check wikipedia table for mistakes
#df_toronto.sort_values('Postcode')

In [286]:
df_toronto.shape

(103, 3)