# PART A: Scrapping Toronto neighbourhood data from a wikipedia webpage 

At first we load the toronto neighbourhood data which is available in tabular format on a wikipedia page into a pandas dataframe. 

In [108]:
import requests                                                         #importing required libraries
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #Saving webpage link 
html = requests.get(url).content                                        #call to retrieve the data from webpage
df_list = pd.read_html(html)                                            #reading all tables on the webpage as list
df=pd.DataFrame(df_list[0])                                     #saving the neighbourhood data from the list as dataframe 

In [109]:
df.head()                                                              #overview of dataframe

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [110]:
df.shape                                                               #size of dataframe

(180, 3)

After successfully loading the Toronto neighbourhood data in our dataframe df, we see there 3 columns namely Postal code, Borough and Neighbourhood and 180 rows. We see more than one neighbourhood can exist in one postal code.<br>
We now proceed with data cleaning.<br>
<br>
In our dataframe- some values in column Borough and Neighbourhood are recorded as 'Not assigned'. Let's remove those rows with unrecorded Borough values.

In [111]:
df=df[df.Borough!='Not assigned']               #retains those rows where Borough values are not recorded as 'Not assigned'
df.reset_index(drop=True,inplace=True)          #index values are reset post removal of rows
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Let's now replace 'Not assigned' values in Neighbourhood column with corresponding borough values.

In [112]:
for i in range (0,len(df)):
    if df.at[i,'Neighbourhood']=='Not assigned':
        df.at[i,'Neighbourhood'] = df.at[i,'Borough']
        print(i,"done")

In [113]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [114]:
df.shape

(103, 3)

Our data is cleaned and ready to use for further processing. Post cleaning our dataframe have 103 rows and 3 columns namely
<br> Postal code, Borough and Neighbourhood.