#Part 1
**Data Cleaning**
For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset

In [109]:
#import modules
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [110]:
#scrap page from url with BeautifulSoup
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
url=requests.get(url).text
soup=BeautifulSoup(url,'lxml')
#print(soup.prettify())

In [111]:
#Inspection of hierarchy tree
soup=BeautifulSoup(url,'lxml')
print(soup.prettify())

In [112]:
#Extract table with BeautifulSoup
table=soup.find('tbody')
#print(table.prettify())

In [113]:
#Extract column data of PostalCode,Borough and Neighbourhood
table1=table.find_all('td')
postalcode=[]
borough=[]
neighbourhood=[]
for tab in table1:
    postalcode.append(tab.b.text)
    if tab.i:
        borough.append('none')
        neighbourhood.append('none')
    elif tab.a:
        borough.append(tab.a.text)
        neighbourhood.append(tab.span.text)

In [114]:
#Convert data into dataframe
df=pd.DataFrame([postalcode,borough,neighbourhood]).T
df.columns=['PostalCode','Borough','Neighbourhood']
#extract out the Neighbourhood data
df['Neighbourhood']=df['Neighbourhood'].str.extract('.*\((.*)\).*')
df=df.astype(object)

In [115]:
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,none,
1,M2A,none,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Queen's Park,
7,M8A,none,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


In [116]:
#remove rows with 'Borough' as none
df1=df[df['Borough']!='none']
Final=df1[df1['Borough']!='None']

In [117]:
df2

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Queen's Park,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern , Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill , Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [121]:
#Tidy data by replacing '/' separator in 'Borough' with ',' separator
df2['Neighbourhood']=df2['Neighbourhood'].str.replace('/',',')

In [122]:
#Tidy data by sorting in order of 'PostalCode',resetting index and remove irrelevant column
Toronto=df2.sort_values(by='PostalCode',ascending=True).reset_index().drop('index',axis='columns')
Toronto

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern , Rouge"
1,M1C,Scarborough,"Rouge Hill , Port Union , Highland Creek"
2,M1E,Scarborough,"Guildwood , Morningside , West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park , Ionview , East Birchmount Park"
7,M1L,Scarborough,"Golden Mile , Clairlea , Oakridge"
8,M1M,Scarborough,"Cliffside , Cliffcrest , Scarborough Village West"
9,M1N,North York,Willowdale


In [123]:
Toronto.shape

(101, 3)