# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, we will explore, segment, and cluster the neighborhoods in the city of Toronto. 

### ~Scrape the Wikipedia page and transform the data

We need to scarpe the Wikipedia page for the Toronto neighborhood data and read it into a pandas data frame. I have used BeautifulSoup for website scraping.

In [137]:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')

table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')

data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])

df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


### ~Data Processing

In [138]:
#Ignore cells with a borough that is Not assigned and Postal Codes with null values
df = df[df['Borough']!='Not assigned'] 
df = df[~df['PostalCode'].isnull()] 

#More than one neighborhood can exist in one postal code area. So, group them and separate them with commas.
df =df.groupby(['PostalCode','Borough'])['Neighbourhood'].agg([('Neighbourhood', ', '.join)]).reset_index()

#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
df.loc[df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df['Borough']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [139]:
print('The number of rows in the data frame : ',df.shape[0])

The number of rows in the data frame :  103
