## Segmenting and Clustering Neighborhoods in Toronto

This Notebook explores, segments and clusters the neighborhoods in the city of Toronto. For this data a Wikipedia page exists that has all the information needed to explore and cluster said neighborhoods. Here we scrape the wikipedia page and wrangle the data, clean it and then read it in a pandas dataframe.

### Week 3

In [79]:
import pandas as pd

In [80]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
Toronto_list = pd.read_html(url)
Toronto_df = pd.DataFrame(Toronto_list)

This particular Wikipedia page contains three different tables, as can be seen in the following

In [81]:
print('Tables extracted from the page:',len(Toronto_list))
print('Length of the first Table:', len(Toronto_list[0]))
print('Length of the second Table:', len(Toronto_list[1]))
print('Length of the third Table:', len(Toronto_list[2]))

Tables extracted from the page: 3
Length of the first Table: 181
Length of the second Table: 6
Length of the third Table: 2


We are only interested in the first table, which is now converted into a DataFrame

In [83]:
Toronto_df = pd.DataFrame(Toronto_list[0])
Toronto_df.head()

Unnamed: 0,0,1,2
0,Postal Code,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


The column headers clearly need to be modified using the first row

In [84]:
headers = Toronto_df.iloc[0]
Toronto_df = pd.DataFrame(Toronto_df.values[1:], columns = headers)
Toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The occurrances of 'Not assigned' in the Borough column need to be removed

In [87]:
Toronto_df = Toronto_df[Toronto_df.Borough != 'Not assigned']
Toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Now we need to reindex the index column, as the removal of unuseful rows have distorted this column

In [89]:
Toronto_df.reset_index(drop=True, inplace=True)

<bound method NDFrame.head of 0   Postal Code           Borough  \
0           M3A        North York   
1           M4A        North York   
2           M5A  Downtown Toronto   
3           M6A        North York   
4           M7A  Downtown Toronto   
5           M9A         Etobicoke   
6           M1B       Scarborough   
7           M3B        North York   
8           M4B         East York   
9           M5B  Downtown Toronto   
10          M6B        North York   
11          M9B         Etobicoke   
12          M1C       Scarborough   
13          M3C        North York   
14          M4C         East York   
15          M5C  Downtown Toronto   
16          M6C              York   
17          M9C         Etobicoke   
18          M1E       Scarborough   
19          M4E      East Toronto   
20          M5E  Downtown Toronto   
21          M6E              York   
22          M1G       Scarborough   
23          M4G         East York   
24          M5G  Downtown Toronto   
25      

In [90]:
Toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


We can now check wheter there are Neighborhood entries with 'Not assigned' entries, which are none as shown here

In [93]:
print(len(Toronto_df[Toronto_df.Neighborhood == 'Not assigned']))

0


Finally, we notice that the column name required is actually 'PostalCode', so we need to rename the corresponding column

In [104]:
Toronto_df.rename(columns = {'Postal Code':'PostalCode'},inplace=True)

Finally, despite being a little bit redundant, we show the shape of this DataFrame

In [97]:
Toronto_df.shape

(103, 3)