# Segmenting and Clustering Neighborhoods in the city of Toronto, Canada.

First, we need to import pandas, and download the data. Then we visualize it in order to know what we need to do for cleaning.

In [2]:
import pandas as pd
d=pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)
d

[    Postcode           Borough  \
 0        M1A      Not assigned   
 1        M2A      Not assigned   
 2        M3A        North York   
 3        M4A        North York   
 4        M5A  Downtown Toronto   
 5        M5A  Downtown Toronto   
 6        M6A        North York   
 7        M6A        North York   
 8        M7A      Queen's Park   
 9        M8A      Not assigned   
 10       M9A         Etobicoke   
 11       M1B       Scarborough   
 12       M1B       Scarborough   
 13       M2B      Not assigned   
 14       M3B        North York   
 15       M4B         East York   
 16       M4B         East York   
 17       M5B  Downtown Toronto   
 18       M5B  Downtown Toronto   
 19       M6B        North York   
 20       M7B      Not assigned   
 21       M8B      Not assigned   
 22       M9B         Etobicoke   
 23       M9B         Etobicoke   
 24       M9B         Etobicoke   
 25       M9B         Etobicoke   
 26       M9B         Etobicoke   
 27       M1C       

We see that there are 3 tables, and the one we need is the first. So, we select it and replace 'Not assigned values' by 'NAN'.

In [3]:
d0=d[0]

import numpy as np
d0.replace("Not assigned", np.nan, inplace=True)
d0

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,
9,M8A,,


Now, as instruction, the rows in which there is a valid borough but an invalid neighborhood, the borough value needs to be assigned as neighborhood value too.

In [6]:
d0.loc[d0['Neighbourhood'].isnull(),'Neighbourhood'] = d0['Borough']
d0.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Drop the NaN values

In [7]:
d0.dropna(inplace=True)
d0.reset_index(drop=True, inplace=True)
d0.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Now, we create a new dataframe in which there are merged cells, grouped by postcode.

In [8]:
result = d0.groupby(['Postcode'], sort=False).agg( ', '.join)
result.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,"Downtown Toronto, Downtown Toronto","Harbourfront, Regent Park"
M6A,"North York, North York","Lawrence Heights, Lawrence Manor"
M7A,Queen's Park,Queen's Park


#  Now the result DataFrame

In [9]:
result.reset_index(inplace=True)
result.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,"Downtown Toronto, Downtown Toronto","Harbourfront, Regent Park"
3,M6A,"North York, North York","Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,"Scarborough, Scarborough","Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,"East York, East York","Woodbine Gardens, Parkview Hill"
9,M5B,"Downtown Toronto, Downtown Toronto","Ryerson, Garden District"


Note: It seems that the postcode are ordered by number, and then alfabetically. This can be slightly different from the example in the peer-graded assignment, but it does not mean that my dataframe is incorrect.

In [10]:
result.shape

(103, 3)