<h1>Toronto Neighbourhood Clustering</h1>
<p>Exploration of neighbourhoods in Toronto, and clustering</p>

In [2]:
# import libraries
import numpy as np
import pandas as pd

<h3>Get the data</h3>
<p>Download the list of Postcodes, Boroughs and Neighbourhoods from the <a href = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" target="blank">wikipedia page</a> on postal codes of Canada</p>

In [250]:
# Download the data and read the Postcode tabel into a dataframe
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_page = pd.read_html(url) #  list
df_neighbourhoods = wiki_page[0]        # dataframe [287 rows x 3 columns]

df_neighbourhoods.head()

# to be precise: column names should be: PostalCode, Borough, and Neighborhood
df_neighbourhoods.rename(columns={"Postcode": "PostalCode"}, inplace=True)
df_neighbourhoods.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


<h3>Data Transformations: "Not assigned"</h3>
<ul>
    <li />Ignore cells with a borough that is "Not assigned"
    <li />If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough. 
<ul>

In [251]:
# Ignore cells with a borough that is "Not assigned"
df_neighbourhoods = df_neighbourhoods[df_neighbourhoods['Borough']!='Not assigned'].reset_index(drop=True) 


In [254]:
# find the rows where neighbourhood is "Not assigned"
idx_list = df_neighbourhoods[df_neighbourhoods["Neighbourhood"] == "Not assigned"].index  # M9A Queen's park Not assigned

for i in idx_list: 
    df_neighbourhoods.iloc[idx_list,2] = df_neighbourhoods.iloc[idx_list,1]

# check results:
df_neighbourhoods.iloc[idx_list]
df_neighbourhoods[df_neighbourhoods["Neighbourhood"] == "Not assigned"] # should return no rows


Unnamed: 0,PostalCode,Borough,Neighbourhood


<h3>Group data on PostalCode and Borough</h3>

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [255]:
# Combine rows with the same postcode/borough: concatenate neighbourhoods
# create a new dataframe grouped on Postcode, Borough
df_toronto = df_neighbourhoods.groupby(['PostalCode', 'Borough']).count().reset_index()

# create a list with the concatenated neighbourhoods
nb_list = []
for i in df_grouped.index:
    l1 = df_grouped.iloc[i]
    # get a subset for this Postcode, Borough
    temp = df_neighbourhoods[(df_neighbourhoods["PostalCode"]==l1[0]) & (df_neighbourhoods["Borough"]==l1[1])]
    # concatenate neighbourhoods
    ln= ', '.join(temp['Neighbourhood'].tolist()) 
    # add to the list
    nb_list.append(ln)
# now put the list in the grouped dataframe    
df_toronto["Neighbourhood"] = nb_list

print("Dataframe shape: ", df_toronto.shape)  # for the assignment
df_toronto

Dataframe shape:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
