# Toronto Clustering

The Wikipedia table is almost in the format we would like it to be in. Nonetheless, I treated it as if it was not and formatted it anyway, in order to demonstrate my solution for how it could be done.

First we have to import pandas and download the table from the Wikipedia entry.

In [81]:
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df = pd.read_html(url)[0]

We do not want any entries where "Borough" is "Not assigned".

In [82]:
df = df[df["Borough"]!="Not assigned"]

If "Neighbourhood" is "Not assigned", we want to replace it with the value in "Borough".

In [83]:
df["Neighbourhood"] = df.apply(lambda x: x["Borough"] 
                               if x["Neighbourhood"] == "Not assigned" 
                               else x["Neighbourhood"], axis=1)

We want to concatenate the "Neighbourhood" cells for which the "Postal Code"s are the same into a single row, where their strings get separated by comas. Finally, we show the first eleven rows of our dataframe.

In [80]:
replace = dict((key,"") for key in df[df.duplicated(subset="Postal Code",keep=False)]["Postal Code"].unique())

for index, row in df.iterrows():
    if row["Postal Code"] in replace.keys():
        if replace[row["Postal Code"]] == "":
            replace[row["Postal Code"]] += row["Neighbourhood"]
        else:
            replace[row["Postal Code"]] += (", "+row["Neighbourhood"])

df["Neighbourhood"] = df.apply(lambda x: replace[x["Postal Code"]]
                               if x["Postal Code"] in replace.keys() 
                               else x["Neighbourhood"], axis=1)

df.drop_duplicates(subset="Postal Code",keep="first",inplace=True)
df = df.reset_index(drop=True)
df.head(11)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


The last step is to print out the shape of our table.

In [84]:
df.shape

(103, 3)