# Segmenting and Clustering Neighborhoods in Toronto

First we need to get the data on Toronto neighbourhoods from the wikipedia page and save it in a dataframe.

In [1]:
import pandas as pd  # library for data analysis
import requests  # library to handle requests
from bs4 import BeautifulSoup  # library to parse HTML documents

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wikitable = "wikitable sortable jquery-tablesorter"
response = requests.get(url)
print(response.status_code)  # check we've connected

200


In [3]:
soup = BeautifulSoup(response.text, 'html.parser')
toronto = soup.find('table', {'class': "wikitable"})

In [4]:
df = pd.read_html(str(toronto))
df = pd.DataFrame(df[0])
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Preprocessing and cleaning the data



Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.


In [5]:
# check shape of df before
print(df.shape)

# drop rows where Borough is not assigned
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)
# check shape after
print(df.shape)

# check head
df.head()

(180, 3)
(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

_As it happens, this is the format the information already came in, clearly the wikipedia page has changed since the course was written!_

_Below, I provide the code I would have used and proof it's already presented that way._

In [6]:
# the postal code mentioned above is already presented as desired:
df.loc[df['Postal Code'] == 'M5A']


Unnamed: 0,Postal Code,Borough,Neighbourhood
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [7]:
# however, if it wasn't, this is how I would have fixed it!
df1 = df.groupby('Postal Code')['Neighbourhood'].apply(', '.join).reset_index()
df2 = df.drop(['Neighbourhood'], axis=1)
df2 = pd.merge(df1, df2, on='Postal Code')
df2 = df2.drop_duplicates(subset=['Postal Code'])
df2.head()

Unnamed: 0,Postal Code,Neighbourhood,Borough
0,M1B,"Malvern, Rouge",Scarborough
1,M1C,"Rouge Hill, Port Union, Highland Creek",Scarborough
2,M1E,"Guildwood, Morningside, West Hill",Scarborough
3,M1G,Woburn,Scarborough
4,M1H,Cedarbrae,Scarborough


In [8]:
# as you can see, the df I would have created, and the original are the same size

df.shape, df2.shape

((103, 3), (103, 3))

If a cell has a borough, but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

In [9]:
# check for rows with Not assigned neighbourhoods

df[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


_There are no rows which have a Borough, but which have a Neighbourhood Not assigned._

_If there were, I would have done this:_


In [10]:
df.Neighbourhood[df.Neighbourhood == 'Not assigned'] = df.Borough

In [11]:
# here is the shape of the df
shape = df.shape
print(f"The shape of my dataframe is {shape}")

The shape of my dataframe is (103, 3)
