# Segmenting and Clustering Neighborhoods in Toronto
The goal of this notebook is to segment and cluster neighborhoods in Toronto while gathering their postal code and borough.

In [1]:
!pip install bs4



In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

## Building the dataframe

I am using an older version of the Wikipedia page as it uses a basic HTML table - which is more easily parsed to a dataframe than the grid used in newer versions of the page. I am also renaming the column names.

In [3]:
req = requests.get("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1012118802") 

soup = BeautifulSoup(req.content,'html.parser') 

table = soup.find_all('table')[0]

df = pd.read_html(str(table))

df = df[0]

df.columns = ['PostalCode', 'Borough', 'Neighborhood']


Cleaning the data. First I replace all occurances of "Not assigned" with "NaN". If a cell has a borough but no neighborhood, then the neighborhood will be the same as the borough. Afterwards, every row that has "NaN" in it is dropped. I also reset the index to fill up any gaps.

In [4]:
df.replace("Not assigned", np.nan, inplace=True)
df.Neighborhood.fillna(df.Borough, inplace=True)
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

In [5]:
df.shape

(103, 3)