# Segmenting and Clustering Neighborhoods in Toronto

### 1. Initialize the Notebook

Download the dependencies required for the notebook.

In [135]:
# Import pandas 
import pandas as pd

### 2. Read the data and transform it into a dataframe

Read the data from the wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and get the necessary data

In [136]:
# Use the read_html to get the data into a list
read_postal = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

# create the dataframe based on the list data
df_postal = read_postal[0]
df_postal.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### 3. Clean the data

Make sure that the column names match the requirements of PostalCode, Borough and Neighborhood.  Remove the 'Not assigned' data from the Borough column and if the Neighborhood column is null then replace with the Borough column.  Additionally, the data needs to be grouped by the postal code and the Neighborhood column separated by commas.

In [137]:
# Set the column names
df_postal.columns = ["PostalCode","Borough","Neighborhood"]

# Drop the 'Not assigned' from the Borough column
df_postal = df_postal[df_postal["Borough"] != "Not assigned"].reset_index(drop=True)

# Replace null Neighborhood values with Borough column values
df_postal["Neighborhood"].fillna(df_postal["Borough"], inplace=True)

# The Neighborhood column already comes grouped by the Postal code column but is separated by ' / ' so it is replaced with ', '
df_postal["Neighborhood"] = df_postal["Neighborhood"].str.replace(' / ',', ')

# Display the first 12 results of the cleaned up data
df_postal.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### 4. Show the size of the dataframe

In [138]:
# Use the .shape method to display the rows and columns in the dataframe
df_postal.shape

(103, 3)