# Segmenting and Clustering Neighborhoods in Toronto

## 1. Assignment's description

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.


## Part 1 - Scrape and Create DataFrame from Wiki

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [4]:
html_doc =" https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(html_doc).text
soup = BeautifulSoup(source, "lxml")

In [6]:
table = soup.find("table")

#create dataframe first
column_names = ["Postalcode","Borough", "Neighborhood"]
df = pd.DataFrame(columns = column_names)

In [11]:
#Search all the postcode, borough, neighborhood
for rows in table.find_all("tr"):
    row_data = []
    for cells in rows.find_all("td"):
        row_data.append(cells.text.strip())
    
    if len(row_data)==3:
        df.loc[len(df)] = row_data

In [12]:
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [15]:
df = df[df["Borough"]!= "Not assigned"]
df

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
520,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
525,M4Y,Downtown Toronto,Church and Wellesley
528,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
529,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [26]:
temp_df=df.groupby('Postalcode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
temp_df = temp_df.reset_index(drop=False)
temp_df.rename(columns={"Neighborhood":"Neighborhood_joined"},inplace=True)

temp_df

Unnamed: 0,Postalcode,Neighborhood_joined
0,M1B,"Malvern, Rouge, Malvern, Rouge, Malvern, Rouge"
1,M1C,"Rouge Hill, Port Union, Highland Creek, Rouge ..."
2,M1E,"Guildwood, Morningside, West Hill, Guildwood, ..."
3,M1G,"Woburn, Woburn, Woburn"
4,M1H,"Cedarbrae, Cedarbrae, Cedarbrae"
...,...,...
98,M9N,"Weston, Weston, Weston"
99,M9P,"Westmount, Westmount, Westmount"
100,M9R,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,"South Steeles, Silverstone, Humbergate, Jamest..."


In [29]:
df_merge = pd.merge(df, temp_df, on='Postalcode')
df_merge

Unnamed: 0,Postalcode,Borough,Neighborhood,Neighborhood_joined
0,M3A,North York,Parkwoods,"Parkwoods, Parkwoods, Parkwoods"
1,M3A,North York,Parkwoods,"Parkwoods, Parkwoods, Parkwoods"
2,M3A,North York,Parkwoods,"Parkwoods, Parkwoods, Parkwoods"
3,M4A,North York,Victoria Village,"Victoria Village, Victoria Village, Victoria V..."
4,M4A,North York,Victoria Village,"Victoria Village, Victoria Village, Victoria V..."
...,...,...,...,...
304,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...","Old Mill South, King's Mill Park, Sunnylea, Hu..."
305,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...","Old Mill South, King's Mill Park, Sunnylea, Hu..."
306,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...","Mimico NW, The Queensway West, South of Bloor,..."
307,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...","Mimico NW, The Queensway West, South of Bloor,..."


In [30]:
df_merge.drop(['Neighborhood'],axis=1,inplace=True)
df_merge

Unnamed: 0,Postalcode,Borough,Neighborhood_joined
0,M3A,North York,"Parkwoods, Parkwoods, Parkwoods"
1,M3A,North York,"Parkwoods, Parkwoods, Parkwoods"
2,M3A,North York,"Parkwoods, Parkwoods, Parkwoods"
3,M4A,North York,"Victoria Village, Victoria Village, Victoria V..."
4,M4A,North York,"Victoria Village, Victoria Village, Victoria V..."
...,...,...,...
304,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."
305,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."
306,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
307,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [31]:
df_merge.drop_duplicates(inplace=True)
df_merge

Unnamed: 0,Postalcode,Borough,Neighborhood_joined
0,M3A,North York,"Parkwoods, Parkwoods, Parkwoods"
3,M4A,North York,"Victoria Village, Victoria Village, Victoria V..."
6,M5A,Downtown Toronto,"Regent Park, Harbourfront, Regent Park, Harbou..."
9,M6A,North York,"Lawrence Manor, Lawrence Heights, Lawrence Man..."
12,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government, Q..."
...,...,...,...
294,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North,..."
297,M4Y,Downtown Toronto,"Church and Wellesley, Church and Wellesley, Ch..."
300,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
303,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [33]:
df_merge.rename(columns={'Neighborhood_joined':'Neighborhood'},inplace=True)
df_merge.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,"Parkwoods, Parkwoods, Parkwoods"
3,M4A,North York,"Victoria Village, Victoria Village, Victoria V..."
6,M5A,Downtown Toronto,"Regent Park, Harbourfront, Regent Park, Harbou..."
9,M6A,North York,"Lawrence Manor, Lawrence Heights, Lawrence Man..."
12,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government, Q..."


In [35]:
df_merge.shape

(103, 3)