# Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Scraping data from Wikipedia

First thing we are going to do is to get the table of postal codes of Canada and to transform it into a pandas dataframe.

In [2]:
import pandas as pd

In [20]:
df=pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
print(df.shape)
df.head()

(180, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Now let's process the dataframe by removing the rows that has no assigned borough. We will also assign a non-assigned neighborhood to its borough.

In [21]:
df=df[df.Borough != 'Not assigned']
df = df.sort_values(by=['Postal code','Borough'])

df.reset_index(inplace=True)
df.drop('index',axis=1,inplace=True)

df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [24]:
df_neighbor=df.replace({'Neighborhood':'Not assigned'}, (df['Borough']))
df_neighbor.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


In [25]:
df_neighbor.shape

(103, 3)

## Part 2: Getting the latitude and the longitude coordinates of each neighborhood

For this, we gonna transform the geospatial data in form of csv into dataframe.

In [37]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Then we are going to merge the geospatial data with our main data frame from part 1.

In [38]:
# Merge coordinates into neighbourhood dataframe
df_neighbor_new = df_neighbor.merge(geo_df,left_on='Postal code',right_on='Postal code', left_index=False, right_index=False)
df_neighbor_new.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
