# Segmenting and Clustering Neighborhoods in Toronto

This notebook is used for the Capstone Project of the IBM Data Science Professional Certificate on Coursera. The goal of the notebook is to find a segmentation and clustering of different Toronto Neighborhoods.

In [4]:
import pandas as pd
import numpy as np

## 1. Creating a Data Frame of Toronto Neighborhood Data 

First, the table of Toronto Neighborhoods is downloaded from the respective [Wikipedia Page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) and transformed to a Pandas Dataframe.

In [72]:
toronto_df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
toronto_df.head() #This is now a data frame that looks exactly like the table from Wikipedia, so we still need to do some work on it to make it useful.

Unnamed: 0,Post Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


All rows with a non-assigned borough are dropped:

In [73]:
toronto_df = toronto_df.drop(toronto_df[toronto_df.Borough == 'Not assigned'].index) #drop all rows with a non-assigned borough
toronto_df.head()

Unnamed: 0,Post Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Next, all post codes should appear in only one row, so rows with the same post code need to be combined:

In [74]:
toronto_df = toronto_df.groupby(['Post Code','Borough'])['Neighborhood'].apply(lambda Neighborhood: ''.join(Neighborhood.to_string(index = False))).str.replace('(\\n)', '').reset_index()
toronto_df = pd.DataFrame(toronto_df) #combine rows with the same Post Code
toronto_df.head()

Unnamed: 0,Post Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If there are non-assigned neighborhoods we replace their name with the borough name:

In [75]:
toronto_df['Neighborhood'] = toronto_df['Neighborhood'].replace('Not assigned', toronto_df['Borough'])
toronto_df.head() #replacing non-assigned neigborhoods with borough names

Unnamed: 0,Post Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


The data frame now has 103 rows:

In [76]:
toronto_df.shape

(103, 3)

## 2. Finding the location data of the Neighborhoods

We take the data from the given CSV file:

In [77]:
pc_location = pd.read_csv('http://cocl.us/Geospatial_data')
pc_location.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Next, the two tables need to be merged:

In [82]:
toronto_loc = pd.concat([toronto_df, pc_location], axis = 1, join = 'inner')
toronto_loc.drop(columns = ['Postal Code'], inplace = True) #the post code does not need to be in there twice
toronto_loc.head()

Unnamed: 0,Post Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## 3. Clustering Toronto Neighborhoods