<h1>Segmenting and Clustering Neighborhoods in Toronto

<b>Applied Data Science Capstone - Coursera

This notebook contains Part 1 of my submission for the Week 3 Assignment: Segmenting and Clustering Neighborhoods in Toronto from the Applied Data Science Capstone course.

In [14]:
#First, lets import all the libraries to be used on this notebook
import pandas as pd
import numpy as np
import requests
print('Libraries Imported!')

Libraries Imported!


<h2>Part 1 - Build Toronto Neighborhood dataset

Please note that to do this part of the assignment I did not use the Beautiful Soup library, since pandas can read html tables directly. Please check https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

In [15]:
#Download page and store locally
url  = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url)
if page.status_code == 200:
    print('Page download successful')
else:
    print('Page download error. Error code: {}'.format(page.status_code))

Page download successful


In [16]:
#Using the pandas function "read_html" we can easily process the HTML string. 
#This particular table doesn't have <thead> tags in the HTML markup, so we set header = 0 to use the first row as column names
#Since we will discard the "Not Assigned" columns, we set them to NaN so we can later use the dropna method.
df_html = pd.read_html(url, header=0, na_values = ['Not assigned'])[0]
df_html.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [17]:
#Drop the the rows on which the Borough is empty
df_html.dropna(subset=['Borough'], inplace=True)

In [18]:
#Check Neighborhood is empty but Borough exists
n_empty_neighborhood = df_html[df_html['Neighborhood'].isna()].shape[0]
print('Number of rows on which Neighborhood column is empty: {}'.format(n_empty_neighborhood))

Number of rows on which Neighborhood column is empty: 0


In [20]:
#Show which neighborhood is emtpy but Borough exists
df_html[df_html['Neighborhood'].isna()]

Unnamed: 0,Postal Code,Borough,Neighborhood


In [21]:
#Replace empty Neighborhood with Borough name and check again
df_html['Neighborhood'].fillna(df_html['Borough'], inplace=True)
n_empty_neighborhood = df_html[df_html['Neighborhood'].isna()].shape[0]
print('Number of rows on which Neighborhood column is empty: {}'.format(n_empty_neighborhood))

Number of rows on which Neighborhood column is empty: 0


In [22]:
#Confirm that Queen's Park Neighborhood is not empty now:
df_html[df_html['Borough']=="Queen's Park"]

Unnamed: 0,Postal Code,Borough,Neighborhood


In [30]:
#Group by Postcode / Borough
df_postcodes = df_html.groupby(['Postal Code','Borough']).Neighborhood.agg([('Neighborhood', ', '.join)])
df_postcodes.reset_index(inplace=True)
df_postcodes.head(5)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [31]:
#Lets check for Scarborough, so we can compare with the example dataframe shown in the assignment
df_postcodes[df_postcodes['Borough']=='Scarborough']

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [32]:
#Shape of the dataset
print('The shape of the dataset is:',df_postcodes.shape)

The shape of the dataset is: (103, 3)


In [33]:
#Export dataset to .csv file, since it will be used on part 2
df_postcodes.to_csv('Toronto_Postcodes.csv')