# Segmenting and Clustering Neighborhoods in Toronto
Applied Data Science Capstone - Coursera

This notebook contains Part 1 of my submission for the Week 3 Assignment: Segmenting and Clustering Neighborhoods in Toronto from the Applied Data Science Capstone course

# Import Libraries

In [184]:
#First, lets import all the libraries to be used on this notebook
import pandas as pd
import numpy as np
import requests
print('Libraries Imported!')

Libraries Imported!


# Download page and store locally
Please note that to do this part of the assignment I did not use the Beautiful Soup library, since pandas can read html tables directly. Please check https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

In [185]:
url  = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url)
if page.status_code == 200:
    print('Page download successful')
else:
    print('Page download error. Error code: {}'.format(page.status_code))

Page download successful


In [186]:
#Using the pandas function "read_html" we can easily process the HTML string. 
#This particular table doesn't have <thead> tags in the HTML markup, so we set header = 0 to use the first row as column names
#Since we will discard the "Not Assigned" columns, we set them to NaN so we can later use the dropna method.
df_html = pd.read_html(url, header=0, na_values = ['Not assigned'])[0]
df_html.head()


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [187]:
# clean dataframe
df_html = df_html[df_html.Borough!='Not assigned']
df_html = df_html[df_html.Borough!= 0]
df_html.reset_index(drop = True, inplace = True)
i = 0
for i in range(0,df_html.shape[0]):
    if df_html.iloc[i][2] == 'Not assigned':
        df_html.iloc[i][2] = df_html.iloc[i][1]
        i = i+1   

In [188]:
df_html

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,,
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


# Group neighborhoods in the same borough

In [189]:
df = df_html.groupby(['Postal Code','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [190]:
df.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
unique,103,10,99
top,M8Y,North York,Downsview
freq,1,24,4


In [191]:
#Group by Postal code / Borough
df = df.groupby(['Postal Code','Borough']).Neighborhood.agg([('Neighborhood', ', '.join)])
df.reset_index(inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [192]:

def neighborhood_list(grouped):    
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
                    
grp = df.groupby(['Postal Code', 'Borough'])
df_2 = grp.apply(neighborhood_list).reset_index(name='Neighborhood')

In [193]:
df_2.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
unique,103,10,99
top,M8Y,North York,Downsview
freq,1,24,4


# 6. Check whether it is the same as required by the question

In [194]:
# create a new test dataframe
column_names = ["Postal Code", "Borough", "Neighborhood"]
test_df = pd.DataFrame(columns=column_names)

test_list = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list:
    test_df = test_df.append(df_2[df_2["Postal Code"]==postcode], ignore_index=True)
    
test_df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


# Finally, print the number of rows of the cleaned dataframe

In [195]:
print(df_2.shape)

(103, 3)


In [196]:
#Export dataset to .csv file, since it will be used on part 2
df_postcodes.to_csv('Toronto.csv')