# Segmenting and Clustering Neighborhoods in Toronto - Coursera Project

**This notebook is used for all three parts of the assignment.** Therefore, you don't need to reopen it again.

### Part 1. Exctracting and processing the data

Import pandas and numpy libraries.

In [6]:
import pandas as pd
import numpy as np

Create the url and excract the data. Select the first table.

In [56]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
tables = pd.read_html(url)
df = tables[0]
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Change the name of 'Postal Code' column to 'PostalCode'

In [57]:
df.rename(columns = {'Postal code': 'PostalCode'}, inplace = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Remove "Not assigned" rows.

In [58]:
df = df[df['Borough'] != 'Not assigned'].reset_index(drop = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


Check the number of Postal codes to ensure that every postal code contains all its neighborhoods

In [59]:
print('The number of Postal codes is', df.shape[0])
print('The number of unique Postal codes is', df['PostalCode'].nunique())

The number of Postal codes is 103
The number of unique Postal codes is 103


Check the number of empty neighborhoods

In [60]:
print('The number of empty neighborhoods is', df[df['Neighborhood'] == ''].shape[0])

The number of empty neighborhoods is 0


Print the dimensions of the dataframe

In [61]:
print('There are {} rows and {} columns in the dataframe'.format(df.shape[0], df.shape[1]))

There are 103 rows and 3 columns in the dataframe


### Part 2. Finding the coordinates and adding them to the dataframe

The Geocoder library didn't work in my case, I always received None, so I had to use the csv file.

In [99]:
ll_df = pd.read_csv('Toronto_Coordinates.csv')
ll_df.rename(columns = {'Postal Code': 'PostalCode'}, inplace = True)
ll_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Add two new columns to the dataframe and set the index to 'PostalCode' for convenience

In [107]:
header_list = ['PostalCode', 'Borough', 'Neighborhood', 'Latitude','Longitude']
df = df.reindex(columns = header_list)
df.set_index('PostalCode', inplace = True)
df.head()

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,,
M4A,North York,Victoria Village,,
M5A,Downtown Toronto,Regent Park / Harbourfront,,
M6A,North York,Lawrence Manor / Lawrence Heights,,
M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,,


Iterate through the dataframe with Toronto Coordinates and paste them in the main dataframe

In [108]:
for index, row in ll_df.iterrows():
    df.loc[row['PostalCode'], 'Latitude' : 'Longitude'] = row['Latitude':'Longitude']
df.reset_index(inplace = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
