## Segmenting and Clustering Neighborhoods in Toronto - 1
The objective of this assignment is to explore and cluster the neighborhoods in Toronto.
The first part is to retrieve the data, clean and transform it before exploration.

In [1]:
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None

Scrap the web for the table of postal codes in Toronto and transform the data into a pandas dataframe.
Use pandas read_html to retrieve the table from the Wikipeia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
tables = pd.read_html(url, header=0)
print('Number of tables scrapped =',len(tables))

Number of tables scrapped = 3


Subset to the required table and rename the columns

In [3]:
df = tables[0]  # 1st table
df.columns = ['PostalCode','Borough','Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Removing rows with a Borough that is "Not assigned"

In [4]:
#Remove rows with "Not assiged" values in Borough field
df = df[df.Borough != "Not assigned"]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Group neighbourhoods with the same postcode and borough and aggregated the neighbourhoods into strings, with commas seperating them

In [5]:
# Combine rows with same PostalCode and Borough by concatenating the Neighbourhood names
df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(','.join).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Convert Neighborhood that have "Not assigned to have same name as Borough

In [6]:
# Assign to Borough column if the value in Neighbourhood column is 'Not assigned'
df['Neighborhood'] = np.where(df['Neighborhood']=='Not assigned', df['Borough'], df['Neighborhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Display the number of rows and columns of the final dataframe

In [7]:
df.shape

(103, 3)

### Get geospatial information for Toronto neighborhood
Now that a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name is built, we need  to get the latitude and the longitude coordinates of each neighborhood to utilise Foursquare location data.

In [8]:
geospatial_df = pd.read_csv('https://cocl.us/Geospatial_data')
geospatial_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the new coordinates dataframe and the Toronto neighborhoods dataframe (drop the extra postal codes column).

In [9]:
# merge the two data frames
df_toronto = pd.merge(df, geospatial_df, \
                          left_on='PostalCode', \
                          right_on='Postal Code', \
                          how='left').drop('Postal Code', axis=1)
df_toronto.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


In [10]:
df_toronto.shape

(103, 5)

In [11]:
# save data frame to csv file 
df_toronto.to_csv('toronto.csv')