# Segmenting and Clustering Neighborhoods in Toronto

The goal of this analysis is to explore and cluster neighborhoods in Toronto based on information provided by Foursquare.com.

This analysis forms part of the module ["Applied Data Science Capstone"](https://www.coursera.org/learn/applied-data-science-capstone/home/welcome).

## Load necessary packages

In [37]:
import pandas as pd # library for data analysis

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values


## Load data

The following dataset comes from a Wikipedia page including the postal codes, the associated Borough and Neighborhoods for Toronto.

In [2]:
!wget -q -O 'toronto_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Data downloaded!')

Data downloaded!


In [28]:
# Create new pandas dataframe for table included on website
table = pd.read_html('toronto_data.html')
df = pd.DataFrame(table[0])
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [36]:
## Clean up

# Remove cells where "borough" == "Not assigned"
df = df[df['Borough'] != 'Not assigned']

# If "neighborhood" == "Not assigned", but there is a borough, neighborhood will have the borough's name
df['Neighborhood'] = df['Neighborhood'].replace('Not assigned', df['Borough'])

# Combine neighborhoods with identical postal code and separate their neighborhoods by comma
df.groupby('Postal code')['Neighborhood'].apply(','.join)
#df.groupby(['name','month'])['text'].apply(','.join).reset_index()

df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [35]:
# Show dimensions of the data frame
df.shape

(103, 3)

## Get geospatial data of postal codes

In [91]:
# Load geodata from csv file
geoDat = pd.read_csv('https://cocl.us/Geospatial_data')
geoDat.rename(columns = {'Postal Code': 'Postal code'}, inplace = True)
geoDat.head()

Unnamed: 0,Postal code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [98]:
# Combine with Toronto data frame
df_withGeo = df.merge(geoDat, on = 'Postal code')
df_withGeo.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
