# Segmenting and Clustering Neighborhoods in Toronto

This analysis forms part of the module ["Applied Data Science Capstone"](https://www.coursera.org/learn/applied-data-science-capstone/home/welcome).

The goal is to explore and cluster neighborhoods in Toronto based on information provided by Foursquare.com.



## Load necessary packages

In [182]:
import pandas as pd # library for data analysis

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import folium # plotting library

import requests # library to handle requests
from pandas import json_normalize # tranform JSON file into a pandas dataframe

from sklearn.preprocessing import StandardScaler # to make z-scores

from sklearn.cluster import KMeans # for K-Means Clustering

%matplotlib inline

## Load data

The following dataset comes from a Wikipedia page including the postal codes, the associated Borough and Neighborhoods for Toronto.

In [10]:
!wget -q -O 'toronto_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Data downloaded!')

Data downloaded!


In [11]:
# Create new pandas dataframe for table included on website
table = pd.read_html('toronto_data.html')
df = pd.DataFrame(table[0])
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [12]:
## Clean up

# Remove cells where "borough" == "Not assigned"
df = df[df['Borough'] != 'Not assigned']

# If "neighborhood" == "Not assigned", but there is a borough, neighborhood will have the borough's name
df['Neighborhood'] = df['Neighborhood'].replace('Not assigned', df['Borough'])

# Combine neighborhoods with identical postal code and separate their neighborhoods by comma
df.groupby('Postal code')['Neighborhood'].apply(','.join)
#df.groupby(['name','month'])['text'].apply(','.join).reset_index()

df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [13]:
# Show dimensions of the data frame
df.shape

(103, 3)

## Get geospatial data of postal codes

In [14]:
# Load geodata from csv file
geoDat = pd.read_csv('https://cocl.us/Geospatial_data')
geoDat.rename(columns = {'Postal Code': 'Postal code'}, inplace = True)
geoDat.head()

Unnamed: 0,Postal code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
# Combine with Toronto data frame
df_withGeo = df.merge(geoDat, on = 'Postal code')
df_withGeo.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494


In [18]:
df_withGeo.shape

(103, 5)

## Find clusters of neighborhoods in Toronto

### Remove boroughs without "Toronto" in the name

In [48]:
df_reduced = df_withGeo[df_withGeo['Borough'].str.contains('Toronto')]
df_reduced.reset_index(drop = True, inplace = True)
df_reduced.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
1,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


### Plot of all neighborhoods in Toronto

In [2]:
# get center coordinates of Toronto
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent="toro_agent") # could be anything
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

43.6534817 -79.3839347


In [22]:
# create map
neigh_map = folium.Map(location = [latitude, longitude], zoom_start = 12)

for lat, lng, label in zip(df_reduced.Latitude, df_reduced.Longitude, df_reduced.Neighborhood):
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        color = 'blue',
        popup = label,
        fill = True,
        fill_color = 'blue',
        fill_opacity = 0.5
    
    ).add_to(neigh_map)


neigh_map

### Define Foursquare credentials and version

In [67]:
CLIENT_ID = 'TOPVCHKEI1GQK4T4IEL512EZNEAJT3MWXXUVGI12NL0CMIMI' # your Foursquare ID
CLIENT_SECRET = '0HRUKCRKEKC3NUEY3LTIJSMT5KOCPJZEGU1ZVZ2Q5B1EZTRK' # your Foursquare Secret
VERSION = '20200120' # Foursquare API version
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: TOPVCHKEI1GQK4T4IEL512EZNEAJT3MWXXUVGI12NL0CMIMI
CLIENT_SECRET:0HRUKCRKEKC3NUEY3LTIJSMT5KOCPJZEGU1ZVZ2Q5B1EZTRK


### Function to retrieve information on each neighborhood from Foursquare

In [68]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Retrieve information using getNerbyVenus-function

In [69]:
toronto_venues = getNearbyVenues(names=df_reduced['Neighborhood'],
                                   latitudes=df_reduced['Latitude'],
                                   longitudes=df_reduced['Longitude']
                                  )

Regent Park / Harbourfront
Queen's Park / Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond / Adelaide / King
Dufferin / Dovercourt Village
Harbourfront East / Union Station / Toronto Islands
Little Portugal / Trinity
The Danforth West / Riverdale
Toronto Dominion Centre / Design Exchange
Brockton / Parkdale Village / Exhibition Place
India Bazaar / The Beaches West
Commerce Court / Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park / The Junction South
North Toronto West
The Annex / North Midtown / Yorkville
Parkdale / Roncesvalles
Davisville
University of Toronto / Harbord
Runnymede / Swansea
Moore Park / Summerhill East
Kensington Market / Chinatown / Grange Park
Summerhill West / Rathnelly / South Hill / Forest Hill SE / Deer Park
CN Tower / King and Spadina / Railway Lands / Harbourfront West / Bathurst Quay / South Niagara / Island airport
Roseda

### Inspect new dataframe 

In [70]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park / Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Regent Park / Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Regent Park / Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Regent Park / Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Regent Park / Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


### Reduce venues to restaurants

In [156]:
#toronto_venues['Venue Category'].unique()

food = ['Bakery', 'Coffee Shop', 'Breakfast Spot', 'Restaurant', 'French Restaurant', 'Café', 'Mexican Restaurant','Ice Cream Shop',
       'Asian Restaurant', 'Italian Restaurant', 'Sushi Restaurant', 'Creperie', 'Burrito Place', 'Diner', 'Fried Chicken Joint', 
       'Burger Joint', 'Sandwich Place', 'Ramen Restaurant', 'Thai Restaurant', 'Steakhouse', 'American Restaurant', 'Japanese Restaurant',
       'Gastropub', 'Fast Food Restaurant', 'Middle Eastern Restaurant', 'Modern European Restaurant', 'Seafood Restaurant', 'Chinese Restaurant',
       'Pizza Place', 'Ethiopian Restaurant', 'Vietnamese Restaurant', 'Greek Restaurant', 'BBQ Joint', 'Food Truck', 'New American Restaurant', 'Vegetarian / Vegan Restaurant',
       'German Restaurant', 'Comfort Food Restaurant', 'Moroccan Restaurant', 'Belgian Restaurant', 'Eastern European Restaurant', 'Indian Restaurant', 'Falafel Restaurant',
       'Salad Place', 'Donut Shop', 'Korean Restaurant', 'Colombian Restaurant', 'Brazilian Restaurant', 'Gluten-free Restaurant', 'Mediterranean Restaurant',
       'Latin American Restaurant', 'Soup Place', 'Cuban Restaurant', 'Carribean Restaurant', 'Frozen Yogurt Shop', 'Taco Place', 'Fish & Chips Shop', 
       'Food & Drink Shop', 'Cajun / Creole Restaurant', 'Noodle House', 'Food', 'Doner Restaurant', 'Filipino Restaurant', 'Dumpling Restaurant',
       'Molecular Gastronomy Restaurant', 'Taiwanese Restaurant', 'Theme Restaurant']

toronto_rests = toronto_venues.loc[(toronto_venues['Venue Category'].isin(food)), :]
toronto_rests.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park / Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Regent Park / Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
4,Regent Park / Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
5,Regent Park / Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
7,Regent Park / Harbourfront,43.65426,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot


### Merge similar types of venues to make classification more intuitive


In [157]:
asian = ['Asian Restaurant', 'Sushi Restaurant', 'Ramen Restaurant', 'Thai Restaurant', 'Japanese Restaurant', 'Chinese Restaurant', 'Vietnamese Restaurant',
        'Indian Restaurant', 'Korean Restaurant', 'Filipino Restaurant', 'Dumpling Restaurant', 'Taiwanese Restaurant']

latinam = ['Mexican Restaurant', 'Burrito Place', 'Colombian Restaurant', 'Brazilian Restaurant', 'Latin American Restaurant', 'Cuban Restaurant',
          'Caribbean Restaurant', 'Taco Place', 'Cajun / Creole Restaurant']

european = ['French Restaurant', 'Italian Restaurant', 'Pizza Place', 'Greek Restaurant', 'German Restaurant', 'Belgian Restaurant', 'Eastern European Restaurant',
           'Falafel Restaurant', 'Mediterranean Restaurant', 'Fish & Chips Shop', 'Doner Restaurant', 'Modern European Restaurant']

fancy = ['Vegetarian / Vegan Restaurant', 'Molecular Gastronomy Restaurant', 'Theme Restaurant', 'Gluten-free Restaurant']

sweet = ['Bakery', 'Coffee Shop', 'Café', 'Ice Cream Shop', 'Creperie', 'Donut Shop', 'Frozen Yogurt Shop']

fast = ['Fried Chicken Joint', 'Burger Joint', 'Sandwich Place', 'Fast Food Restaurant', 'Food Truck', 'Comfort Food Restaurant']

american = ['Diner', 'American Restaurant', 'BBQ Joint', 'New American Restaurant']
 
general = ['Restaurant', 'Salad Place', 'Soup Place', 'Food Court', 'Food & Drink Shop', 'Noodle House',
          'Food', 'Airport Food Court', 'Gastropub', 'Steakhouse', 'Breakfast Spot', 'Seafood Restaurant']   

african = ['Ethiopian Restaurant', 'Moroccan Restaurant']

# replace
toronto_rests.loc[(toronto_rests['Venue Category'].isin(asian)), 'Venue Category'] = 'Asian Restaurant'
toronto_rests.loc[(toronto_rests['Venue Category'].isin(latinam)), 'Venue Category'] = 'Latin American Restaurant'
toronto_rests.loc[(toronto_rests['Venue Category'].isin(fancy)), 'Venue Category'] = 'Fancy Restaurant'
toronto_rests.loc[(toronto_rests['Venue Category'].isin(sweet)), 'Venue Category'] = 'Sweet Food'
toronto_rests.loc[(toronto_rests['Venue Category'].isin(fast)), 'Venue Category'] = 'Fast Food'
toronto_rests.loc[(toronto_rests['Venue Category'].isin(american)), 'Venue Category'] = 'American Restaurant'
toronto_rests.loc[(toronto_rests['Venue Category'].isin(european)), 'Venue Category'] = 'European Restaurant'
toronto_rests.loc[(toronto_rests['Venue Category'].isin(general)), 'Venue Category'] = 'Others'
toronto_rests.loc[(toronto_rests['Venue Category'].isin(african)), 'Venue Category'] = 'African Restaurant'

toronto_rests.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park / Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Sweet Food
1,Regent Park / Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Sweet Food
4,Regent Park / Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Others
5,Regent Park / Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Others
7,Regent Park / Harbourfront,43.65426,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Others


### Summarise venue categories per neighborhood

In [164]:
# Turn venue categories into dummy variables
rests_oneHot = pd.get_dummies(toronto_rests[['Venue Category']], prefix="", prefix_sep="")

# Reinsert neighborhood column and place at beginning of dataframe
rests_oneHot['Neighborhood'] = toronto_rests['Neighborhood']
fixed_columns = [rests_oneHot.columns[-1]] + list(rests_oneHot.columns[:-1])
rests_oneHot = rests_oneHot[fixed_columns]


In [165]:
rests_oneHot.head()

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Asian Restaurant,European Restaurant,Fancy Restaurant,Fast Food,Latin American Restaurant,Middle Eastern Restaurant,Others,Sweet Food
0,Regent Park / Harbourfront,0,0,0,0,0,0,0,0,0,1
1,Regent Park / Harbourfront,0,0,0,0,0,0,0,0,0,1
4,Regent Park / Harbourfront,0,0,0,0,0,0,0,0,1,0
5,Regent Park / Harbourfront,0,0,0,0,0,0,0,0,1,0
7,Regent Park / Harbourfront,0,0,0,0,0,0,0,0,1,0


In [213]:
# Mean types of restaurant per neighborhood
rests_sum = rests_oneHot.groupby('Neighborhood').mean().reset_index()
rests_sum.head()

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Asian Restaurant,European Restaurant,Fancy Restaurant,Fast Food,Latin American Restaurant,Middle Eastern Restaurant,Others,Sweet Food
0,Berczy Park,0.0,0.074074,0.074074,0.185185,0.037037,0.037037,0.037037,0.0,0.222222,0.333333
1,Brockton / Parkdale Village / Exhibition Place,0.0,0.0,0.0,0.090909,0.0,0.0,0.090909,0.0,0.272727,0.545455
2,Business reply mail Processing CentrE,0.0,0.0,0.0,0.25,0.0,0.25,0.25,0.0,0.25,0.0
3,CN Tower / King and Spadina / Railway Lands / ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,Central Bay Street,0.0,0.020408,0.142857,0.163265,0.020408,0.142857,0.0,0.020408,0.102041,0.387755


### Cluster analysis

In [214]:
# Remove Neighborhood column
X = rests_sum.drop('Neighborhood', axis = 1)

# Only keep values
X = X.values

# Standard scale data
X = StandardScaler().fit_transform(X)

# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(X)

### Join cluster id and previous data frame

In [215]:
rests_sum.insert(1, 'Cluster', kmeans.labels_)
rests_sum.head()


Unnamed: 0,Neighborhood,Cluster,African Restaurant,American Restaurant,Asian Restaurant,European Restaurant,Fancy Restaurant,Fast Food,Latin American Restaurant,Middle Eastern Restaurant,Others,Sweet Food
0,Berczy Park,0,0.0,0.074074,0.074074,0.185185,0.037037,0.037037,0.037037,0.0,0.222222,0.333333
1,Brockton / Parkdale Village / Exhibition Place,0,0.0,0.0,0.0,0.090909,0.0,0.0,0.090909,0.0,0.272727,0.545455
2,Business reply mail Processing CentrE,1,0.0,0.0,0.0,0.25,0.0,0.25,0.25,0.0,0.25,0.0
3,CN Tower / King and Spadina / Railway Lands / ...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,Central Bay Street,0,0.0,0.020408,0.142857,0.163265,0.020408,0.142857,0.0,0.020408,0.102041,0.387755


### Reinsert coordinates


In [221]:
rests_sum = rests_sum.merge(df_reduced[['Neighborhood', 'Longitude', 'Latitude']], on = 'Neighborhood')
rests_sum.head()

Unnamed: 0,Neighborhood,Cluster,African Restaurant,American Restaurant,Asian Restaurant,European Restaurant,Fancy Restaurant,Fast Food,Latin American Restaurant,Middle Eastern Restaurant,Others,Sweet Food,Longitude,Latitude
0,Berczy Park,0,0.0,0.074074,0.074074,0.185185,0.037037,0.037037,0.037037,0.0,0.222222,0.333333,-79.373306,43.644771
1,Brockton / Parkdale Village / Exhibition Place,0,0.0,0.0,0.0,0.090909,0.0,0.0,0.090909,0.0,0.272727,0.545455,-79.428191,43.636847
2,Business reply mail Processing CentrE,1,0.0,0.0,0.0,0.25,0.0,0.25,0.25,0.0,0.25,0.0,-79.321558,43.662744
3,CN Tower / King and Spadina / Railway Lands / ...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-79.39442,43.628947
4,Central Bay Street,0,0.0,0.020408,0.142857,0.163265,0.020408,0.142857,0.0,0.020408,0.102041,0.387755,-79.387383,43.657952


### Get average number of restaurant type per cluster

In [216]:
rests_sum.groupby('Cluster').mean()

Unnamed: 0_level_0,African Restaurant,American Restaurant,Asian Restaurant,European Restaurant,Fancy Restaurant,Fast Food,Latin American Restaurant,Middle Eastern Restaurant,Others,Sweet Food
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0.002451,0.06363,0.1264,0.13593,0.021319,0.068813,0.037631,0.007993,0.156521,0.379312
1,0.0,0.015385,0.047436,0.198718,0.0,0.197436,0.146154,0.0,0.298718,0.096154
2,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.6
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The clusters appear represent the following patterns:

* **Cluster 0**: Most diverse origins.
* **Cluster 1**: Most country-independent food.
* **Cluster 2**: Most sweet food.
* **Cluster 3**: Asian restaurants

### Plot clusters

In [252]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
rainbow = ['#363200', '#2D4262', '#CB0000',  '#D09683']

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(rests_sum['Latitude'], rests_sum['Longitude'], rests_sum['Neighborhood'], rests_sum['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.4).add_to(map_clusters)
       
map_clusters