# Clustering of Neighborhoods in Toronto

In this project I cluster the neighborhoods in Toronto according to the their most common venues. To obtain the necessary data I use web scrapping to get postal codes in Toronto and then use the Foursquare API to get the venues in each neighborhood. Then I use k-means clustering.

This project is my assignment for the [Capstone Project Course](https://www.coursera.org/learn/applied-data-science-capstone/) 's week 3 evaluation.

Constantino Carreto

## Table of Contents

1.  <a href="#item1">Web Scrapping of Postal Codes</a>
2.  <a href="#item2">Getting the latitude and the longitude coordinates of each neighborhood</a>  
3.  <a href="#item3">Clustering Neighborhoods in Toronto</a> 

    3A. <a href="#item3A">Foursquare API</a> 
    
    3B. <a href="#item3B">Finding the most common venues in each neighborhood</a> 
    
    3C. <a href="#item3C">Clustering Neighborhoods</a> 
    
    3D. <a href="#item3D">Examining clusters</a> 



In [56]:
# import labraries
import pandas as pd
import numpy as np

import requests # for making the request to the website
import json # to transform to json file
import lxml.html as lh # for parsing html content

#!pip install pgeocode
import pgeocode # to obtain latitude and longitude coordinates

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#!pip install folium==0.5.0
import folium # map rendering library

import os # to access working directory
print('libraries imported')

libraries imported


<section id="item1"> </section>

## 1. Web Scrapping of Postal Codes
We scrape the postal codes for the city of Toronto and process them into a dataframe. 
To Scrape the postal codes from the [wikipedia website](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)
I follow the steps in [this page](https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059). I need postal codes because in the next section I use them in the Foursquare API to explore venues in each neighborhood.

In [2]:
# the website with the postal codes
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
tr_elements[0:5]

[<Element tr at 0x2076f8417c8>,
 <Element tr at 0x20773808728>,
 <Element tr at 0x20773808778>,
 <Element tr at 0x207738224a8>,
 <Element tr at 0x20773822458>]

In [11]:
## Let's process column headers
#Create empty list
col=[]
i=0
print('number of columns: ',len(tr_elements[0]))
#For the first row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content().replace('\n','')
    print(i,name)
    col.append((name,[]))
#print(col)

number of columns:  3
1 Postal Code
2 Borough
3 Neighbourhood


In [12]:
#Since our first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content().replace('\n','') 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [13]:
# convert to dataframe
Dict={title:column for (title,column) in col}
df_zc=pd.DataFrame(Dict)

In [14]:
df_zc.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [16]:
# let's clean the dataframe
# we merge nieghbourhoods with the same zip code
if len(df_zc['Postal Code'].unique())==df_zc.shape[0]:
    print('we have no zip code repetitions')
df_zc.shape

we have no zip code repetitions


(181, 3)

In [17]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df_zc = df_zc.loc[(df_zc['Borough']!='Not assigned') & (df_zc['Postal Code']!='')]
print(df_zc.shape)
df_zc.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [18]:
# we modify cases where the Niegbourhood is unknown
# If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
df_zc.loc[df_zc['Neighbourhood']=='Not assigned'].head()
# it turns out there is no such cases

Unnamed: 0,Postal Code,Borough,Neighbourhood


In [19]:
print(df_zc.shape)

(103, 3)


<section id="item2"> </section>

## 2. Getting the latitude and the longitude coordinates of each neighborhood

In [20]:
# take a look
df_zc.tail()

Unnamed: 0,Postal Code,Borough,Neighbourhood
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [21]:
# we use pgeocode to get the latitude and longitude corresponding to each postal code
lat = []
lon = []

# iterate through each postal code
for i, pc in enumerate(df_zc['Postal Code']):
    #print('processing postal code ', str(i), ' ', pc)
    nomi = pgeocode.Nominatim('ca')
    result = nomi.query_postal_code(pc)
    lat.append(result[9])
    lon.append(result[10])
    
print(lat[0:5], lon[0:5])

[43.7545, 43.7276, 43.6555, 43.7223, 43.6641] [-79.33, -79.3148, -79.3626, -79.4504, -79.3889]


In [22]:
# we add the new data to our dataframe
df_zc['Latitude'] = lat
df_zc['Longitude'] = lon
df_zc.rename(columns={'Neighbourhood':'Neighborhood'}, inplace = True)
df_zc.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.7545,-79.33
3,M4A,North York,Victoria Village,43.7276,-79.3148
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889


In [23]:
# we temporarily exclude problematic postal codes
df_zc2 = df_zc.loc[df_zc['Postal Code'] != 'M7R']

<section id="item3"> </section>

## 3. Clustering Neighbourhoods in Toronto

In this section I implement k-means clustering to group neighborhoods according to the type of venues that exist around

<section id="item3A"> </section>

### 3A Foursquare API

I use the Foursquare API to see which venues exist in each neighborhood

In [25]:
# Define Foursquare credentials
CLIENT_ID = 'DVCTZDPDYXTS0BRJFPLMHM323APGXNWZI5PLRQ1VC0CFLF1T' # my Foursquare ID
CLIENT_SECRET = '5NWAGXRLXIXAV0L3DNYY1EPIHNMAAAIZFDFELYSYXL5LFWL1' # my Foursquare Secret
VERSION = '20180605' #'20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentails:
CLIENT_ID: DVCTZDPDYXTS0BRJFPLMHM323APGXNWZI5PLRQ1VC0CFLF1T
CLIENT_SECRET:5NWAGXRLXIXAV0L3DNYY1EPIHNMAAAIZFDFELYSYXL5LFWL1


In [26]:
# function that extracts the category of the venue. We use it togheter with the next function to process Foursqure response
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [27]:
# let's create a function that extract information for all neighbourhoods' request, too
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now let's write the code to run the above function on each neighborhood and create a new dataframe called _toronto_venues_.

In [28]:
# get the venues
toronto_venues_raw = getNearbyVenues(names=df_zc2['Neighborhood'],
                                   latitudes=df_zc2['Latitude'],
                                   longitudes=df_zc2['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [29]:
# give a look to the new dataframe
print(toronto_venues_raw.shape)
toronto_venues_raw.head()

(2151, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.7545,-79.33,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.7545,-79.33,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.7545,-79.33,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.7276,-79.3148,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.7276,-79.3148,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [30]:
# save dataframe for security
toronto_venues_raw.to_csv('toronto_venues.csv', index = False)
os.listdir()

['.ipynb_checkpoints',
 'Clustering Toronto.ipynb',
 'toronto_venues.csv',
 'Untitled.ipynb']

Let's check how many venues were returned for each neighborhood

In [31]:
# we now focus on neighborhoods not on postal codes
toronto_venues = pd.read_csv('toronto_venues.csv')
print(toronto_venues.head())
toronto_venues.groupby('Neighborhood').count()

       Neighborhood  Neighborhood Latitude  Neighborhood Longitude  \
0         Parkwoods                43.7545                -79.3300   
1         Parkwoods                43.7545                -79.3300   
2         Parkwoods                43.7545                -79.3300   
3  Victoria Village                43.7276                -79.3148   
4  Victoria Village                43.7276                -79.3148   

                    Venue  Venue Latitude  Venue Longitude  \
0         Brookbanks Park       43.751976       -79.332140   
1                     KFC       43.754387       -79.333021   
2           Variety Store       43.751974       -79.333114   
3  Victoria Village Arena       43.723481       -79.315635   
4             Tim Hortons       43.725517       -79.313103   

         Venue Category  
0                  Park  
1  Fast Food Restaurant  
2     Food & Drink Shop  
3          Hockey Arena  
4           Coffee Shop  


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Wilson Heights, Downsview North",7,7,7,7,7,7
Bayview Village,5,5,5,5,5,5
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
...,...,...,...,...,...,...
"Willowdale, Willowdale West",2,2,2,2,2,2
Woburn,1,1,1,1,1,1
Woodbine Heights,4,4,4,4,4,4
York Mills West,2,2,2,2,2,2


<section id="item3B"> </section>

### B. Finding the most common venues in each neighborhood

I analyze what venue type belong to each neighborhood and find the most common venues

In [32]:
# one hot encoding
# let's create dummies for each venue type
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
print(toronto_onehot.shape, toronto_venues.shape)

# move neighborhood column to the first column
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']
columnas = list(toronto_onehot.columns)
columnas.remove('Neighborhood')
fixed_columns = ['Neighborhood'] + columnas
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

(2151, 254) (2151, 7)


Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [33]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
print(toronto_grouped.shape)
toronto_grouped.head()

(96, 254)


Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


First, let's write a function to sort the venues in descending order.

In [34]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [35]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head()

(96, 11)


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Breakfast Spot,Badminton Court,Latin American Restaurant,Newsagent,Yoga Studio,Fast Food Restaurant,Event Space,Falafel Restaurant,Farmers Market,Field
1,"Alderwood, Long Branch",Pharmacy,Athletics & Sports,Sandwich Place,Pub,Gym,Coffee Shop,Pizza Place,Convenience Store,Cupcake Shop,Donut Shop
2,"Bathurst Manor, Wilson Heights, Downsview North",Pizza Place,Mediterranean Restaurant,Middle Eastern Restaurant,Fried Chicken Joint,Deli / Bodega,Coffee Shop,Grocery Store,Yoga Studio,Falafel Restaurant,Ethiopian Restaurant
3,Bayview Village,Flower Shop,Dog Run,Park,Trail,Gas Station,Yoga Studio,Falafel Restaurant,Escape Room,Ethiopian Restaurant,Event Space
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Sandwich Place,Italian Restaurant,Pharmacy,Indian Restaurant,Sushi Restaurant,Juice Bar,Restaurant,Thai Restaurant,Toy / Game Store


<section id="item3C"> </section>


### C. Clustering Neighborhoods

Run _k_-means to cluster the neighborhoods into 5 clusters according to the most common venues.

In [39]:
# set number of clusters
#we choose initially 5
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print('cluster ocurrencies \n', pd.DataFrame(kmeans.labels_)[0].value_counts(dropna=False))
kmeans.labels_[0:10] 

cluster ocurrencies 
 0    82
3    11
4     1
2     1
1     1
Name: 0, dtype: int64


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Let's create a new dataframe that includes the cluster labels as well as the top 10 venues for each neighborhood.

In [40]:
# add clustering labels
try:
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
except:
    pass
neighborhoods_venues_sorted['Cluster Labels'] = neighborhoods_venues_sorted['Cluster Labels'].astype('int')

toronto_merged = df_zc2
print(toronto_merged.head())
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
# note we have to do inner merge since for some postal codes we were not able to fing venues
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood', how='inner')
toronto_merged.sort_values('Cluster Labels', inplace=True)

toronto_merged.tail(8) # check the last columns!


  Postal Code           Borough                                 Neighborhood  \
2         M3A        North York                                    Parkwoods   
3         M4A        North York                             Victoria Village   
4         M5A  Downtown Toronto                    Regent Park, Harbourfront   
5         M6A        North York             Lawrence Manor, Lawrence Heights   
6         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government   

   Latitude  Longitude  
2   43.7545   -79.3300  
3   43.7276   -79.3148  
4   43.6555   -79.3626  
5   43.7223   -79.4504  
6   43.6641   -79.3889  


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
54,M1J,Scarborough,Scarborough Village,43.7464,-79.2323,3,Grocery Store,Park,Yoga Studio,Fast Food Restaurant,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Field
129,M4T,Central Toronto,"Moore Park, Summerhill East",43.6899,-79.3853,3,Park,Playground,Grocery Store,Thai Restaurant,Gym,Yoga Studio,Event Space,Eastern European Restaurant,Electronics Store,Escape Room
98,M9N,York,Weston,43.7068,-79.517,3,Convenience Store,Park,Jewelry Store,Yoga Studio,Farmers Market,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Field
100,M2P,North York,York Mills West,43.75,-79.3978,3,Convenience Store,Park,Yoga Studio,Fast Food Restaurant,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fish & Chips Shop
104,M6P,West Toronto,"High Park, The Junction South",43.6605,-79.4633,3,Residential Building (Apartment / Condo),Park,Yoga Studio,Farmers Market,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Fast Food Restaurant
46,M2H,North York,Hillcrest Village,43.8015,-79.3577,3,Residential Building (Apartment / Condo),Park,Yoga Studio,Farmers Market,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Fast Food Restaurant
57,M4J,East York,"East Toronto, Broadview North (Old East York)",43.6872,-79.3368,3,Park,Convenience Store,Yoga Studio,Fast Food Restaurant,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fish & Chips Shop
108,M1R,Scarborough,"Wexford, Maryvale",43.7507,-79.3003,4,Auto Garage,Yoga Studio,Electronics Store,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Field


In [44]:
# number of clusters per type
print(toronto_merged['Cluster Labels'].value_counts(dropna=False))

0    86
3    11
4     1
2     1
1     1
Name: Cluster Labels, dtype: int64


Finally, let's visualize the resulting clusters

In [42]:
# Toronto coordinates
latitude = 43.703908
longitude = -79.347015

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
#print(rainbow)

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<section id="item3D"> </section>

### D. Examining clusters

We compare the venues that characterize each cluster 

#### Cluster 0
Most commonly we find gyms, parks, and restaurants i.e. places to eat and for spare time. This is the most common type of neighborhood.

In [48]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
76,Downtown Toronto,0,Coffee Shop,Hotel,Café,Restaurant,Gym,Salad Place,Asian Restaurant,Japanese Restaurant,Steakhouse,American Restaurant
112,Central Toronto,0,Sandwich Place,Café,Pharmacy,Burger Joint,Indian Restaurant,Italian Restaurant,French Restaurant,Liquor Store,Mexican Restaurant,Middle Eastern Restaurant
111,Central Toronto,0,Gym Pool,Playground,Park,Garden,Yoga Studio,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant
107,Etobicoke,0,Flea Market,Ice Cream Shop,Coffee Shop,Sandwich Place,Discount Store,Middle Eastern Restaurant,Supermarket,Pizza Place,Chinese Restaurant,Field
103,Central Toronto,0,Home Service,Park,Bus Line,Trail,Yoga Studio,Farmers Market,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant


#### cluster 1
This cluster differentiates from cluster 1 mainly by the presence of venues related to fish.

In [46]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
36,Scarborough,1,Korean BBQ Restaurant,Yoga Studio,Fountain,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Field


#### cluster 2
This clusters includes pools as the most common venue.

In [50]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
73,North York,2,Pool,Yoga Studio,Eastern European Restaurant,Food & Drink Shop,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant


### cluster 3
It is highly distinguished by parks.

In [53]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
169,Etobicoke,3,Convenience Store,Baseball Field,Park,Yoga Studio,Fast Food Restaurant,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fish & Chips Shop
2,North York,3,Food & Drink Shop,Park,Fast Food Restaurant,Yoga Studio,Eastern European Restaurant,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Field
109,North York,3,Coffee Shop,Park,Yoga Studio,Fast Food Restaurant,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Field
93,Central Toronto,3,Photography Studio,Park,Lawyer,Eastern European Restaurant,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant
54,Scarborough,3,Grocery Store,Park,Yoga Studio,Fast Food Restaurant,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Field


#### cluster 4
The most common venue is auto Garage.

In [55]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
108,Scarborough,4,Auto Garage,Yoga Studio,Electronics Store,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Field
