#### Author: DIEP, LY BAO LONG
#### Assignment: Segmenting and Clustering Neighborhoods in Toronto.
#### Day: 12-03-2021

In [2]:
import pandas as pd
import numpy as np
import requests
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

In [4]:
# # https://pythonbasics.org/pandas-web-scraping/#:~:text=Pandas%20makes%20it%20easy%20to,Excel%20file%20or%20csv%20file.
# url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# html = pd.read_html(url, na_values = 'Not assigned')
# toronto_df = html[0].dropna(subset = ['Borough'])

toronto_df = pd.read_csv('Toronto.csv',index_col = 0)
print(f'Shape: {toronto_df.shape[0]}x{toronto_df.shape[1]}')
toronto_df.head()

Shape: 103x3


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"['Regent Park', 'Harbourfront']"
3,M6A,North York,"['Lawrence Manor', 'Lawrence Heights']"
4,M7A,Queen's Park,Ontario Provincial Government


#### PERSONAL EVALUATION: First Look at Dataset
- To easily scrape the data from a webpage into a pandas DataFrame, we can simply use pandas function (read_html) to do it.
- Since there were a lot of rows of Borough and Neighbourhood containing "Not assigned" values, we can assign these "Not assigned" values as missing/ unavailable values (NaN) at the very begining of scraping the data from Wikipedia.

In [5]:
latlng_df = pd.read_csv('Geospatial_Coordinates.csv')
toronto_df = pd.merge(toronto_df, latlng_df, on = 'Postal Code')
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"['Regent Park', 'Harbourfront']",43.65426,-79.360636
3,M6A,North York,"['Lawrence Manor', 'Lawrence Heights']",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


#### Let's visually look at all of the boroughs in Toronto and its neighborhood 

In [6]:
address = 'Toronto, ON'
geolocator = Nominatim(user_agent = 'to_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'Toronto\nLat: {latitude}\nLgn: {longitude}')

Toronto
Lat: 43.6534817
Lgn: -79.3839347


In [7]:
map_toronto = folium.Map([latitude,longitude], zoom_start = 13)
for lat,lng, borough, neighbourhood in zip(toronto_df.Latitude, toronto_df.Longitude, toronto_df.Borough, toronto_df.Neighbourhood):
    label = f'Neighborhood: {neighbourhood}\nBorough: {borough}'
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'red',
        fill = True,
        fill_color = 'yellow',
        fill_opacity = 0.7,
        parse_html = False
    ).add_to(map_toronto)
map_toronto

In [8]:
borough_df = toronto_df.groupby('Borough').count().sort_values('Neighbourhood',ascending = False)
borough_df

Unnamed: 0_level_0,Postal Code,Neighbourhood,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
North York,24,24,24,24
Downtown Toronto,18,18,18,18
Scarborough,17,17,17,17
Etobicoke,11,11,11,11
Central Toronto,9,9,9,9
West Toronto,6,6,6,6
East Toronto,5,5,5,5
York,5,5,5,5
East York,4,4,4,4
East YorkEast Toronto,1,1,1,1


#### Choice of borough
- Taking a look at the above borough_df dataframe after being grouped by the Borough columns, we can vividly see that the North York seem to be the potential borough because it has up to 24 neighborhoods  with different Postal Code among the others.
- Examining the neighborhoods of North York region, we may explore a variety of venues which surround these neighbourhoods.

#### Get the North York DataFrame from the wholde dataset

In [9]:
ny_df = toronto_df[toronto_df['Borough'] == 'North York'].reset_index(drop = True)
address = 'North York, ON'
geolocator = Nominatim(user_agent='to_explorer')
location = geolocator.geocode(address)
ny_lat = location.latitude
ny_lng = location.longitude
print(f'North York\nLatitude: {ny_lat}\nLongitude: {ny_lng}')
print(f'Shape: {ny_df.shape[0]}x{ny_df.shape[1]}')
ny_df.head()

North York
Latitude: 43.7543263
Longitude: -79.44911696639593
Shape: 24x5


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"['Lawrence Manor', 'Lawrence Heights']",43.718518,-79.464763
3,M3B,North York,Don Mills North,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


In [10]:
map_ny = folium.Map([ny_lat,ny_lng],zoom_start = 11)
for lat, lng, neigh, pc in zip(ny_df.Latitude, ny_df.Longitude, ny_df.Neighbourhood, ny_df['Postal Code']):
    label = f'Neighbor: {neigh}\nPostal Code: {pc}'
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat,lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = 'red',
        fill_opacity = 0.7,
        parse_html = False
    ).add_to(map_ny)
map_ny

#### Using Foursquare to explore the neighborhoods

In [11]:
CLIENT_ID = 'yAHPPBLZ4QX43IMOBWWZPQPW2GO2PC403TLIIJPTXEUDV1PGJ' # your Foursquare ID
CLIENT_SECRET = 'OWL05PQU02RVOHOS45IIGJ3SCHJOYOGUFNOOL21SPROSTZ1I' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
ACCESS_TOKEN = 'OZ0KJZ4ZYJPGUBRGQPDA2QXNO5MTBC4IG2SVMXP2K0LBGBQM'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
print('ACCESS_TOKEN:' + ACCESS_TOKEN)

Your credentails:
CLIENT_ID: yAHPPBLZ4QX43IMOBWWZPQPW2GO2PC403TLIIJPTXEUDV1PGJ
CLIENT_SECRET:OWL05PQU02RVOHOS45IIGJ3SCHJOYOGUFNOOL21SPROSTZ1I
ACCESS_TOKEN:OZ0KJZ4ZYJPGUBRGQPDA2QXNO5MTBC4IG2SVMXP2K0LBGBQM


#### Instruction
- This function is used to get the venues of each neighborhood based on its latitude and longitude via the url link which is created by Foursquare.

In [12]:
def getNearbyVenues(pc, names, latitude, longitude, radius = 500):
    venues_list = []
    for pc, names, lat, lng in zip(pc, names, latitude, longitude):
        print(names)
        
        url = f'https://api.foursquare.com/v2/venues/explore?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{lng}&oauth_token={ACCESS_TOKEN}&radius={radius}&limit={LIMIT}'
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(
            pc,
            names,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code',
            'Neighborhood', 
              'Neighborhood Latitude', 
              'Neighborhood Longitude', 
              'Venue', 
              'Venue Latitude', 
              'Venue Longitude', 
              'Venue Category']
    return(nearby_venues)

In [13]:
ny_venues = getNearbyVenues(ny_df['Postal Code'],ny_df.Neighbourhood, ny_df.Latitude, ny_df.Longitude)
ny_venues.head()

Parkwoods
Victoria Village
['Lawrence Manor', 'Lawrence Heights']
Don Mills North
Glencairn
Don Mills South (Flemingdon Park)
Hillcrest Village
['Bathurst Manor', 'Wilson Heights', 'Downsview North']
['Fairview', 'Henry Farm', 'Oriole']
['Northwood Park', 'York University']
Bayview Village
Downsview East (CFB Toronto)
['York Mills', 'Silver Hills']
Downsview West
['North Park', 'Maple Leaf Park', 'Upwood Park']
Humber Summit
['Willowdale', 'Newtonbrook']
Downsview Central
['Bedford Park', 'Lawrence Manor East']
['Humberlea', 'Emery']
Willowdale South
Downsview Northwest
York Mills West
Willowdale West


Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,Parkwoods,43.753259,-79.329656,Careful & Reliable Painting,43.752622,-79.331957,Construction & Landscaping
2,M3A,Parkwoods,43.753259,-79.329656,Towns On The Ravine,43.754754,-79.332552,Hotel
3,M3A,Parkwoods,43.753259,-79.329656,Sun Life,43.75476,-79.332783,Construction & Landscaping
4,M3A,Parkwoods,43.753259,-79.329656,GTA Restoration,43.753396,-79.333477,Fireworks Store


In [14]:
# Let's have an overall observation on the total number of Venues of each Postal Code 
ny_venues.groupby('Postal Code').count().sort_values('Venue',ascending=False).loc[:,'Venue':'Venue Category']

Unnamed: 0_level_0,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M2J,100,100,100,100
M2N,57,57,57,57
M5M,53,53,53,53
M6A,39,39,39,39
M3H,34,34,34,34
M3C,31,31,31,31
M3J,17,17,17,17
M6B,11,11,11,11
M3N,9,9,9,9
M2R,9,9,9,9


In [15]:
print(f'Number of unique categories: {len(ny_venues["Venue Category"].unique())}')

Number of unique categories: 141


#### One-hot Coding technique
- Now, we will utilize the one-hot coding technique to turn each venue categories which are found based on the Postal Code into categorical data (1 if the venue is available, otherwise, 0)

In [16]:
ny_onehot = pd.get_dummies(ny_venues[['Venue Category']],prefix="",prefix_sep="")
ny_onehot[['Postal Code', 'Neighborhood']] = ny_venues[['Postal Code','Neighborhood']]
fixed_cols = list(ny_onehot.columns[-2:]) + list(ny_onehot.columns[:-2])
ny_onehot = ny_onehot[fixed_cols]
ny_onehot.head()

Unnamed: 0,Postal Code,Neighborhood,ATM,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Sushi Restaurant,Tailor Shop,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store,Yoga Studio
0,M3A,Parkwoods,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,Parkwoods,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A,Parkwoods,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M3A,Parkwoods,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M3A,Parkwoods,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
ny_onehot.shape

(435, 143)

#### PERSONAL EVALUATION: Differences in the dataset
- Briefly glacing at the dataset, it is noticeable that the Neighborhood column has a number of similar values, but different Postal Code. Hence, the Postal Code column can be used as a key feature in the process of exploring and clustering the neighborhood in Toronto.
- It is evident that after filling the dataset with the Latitude and Longitude columns based on the Postal Code, we are able to point out that although there are a number of similar values in the Neighborhood columns, these similar neiborhoods totally get distinct Postal Code, Latitude, and Longitude.
- In conclusion, if we choose to group the dataset by the Neighborhood, there will be a whole host of potential venues can be missed in the exploring and clustering process. Therefore, utilizing the Postal Code as key feature for grouping should be taken into account because of its distinction.

In [18]:
ny_grouped = ny_onehot.groupby(['Postal Code','Neighborhood']).mean().reset_index()
ny_grouped
# ny_onehot.groupby('Postal Code')

Unnamed: 0,Postal Code,Neighborhood,ATM,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Sushi Restaurant,Tailor Shop,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store,Yoga Studio
0,M2H,Hillcrest Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M2J,"['Fairview', 'Henry Farm', 'Oriole']",0.0,0.01,0.0,0.01,0.0,0.0,0.02,0.0,...,0.0,0.0,0.01,0.0,0.01,0.01,0.01,0.0,0.05,0.01
2,M2K,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M2L,"['York Mills', 'Silver Hills']",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M2M,"['Willowdale', 'Newtonbrook']",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M2N,Willowdale South,0.0,0.017544,0.0,0.0,0.0,0.017544,0.0,0.0,...,0.017544,0.0,0.0,0.017544,0.0,0.0,0.0,0.017544,0.0,0.0
6,M2P,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,M2R,Willowdale West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M3A,Parkwoods,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,M3B,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# This code is used to observe the frequent appearance of each venues categories
num_top_venues = 8
for pc,hood in zip(ny_grouped['Postal Code'],ny_grouped['Neighborhood']):
    print("----"+f'{hood} (PC:{pc})'+"----")
    temp = ny_grouped[ny_grouped['Postal Code'] == pc].T.reset_index() # transpose the index and the columns, then get the corresponding neighborhood and its columns 
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Hillcrest Village (PC:M2H)----
                       venue  freq
0   Mediterranean Restaurant  0.17
1                Golf Course  0.17
2                    Dog Run  0.17
3                       Pool  0.17
4         Athletics & Sports  0.17
5       Fast Food Restaurant  0.17
6  Middle Eastern Restaurant  0.00
7         Miscellaneous Shop  0.00


----['Fairview', 'Henry Farm', 'Oriole'] (PC:M2J)----
                  venue  freq
0        Clothing Store  0.10
1            Shoe Store  0.07
2        Cosmetics Shop  0.06
3         Women's Store  0.05
4           Coffee Shop  0.05
5  Fast Food Restaurant  0.04
6        Lingerie Store  0.03
7            Restaurant  0.03


----Bayview Village (PC:M2K)----
                 venue  freq
0                 Café  0.17
1  Japanese Restaurant  0.17
2                  Gym  0.17
3   Chinese Restaurant  0.17
4                  Spa  0.17
5                 Bank  0.17
6  Martial Arts School  0.00
7       Massage Studio  0.00


----['York Mills', 'Silver

In [20]:
# This is for the illustration for the above loop (do not regard about this cell, this cell is only used for coder's purpose)
# ny_grouped[ny_grouped['Postal Code'] == 'M2H'].T

In [20]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    return row_categories_sorted.index.values[0:num_top_venues]

# # Explaination for the above function (do not concern this block of code, this is only used for the coder's purpose)
# row = ny_grouped.iloc[0,:] # get the first row and all columns
# row_categories = row.iloc[2:] # only take the row with numeric values
# row_categories_sorted = row_categories.sort_values(ascending = False)
# row_categories_sorted

#### Clustering the neiborhood.
- Observing the frequency of each of the venue categories, there are a number of neighborhoods that have limited venues.
- Hence, to make sure there will be available venues in each of the neighborhoods, we will take only 4 top/ most common venues to cluster.

In [21]:
num_top_venues = 4
indicators = ['st','nd','rd']
columns = ['Postal Code','Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append(f'{ind+1}{indicators[ind]} Most Common Venue')
    except:
        columns.append(f'{ind+1}th Most Common Venue')

neigh_venues_sorted = pd.DataFrame(columns = columns)
neigh_venues_sorted[['Postal Code','Neighborhood']] = ny_grouped[['Postal Code','Neighborhood']]
for ind in np.arange(ny_grouped.shape[0]):
#     neigh_venues_sorted.iloc[0,0:] 
    neigh_venues_sorted.iloc[ind,2:] = return_most_common_venues(ny_grouped.iloc[ind,:],num_top_venues)
neigh_venues_sorted

Unnamed: 0,Postal Code,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,M2H,Hillcrest Village,Dog Run,Pool,Golf Course,Fast Food Restaurant
1,M2J,"['Fairview', 'Henry Farm', 'Oriole']",Clothing Store,Shoe Store,Cosmetics Shop,Coffee Shop
2,M2K,Bayview Village,Gym,Chinese Restaurant,Japanese Restaurant,Café
3,M2L,"['York Mills', 'Silver Hills']",Park,Cafeteria,Martial Arts School,Yoga Studio
4,M2M,"['Willowdale', 'Newtonbrook']",Gym,Park,Home Service,Deli / Bodega
5,M2N,Willowdale South,Rental Car Location,Restaurant,Japanese Restaurant,Ramen Restaurant
6,M2P,York Mills West,Electronics Store,Park,Construction & Landscaping,Convenience Store
7,M2R,Willowdale West,Spa,Grocery Store,Butcher,Pizza Place
8,M3A,Parkwoods,Construction & Landscaping,BBQ Joint,Food & Drink Shop,Park
9,M3B,Don Mills North,Gym,Café,Japanese Restaurant,Baseball Field


In [22]:
# Based on the frequency of venue categories, we can cluster the neighborhoods into 3 distinct group
kclusters = 3
ny_grouped_clustering = ny_grouped.drop(['Postal Code','Neighborhood'],1)
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(ny_grouped_clustering)
print(f'Number of Cluster Labels: {len(kmeans.labels_)}')
print(f'Labels: {kmeans.labels_}')

Number of Cluster Labels: 24
Labels: [0 0 0 2 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0]


In [23]:
neigh_venues_sorted.insert(0,'Cluster Labels',kmeans.labels_)
ny_merged = ny_df
ny_merged = ny_merged.join(neigh_venues_sorted.set_index('Postal Code'), on = 'Postal Code')
columns = [ny_merged.columns[1]] + [ny_merged.columns[0]] + list(ny_merged.columns[2:])  
ny_merged = ny_merged[columns]
ny_merged

Unnamed: 0,Borough,Postal Code,Neighbourhood,Latitude,Longitude,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,North York,M3A,Parkwoods,43.753259,-79.329656,1,Parkwoods,Construction & Landscaping,BBQ Joint,Food & Drink Shop,Park
1,North York,M4A,Victoria Village,43.725882,-79.315572,0,Victoria Village,Coffee Shop,Bridal Shop,Portuguese Restaurant,Hockey Arena
2,North York,M6A,"['Lawrence Manor', 'Lawrence Heights']",43.718518,-79.464763,0,"['Lawrence Manor', 'Lawrence Heights']",Clothing Store,Furniture / Home Store,Accessories Store,Home Service
3,North York,M3B,Don Mills North,43.745906,-79.352188,0,Don Mills North,Gym,Café,Japanese Restaurant,Baseball Field
4,North York,M6B,Glencairn,43.709577,-79.445073,0,Glencairn,Spa,Pizza Place,Japanese Restaurant,Metro Station
5,North York,M3C,Don Mills South (Flemingdon Park),43.7259,-79.340923,0,Don Mills South (Flemingdon Park),Clothing Store,Coffee Shop,Beer Store,Sporting Goods Shop
6,North York,M2H,Hillcrest Village,43.803762,-79.363452,0,Hillcrest Village,Dog Run,Pool,Golf Course,Fast Food Restaurant
7,North York,M3H,"['Bathurst Manor', 'Wilson Heights', 'Downsvie...",43.754328,-79.442259,0,"['Bathurst Manor', 'Wilson Heights', 'Downsvie...",Pharmacy,Mobile Phone Shop,Bank,Ice Cream Shop
8,North York,M2J,"['Fairview', 'Henry Farm', 'Oriole']",43.778517,-79.346556,0,"['Fairview', 'Henry Farm', 'Oriole']",Clothing Store,Shoe Store,Cosmetics Shop,Coffee Shop
9,North York,M3J,"['Northwood Park', 'York University']",43.76798,-79.487262,0,"['Northwood Park', 'York University']",Furniture / Home Store,Spa,Falafel Restaurant,Miscellaneous Shop


In [24]:
map_clusters = folium.Map(location = [ny_lat,ny_lng], zoom_start = 11)
# create color code for each of the distinct clusters
x = np.arange(kclusters)
ys = [i + x +(i*x)**2 for i in range (kclusters)]
colors_array = cm.rainbow(np.linspace(0,1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array] 

for lat, lng, poi, cluster in zip(ny_merged.Latitude, ny_merged.Longitude, ny_merged.Neighbourhood, ny_merged['Cluster Labels']):
    label = folium.Popup(f'{str(poi)}\nCluster: {str(cluster)}', parse_html = True)
    folium.CircleMarker(
        [lat,lng],
        radius = 5,
        popup = label,
        color = rainbow[cluster],
        fill = True,
        fill_color = rainbow[cluster],
        fill_opacity = 0.7
    ).add_to(map_clusters)
map_clusters
# rainbow[3]

#### Examine the cluster
- We can perceptually point out that the distribution of the cluster 2 (in red) is overweight the other clusters.

In [25]:
ny_merged.loc[ny_merged['Cluster Labels'] == 0, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Postal Code,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
1,M4A,0,Victoria Village,Coffee Shop,Bridal Shop,Portuguese Restaurant,Hockey Arena
2,M6A,0,"['Lawrence Manor', 'Lawrence Heights']",Clothing Store,Furniture / Home Store,Accessories Store,Home Service
3,M3B,0,Don Mills North,Gym,Café,Japanese Restaurant,Baseball Field
4,M6B,0,Glencairn,Spa,Pizza Place,Japanese Restaurant,Metro Station
5,M3C,0,Don Mills South (Flemingdon Park),Clothing Store,Coffee Shop,Beer Store,Sporting Goods Shop
6,M2H,0,Hillcrest Village,Dog Run,Pool,Golf Course,Fast Food Restaurant
7,M3H,0,"['Bathurst Manor', 'Wilson Heights', 'Downsvie...",Pharmacy,Mobile Phone Shop,Bank,Ice Cream Shop
8,M2J,0,"['Fairview', 'Henry Farm', 'Oriole']",Clothing Store,Shoe Store,Cosmetics Shop,Coffee Shop
9,M3J,0,"['Northwood Park', 'York University']",Furniture / Home Store,Spa,Falafel Restaurant,Miscellaneous Shop
10,M2K,0,Bayview Village,Gym,Chinese Restaurant,Japanese Restaurant,Café


In [26]:
ny_merged.loc[ny_merged['Cluster Labels'] == 1, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Postal Code,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,M3A,1,Parkwoods,Construction & Landscaping,BBQ Joint,Food & Drink Shop,Park
11,M3K,1,Downsview East (CFB Toronto),Electronics Store,Airport,Other Repair Shop,Construction & Landscaping
14,M6L,1,"['North Park', 'Maple Leaf Park', 'Upwood Park']",Massage Studio,Construction & Landscaping,Park,Bakery
22,M2P,1,York Mills West,Electronics Store,Park,Construction & Landscaping,Convenience Store


In [27]:
ny_merged.loc[ny_merged['Cluster Labels'] == 2, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Postal Code,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
12,M2L,2,"['York Mills', 'Silver Hills']",Park,Cafeteria,Martial Arts School,Yoga Studio


- It is remarkable that there is a variety of venue categories in cluster 0.
- The neighborhoods in cluster 0 will offer more common venues for the observers. If we choose these neighborhoods as our destination, we likely have more options for our trip.
- In contrast, cluster 1 and cluster 2 are not diverse in venue categories, which may provide poor amenities.