# Cluster Neighborhoods in Toronto

## IBM Data Science Capstone Project

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

### Part 1

1. Import the relevant libaries and check the versions

In [1]:
!python -V

Python 3.7.3


In [2]:
import numpy as np
np.__version__

'1.16.2'

In [3]:
import pandas as pd
pd.__version__

'1.1.0'

In [20]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values



In [23]:
!pip install geocoder
import geocoder
geocoder.__version__

Collecting geocoder
  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


'1.38.1'

In [46]:
import json
json.__version__

'2.0.9'

In [47]:
import requests
requests.__version__

'2.18.3'

In [48]:
from sklearn.cluster import KMeans

In [49]:
import folium
folium.__version__

'0.5.0'

In [31]:
from time import sleep

In [50]:
from dotenv import load_dotenv

In [53]:
import os

In [158]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [123]:
import matplotlib.pyplot as plt 
%matplotlib inline

2. Scrape Wikipedia for [Toronto neighborhoods](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

In [10]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)
df = df[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


3. Drop cells with a borough that is "Not assigned".

In [11]:
idx_to_drop = df[ df['Borough'] == "Not assigned" ].index
df.drop(idx_to_drop , inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


4. If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

In [16]:
idx_na = df[ df['Neighbourhood'] == "Not assigned" ].index
idx_na

Int64Index([], dtype='int64')

No remaining neighborhoods are not assigned. 

5. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.

In [17]:
df.shape

(103, 3)

In [18]:
df['Postal Code'].unique

<bound method Series.unique of 2      M3A
3      M4A
4      M5A
5      M6A
6      M7A
      ... 
160    M8X
165    M4Y
168    M7Y
169    M8Y
178    M8Z
Name: Postal Code, Length: 103, dtype: object>

The number of rows is equal to the number of unique postal codes, so there are no postal code duplicates. Multiple neighborhoods within a single postal code are in the Neighbourhood column separated by commas

### Part 2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.

In [37]:
lat_lng_coords = None
counter = 0
# loop until you get the coordinates
while(lat_lng_coords is None):
    g = geocoder.google('Mountain View, CA')    #('{}, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.latlng
    
    counter += 1
    print(counter)
    sleep(0.5)

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

print(latitude, longitude)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160


KeyboardInterrupt: 

I've made over 100 calls to the geocoder without success, so I will use the csv provided instead. 

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [38]:
url = 'http://cocl.us/Geospatial_data'
df_latlng = pd.read_csv(url)
df_latlng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now I need to add latitude and longitude into the first dataframe, based on postal code.

In [40]:
df_latlng.shape

(103, 3)

In [42]:
df3 = pd.merge(df, df_latlng, on=['Postal Code'])
df3.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Part 3

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you. 

Just make sure:

- to add enough Markdown cells to explain what you decided to do and to report any observations you make. 
- to generate maps to visualize your neighborhoods and how they cluster together. 


In [57]:
# I have my credientials in a .env file to keep them private. This code loads the credientials from .env (in same directory as Jupyter Notebook).

load_dotenv()

CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
VERSION = os.getenv('VERSION')

LIMIT = 100
test_LIMIT = 10

For each latitude and longitude, I want to explore the venues within a given radius. I made a function to build the url automatically for each query. I use a zip function to efficiently use information from the dataframe. I make a dictionary with the values corresponding to each entry in the dataframe and pass that to the function. The function returns the url, from which a json file is requested. The json file is cleaned up and converted to a dataframe. This outputs a single dataframe containing all returned venues for all postal codes. 

In [77]:
def url_builder(lat, lng, rad, action, item, lim) :
    # Making a function to build the url from uri and user input
    
    text_uri = 'https://api.foursquare.com/v2/'
    text_etc = 'client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'
    
    latitude = lat
    longitude = lng
    radius = rad
    
    item = item
    action = action
    limit = lim
    
    url = text_uri+item+'/'+action+'?'+text_etc
    url = url.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, limit)
    # print(url)
    return url



In [134]:
rad = 1000
lim = LIMIT

venues_list=[]

for lat, lng, postalcode, neighborhood in zip(df3['Latitude'], df3['Longitude'], df3['Postal Code'], df3['Neighbourhood']):
    dict1 = { 'lat' : lat , 
              'lng' : lng , 
              'rad' : rad , 
              'action' : 'explore' ,
              'item' : 'venues' ,
              'lim' : lim }
    
    url = url_builder(**dict1)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
        
        
    venues_list.append([(
            postalcode, neighborhood, 
            lat, lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 'Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    

In [135]:
nearby_venues.head()

Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,M3A,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
2,M3A,Parkwoods,43.753259,-79.329656,Tim Hortons,43.760668,-79.326368,Café
3,M3A,Parkwoods,43.753259,-79.329656,Bruno's valu-mart,43.746143,-79.32463,Grocery Store
4,M3A,Parkwoods,43.753259,-79.329656,A&W,43.760643,-79.326865,Fast Food Restaurant


In [136]:
nearby_venues.shape

(4893, 8)

In [137]:
nearby_venues.groupby('Postal Code').count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
M1B,19,19,19,19,19,19,19
M1C,5,5,5,5,5,5,5
M1E,23,23,23,23,23,23,23
M1G,9,9,9,9,9,9,9
M1H,28,28,28,28,28,28,28
...,...,...,...,...,...,...,...
M9N,15,15,15,15,15,15,15
M9P,16,16,16,16,16,16,16
M9R,15,15,15,15,15,15,15
M9V,15,15,15,15,15,15,15


In [138]:
print('There are {} uniques categories.'.format(len(nearby_venues['Venue Category'].unique())))

There are 334 uniques categories.


Using one-hot encoding for the venue categories.

In [139]:
# one hot encoding
nearby_onehot = pd.get_dummies(nearby_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
nearby_onehot['Postal Code'] = nearby_venues['Postal Code'] 

# move neighborhood column to the first column
fixed_columns = [nearby_onehot.columns[-1]] + list(nearby_onehot.columns[:-1])
nearby_onehot = nearby_onehot[fixed_columns]

nearby_onehot.head()

Unnamed: 0,Postal Code,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [140]:
nearby_onehot.shape

(4893, 335)

Get the mean of the frequency of each venue type within each postal code. Sort to find the most common venue types in each postal code and make a new dataframe that shows each postal code with only the top n venue types. 

In [141]:
nearby_grouped = nearby_onehot.groupby('Postal Code').mean().reset_index()
nearby_grouped.head()

Unnamed: 0,Postal Code,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.035714,0.0


In [142]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [143]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = nearby_grouped['Postal Code']

for ind in np.arange(nearby_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(nearby_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Coffee Shop,Trail,Fast Food Restaurant,Chinese Restaurant,Bus Station,Bakery,Caribbean Restaurant,Supermarket,Restaurant,Sandwich Place
1,M1C,Italian Restaurant,Breakfast Spot,Playground,Burger Joint,Park,Zoo,Electronics Store,Elementary School,Ethiopian Restaurant,Event Space
2,M1E,Pizza Place,Bank,Fast Food Restaurant,Restaurant,Coffee Shop,Sandwich Place,Electronics Store,Supermarket,Discount Store,Beer Store
3,M1G,Park,Coffee Shop,Chinese Restaurant,Indian Restaurant,Pharmacy,Fast Food Restaurant,Mobile Phone Shop,Fireworks Store,Falafel Restaurant,Eastern European Restaurant
4,M1H,Gas Station,Bank,Bakery,Coffee Shop,Indian Restaurant,Grocery Store,Chinese Restaurant,Lawyer,Music Store,Restaurant


Cluster the neighborhoods based on their most common venue types. Using k-means unsupervised machine learning algorithm. The data are unlabeled, meaning we do not know the 'true' neigborhood classification. Therefore, our previous method of determining optimum k (calculating error for multiple k and chosing the 'elbow point') will not work. In this case, the [Silhouette analysis](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py) can help to determine optimum k (high silhoutte score).

In [128]:
from sklearn.metrics import silhouette_samples, silhouette_score

In [144]:
# using 
for k in range(3, 15) :
    nearby_grouped_clustering = nearby_grouped.drop('Postal Code', 1)
    kmeans = KMeans( n_clusters = k, random_state = 0 , init='k-means++' ).fit(nearby_grouped_clustering)
    silhouette_avg = silhouette_score(nearby_grouped_clustering, kmeans.labels_)
    print("For n_clusters =", k, "The average silhouette_score is :", silhouette_avg, " inertia = ", kmeans.inertia_)

For n_clusters = 3 The average silhouette_score is : 0.11102709076663066  inertia =  4.443695673737789
For n_clusters = 4 The average silhouette_score is : 0.09868915298467931  inertia =  4.22820748713241
For n_clusters = 5 The average silhouette_score is : 0.11443155163153058  inertia =  3.8588725184967103
For n_clusters = 6 The average silhouette_score is : 0.10831952779601572  inertia =  3.640557173451673
For n_clusters = 7 The average silhouette_score is : 0.10286161603938897  inertia =  3.593103869334795
For n_clusters = 8 The average silhouette_score is : 0.1095628812915048  inertia =  3.4067937273234987
For n_clusters = 9 The average silhouette_score is : 0.012782207363586345  inertia =  3.2569608433639345
For n_clusters = 10 The average silhouette_score is : 0.11704160372403316  inertia =  3.168829046628992
For n_clusters = 11 The average silhouette_score is : 0.1350729779439762  inertia =  3.05494760401153
For n_clusters = 12 The average silhouette_score is : 0.042460923604634

Silhouette score is higher for lower k. Best values appear to be k = {3, 4, 5, 6, 8}

In [145]:
# redo k-mean for optimum k
best_k = 5

nearby_grouped_clustering = nearby_grouped.drop('Postal Code', 1)
kmeans = KMeans( n_clusters = best_k, random_state = 0 ).fit(nearby_grouped_clustering)

# add the cluster label into the grouped dataframe
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)


In [146]:
# insert latitude and longitude

nearby_merged = df3 

nearby_merged = nearby_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')

nearby_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Shopping Mall,Pharmacy,Bus Stop,ATM,Supermarket,Food & Drink Shop,Convenience Store,Fish & Chips Shop,Fast Food Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,3.0,Coffee Shop,Portuguese Restaurant,Hockey Arena,Pizza Place,Men's Store,Golf Course,Lounge,Grocery Store,Gym / Fitness Center,Zoo
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,3.0,Coffee Shop,Park,Pub,Café,Theater,Restaurant,Bakery,Italian Restaurant,Breakfast Spot,Sushi Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,3.0,Clothing Store,Furniture / Home Store,Coffee Shop,Fast Food Restaurant,Restaurant,Dessert Shop,Vietnamese Restaurant,Fried Chicken Joint,Sushi Restaurant,Cosmetics Shop
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,3.0,Coffee Shop,Park,Pizza Place,Café,Sushi Restaurant,Italian Restaurant,Gastropub,Middle Eastern Restaurant,Thai Restaurant,Clothing Store


In [159]:
# visualize clusters

# create map
lat_init = nearby_merged['Latitude'][0]
lng_init = nearby_merged['Longitude'][0]
map_clusters = folium.Map(location=[lat_init, lng_init], zoom_start=11)

# set color scheme for the clusters
x = np.arange(best_k)
ys = [i + x + (i*x)**2 for i in range(best_k)]
colors_array = cm.jet(np.linspace(0, 1, len(ys)))
print(colors_array[0])
jet = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nearby_merged['Latitude'], 
                                  nearby_merged['Longitude'], 
                                  nearby_merged['Postal Code'], 
                                  nearby_merged['Cluster Labels']):
    try : 
        cluster = int(cluster)
    except :
        break

    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=jet[cluster-1],
        fill=True,
        fill_color=jet[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
    print(cluster, jet[cluster-1])
       
map_clusters

[0.  0.  0.5 1. ]
0 #800000
3 #7dff7a
3 #7dff7a
3 #7dff7a
3 #7dff7a
0 #800000
0 #800000
3 #7dff7a
0 #800000
3 #7dff7a
0 #800000
1 #000080
1 #000080
3 #7dff7a
3 #7dff7a
3 #7dff7a
0 #800000
3 #7dff7a
0 #800000
3 #7dff7a
3 #7dff7a
0 #800000
0 #800000
3 #7dff7a
3 #7dff7a
3 #7dff7a
0 #800000
0 #800000
0 #800000
3 #7dff7a
3 #7dff7a
3 #7dff7a
0 #800000
3 #7dff7a
0 #800000
3 #7dff7a
3 #7dff7a
3 #7dff7a
0 #800000
0 #800000
3 #7dff7a
3 #7dff7a
3 #7dff7a
3 #7dff7a
0 #800000
4 #ff9400
1 #000080
3 #7dff7a
3 #7dff7a
0 #800000
0 #800000
0 #800000
3 #7dff7a
2 #0080ff
3 #7dff7a
3 #7dff7a
0 #800000
1 #000080
3 #7dff7a
3 #7dff7a
0 #800000
3 #7dff7a
3 #7dff7a
0 #800000
0 #800000
0 #800000
1 #000080
3 #7dff7a
3 #7dff7a
3 #7dff7a
0 #800000
0 #800000
0 #800000
3 #7dff7a
3 #7dff7a
3 #7dff7a
3 #7dff7a
0 #800000
0 #800000
3 #7dff7a
3 #7dff7a
3 #7dff7a
0 #800000
3 #7dff7a
3 #7dff7a
0 #800000
3 #7dff7a
3 #7dff7a
0 #800000
0 #800000
0 #800000
3 #7dff7a
3 #7dff7a
0 #800000
3 #7dff7a


Two clusters have only a single member, indicating that a lower k value could be used. 

What are the defining features of each cluster? 

In [180]:
labels_list = ['1st Most Common Venue', 
               '2nd Most Common Venue', 
               '3rd Most Common Venue',
               '4th Most Common Venue',
               '5th Most Common Venue', 
               '6th Most Common Venue',
               '7th Most Common Venue',
               '8th Most Common Venue',
               '9th Most Common Venue',
               '10th Most Common Venue']

In [219]:
top5_list = []

for n in range(0, best_k) :
    clusterN = nearby_merged.loc[nearby_merged['Cluster Labels'] == n, 
                                 nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

    dict_list = []
    for label in labels_list :
        f0 = clusterN[label].value_counts()
        f0 = f0.to_dict()
        dict_list.append(f0)

    df0 = pd.DataFrame.from_dict(dict_list)
    df0 = df0.fillna(0)
    df0a = pd.DataFrame(df0.sum(axis=0))
    top5_list.append(df0a.index[0:5])


In [222]:
df_top5 = pd.DataFrame(top5_list)
df_top5.columns = ['1st Most Common Venue', 
               '2nd Most Common Venue', 
               '3rd Most Common Venue',
               '4th Most Common Venue',
               '5th Most Common Venue']
df_top5

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Pizza Place,Coffee Shop,Park,Pharmacy,Chinese Restaurant
1,Park,Italian Restaurant,Convenience Store,Pizza Place,Discount Store
2,Vietnamese Restaurant,Food Truck,Baseball Field,Farmers Market,Eastern European Restaurant
3,Coffee Shop,Café,Park,Restaurant,Italian Restaurant
4,Park,Pool,Farm,Dumpling Restaurant,Eastern European Restaurant


The dataframe above helps us understand how the clusters are broken up. For example, Cluster 0 has pizza, coffee, park, pharmacy, and Chinese restuarant as the most common venues. Four of the five clusters have parks present. Only Cluster 2 has a farmers' market. 