<h1>Toronto Neighbourhood Clustering</h1>
<p>Exploration of neighbourhoods in Toronto, and clustering</p>

In [1]:
# import libraries
import numpy as np
import pandas as pd

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans


In [3]:
# install and import folium
# run in conda console, 
# then check results in Anaconda Navigator | Environments ("update index"...)
###!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

<h2>Get the data</h2>
<p>Download the list of Postcodes, Boroughs and Neighbourhoods from the <a href = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" target="blank">wikipedia page</a> on postal codes of Canada</p>

In [4]:
# Download the data and read the Postcode tabel into a dataframe
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_page = pd.read_html(url) #  list
df_neighborhoods = wiki_page[0]        # dataframe [287 rows x 3 columns]

df_neighborhoods.head()

# to be precise: column names should be: PostalCode, Borough, and Neighborhood
df_neighborhoods.rename(columns={"Postcode": "PostalCode"}, inplace=True)
df_neighborhoods.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


<h3>Data Transformations: "Not assigned"</h3>
<ul>
    <li />Ignore cells with a borough that is "Not assigned"
    <li />If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough. 
<ul>

In [5]:
# Ignore cells with a borough that is "Not assigned"
df_neighborhoods = df_neighborhoods[df_neighborhoods['Borough']!='Not assigned'].reset_index(drop=True) 


In [6]:
# find the rows where neighbourhood is "Not assigned"
idx_list = df_neighborhoods[df_neighborhoods["Neighbourhood"] == "Not assigned"].index  # M9A Queen's park Not assigned

for i in idx_list: 
    df_neighborhoods.iloc[idx_list,2] = df_neighborhoods.iloc[idx_list,1]

# check results:
df_neighborhoods.iloc[idx_list]
df_neighborhoods[df_neighborhoods["Neighbourhood"] == "Not assigned"] # should return no rows


Unnamed: 0,PostalCode,Borough,Neighbourhood


<h3>Group data on PostalCode and Borough</h3>

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [7]:
# Combine rows with the same postcode/borough: concatenate neighbourhoods
# create a new dataframe grouped on Postcode, Borough
df_toronto = df_neighborhoods.groupby(['PostalCode', 'Borough']).count().reset_index()

# create a list with the concatenated neighbourhoods
nb_list = []
for i in df_toronto.index:
    l1 = df_toronto.iloc[i]
    # get a subset for this Postcode, Borough
    temp = df_neighborhoods[(df_neighborhoods["PostalCode"]==l1[0]) & (df_neighborhoods["Borough"]==l1[1])]
    # concatenate neighbourhoods
    ln= ', '.join(temp['Neighbourhood'].tolist()) 
    # add to the list
    nb_list.append(ln)
# now put the list in the grouped dataframe    
df_toronto["Neighbourhood"] = nb_list

print("Dataframe shape: ", df_toronto.shape)  # for the assignment
df_toronto

Dataframe shape:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


<h3>Add the geospatial coordinates</h3>

First attempt with geocoder

In [8]:
### This loop takes forever, even for trying to get geodata for one postal code
# so - switched to loading the CSV file (next cell)

#import geocoder # import geocoder

# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while lat_lng_coords is None:
#    g = geocoder.google('{}, Toronto, Ontario'.format(df_toronto[["PostalCode"]]))
#    lat_lng_coords = g.latlng
#     
#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]



Download csv file with geospatial data

In [9]:
# download the geospatial data and add to the dataframe
url_geo = 'http://cocl.us/Geospatial_data'
geospatial = pd.read_csv(url_geo)
geospatial.columns
geospatial.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
geospatial.head()

df_toronto = pd.merge(df_toronto, geospatial, on=['PostalCode'])
df_toronto.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<h4>Limit the dataset to boroughs with "Toronto" in the name</h4>

In [10]:
# df_toronto.shape  (103,5)
df_toronto['Borough'].unique() # 39 rows 

# create new dataframe with only "Toronto" - boroughs
trt_boroughs = df_toronto[df_toronto['Borough'].str.contains("Toronto", na=False)].reset_index(drop=True) 
trt_boroughs.shape  # (39,5)

(39, 5)

## Create a map

In [11]:
# Found Toronto location on-line:
# https://www.latlong.net/place/toronto-on-canada-27230.html
# Latitude and longitude coordinates are: 43.651070, -79.347015

trt_lat = 43.651070
trt_lng = -79.347015

# create map of Toronto using latitude and longitude values
map_trt = folium.Map(location=[trt_lat, trt_lng], zoom_start=11)

# add markers to map to indicate neighbourhoods
for lat, lng, borough, neighbourhood in zip(trt_boroughs['Latitude']
                                            , trt_boroughs['Longitude']
                                            , trt_boroughs['Borough']
                                            , trt_boroughs['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_trt)  
    
map_trt

<h2>Explore neighbourhoods in the Toronto Buroughs</h2>

In [12]:
# @hidden_cell 
CLIENT_ID = 'B4J5LQK12WVH1OASZ5L3U0JN2IHTCBVISRBWUFAK3CHI44C3' # your Foursquare ID
CLIENT_SECRET = 'LN13DX3XNEKPEE03WDUTGWUQJLJ1PKFF4BKL2MN5410P4MGD' # your Foursquare Secret
VERSION = '20200202' # Foursquare API version

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

<h4>Define a function to get venues from FourSquare</h4>

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

<b>Define function to extract the category of the venue</b>

In [14]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

<h4>Get venues for all neighbourhoods</h4>

In [15]:
# Set parameters radius and limit
radius = 500  # define radius 
LIMIT = 100   # limit; for the top 100 venues

# call function with three lists: names, latitutes and longitudes
# extracted from dataframe trt_boroughs
trt_venues = getNearbyVenues(names=trt_boroughs['Neighbourhood'],
                                   latitudes=trt_boroughs['Latitude'],
                                   longitudes=trt_boroughs['Longitude']
                                  )
print(trt_venues.shape)  # (3314, 7)
trt_venues.head()

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The Junction Sout

Unnamed: 0,Neighbourhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


<strong>Number of venues returned</strong>

In [16]:
# print("NR of neighborhoods: ", len(trt_venues.groupby('Neighbourhood'))) # 40
trt_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55
"Brockton, Exhibition Place, Parkdale Village",24,24,24,24,24,24
Business Reply Mail Processing Centre 969 Eastern,17,17,17,17,17,17
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",16,16,16,16,16,16
"Cabbagetown, St. James Town",42,42,42,42,42,42
Central Bay Street,84,84,84,84,84,84
"Chinatown, Grange Park, Kensington Market",80,80,80,80,80,80
Christie,18,18,18,18,18,18
Church and Wellesley,81,81,81,81,81,81


<strong>Number of unique categories from all the returned venues</strong>

In [17]:
print('There are {} uniques categories.'.format(len(trt_venues['Venue Category'].unique())))

There are 229 uniques categories.


<h2>Analyse Each Neighbourhood</h2>

In [18]:
# one hot encoding : create a frame of categories (colums)
                   # for each item, list the neighborhood and a '1' for category, all other categories '0'
trt_onehot = pd.get_dummies(trt_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
trt_onehot['Neighbourhood'] = trt_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [trt_onehot.columns[-1]] + list(trt_onehot.columns[:-1])
trt_onehot = trt_onehot[fixed_columns]

trt_onehot.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Check dataframe size

In [19]:
trt_onehot.shape # (1706, 230)

(1706, 230)

<h4>Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category</h4>

In [20]:
trt_grouped = trt_onehot.groupby('Neighbourhood').mean().reset_index()
trt_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,...,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.011905,0.0,0.011905
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.05,0.0,0.05,0.0125,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.012346,0.0,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,...,0.012346,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,0.0,0.012346


Check the new size 

In [21]:
trt_grouped.shape 
# (39,230)
trt_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,...,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.011905,0.0,0.011905
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.05,0.0,0.05,0.0125,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.012346,0.0,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,...,0.012346,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,0.0,0.012346


<h3>Print each neighborhood along with the top 5 most common venues</h3>

In [22]:
num_top_venues = 5

for hood in trt_grouped['Neighbourhood']:
    print("---- "+hood+" ----")
    # take a slice: the neighborhood's data, and transpose
    temp = trt_grouped[trt_grouped['Neighbourhood'] == hood].T.reset_index()
    # reset column names
    temp.columns = ['venue','freq']
    # remove first line: that's the org. headers 
    temp = temp.iloc[1:]
    # convert freq to float and round 
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    # print the first 5 venues for each neighborhood 
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Adelaide, King, Richmond ----
            venue  freq
0     Coffee Shop  0.07
1      Steakhouse  0.04
2             Bar  0.04
3            Café  0.04
4  Cosmetics Shop  0.03


---- Berczy Park ----
                venue  freq
0         Coffee Shop  0.07
1        Cocktail Bar  0.05
2          Steakhouse  0.04
3  Seafood Restaurant  0.04
4         Cheese Shop  0.04


---- Brockton, Exhibition Place, Parkdale Village ----
            venue  freq
0  Breakfast Spot  0.08
1            Café  0.08
2     Coffee Shop  0.08
3       Nightclub  0.08
4             Gym  0.04


---- Business Reply Mail Processing Centre 969 Eastern ----
              venue  freq
0        Smoke Shop  0.06
1     Garden Center  0.06
2       Pizza Place  0.06
3        Comic Shop  0.06
4  Recording Studio  0.06


---- CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara ----
                venue  freq
0     Airport Service  0.19
1      Airport Lounge  0.12
2    Ai

<h3>Store results in a dataframe</h3>

Define the function that sorts the venues on frequency

In [23]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top rows

In [24]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']  # for 1st, 2nd, 3rd

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))  # fails if ind >=4
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))                   # default: suffix "th"

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = trt_grouped['Neighbourhood']

# loop through 39 rows, add categories of top 10
for ind in np.arange(trt_grouped.shape[0]): 
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(trt_grouped.iloc[ind, :], num_top_venues)

# Check: display top rows
neighbourhoods_venues_sorted.head()


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Bar,Steakhouse,Café,Restaurant,Cosmetics Shop,Asian Restaurant,Burger Joint,Breakfast Spot,Thai Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Beer Bar,Seafood Restaurant,Bakery,Farmers Market,Steakhouse,Cheese Shop,Café,Bistro
2,"Brockton, Exhibition Place, Parkdale Village",Breakfast Spot,Café,Nightclub,Coffee Shop,Yoga Studio,Gym,Pet Store,Performing Arts Venue,Office,Italian Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Skate Park,Recording Studio,Burrito Place,Fast Food Restaurant,Light Rail Station,Farmers Market,Auto Workshop,Restaurant,Spa,Pizza Place
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge,Boat or Ferry,Sculpture Garden,Rental Car Location,Coffee Shop,Harbor / Marina,Airport Gate,Airport Food Court


<h2>Cluster the Neighbourhoods</h2>

Cluster the neighborhood into 5 clusters using k-means

In [25]:
# set number of clusters
kclusters = 5

trt_grouped_clustering = trt_grouped.drop('Neighbourhood', 1)
# frame of only categories and avg. frequencies

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(trt_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print("Cluster memberships:\n{}".format(kmeans.labels_))


Cluster memberships:
[2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 3 2 2 2 2 1 2 0 2 2 2 0 4 2 2 2 2 2 2 1
 1 2]


Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [26]:
# add clustering labels

### Can do this only once: uncomment for restart! 
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
###

trt_merged = trt_boroughs

# merge neighborhoods_venues with manhattan_data to add latitude/longitude for each neighborhood
trt_merged = trt_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
print(trt_merged.shape) # shape: 
trt_merged.head() # check the first rows
trt_merged.tail() # check the last rows

(39, 16)


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
34,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763,2,Mexican Restaurant,Café,Thai Restaurant,Bar,Grocery Store,Fried Chicken Joint,Park,Music Venue,Cajun / Creole Restaurant,Diner
35,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325,2,Gift Shop,Breakfast Spot,Restaurant,Movie Theater,Coffee Shop,Eastern European Restaurant,Dessert Shop,Bookstore,Bar,Cuban Restaurant
36,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445,2,Café,Coffee Shop,Pizza Place,Sushi Restaurant,Italian Restaurant,Restaurant,Juice Bar,Pub,Sandwich Place,Bookstore
37,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,2,Coffee Shop,Park,Gym,College Auditorium,Sandwich Place,Salad Place,Restaurant,Burger Joint,Burrito Place,Café
38,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,2,Skate Park,Recording Studio,Burrito Place,Fast Food Restaurant,Light Rail Station,Farmers Market,Auto Workshop,Restaurant,Spa,Pizza Place


<strong>Visualize the resulting clusters</strong>

In [27]:
# create map
map_trt_clusters = folium.Map(location=[trt_lat, trt_lng], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(trt_merged['Latitude'], trt_merged['Longitude'], trt_merged['Neighbourhood'], trt_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_trt_clusters)
       
map_trt_clusters

## First conclusions
Of the 39 neighbourhoods, by far the majority has landed in one cluster (s. below: cluster 3).

Then there are two clusters with only one neighbourhood in it, and one cluster with only two neighbourhoods. 

Perhaps five clusters is not the best way to group the Toronto Neighbourhoods


<h2>Examine Clusters</h2>

Let's look at the contents of each of the clusters, neighbourhoods and their top-10 most common venues:

#### Cluster 1

In [28]:
trt_merged.loc[trt_merged['Cluster Labels'] == 0, trt_merged.columns[[2] + list(range(6, trt_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,"Moore Park, Summerhill East",Park,Playground,Tennis Court,Restaurant,Dance Studio,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
10,Rosedale,Park,Playground,Trail,Dance Studio,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


#### Cluster 2

In [29]:
trt_merged.loc[trt_merged['Cluster Labels'] == 1, trt_merged.columns[[2] + list(range(6, trt_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,The Beaches,Health Food Store,Pub,Neighborhood,Trail,Discount Store,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Yoga Studio
2,"The Beaches West, India Bazaar",Park,Sushi Restaurant,Pet Store,Movie Theater,Pub,Burrito Place,Burger Joint,Liquor Store,Brewery,Sandwich Place
4,Lawrence Park,Lake,Swim School,Bus Line,Park,General Travel,Deli / Bodega,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
5,Davisville North,Park,Department Store,Gym,Breakfast Spot,Sandwich Place,Food & Drink Shop,Hotel,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


#### Cluster 3

In [30]:
trt_merged.loc[trt_merged['Cluster Labels'] == 2, trt_merged.columns[[2] + list(range(6, trt_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"The Danforth West, Riverdale",Greek Restaurant,Coffee Shop,Italian Restaurant,Restaurant,Ice Cream Shop,Furniture / Home Store,Frozen Yogurt Shop,Pub,Pizza Place,Liquor Store
3,Studio District,Café,Coffee Shop,Gastropub,Bakery,Brewery,Italian Restaurant,American Restaurant,Yoga Studio,Comfort Food Restaurant,Sandwich Place
6,North Toronto West,Clothing Store,Coffee Shop,Yoga Studio,Gym / Fitness Center,Salon / Barbershop,Restaurant,Rental Car Location,Park,Mexican Restaurant,Metro Station
7,Davisville,Sandwich Place,Dessert Shop,Pizza Place,Sushi Restaurant,Coffee Shop,Italian Restaurant,Gym,Café,Diner,Seafood Restaurant
9,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Coffee Shop,Pub,Pizza Place,Fried Chicken Joint,Vietnamese Restaurant,Supermarket,Light Rail Station,Sushi Restaurant,Liquor Store,American Restaurant
11,"Cabbagetown, St. James Town",Restaurant,Coffee Shop,Italian Restaurant,Pub,Café,Bakery,Pizza Place,Grocery Store,Pet Store,Sandwich Place
12,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Pub,Men's Store,Mediterranean Restaurant,Hotel,Gym
13,Harbourfront,Coffee Shop,Park,Bakery,Café,Pub,Breakfast Spot,Restaurant,Mexican Restaurant,Farmers Market,Event Space
14,"Ryerson, Garden District",Coffee Shop,Clothing Store,Cosmetics Shop,Japanese Restaurant,Café,Fast Food Restaurant,Ramen Restaurant,Bookstore,Pizza Place,Bakery
15,St. James Town,Coffee Shop,Café,Restaurant,Cocktail Bar,Hotel,Cosmetics Shop,Breakfast Spot,Bakery,Beer Bar,American Restaurant


#### Cluster 4

In [31]:
trt_merged.loc[trt_merged['Cluster Labels'] == 3, trt_merged.columns[[2] + list(range(6, trt_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,"Forest Hill North, Forest Hill West",Jewelry Store,Trail,Sushi Restaurant,Bus Line,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


#### Cluster 5

In [32]:
trt_merged.loc[trt_merged['Cluster Labels'] == 4, trt_merged.columns[[2] + list(range(6, trt_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Roselawn,Health & Beauty Service,Pool,Garden,Yoga Studio,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
