# IBM Data Science Course - Final Capstone Project

## Introduction

The goal of this assignment is to explore and cluster the neighbourhoods in Toronto.

The notebook is organized as follows:
***

### Assignment Part 1

<b>0. Introduction</b>: Import required libraries

<b>1. Scrape Toronto neighbourhood data</b>: Scraping toronto neighbourhood data from Wikipedia

### Assignment Part 2

<b>2. Getting longitudes and latitudes</b>: Use csv file of longitudes and latitudes to enrich the Toronto neighbourhood data

### Assignment Part 3

<b>3. Explore & segment neighborhoods</b>: Utilize the Foursquare API to explore the neighborhoods and segment them

<b>4. Analyze Each Neighborhood</b>: Group venues by neighbourhood and analyze most common venues

<b>5. Examine clusters</b>: Examine each cluster and determine the discriminating venue categories that distinguish each cluster
***


In [2]:
import pandas as pd
import numpy as np

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import folium

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from sklearn.cluster import KMeans # import k-means from clustering stage

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries successfully imported. Ready to explore and cluster!')


Libraries successfully imported. Ready to explore and cluster!


## 1. Scraping Toronto neighbourhoods data

First lets scrape the table of Toronto neighbourhoods from the Wikipedia page at <a href="https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050" target="_blank">this link</a>.


In [3]:
df = pd.read_html('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050')[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Only cells that have an assigned borough are to be processed. Cells with a borough that is 'Not assigned' are to be ignored. 

The following code drops the rows with no assigned borough

In [4]:
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


More than one neighbourhood can exist in one postal code area. 

The following code combines these into one row with all the neighbourhoods separated with a comma

In [5]:
# create empty dataframe df1 to write combined neighbourhood into
df1 = pd.DataFrame(columns=['Postcode','Borough','Neighbourhood'])

# loop through each unique postcode and write its postcode, borough, list of neighbourhoods into df1 dataframe
for postcode in df['Postcode'].unique():

    borough = df[df['Postcode'] == postcode].iloc[0,1]

    # create of list of all the neighbourhoods that have the same neighbourhood for a given unique postcode
    neighbourhood_list = list(n for n in df[df['Postcode'] == postcode]['Neighbourhood'])
    
    # loop through list of neighbourhood for the given postcode and create string of all neighbourhoods separated by a comma
    for i in range(len(neighbourhood_list)):
         if i == 0: neighbourhood = neighbourhood_list[i]
         else: neighbourhood = neighbourhood + ', ' + neighbourhood_list[i]
    
    # add unique postcode, borough and string of neighbourhoods to a dictionary
    data = {'Postcode': postcode, 'Borough': borough, 'Neighbourhood': neighbourhood}
    
    # append dictionary to the dataframe df1
    df1 = df1.append(data, ignore_index = True)
    
df1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


For rows with a borough but the neighbourhood has 'Not Assigned', then the neighbourhood will be the same as the borough.

The following code assigns the borough to the neighbourhood field if it has the value 'Not Assigned' using the `replace` function

The dataframe <b>df2</b> is created to filter for all the records with neighbourbood value of 'Not Assigned' and this is used to replace values in the main dataframe <b>df1</b>

In [6]:
df2 = df1[df1['Neighbourhood'] == 'Not Assigned']
df1['Neighbourhood'] = df1['Neighbourhood'].replace(df2['Neighbourhood'], df2['Borough'])
print("Number of records with neighbourhood 'Not Assigned' that were changed to match the borough: ", df2.shape[0])

Number of records with neighbourhood 'Not Assigned' that were changed to match the borough:  0


In [7]:
df1.shape

(103, 3)

## 2. Getting longitudes and latitudes for Toronto Postcodes from csv file

Download geospatial data from <a href="https://cocl.us/Geospatial_data" target="_blank">this link</a>.


In [8]:
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data

In [9]:
# Load the geospatial file in df_lonlat dataframe
df_lonlat = pd.read_csv('Geospatial_Coordinates.csv')
df_lonlat.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
# rename 'Postal Code' column in the geospatial data file to facilitate merging into main dataframe

df_lonlat.rename(columns={'Postal Code': 'Postcode'}, inplace=True)
df_lonlat.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
# merge the longitude and latitude data in the geospatial dataframe with the main dataframe containing Toronto neighbourhood data

df1 = pd.merge(df1, df_lonlat, on = 'Postcode', how = 'left')


In [12]:
df1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


In [13]:
# Get Toronto longitude and latitude in preparation for mapping the neighbourhoods in Toronto
# Use geopy library to get the latitude and longitude values of New York City.

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df1['Latitude'], df1['Longitude'], df1['Borough'], df1['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In order to simplify the above map and segment and cluster only the neighborhoods containing `Toronto` slice the original dataframe and create a new dataframe of the only neighbourhoods containing `Toronto`.

In [15]:
# filter for only neighbourhoods containing 'Toronto'
toronto_data = df1[df1['Borough'].str.contains('Toronto')]
toronto_data.head(40)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
31,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259


In [16]:
# create map of Toronto using latitude and longitude values
map_toronto_filtered = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_filtered)  
    
map_toronto_filtered

## 3. Explore & segment neighborhoods

Utilize the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [17]:
# The code was removed by Watson Studio for sharing.

The follow function explores each neighbourhood by using its longitute and latitude to retrieve the top 100 places within a radius of 500.

The function sends a GET request and examine the results in JSON format and returns the relevant information on the nearby venues in a pandas dataframe

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run the above function to get relevant information on the nearby venues for each neighborhood and create a new dataframe called `toronto_venues`.

In [19]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Harbourfront
Queen's Park
Ryerson, Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
The Danforth West, Riverdale
Design Exchange, Toronto Dominion Centre
Brockton, Exhibition Place, Parkdale Village
The Beaches West, India Bazaar
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North, Forest Hill West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
Harbord, University of Toronto
Runnymede, Swansea
Moore Park, Summerhill East
Chinatown, Grange Park, Kensington Market
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown, St. James Town
Fir

In [20]:
# check the size of the toronto_venues dataframe from running the function above

print(toronto_venues.shape)
toronto_venues.head(10)

(1622, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
4,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
5,Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
6,Harbourfront,43.65426,-79.360636,The Extension Room,43.653313,-79.359725,Gym / Fitness Center
7,Harbourfront,43.65426,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site
8,Harbourfront,43.65426,-79.360636,Corktown Common,43.655618,-79.356211,Park
9,Harbourfront,43.65426,-79.360636,SOMA chocolatemaker,43.650622,-79.358127,Chocolate Shop


Check how many venues were returned for each neighborhood

In [21]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",97,97,97,97,97,97
Berczy Park,57,57,57,57,57,57
"Brockton, Exhibition Place, Parkdale Village",25,25,25,25,25,25
Business Reply Mail Processing Centre 969 Eastern,16,16,16,16,16,16
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",17,17,17,17,17,17
"Cabbagetown, St. James Town",45,45,45,45,45,45
Central Bay Street,62,62,62,62,62,62
"Chinatown, Grange Park, Kensington Market",62,62,62,62,62,62
Christie,16,16,16,16,16,16
Church and Wellesley,79,79,79,79,79,79


#### Find out how many unique categories can be curated from all the returned venues


In [22]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 236 uniques categories.


## 4. Analyze Each Neighborhood

In [23]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
# examine the new dataframe size.

toronto_onehot.shape

(1622, 237)

#### Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [25]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020619,0.0,...,0.0,0.0,0.0,0.0,0.010309,0.0,0.0,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.016129,0.0,0.016129
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.064516,0.0,0.048387,0.016129,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.012658,0.012658,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025316


In [26]:
#  confirm the new size

toronto_grouped.shape

(39, 237)

#### Print each neighborhood along with the top 5 most common venues

In [27]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
             venue  freq
0      Coffee Shop  0.09
1             Café  0.05
2       Restaurant  0.04
3            Hotel  0.04
4  Thai Restaurant  0.03


----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.05
2      Farmers Market  0.04
3  Seafood Restaurant  0.04
4         Cheese Shop  0.04


----Brockton, Exhibition Place, Parkdale Village----
                   venue  freq
0                   Café  0.12
1         Breakfast Spot  0.08
2  Performing Arts Venue  0.08
3                 Bakery  0.08
4            Coffee Shop  0.08


----Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0             Brewery  0.06
1                 Spa  0.06
2          Restaurant  0.06
3  Light Rail Station  0.06
4       Auto Workshop  0.06


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0   Airport Ser

#### Load that into a pandas dataframe

First, create function to sort the venues in descending order.

In [28]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood.

In [29]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head(10)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Restaurant,Hotel,Deli / Bodega,Thai Restaurant,Gym,Bakery,Concert Hall,Lounge
1,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Pharmacy,Cheese Shop,Restaurant,Farmers Market,Beer Bar,Seafood Restaurant,Japanese Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Café,Performing Arts Venue,Coffee Shop,Breakfast Spot,Bakery,Stadium,Burrito Place,Restaurant,Climbing Gym,Pet Store
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Garden Center,Brewery,Farmers Market,Spa,Fast Food Restaurant,Burrito Place,Butcher,Restaurant,Auto Workshop
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Harbor / Marina,Bar,Rental Car Location,Plane,Coffee Shop,Boat or Ferry,Boutique
5,"Cabbagetown, St. James Town",Coffee Shop,Bakery,Pizza Place,Restaurant,Chinese Restaurant,Café,Pub,Italian Restaurant,Beer Store,Playground
6,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Salad Place,Bubble Tea Shop,Burger Joint,Thai Restaurant,Modern European Restaurant,Comic Shop
7,"Chinatown, Grange Park, Kensington Market",Café,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Mexican Restaurant,Coffee Shop,Arts & Crafts Store,Comfort Food Restaurant,Caribbean Restaurant,Grocery Store,Park
8,Christie,Grocery Store,Café,Park,Baby Store,Nightclub,Italian Restaurant,Restaurant,Athletics & Sports,Coffee Shop,Candy Store
9,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Pub,Fast Food Restaurant,Men's Store,Mediterranean Restaurant


# 4. Cluster Neighbourhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [30]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:40] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 4, 1, 1, 1, 4, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [31]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighbourhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,1,Coffee Shop,Park,Bakery,Café,Pub,Breakfast Spot,Theater,Cosmetics Shop,Shoe Store,Brewery
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Yoga Studio,College Cafeteria,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,Portuguese Restaurant
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,1,Coffee Shop,Clothing Store,Café,Japanese Restaurant,Middle Eastern Restaurant,Bubble Tea Shop,Cosmetics Shop,Diner,Lingerie Store,Bookstore
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Cosmetics Shop,Cocktail Bar,Clothing Store,American Restaurant,Gym,Hotel,Farmers Market,Department Store
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Health Food Store,Pub,Neighborhood,Trail,Yoga Studio,Dog Run,Diner,Discount Store,Distribution Center,Donut Shop


Finally, visualize the resulting clusters

In [32]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

Examine each cluster and determine the discriminating venue categories that distinguish each cluster

#### Cluster 1

In [33]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
68,Central Toronto,0,Trail,Jewelry Store,Bus Line,Sushi Restaurant,Yoga Studio,Diner,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store


#### Cluster 2

In [34]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,1,Coffee Shop,Park,Bakery,Café,Pub,Breakfast Spot,Theater,Cosmetics Shop,Shoe Store,Brewery
4,Downtown Toronto,1,Coffee Shop,Sushi Restaurant,Yoga Studio,College Cafeteria,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,Portuguese Restaurant
9,Downtown Toronto,1,Coffee Shop,Clothing Store,Café,Japanese Restaurant,Middle Eastern Restaurant,Bubble Tea Shop,Cosmetics Shop,Diner,Lingerie Store,Bookstore
15,Downtown Toronto,1,Coffee Shop,Café,Cosmetics Shop,Cocktail Bar,Clothing Store,American Restaurant,Gym,Hotel,Farmers Market,Department Store
20,Downtown Toronto,1,Coffee Shop,Cocktail Bar,Bakery,Pharmacy,Cheese Shop,Restaurant,Farmers Market,Beer Bar,Seafood Restaurant,Japanese Restaurant
24,Downtown Toronto,1,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Salad Place,Bubble Tea Shop,Burger Joint,Thai Restaurant,Modern European Restaurant,Comic Shop
25,Downtown Toronto,1,Grocery Store,Café,Park,Baby Store,Nightclub,Italian Restaurant,Restaurant,Athletics & Sports,Coffee Shop,Candy Store
30,Downtown Toronto,1,Coffee Shop,Café,Restaurant,Hotel,Deli / Bodega,Thai Restaurant,Gym,Bakery,Concert Hall,Lounge
31,West Toronto,1,Bakery,Pharmacy,Music Venue,Middle Eastern Restaurant,Bank,Bar,Café,Pool,Supermarket,Grocery Store
36,Downtown Toronto,1,Coffee Shop,Aquarium,Hotel,Café,Fried Chicken Joint,Restaurant,Brewery,Scenic Lookout,Italian Restaurant,Pizza Place


#### Cluster 3

In [35]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,East Toronto,2,Health Food Store,Pub,Neighborhood,Trail,Yoga Studio,Dog Run,Diner,Discount Store,Distribution Center,Donut Shop


#### Cluster 4

In [36]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,Central Toronto,3,Garden,Home Service,Music Venue,Department Store,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Donut Shop


#### Cluster 5

In [37]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
83,Central Toronto,4,Park,Restaurant,Lawyer,Trail,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant
91,Downtown Toronto,4,Park,Playground,Trail,Deli / Bodega,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant


### Observations

<b>Cluster 1</b> (Label 0): Consists of mixed venues - various diverse venues

<b>Cluster 2</b> (Label 1): Consists of afternoon hang-out venues - coffee, confectionaries and other day-time recreational venues

<b>Cluster 3</b> (Label 2): Consists of residential areas and amenities for various family members

<b>Cluster 4</b> (Label 3): Consists of entertainment and socializing venues 

<b>Cluster 5</b> (Label 4): Consists of outdoors venues (parks and trails) and ethinic restaurants