# Peer-graded Assignment: Segmenting and Clustering Neighbourhoods in Toronto 

Tony Hall

This project involves scraping information on the neighnorhoods and boroughs of Toronto from the web, and using FourSquare location data along with the K-Means clustering algorithm to build an unsupervised Machine Learning model which clusters similar neighborhoods together.


## Question 1

Scrape and format the Canadian Postcodes


In [1]:
import pandas as pd
import numpy as np
#Beautifulsoup is used for web scraping, requests for getting the web page and lxml for parsing the html
from bs4 import BeautifulSoup
import requests
from lxml import html
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Beautiful Soup code below is redundant after I found that pd.read_html in the cell below is simpler
#page = requests.get(url).text
#soup = BeautifulSoup(page, 'lxml')
#print(soup.prettify())

In [3]:
#Read the table into a list of dataframes. The first element in the list [0] is the dataframe of postcodes and boroughs
dfs = pd.read_html(url, header=0)
df = pd.DataFrame(dfs[0])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
# For every row where the Borough is assigned and the Neighbourhood is not, update Neighbourhood with Borough
for row in df[(df['Borough'] != 'Not assigned') & (df['Neighbourhood'] == 'Not assigned') ].index:
    print("Updating row ",row,"Neighbourhood with Borough ",df.loc[row]['Borough'])
    df.loc[row]['Neighbourhood']=df.loc[row]['Borough']  
    

Updating row  8 Neighbourhood with Borough  Queen's Park


In [5]:
# Remove rows where Borough or Neighborhood are 'Not Assigned'
pre_len = len(df)
df = df[(df['Borough'] != 'Not assigned') | (df['Neighbourhood'] != 'Not assigned') ]
print("Number of Not Assigned rows removed: ",pre_len-len(df))

Number of Not Assigned rows removed:  77


In [6]:
df=df.reset_index(drop=True)

In [7]:
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [8]:
df.shape

(211, 3)


## Question 2

Get the Latitude and Longitude of each of the Canadian Postcodes

In [9]:
#install and import Geocoder for fetching latitudes and Longitudes of Borough's 
import sys
!{sys.executable} -m pip install geocoder
import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K    100% |████████████████████████████████| 102kB 18.9MB/s a 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
[31mtensorflow 1.3.0 requires tensorflow-tensorboard<0.2.0,>=0.1.0, which is not installed.[0m
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [10]:
#create a list of the latitudes and Longitudes of the Boroughs using the geocoder API
rownumber = 0
latlng=[]

print("Calling Geocoder API, please wait....")
for row in df['Neighbourhood']:
    location = df.loc[rownumber]['Neighbourhood']+", "+df.loc[rownumber]['Borough']+", Ontario, Canada"
    rownumber = rownumber+1
    latlng.append(geocoder.arcgis(location).latlng)
print(".....Complete")

Calling Geocoder API, please wait....
.....Complete


In [11]:
#for each split the latlng list into lat (latitude) and lng (Longitude). Fetching latlng then splitting the result rather than fetching lat and lng seperately from Geocoder is done to save time - the APT calls seem to be slow
lat=[]
lng=[]
i=0
for item in latlng:
    lat.append(latlng[int(i)][0])
    lng.append(latlng[int(i)][1])
    i=i+1
#add new Latitude and Longitude columns using the lat/lng lists
df['Latitude']=lat
df['Longitude']=lng
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,44.20973,-79.471904
1,M4A,North York,Victoria Village,43.73154,-79.31428
2,M5A,Downtown Toronto,Harbourfront,43.65011,-79.3829
3,M5A,Downtown Toronto,Regent Park,43.66069,-79.36031
4,M6A,North York,Lawrence Heights,43.72357,-79.43711
5,M6A,North York,Lawrence Manor,43.72292,-79.43131
6,M7A,Queen's Park,Queen's Park,44.38882,-79.69972
7,M9A,Etobicoke,Islington Avenue,43.722656,-79.558673
8,M1B,Scarborough,Rouge,43.80766,-79.17405
9,M1B,Scarborough,Malvern,43.80977,-79.22084



## Question 2

Cluster the neighbourhoods of Toronto using foursquare and k-means

In [12]:
# install and import the folium library to visualise the neighbourhood clusters on map
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/DSX-Python35

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.0.2r             |       h14c3975_0         3.1 MB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    certifi-2018.8.24          |        py35_1001         139 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-2.2.2               |           py35_1         462 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.0 MB

The following NEW packages will

Visualise the Neighbourhoods on a map of Toronto

In [13]:
# create map of New York using latitude and longitude values
toronto_map = folium.Map(location=[43.6532, -79.3932], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

Prepare Foursquare credentials

In [14]:
CLIENT_ID = 'EKC0OWGJC1SY1AE1UHB4PUPH2JGARZTQK1U5C1USTUNA43JF' # your Foursquare ID
CLIENT_SECRET = 'CHH0EPUEN2PH4WBDV4XHTDT5NWTUZ2SYVIQALUOZXWYNABRO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Now to try out the Foursquare explore request on one of our Toronto neighbourhoods

In [15]:
#prepare the request url
latitude = df.loc[2]['Latitude']
longitude = df.loc[2]['Longitude']
radius = 500
limit = 100
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, limit)
results = requests.get(url).json()["response"]['groups'][0]['items']
if results == []:
    print("request returned no results")
#results


In [16]:
#Use the function to extract the category from the dataframe (because the column name could be either 'categories' or 'venue.categories')
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [17]:
#define a function to return all the venues for a given neighborhood

#pass the neighborhood (nb), the latitude (la), the radius and the limit 
def get_venues(nb, la, lo, radius, limit):
    
    print(nb)
    
    #form the request url and request only the items (which are the venues)
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, la, lo, VERSION, radius, limit)
    items = requests.get(url).json()["response"]['groups'][0]['items']
    
    #include exception handling where Foursquare request fails for a particular Neighbourhood
    if items == []:
        print(" -- Foursquare request for ",nb,"returned no results --")
        return None
    else:
        # flatten JSON, filter for only wanted columns then use the get_category_type funtion to replace the category list with just the category.
        venues1 = json_normalize(items)
        filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
        venues1 =venues1.loc[:, filtered_columns]
        venues1['venue.categories'] = venues1.apply(get_category_type, axis=1)

        #define a new dataframe with the Neighbourhood information
        venues2 = pd.DataFrame(columns=['Neighborhood','Neighborhood Latitude', 
                      'Neighborhood Longitude']) 

        #for each of the venues returned, add the venue dataframe to the (empty) neighbourhood dataframe and fill all rows in the the neighbourhood columns with the neighborhood name, the neighbourhood latitude and longitude
        i=0
        for row in venues1:
            venues2[row]=venues1[row]
            venues2['Neighborhood']=nb
            venues2['Neighborhood Latitude']=la
            venues2['Neighborhood Longitude']=lo
            i=i+1

        #rename the columns
        venues2.rename(index=str,columns={"venue.name":"Venue","venue.categories":"Venue Category","venue.location.lat":"Venue Latitude","venue.location.lng":"Venue Longitude"}, inplace=True)
    
    return venues2

In [18]:
#test out the get_venues function
row = 149
n = df.loc[row]['Neighbourhood']
n_lat = df.loc[row]['Latitude']
n_long = df.loc[row]['Longitude']

df2=get_venues(n, n_lat, n_long, 500, 100)
df2.head()

Tam O'Shanter


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category,Venue Latitude,Venue Longitude
0,Tam O'Shanter,43.78534,-79.29833,Burger King,Fast Food Restaurant,43.783653,-79.292935
1,Tam O'Shanter,43.78534,-79.29833,Subway,Sandwich Place,43.783665,-79.292709
2,Tam O'Shanter,43.78534,-79.29833,Gusto Pizza,Pizza Place,43.783607,-79.298983
3,Tam O'Shanter,43.78534,-79.29833,Tim Hortons,Coffee Shop,43.783808,-79.293351
4,Tam O'Shanter,43.78534,-79.29833,A K Sports Cards & Comics,Hobby Shop,43.784034,-79.293109


In [19]:
#Iterate thorough all the Toronto based Neighbourhoods, adding each set of venues to the df3 dataframe
df3=pd.DataFrame()
radius = 500
limit = 100
for i, row in enumerate(df['Borough']):
    if row.find('Toronto') > 0:
        #print(i,df.iloc[i]['Borough'],df.iloc[i]['Neighbourhood'], df.iloc[i]['Latitude'],df.iloc[i]['Longitude'], )
        df3=df3.append(get_venues(df.iloc[i]['Neighbourhood'], df.iloc[i]['Latitude'], df.iloc[i]['Longitude'], radius, limit),ignore_index=True)
        
    

Harbourfront
Regent Park
Ryerson
Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide
King
Richmond
Dovercourt Village
Dufferin
Harbourfront East
Toronto Islands
Union Station
Little Portugal
Trinity
The Danforth West
Riverdale
Design Exchange
Toronto Dominion Centre
Brockton
Exhibition Place
Parkdale Village
The Beaches West
India Bazaar
Commerce Court
Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North
Forest Hill West
High Park
The Junction South
North Toronto West
The Annex
North Midtown
Yorkville
Parkdale
Roncesvalles
Davisville
Harbord
University of Toronto
Runnymede
Swansea
Moore Park
Summerhill East
 -- Foursquare request for  Summerhill East returned no results --
Chinatown
Grange Park
Kensington Market
Deer Park
Forest Hill SE
Rathnelly
South Hill
Summerhill West
 -- Foursquare request for  Summerhill West returned no results --
CN Tower
Bathurst Quay
Island airport
Harbourfront West
King and Spadina


Now to take a look at the results of the complete dataframe

In [20]:
print("Shape of the result is:",df3.shape)
df3.head()

Shape of the result is: (4218, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category,Venue Latitude,Venue Longitude
0,Harbourfront,43.65011,-79.3829,The Keg Steakhouse & Bar,Steakhouse,43.649937,-79.384196
1,Harbourfront,43.65011,-79.3829,Adelaide Club Toronto,Gym / Fitness Center,43.649279,-79.381921
2,Harbourfront,43.65011,-79.3829,Pilot Coffee Roasters,Coffee Shop,43.648835,-79.380936
3,Harbourfront,43.65011,-79.3829,John & Sons Oyster House,Seafood Restaurant,43.650656,-79.381613
4,Harbourfront,43.65011,-79.3829,Rosalinda,Vegetarian / Vegan Restaurant,43.650252,-79.385156


Now to check how many venues we got back for each Neighbourhood

In [21]:
df3.groupby('Neighborhood').count().head()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category,Venue Latitude,Venue Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Bathurst Quay,25,25,25,25,25,25
Berczy Park,100,100,100,100,100,100
Brockton,43,43,43,43,43,43
Business Reply Mail Processing Centre 969 Eastern,100,100,100,100,100,100


How many unique venue categories did we get?

In [22]:
len(df3['Venue Category'].unique())

250

Taking a look at the counts of each venue type returns unsurprising results, with Coffee Shop, Cafe, hotel etc the most common and Castles near the bottom! (Maybe "Castle" would be closer to the top in Old Europe)

In [23]:
counts = df3['Venue Category'].value_counts()
counts

Coffee Shop                      309
Café                             225
Hotel                            140
Restaurant                       132
Bar                              118
Bakery                           112
Steakhouse                       105
Japanese Restaurant              101
American Restaurant               95
Pizza Place                       92
Gastropub                         90
Breakfast Spot                    89
Burger Joint                      88
Gym                               75
Thai Restaurant                   74
Asian Restaurant                  74
Sushi Restaurant                  73
Italian Restaurant                73
Seafood Restaurant                67
Bookstore                         59
Cosmetics Shop                    57
Concert Hall                      53
Salad Place                       49
Pub                               49
Vegetarian / Vegan Restaurant     48
Sandwich Place                    47
Clothing Store                    43
D

Now to prepare the dataframe for k-means though onehot encoding and normalization

In [24]:
df_onehot = pd.get_dummies(df3['Venue Category'])
df_onehot.drop('Neighborhood',axis=1,inplace=True)
#insert a new column for the Neighbourhood values (note that immediately reusing the 'Neighborhood' name causes as error as you can't add a Neighourhood column that already exists, so renamed the column post the insert)
df_onehot.insert(0,"New",df3['Neighborhood'])
df_onehot.rename(columns={'New':'Neighborhood'}, inplace=True)

#get the mean frequency occurance
df_onehot = df_onehot.groupby("Neighborhood").mean().reset_index()
df_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Adelaide,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
1,Bathurst Quay,0.0,0.0,0.0,0.0,0.04,0.0,0.04,0.04,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
3,Brockton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.093023,0.0,0.023256,0.0,0.0,0.0,0.0
4,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Take a look at the top 5 venues for each Neighbourhood

In [25]:
i=0
for row in df_onehot:
    print(df_onehot.iloc[i][0])
    print(df_onehot.iloc[i][1:].sort_values(ascending=False).head(5))
    print('\n')
    i=i+1

Adelaide
Café                   0.06
Coffee Shop            0.06
Hotel                  0.04
American Restaurant    0.03
Bakery                 0.03
Name: 0, dtype: object


Bathurst Quay
Coffee Shop             0.2
Park                   0.08
Café                   0.08
Japanese Restaurant    0.04
Dance Studio           0.04
Name: 1, dtype: object


Berczy Park
Coffee Shop    0.06
Café           0.06
Hotel          0.04
Steakhouse     0.04
Gastropub      0.03
Name: 2, dtype: object


Brockton
Coffee Shop               0.116279
Bar                      0.0930233
Vietnamese Restaurant    0.0930233
Grocery Store            0.0697674
Café                     0.0697674
Name: 3, dtype: object


Business Reply Mail Processing Centre 969 Eastern
Coffee Shop    0.07
Steakhouse     0.04
Hotel          0.04
Café           0.04
Bar            0.04
Name: 4, dtype: object


CN Tower
Coffee Shop    0.06
Café           0.06
Hotel          0.04
Steakhouse     0.04
Gastropub      0.03
Name: 5, dtype: o

IndexError: single positional indexer is out-of-bounds

In [26]:
#put the top10 into a dataframe
df_topvenues=pd.DataFrame(columns=['Neighborhood','#1','#2','#3','#4','#5','#6','#7','#8','#9','#10'])

for row in range(0,len(df_onehot)):
    neighborhood = df_onehot.iloc[row][0]
    topten = df_onehot.iloc[row][1:].sort_values(ascending=False).head(10)
    temp=pd.DataFrame([[neighborhood,topten.index[0],topten.index[1],topten.index[2],topten.index[3],topten.index[4],topten.index[5],topten.index[6],topten.index[7],topten.index[8],topten.index[9]]],columns=['Neighborhood','#1','#2','#3','#4','#5','#6','#7','#8','#9','#10'])
    df_topvenues=df_topvenues.append(temp)
df_topvenues.reset_index(inplace=True, drop=True)
df_topvenues.head()

Unnamed: 0,Neighborhood,#1,#2,#3,#4,#5,#6,#7,#8,#9,#10
0,Adelaide,Café,Coffee Shop,Hotel,American Restaurant,Bakery,Steakhouse,Asian Restaurant,Burger Joint,Gastropub,Breakfast Spot
1,Bathurst Quay,Coffee Shop,Park,Café,Japanese Restaurant,Dance Studio,Diner,Ramen Restaurant,Caribbean Restaurant,Sculpture Garden,Sushi Restaurant
2,Berczy Park,Coffee Shop,Café,Hotel,Steakhouse,Gastropub,Japanese Restaurant,American Restaurant,Asian Restaurant,Burger Joint,Breakfast Spot
3,Brockton,Coffee Shop,Bar,Vietnamese Restaurant,Grocery Store,Café,Pizza Place,Bakery,Restaurant,Portuguese Restaurant,French Restaurant
4,Business Reply Mail Processing Centre 969 Eastern,Coffee Shop,Steakhouse,Hotel,Café,Bar,Sushi Restaurant,Pub,Pizza Place,Japanese Restaurant,American Restaurant


### Clustering

In [27]:
# import k-means
from sklearn.cluster import KMeans

df_clustering = df_onehot.drop('Neighborhood',axis=1)

# set number of clusters
k = 5

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:100]

array([0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 2, 2,
       0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 3, 0, 1, 0, 3, 0, 0, 0, 0, 3, 0,
       4, 0, 0, 3, 3, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0,
       0, 0], dtype=int32)

Insert the cluster labels:

In [28]:
df_topvenues.insert(0,'Cluster Labels',kmeans.labels_)

Now to take a look at one of the clusters

In [29]:
df_topvenues[df_topvenues['Cluster Labels']==4]

Unnamed: 0,Cluster Labels,Neighborhood,#1,#2,#3,#4,#5,#6,#7,#8,#9,#10
46,4,Riverdale,Asian Restaurant,Supermarket,Pharmacy,Thai Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Yoga Studio


Taking a look at the top 5 in each cluster gives a feel for what the clusters represent. Cluster 0 are nightlife areas with the cafes/bars/restaurants etc. Cluster 3 are the standard suburbs which have lots of coffee shops but not a lot of Bars. Cluster 1, 2 and 4 are basically outliers; they dont have many coffee shops

In [30]:
print('Cluster 0 top venues\n',df_topvenues[df_topvenues['Cluster Labels']==0]['#1'].value_counts().head(5))
print('Cluster 1 top venues\n',df_topvenues[df_topvenues['Cluster Labels']==1]['#1'].value_counts().head(5))
print('Cluster 2 top venues\n',df_topvenues[df_topvenues['Cluster Labels']==2]['#1'].value_counts().head(5))
print('Cluster 3 top venues\n',df_topvenues[df_topvenues['Cluster Labels']==3]['#1'].value_counts().head(5))
print('Cluster 4 top venues\n',df_topvenues[df_topvenues['Cluster Labels']==4]['#1'].value_counts().head(5))

Cluster 0 top venues
 Coffee Shop    27
Café            9
Bar             5
Bakery          3
Restaurant      2
Name: #1, dtype: int64
Cluster 1 top venues
 Park    1
Name: #1, dtype: int64
Cluster 2 top venues
 Convenience Store    2
Name: #1, dtype: int64
Cluster 3 top venues
 Coffee Shop          7
Grocery Store        2
Indian Restaurant    1
Bakery               1
Airport Lounge       1
Name: #1, dtype: int64
Cluster 4 top venues
 Asian Restaurant    1
Name: #1, dtype: int64


In [31]:
df_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
df_merged = df_merged.join(df_topvenues.set_index('Neighborhood'), on='Neighbourhood')

#drop rows with no clusters (which are the non Toronto rows)
df_toronto_merged = df_merged.dropna(subset=['Cluster Labels'])
df_toronto_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,#1,#2,#3,#4,#5,#6,#7,#8,#9,#10
2,M5A,Downtown Toronto,Harbourfront,43.65011,-79.3829,0.0,Coffee Shop,Café,Hotel,Steakhouse,Gastropub,Japanese Restaurant,American Restaurant,Asian Restaurant,Burger Joint,Breakfast Spot
3,M5A,Downtown Toronto,Regent Park,43.66069,-79.36031,3.0,Coffee Shop,Thai Restaurant,Pool,Fast Food Restaurant,Electronics Store,Beer Store,Food Truck,Restaurant,Sushi Restaurant,Auto Dealership
13,M5B,Downtown Toronto,Ryerson,43.65011,-79.3829,0.0,Café,Coffee Shop,Hotel,American Restaurant,Bakery,Steakhouse,Asian Restaurant,Burger Joint,Gastropub,Breakfast Spot
14,M5B,Downtown Toronto,Garden District,43.65011,-79.3829,0.0,Coffee Shop,Café,Hotel,Steakhouse,Gastropub,Japanese Restaurant,American Restaurant,Asian Restaurant,Burger Joint,Breakfast Spot
27,M5C,Downtown Toronto,St. James Town,43.67081,-79.37348,3.0,Coffee Shop,Grocery Store,Pizza Place,Pie Shop,Filipino Restaurant,Library,Food & Drink Shop,Bike Rental / Bike Share,Market,Breakfast Spot


Now plotting the clusters on a map

In [32]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

colors_list = ['Red','Blue','Green','Yellow','Purple','Pink','Orange']

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto_merged['Latitude'], df_toronto_merged['Longitude'], df_toronto_merged['Neighbourhood'], df_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=colors_list[int(cluster)],
        fill=True,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

..And we're done! Toronto neighbourhoods clustered and mapped"