# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto - Notebook I

## Introduction

In this notebook we first scrape a Wiki page to get a table of neighborhoods in Toronto and create a data frame from it. In a second notebook, we will use the Foursquare location data to complete this table. In a third notebook, we will create maps to explore the data.

We scrape the data on "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" using the beautifulsoup4 package.

## Scraping the HTML Wiki page

So we start by importing basic Python libraries:

In [1]:
import pandas as pd
import numpy as np

And now import the whole HTML Wiki page using BeautifulSoup4:

In [2]:
from bs4 import BeautifulSoup
import requests

source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml') # We use the lxml parser

We can test a few functionalities of BeautifulSoup if we want (click **here** to see some commands).
<!--
# print(soup.prettify())
# print(soup.title)
# print(soup.title.text)
# print(soup.find('div', class_='printfooter').text)

// Nice video about how to use BeautifulSoup:
// https://www.youtube.com/watch?v=ng2o98k983k
-->

Let us capture the table(s) of the page:

In [3]:
table_on_page = soup.find('table', class_='wikitable sortable')

Some test of commands **here**.
<!--
#print(table_on_page)
# Test:
#table_header1 = table_on_page.tbody.th
#print(table_header1.text)
-->

We read through the HTML table header and save the words into a list (removing the '\n' part when it appears):

In [4]:
#print(soup.find_all('th'))
names_th = []
for word in soup.find_all('th'):
    if word.text[-1] == '\n':
        names_th.append(word.text[:-1])
    else:
        names_th.append(word.text)
names_th

['Postcode', 'Borough', 'Neighbourhood', 'Canadian postal codes']

And we **create a pandas Data Frame** with the 3 first headers:

In [5]:
import pandas as pd
df = pd.DataFrame(columns=[names_th[0],names_th[1],names_th[2]])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood


We now read through the HTML table values and save the words into a list:

In [6]:
values_th = []
for val in soup.find_all('td'):
    values_th.append(val.text)
#values_th

We take that list and assign the values 3 by 3 to the data frame until we reach the last row (that we add after), thus avoiding the extra table elements coming from the other table at the bottom of the page and that we do not need. We also have to remove the '\n' for the last element.

In [7]:
i = 0

# Looping through the list until we reach the last line
while(values_th[3*i+0] != 'M9Z' and i < 500):
    #print(values_th[3*i+0],values_th[3*i+1],values_th[3*i+2])
    df = df.append({names_th[0]:values_th[3*i+0], names_th[1]:values_th[3*i+1], names_th[2]:values_th[3*i+2][:-1]}, ignore_index=True)
    i+=1

# Dealing with the last line
#print(values_th[3*i+0],values_th[3*i+1],values_th[3*i+2])
df = df.append({names_th[0]:values_th[3*i+0], names_th[1]:values_th[3*i+1], names_th[2]:values_th[3*i+2][:-1]}, ignore_index=True)

Checking the head of the data frame:

In [8]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


And checking the tail to make sure we incomporated values until the last row:

In [9]:
df.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor
287,M9Z,Not assigned,Not assigned


The shape of the data frame is the following:

In [10]:
df.shape

(288, 3)

**We now have our scraped HTML table converted into a data frame !**

# Shaping the data frame

In here we will match with the requirements of the assigment concerning the data frame.

We rename the column 'Postcode' into 'PostCode':

In [11]:
df.rename(columns={'Postcode':'PostCode'}, inplace=True)
df.head()

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We remove rows for which "Not assigned" is present in the column 'Borough' (there will remain a "Not assigned" value in the column 'Neighbourhood'):

In [12]:
df = df[~df['Borough'].isin(["Not assigned"])]

In [13]:
df.head(10)

Unnamed: 0,PostCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [14]:
df.shape

(211, 3)

We now need to reset the index for the loop coming after.

In [15]:
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Let's do the loop. When we go through an element, we test the next one. If they match, then we add the neighborhood name to the next element and we replace the postcode of the current element by the letter X, so that we can suppress it after, if they don't match we go to the next element.

In [16]:
for i in range(len(df.index)-1):
    #print(i, df.iloc[i,0], df.iloc[i,1], df.iloc[i,2])
    
    if (df.iloc[i,0] == df.iloc[i+1,0] and df.iloc[i,1] == df.iloc[i+1,1]):
        # print("match")
        df.iloc[i+1,2] = df.iloc[i+1,2] + ', ' + df.iloc[i,2]
        df.iloc[i,0] = 'X'

df.head(10)     

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,X,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,"Regent Park, Harbourfront"
4,X,North York,Lawrence Heights
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,X,Scarborough,Rouge
9,M1B,Scarborough,"Malvern, Rouge"


We suppress the elements marked by X and reset the index:

In [17]:
df = df[~df['PostCode'].isin(["X"])]
df.reset_index(drop=True, inplace=True)

In [18]:
df.head(10)

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [19]:
df.shape

(103, 3)

We need to replace the Neighbourhood name "Not assigned" by the Borough name.

In [20]:
df = df.replace({'Neighbourhood':'Not assigned'},{'Neighbourhood':df['Borough']})
df.head(10)     

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


We can see that the value for Queen's Park has been replaced.

In [21]:
df.shape

(103, 3)

# Importing location data from Foursquare

We borrow the code from the assigment page and adapt it:

In [23]:
'''
import geocoder # import geocoder

# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):
g = geocoder.google('{}, Toronto, Ontario'.format('M5G'))
print(g)
lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
'''

"\nimport geocoder # import geocoder\n\n# initialize your variable to None\n#lat_lng_coords = None\n\n# loop until you get the coordinates\n#while(lat_lng_coords is None):\ng = geocoder.google('{}, Toronto, Ontario'.format('M5G'))\nprint(g)\nlat_lng_coords = g.latlng\n\nlatitude = lat_lng_coords[0]\nlongitude = lat_lng_coords[1]\n"

Since the geocoder does not work, I am just using the csv file:

In [24]:
csv_path="Geospatial_Coordinates.csv"
df_coords = pd.read_csv(csv_path)
df_coords.rename(columns={'Postal Code':'PostCode'}, inplace=True)
df_coords.head()

Unnamed: 0,PostCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [25]:
df.set_index('PostCode', inplace=True)
df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Queen's Park,Queen's Park


In [26]:
df_coords.set_index('PostCode', inplace=True)

In [27]:
df_coords.head()

Unnamed: 0_level_0,Latitude,Longitude
PostCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [28]:
new_df = pd.merge(df, df_coords, on='PostCode')

In [29]:
new_df

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
PostCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
M7A,Queen's Park,Queen's Park,43.662301,-79.389494
M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
M3B,North York,Don Mills North,43.745906,-79.352188
M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


Checking that the last value of the Coursera assignment page matches:

In [30]:
new_df.loc['M5A']

Borough                   Downtown Toronto
Neighbourhood    Regent Park, Harbourfront
Latitude                           43.6543
Longitude                         -79.3606
Name: M5A, dtype: object

Indeed, it matches. We now reset the index.

In [31]:
new_df.reset_index(drop=False, inplace=True)

In [32]:
new_df.head()

Unnamed: 0,PostCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [33]:
new_df.shape

(103, 5)

# Segmenting and Clustering Neighborhoods

We reproduce the analysis sawn in the course, but this time for neighborhoods of Toronto.

In [34]:
import json
#import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
print('Libraries imported.')

Libraries imported.


Creating the map of Toronto with the neighborhoods superimposed on it:

In [40]:
# Getting the center coordinates of Toronto
latitude_Toronto = 43.651070
longitude_Toronto = -79.347015

# Creating the map
map_toronto = folium.Map(location=[latitude_Toronto, longitude_Toronto], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(new_df['Latitude'], new_df['Longitude'], new_df['Borough'], new_df['Neighbourhood']):
    label = '{} @ {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

In [41]:
map_toronto

We want to explore a little the data, so let's use **Foursquare**.

Let's enter the credentials:

In [108]:
CLIENT_ID = '...'
CLIENT_SECRET = '...'
VERSION = '20180605'

Getting 50 venues within a radius of 500 meters. We borrow the function for get nearby venues for all the neighborhoods that was defined in the case of Manhattan.

In [178]:
LIMIT = 50
RADIUS = 500

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We create a data frame with that function, containing the venues in Toronto.

In [180]:
toronto_venues = getNearbyVenues(names=new_df['Neighbourhood'],
                                   latitudes=new_df['Latitude'],
                                   longitudes=new_df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park
Islington Avenue
Malvern, Rouge
Don Mills North
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Port Union, Rouge Hill, Highland Creek
Don Mills South, Flemingdon Park
Woodbine Heights
St. James Town
Humewood-Cedarvale
Old Burnhamthorpe, Markland Wood, Eringate, Bloordale Gardens
West Hill, Morningside, Guildwood
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Wilson Heights, Downsview North, Bathurst Manor
Thorncliffe Park
Richmond, King, Adelaide
Dufferin, Dovercourt Village
Scarborough Village
Oriole, Henry Farm, Fairview
York University, Northwood Park
East Toronto
Union Station, Toronto Islands, Harbourfront East
Trinity, Little Portugal
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview East, CFB Toronto
River

In [181]:
print(toronto_venues.shape)
toronto_venues.head()

(1708, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


Let's check the number of venues obtained for each neighbourhood.

In [182]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
Bayview Village,4,4,4,4,4,4
Berczy Park,50,50,50,50,50,50
Business Reply Mail Processing Centre 969 Eastern,18,18,18,18,18,18
Caledonia-Fairbanks,4,4,4,4,4,4
Canada Post Gateway Processing Centre,11,11,11,11,11,11
Cedarbrae,7,7,7,7,7,7
Central Bay Street,50,50,50,50,50,50
Christie,17,17,17,17,17,17
Church and Wellesley,50,50,50,50,50,50


In [183]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 251 uniques categories.


We proceed to the one-hot encoding of the "Venue Category" so that we can use the k-means clustering method after.

In [184]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood']

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [185]:
toronto_onehot.shape

(1708, 252)

In [186]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.00,...,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.00,0.0,0.000000
1,Bayview Village,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.00,...,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.00,0.0,0.000000
2,Berczy Park,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.00,...,0.00,0.020000,0.000000,0.000000,0.000000,0.000000,0.00,0.00,0.0,0.000000
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.00,...,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.00,0.0,0.055556
4,Caledonia-Fairbanks,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.00,...,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.00,0.0,0.000000
5,Canada Post Gateway Processing Centre,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.090909,0.00,...,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.00,0.0,0.000000
6,Cedarbrae,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.00,...,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.00,0.0,0.000000
7,Central Bay Street,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.020000,0.00,...,0.00,0.020000,0.000000,0.000000,0.000000,0.000000,0.00,0.00,0.0,0.020000
8,Christie,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.00,...,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.00,0.0,0.000000
9,Church and Wellesley,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.00,...,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.02,0.0,0.000000


In [187]:
toronto_grouped.shape

(98, 252)

Now we print the name of each neighbourhood with their top 5 most common venues.

In [188]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
               venue  freq
0     Sandwich Place  0.25
1             Lounge  0.25
2     Breakfast Spot  0.25
3       Skating Rink  0.25
4  Martial Arts Dojo  0.00


----Bayview Village----
                        venue  freq
0         Japanese Restaurant  0.25
1                        Café  0.25
2                        Bank  0.25
3          Chinese Restaurant  0.25
4  Modern European Restaurant  0.00


----Berczy Park----
          venue  freq
0   Coffee Shop  0.08
1  Cocktail Bar  0.06
2    Steakhouse  0.04
3      Beer Bar  0.04
4          Café  0.04


----Business Reply Mail Processing Centre 969 Eastern----
                  venue  freq
0    Light Rail Station  0.11
1           Yoga Studio  0.06
2                   Spa  0.06
3  Gym / Fitness Center  0.06
4         Garden Center  0.06


----Caledonia-Fairbanks----
                  venue  freq
0                  Park  0.50
1  Fast Food Restaurant  0.25
2                Market  0.25
3                 Motel  0.00
4   

We borrow the function that sorts the venues in descending order.

In [189]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

We create the data frame and display the top 10 venues of each neighbourhood.

In [190]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Skating Rink,Breakfast Spot,Sandwich Place,Falafel Restaurant,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
1,Bayview Village,Café,Bank,Japanese Restaurant,Chinese Restaurant,Drugstore,Discount Store,Dog Run,Donut Shop,Dumpling Restaurant,Dim Sum Restaurant
2,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Seafood Restaurant,Beer Bar,Cheese Shop,Café,Steakhouse,Farmers Market,Indian Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Auto Workshop,Park,Comic Shop,Pizza Place,Restaurant,Burrito Place,Brewery,Skate Park,Smoke Shop
4,Caledonia-Fairbanks,Park,Fast Food Restaurant,Market,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


Running *k*-means to cluster the neighborhoods into 3 clusters.

In [191]:
# set number of clusters
kclusters = 3
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int32)

Creating a new data frame that includes the cluster category as well as the top 10 venues.

In [192]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = new_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head(10)

Unnamed: 0,PostCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,Park,Food & Drink Shop,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,French Restaurant,Coffee Shop,Portuguese Restaurant,Hockey Arena,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,Coffee Shop,Park,Bakery,Café,Pub,Theater,Breakfast Spot,Mexican Restaurant,Ice Cream Shop,Chocolate Shop
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0.0,Clothing Store,Accessories Store,Boutique,Shoe Store,Miscellaneous Shop,Event Space,Arts & Crafts Store,Furniture / Home Store,Coffee Shop,Vietnamese Restaurant
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,0.0,Coffee Shop,Park,Gym,Burger Joint,Diner,Chinese Restaurant,Seafood Restaurant,Sandwich Place,Burrito Place,Restaurant
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242,,,,,,,,,,,
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,0.0,Fast Food Restaurant,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore
7,M3B,North York,Don Mills North,43.745906,-79.352188,0.0,Gym / Fitness Center,Caribbean Restaurant,Japanese Restaurant,Café,Construction & Landscaping,Convenience Store,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,0.0,Fast Food Restaurant,Pizza Place,Pet Store,Athletics & Sports,Gastropub,Intersection,Pharmacy,Bus Line,Bank,Gym / Fitness Center
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0.0,Coffee Shop,Café,Theater,Ramen Restaurant,Clothing Store,Cosmetics Shop,Spa,Hotel,Beer Bar,Shopping Mall


In [193]:
toronto_merged.shape

(103, 16)

Let's count how many NaN values there is in the 'Clust Labels' column.

In [194]:
toronto_merged['Cluster Labels'].isnull().sum(axis=0)

5

Let us remove these rows which have failed during the k-means clustering.

In [195]:
toronto_merged.dropna(subset=['Cluster Labels'], axis=0, inplace=True)

In [196]:
toronto_merged.shape

(98, 16)

In [197]:
toronto_merged.head(10)

Unnamed: 0,PostCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,Park,Food & Drink Shop,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,French Restaurant,Coffee Shop,Portuguese Restaurant,Hockey Arena,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,Coffee Shop,Park,Bakery,Café,Pub,Theater,Breakfast Spot,Mexican Restaurant,Ice Cream Shop,Chocolate Shop
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0.0,Clothing Store,Accessories Store,Boutique,Shoe Store,Miscellaneous Shop,Event Space,Arts & Crafts Store,Furniture / Home Store,Coffee Shop,Vietnamese Restaurant
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,0.0,Coffee Shop,Park,Gym,Burger Joint,Diner,Chinese Restaurant,Seafood Restaurant,Sandwich Place,Burrito Place,Restaurant
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,0.0,Fast Food Restaurant,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore
7,M3B,North York,Don Mills North,43.745906,-79.352188,0.0,Gym / Fitness Center,Caribbean Restaurant,Japanese Restaurant,Café,Construction & Landscaping,Convenience Store,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,0.0,Fast Food Restaurant,Pizza Place,Pet Store,Athletics & Sports,Gastropub,Intersection,Pharmacy,Bus Line,Bank,Gym / Fitness Center
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0.0,Coffee Shop,Café,Theater,Ramen Restaurant,Clothing Store,Cosmetics Shop,Spa,Hotel,Beer Bar,Shopping Mall
10,M6B,North York,Glencairn,43.709577,-79.445073,0.0,Pub,Bakery,Pizza Place,Italian Restaurant,Japanese Restaurant,Yoga Studio,Diner,Discount Store,Dog Run,Donut Shop


Since we encounter an error after of "TypeError: list indices must be integers or slices, not float", we enforce the integer type of the cluster labels.

In [198]:
toronto_merged[['Cluster Labels']] = toronto_merged[['Cluster Labels']].applymap(np.int64)

In [199]:
toronto_merged.head(10)

Unnamed: 0,PostCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1,Park,Food & Drink Shop,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,0,French Restaurant,Coffee Shop,Portuguese Restaurant,Hockey Arena,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Park,Bakery,Café,Pub,Theater,Breakfast Spot,Mexican Restaurant,Ice Cream Shop,Chocolate Shop
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0,Clothing Store,Accessories Store,Boutique,Shoe Store,Miscellaneous Shop,Event Space,Arts & Crafts Store,Furniture / Home Store,Coffee Shop,Vietnamese Restaurant
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,0,Coffee Shop,Park,Gym,Burger Joint,Diner,Chinese Restaurant,Seafood Restaurant,Sandwich Place,Burrito Place,Restaurant
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,0,Fast Food Restaurant,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore
7,M3B,North York,Don Mills North,43.745906,-79.352188,0,Gym / Fitness Center,Caribbean Restaurant,Japanese Restaurant,Café,Construction & Landscaping,Convenience Store,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,0,Fast Food Restaurant,Pizza Place,Pet Store,Athletics & Sports,Gastropub,Intersection,Pharmacy,Bus Line,Bank,Gym / Fitness Center
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Coffee Shop,Café,Theater,Ramen Restaurant,Clothing Store,Cosmetics Shop,Spa,Hotel,Beer Bar,Shopping Mall
10,M6B,North York,Glencairn,43.709577,-79.445073,0,Pub,Bakery,Pizza Place,Italian Restaurant,Japanese Restaurant,Yoga Studio,Diner,Discount Store,Dog Run,Donut Shop


Visualizing the results:

In [200]:
# create map
map_clusters = folium.Map(location=[latitude_Toronto, longitude_Toronto], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We just have to examine the clusters now.

Cluster 1:

In [201]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0,French Restaurant,Coffee Shop,Portuguese Restaurant,Hockey Arena,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
2,Downtown Toronto,0,Coffee Shop,Park,Bakery,Café,Pub,Theater,Breakfast Spot,Mexican Restaurant,Ice Cream Shop,Chocolate Shop
3,North York,0,Clothing Store,Accessories Store,Boutique,Shoe Store,Miscellaneous Shop,Event Space,Arts & Crafts Store,Furniture / Home Store,Coffee Shop,Vietnamese Restaurant
4,Queen's Park,0,Coffee Shop,Park,Gym,Burger Joint,Diner,Chinese Restaurant,Seafood Restaurant,Sandwich Place,Burrito Place,Restaurant
6,Scarborough,0,Fast Food Restaurant,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore
7,North York,0,Gym / Fitness Center,Caribbean Restaurant,Japanese Restaurant,Café,Construction & Landscaping,Convenience Store,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store
8,East York,0,Fast Food Restaurant,Pizza Place,Pet Store,Athletics & Sports,Gastropub,Intersection,Pharmacy,Bus Line,Bank,Gym / Fitness Center
9,Downtown Toronto,0,Coffee Shop,Café,Theater,Ramen Restaurant,Clothing Store,Cosmetics Shop,Spa,Hotel,Beer Bar,Shopping Mall
10,North York,0,Pub,Bakery,Pizza Place,Italian Restaurant,Japanese Restaurant,Yoga Studio,Diner,Discount Store,Dog Run,Donut Shop
11,Etobicoke,0,Golf Course,Yoga Studio,Department Store,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore


Cluster 2:

In [202]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,1,Park,Food & Drink Shop,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
21,York,1,Park,Fast Food Restaurant,Market,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
32,Scarborough,1,Playground,Convenience Store,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
35,East York,1,Park,Convenience Store,Metro Station,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
40,North York,1,Electronics Store,Airport,Park,Dumpling Restaurant,Diner,Discount Store,Dog Run,Donut Shop,Drugstore,Yoga Studio
46,North York,1,Grocery Store,Park,Convenience Store,Bank,Hotel,Shopping Mall,Eastern European Restaurant,Electronics Store,Dumpling Restaurant,Dessert Shop
49,North York,1,Basketball Court,Park,Bakery,Construction & Landscaping,Yoga Studio,Dog Run,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
61,Central Toronto,1,Park,Swim School,Bus Line,Dim Sum Restaurant,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant
66,North York,1,Park,Bank,Convenience Store,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Donut Shop,Drugstore,Electronics Store
77,Etobicoke,1,Park,Pizza Place,Bus Line,Mobile Phone Shop,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


Cluster 3:

In [203]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,North York,2,Baseball Field,Yoga Studio,Fast Food Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
101,Etobicoke,2,Breakfast Spot,Baseball Field,Yoga Studio,Fast Food Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant


Considering the large differences in size of these clusters, I am not sure it really makes sense to compare them, but there is still a clear trend which is for Cluster 1 to contain a lot of coffee shops and restaurants while Cluster 2 contains parks and shops.