# Exploring Toronto Neighbourhoods

In this notebook we will show how we can 
1. scrape data from the internet, 
2. query a search-and-discovery online service, 
3. geolocate the data obtained
4. map it on 2d map
5. and analyze the data to study the neighbourhoods of Toronto.

Are the venues within a neighbourhood telling us something about it? We will see how we could answer the question by
clustering all venues in terms of the services each provides (i.e., is it a restaurant, a coffee shop, type of restaurant, etc.)

In the first part we will gather the data on the postal codes, boroughs and neighbourhoods of Toronto. In the second part we will find the aproximate geographic coordinates of each neighbourhood. Finally, we will explore the types of venues that can be found on each neighbourhood and used to try to understand them.

Will end by plotting the clusters obtained on a map and discussed the results obtained.

In [338]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors


# Toronto ZIP codes and Neighbourhoods

We will collate the data on the different neighbourhoods and postal codes of Toronto from the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). We will scrape this webpage using requests and BeautifullSoup.

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html = requests.get(url)
wp = bs(html.text)


This wikipedia entry may contain several tables. Let's check which one is the on we are interested on.

In [3]:
#wp.find_all(name=(lambda x: not x=='html' ),attrs={'class':True})
tables = wp.find_all('table')
len(tables)

5

In [4]:
print(tables[0])#.text[:100])

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Harbourfront</a>
</td></tr>
<tr>
<td>M6A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Lawrence_Heights" title="Lawrence Heights">Lawrence Heights</a>
</td></tr>
<tr>
<td>M6A</td>
<td><a href="/wiki/North

The column headings:

In [5]:
borough_table = tables[0]
headings = [ h.text.strip() for h in borough_table.find_all('th') ]
headings

['Postcode', 'Borough', 'Neighbourhood']

The actual rows.

Here we need to :  
1. filter out all unwanted data entries: Not assigned borough
2. Combine rows w/ == zip, into a csv entry for neighborhood
3. If Neigh == Not assigned => make Neigh = borough
4. Last cell show .shape of df
5. Clean & comment (md) notebook

First, inspect the table as seen by Beautifulsoup for unwanted entries at the boundaries

In [6]:
borough_table.find_all('tr')[:3]

[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>]

In [7]:
borough_table.find_all('tr')[-3:]

[<tr>
 <td>M8Z</td>
 <td><a href="/wiki/Etobicoke" title="Etobicoke">Etobicoke</a></td>
 <td>Royal York South West
 </td></tr>, <tr>
 <td>M8Z</td>
 <td><a href="/wiki/Etobicoke" title="Etobicoke">Etobicoke</a></td>
 <td>South of Bloor
 </td></tr>, <tr>
 <td>M9Z</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>]

As the headings are under a parent `tr` tag, we may as well ignore the first match of `tr` in order to list all
table rows.

In order to filter out the non-assigned boroughs, we use two helper functions, `nta` and `mtchBor`. The former checks if a string is `'Not assigned'`. It does so by comparing both in lowercase.

Finally, we will merge together any entries with the same postal code. In this case, different neighbourhoods will be combined together separated by a comma.

In [8]:
# nta : is Not assigned' ? 
def nta(s):
    return re.match('not\s*',s.lower())

# is neighbourhood 'Not assigned'? If so, make it = to its borough
def mtchBor(nei,bor):
    if nta(nei): return bor
    return nei

postalcodes = [ [ td.text.strip()
          for td in tr.find_all('td') 
         ] 
            for tr in borough_table.find_all('tr')[1:] 
                if not nta(tr.find_all('td')[1].text)
       ]

for i,zpc in enumerate(postalcodes):
    zpc[2]=mtchBor(zpc[2],zpc[1])
    postalcodes[i]=zpc
postalcodes[:10]

[['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', 'Downtown Toronto', "Queen's Park"],
 ['M9A', "Queen's Park", "Queen's Park"],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M3B', 'North York', 'Don Mills North']]

Let's now combine multiple neighborhoods w/ the same zip code.

Well use a dictionary, `rows`, as a hash table and then transform it back to a list before
creating the dataframe.

In [9]:
rows = {}
for pc in postalcodes:
    if pc[0] in rows: 
        rows[pc[0]][2] += ', ' + pc[2]
    else:
        rows[pc[0]] = pc 
rows

{'M3A': ['M3A', 'North York', 'Parkwoods'],
 'M4A': ['M4A', 'North York', 'Victoria Village'],
 'M5A': ['M5A', 'Downtown Toronto', 'Harbourfront'],
 'M6A': ['M6A', 'North York', 'Lawrence Heights, Lawrence Manor'],
 'M7A': ['M7A', 'Downtown Toronto', "Queen's Park"],
 'M9A': ['M9A', "Queen's Park", "Queen's Park"],
 'M1B': ['M1B', 'Scarborough', 'Rouge, Malvern'],
 'M3B': ['M3B', 'North York', 'Don Mills North'],
 'M4B': ['M4B', 'East York', 'Woodbine Gardens, Parkview Hill'],
 'M5B': ['M5B', 'Downtown Toronto', 'Ryerson, Garden District'],
 'M6B': ['M6B', 'North York', 'Glencairn'],
 'M9B': ['M9B',
  'Etobicoke',
  'Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park'],
 'M1C': ['M1C', 'Scarborough', 'Highland Creek, Rouge Hill, Port Union'],
 'M3C': ['M3C', 'North York', 'Flemingdon Park, Don Mills South'],
 'M4C': ['M4C', 'East York', 'Woodbine Heights'],
 'M5C': ['M5C', 'Downtown Toronto', 'St. James Town'],
 'M6C': ['M6C', 'York', 'Humewood-Cedarvale'],
 'M9C': 

In [10]:
postalcodes = [ r for k , r in rows.items()]
postalcodes[:5]

[['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M6A', 'North York', 'Lawrence Heights, Lawrence Manor'],
 ['M7A', 'Downtown Toronto', "Queen's Park"]]

Finally, the dataframe containing all desired zip codes with their different neighbourhoods.

In [11]:
zipcs_df = pd.DataFrame(postalcodes,columns=headings)
zipcs_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


In [12]:
zipcs_df.shape

(103, 3)

# Geolocation of Toronto Neighbourhoods

Recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.

In [14]:
#!pip install geocoder

In [17]:
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
c=0
while(lat_lng_coords is None):
    if c %10 == 0 :
        print('count: c=',c)
        if c>0: lat_lng_coords=[1,gc.latlng]
    c += 1
    #gc = geocoder.google('{}, Toronto, Ontario'.format('M5R'))
    gc = geocoder.google("453 Booth Street, Ottawa ON")
    #lat_lng_coords = gc.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

count: c= 0
count: c= 10


In [20]:
latitude # longitude[0] -> NonType isn't indexable 

1

##### Alas, geocoder sucks!! after several minutes we still can't get an answer!

We will resort to using the table of (zip,coord) from https://cocl.us/Geospatial_data

In [21]:
zipCoord_df = pd.read_csv('https://cocl.us/Geospatial_data')
zipCoord_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [44]:
zipCoord_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
Postal Code    103 non-null object
Latitude       103 non-null float64
Longitude      103 non-null float64
dtypes: float64(2), object(1)
memory usage: 2.5+ KB


In [40]:
cols=zipcs_df.columns.values
cols[0]=zipCoord_df.columns.values[0]
zipcs_df.columns=cols
zipcs_df.columns

Index(['Postal Code', 'Borough', 'Neighbourhood'], dtype='object')

In [41]:
zipCoord_df[zipCoord_df['Postal Code']=='M5A']

Unnamed: 0,Postal Code,Latitude,Longitude
53,M5A,43.65426,-79.360636


In [42]:
zipcs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
Postal Code      103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


In [49]:
toNeigGeo_df = pd.merge(zipcs_df,zipCoord_df,on=['Postal Code'],sort=False)
toNeigGeo_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


In [50]:
toNeigGeo_df.shape

(103, 5)

In [51]:
toNeigGeo_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 5 columns):
Postal Code      103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
Latitude         103 non-null float64
Longitude        103 non-null float64
dtypes: float64(2), object(3)
memory usage: 4.8+ KB


Let's list all neighbourhoods in Toronto

In [79]:
sorted(toNeigGeo_df['Neighbourhood'].unique())

['Adelaide, King, Richmond',
 'Agincourt',
 "Agincourt North, L'Amoreaux East, Milliken, Steeles East",
 'Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown',
 'Alderwood, Long Branch',
 'Bathurst Manor, Downsview North, Wilson Heights',
 'Bayview Village',
 'Bedford Park, Lawrence Manor East',
 'Berczy Park',
 'Birch Cliff, Cliffside West',
 'Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe',
 'Brockton, Exhibition Place, Parkdale Village',
 'Business Reply Mail Processing Centre 969 Eastern',
 'CFB Toronto, Downsview East',
 'CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara',
 'Cabbagetown, St. James Town',
 'Caledonia-Fairbanks',
 'Canada Post Gateway Processing Centre',
 'Cedarbrae',
 'Central Bay Street',
 'Chinatown, Grange Park, Kensington Market',
 'Christie',
 'Church and Wellesley',
 'Clairlea, Golden Mile, Oakridge',
 "Clarks Corners, Sullivan, Ta

# Analysis and Clustering of Toronto Neighbourhoods data

In [52]:
from geopy.geocoders import Nominatim

In [68]:
geocoder = Nominatim(user_agent='to_explorer')
to_location = geocoder.geocode('Toronto, ON')
print(to_location)
for i in to_location:
    print(type(i),i)

Toronto, Golden Horseshoe, Ontario, M6K 1X9, Canada
<class 'str'> Toronto, Golden Horseshoe, Ontario, M6K 1X9, Canada
<class 'tuple'> (43.653963, -79.387207)


In [69]:
import folium
#toronto = folium.Map([43.753259,-79.329656])
toronto = folium.Map(to_location[1],zoom_start=13)
toronto

Let's inspect a given neighbourhood first, say, 'The Annex'.

In [134]:
toNeigGeo_df[toNeigGeo_df['Neighbourhood'].str.contains('Annex')]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
74,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678


In [100]:
annexloc = geocoder.geocode('The Annex, Toronto, Ontario, Canada')
annexloc= annexloc[1] #coordinates

In [101]:
annex_map = folium.Map(annexloc,zoom_start=16)
annex_map

Explore venues in The Annex using FourSquare.

<!--
CLIENT_ID = 'DAL3MN0KT0UBKSKUAF4RGA4MKBDY5PPGNJSLDISRQSY24VI2' # your Foursquare ID
CLIENT_SECRET = 'JGESPBWNVKYCOE4ZABEZBGCO1ZPWLUWGOLQCUCLFHWQYOENB' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
-->

In [369]:
# Setup 4SQ GET URI
BURI='https://api.foursquare.com/v2/'
OBJECT='venues/' # venues, users, tips,...
DV='client_id={}&client_secret={}&v={}'.format(CLIENT_ID,CLIENT_SECRET,VERSION)
RADIUS='&radius={}'.format(500)
LOC='&ll={},{}'.format(annexloc[0],annexloc[1])
LIMIT='&limit={}'.format(100)
REQTYPE='explore?'  #search, explore, tips, trending, venue_id, tip_id, user_id,

url=BURI+OBJECT+REQTYPE+DV+LOC+RADIUS+LIMIT

In [91]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e3201889da7ee001b0035de'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'The Annex',
  'headerFullLocation': 'The Annex, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 42,
  'suggestedBounds': {'ne': {'lat': 43.6748377045, 'lng': -79.40090733739737},
   'sw': {'lat': 43.665837695499995, 'lng': -79.41332666260263}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4c0c36c0a1b32d7f97439cf0',
       'name': 'Jean Sibelius Square',
       'location': {'address': 'Wells St and Kendal Ave.',
        'lat': 43.67142628683921,
        'lng': -79.40883090368688,
        'labeledLatLngs': [{'label': 'display',
          'lat'

In [93]:
annex_venues = results['response']['groups'][0]['items']
annex_venues

[{'reasons': {'count': 0,
   'items': [{'summary': 'This spot is popular',
     'type': 'general',
     'reasonName': 'globalInteractionReason'}]},
  'venue': {'id': '4c0c36c0a1b32d7f97439cf0',
   'name': 'Jean Sibelius Square',
   'location': {'address': 'Wells St and Kendal Ave.',
    'lat': 43.67142628683921,
    'lng': -79.40883090368688,
    'labeledLatLngs': [{'label': 'display',
      'lat': 43.67142628683921,
      'lng': -79.40883090368688}],
    'distance': 183,
    'cc': 'CA',
    'city': 'Toronto',
    'state': 'ON',
    'country': 'Canada',
    'formattedAddress': ['Wells St and Kendal Ave.', 'Toronto ON', 'Canada']},
   'categories': [{'id': '4bf58dd8d48988d163941735',
     'name': 'Park',
     'pluralName': 'Parks',
     'shortName': 'Park',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/park_',
      'suffix': '.png'},
     'primary': True}],
   'photos': {'count': 0, 'groups': []}},
  'referralId': 'e-0-4c0c36c0a1b32d7f97439cf0-0'},
 {'r

In [99]:
annex_venues_summ = [ [item['venue']['name'],
                       item['venue']['categories'][0]['name'],
                       item['venue']['location']['lat'],
                       item['venue']['location']['lng']
                      ]
                      for item in annex_venues
                    ]
annex_venues_summ[:5],len(annex_venues_summ)

([['Jean Sibelius Square', 'Park', 43.67142628683921, -79.40883090368688],
  ['Fresh on Bloor',
   'Vegetarian / Vegan Restaurant',
   43.66675488472059,
   -79.40349130034014],
  ['Roti Cuisine of India',
   'Indian Restaurant',
   43.67461834990478,
   -79.40824866273744],
  ['Fuwa Fuwa Japanese Pancakes',
   'Pastry Shop',
   43.66587984725407,
   -79.40783968911445],
  ['The Madison Avenue Pub', 'Pub', 43.667946944171426, -79.40348638003233]],
 42)

In [103]:
annex_venues_df = pd.DataFrame(annex_venues_summ,columns=['name','category','lat','lng'])
annex_venues_df.head()

Unnamed: 0,name,category,lat,lng
0,Jean Sibelius Square,Park,43.671426,-79.408831
1,Fresh on Bloor,Vegetarian / Vegan Restaurant,43.666755,-79.403491
2,Roti Cuisine of India,Indian Restaurant,43.674618,-79.408249
3,Fuwa Fuwa Japanese Pancakes,Pastry Shop,43.66588,-79.40784
4,The Madison Avenue Pub,Pub,43.667947,-79.403486


In [126]:
annex_venues_map = folium.Map(annexloc,zoom_start=16)

for name, category, lat, lng in zip(annex_venues_df['name'],
                                    annex_venues_df['category'],
                                    annex_venues_df['lat'],
                                    annex_venues_df['lng']
                                   ):
    label=name+' ('+category+')'
    label = folium.Popup(label)
    folium.CircleMarker(
        popup=label,
        location=[lat,lng],
        radius=5,
        fill=True
    ).add_to(annex_venues_map)
annex_venues_map

Let's now study the venues we got located downtown toronto. 

First let's retrieve all neighbourhoods in downtown borough.

In [141]:
toDowntown_df=(toNeigGeo_df[toNeigGeo_df['Borough'].str.contains('owntown')])[['Neighbourhood','Latitude','Longitude']]
toDowntown_df.reset_index(inplace=True)
toDowntown_df.drop('index',axis=1,inplace=True)
toDowntown_df

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Harbourfront,43.65426,-79.360636
1,Queen's Park,43.662301,-79.389494
2,"Ryerson, Garden District",43.657162,-79.378937
3,St. James Town,43.651494,-79.375418
4,Berczy Park,43.644771,-79.373306
5,Central Bay Street,43.657952,-79.387383
6,Christie,43.669542,-79.422564
7,"Adelaide, King, Richmond",43.650571,-79.384568
8,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752
9,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576


In [142]:
toDowntown_df.shape

(19, 3)

Let's write a function that consecutively queries FourSquare for the venues within each neighbourhood of Downtown Toronto.

In [235]:
def explore4Sq(lat,lng,radius=500,limit=100,name=''):
    if not name == '': print(name)
        
    LOC='&ll={},{}'.format(lat,lng)
    RADIUS='&radius='+str(radius)
    LIMIT='&limit='+str(limit)
    url=BURI+OBJECT+REQTYPE+DV+LOC+RADIUS+LIMIT

    return requests.get(url).json()['response']['groups'][0]['items']
   
def exploreVenues(neinames, neilats,neilngs):
    venues_list = []
    
    for neiname, neilat, neilng in zip(neinames,neilats,neilngs):
        venueds = explore4Sq(neilat,neilng,name=neiname) #list of venue_dicts
        for venued in venueds:
            venues_list.append([
                neiname, neilat, neilng,
                venued['venue']['name'], 
                venued['venue']['location']['lat'], 
                venued['venue']['location']['lng'],
                venued['venue']['categories'][0]['name'] 
            ]
            )
    
    return venues_list
        

In [237]:
# Test case 
#loc=toDowntown_df.loc[:1]
#vs = exploreVenues(loc['Neighbourhood'],loc['Latitude'],loc['Longitude'])
#vs 

In [238]:
downtown_venues = exploreVenues(toDowntown_df['Neighbourhood'],toDowntown_df['Latitude'],toDowntown_df['Longitude'])


Harbourfront
Queen's Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Christie
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown, St. James Town
First Canadian Place, Underground city
Church and Wellesley


We will set up a panda dataframe with these results to easy the analysis.

In [242]:
downtown_venues_df = pd.DataFrame(downtown_venues,columns=['Neighbourhood',
                                                           'Nlat','Nlng',
                                                           'Venue','Vlat','Vlng','Vcat']
                                 )
print('We obtained {} venues in downtown Toronto'.format(downtown_venues_df.shape[0]))
downtown_venues_df.head()

We obtained 1307 venues in downtown Toronto


Unnamed: 0,Neighbourhood,Nlat,Nlng,Venue,Vlat,Vlng,Vcat
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653191,-79.357947,Gym / Fitness Center
3,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


Let's see what are all the categories and how many distinct ones there are.

In [262]:
downtown_venues_categories = downtown_venues_df['Vcat'].unique()
print('There are {} distinc categories of venues in downtown Toronto. Here a sample:'.format(len(downtown_venues_categories)))
sorted(downtown_venues_categories)[::10]

There are 204 distinc categories of venues in downtown Toronto. Here a sample:


['Afghan Restaurant',
 'Art Gallery',
 'Baseball Stadium',
 'Boutique',
 'Camera Store',
 'College Arts Building',
 'Dance Studio',
 'Dumpling Restaurant',
 'Fish Market',
 'Gas Station',
 'Grocery Store',
 'Hospital',
 'Jewelry Store',
 "Men's Store",
 'Neighborhood',
 'Performing Arts Venue',
 'Pub',
 'Scenic Lookout',
 'Sporting Goods Shop',
 'Tea Room',
 'Wine Bar']

It is evident that these categories are very granular, that is, it makes such fine distinctions as to consider different a Thai restaurant from a Sushi one. The question this raises is, is that too fine a distinction? 

Put it in other words. If we come up with too detailed features for the venues describing the neighbourhoods, we risk to overfit the analysis. Is it really meaningful to distinguish a sports bar from a wine bar or from a pub? would such differences be really telling us something unique _and still relevant_ about neighbourhoods?  

Let's give a try nevertheless and see what results we get.

First, let's count how many venues we have for each borough. Instead of borough name, we list the neighbourhoods belong to it.

In [260]:
downtown_venues_df.groupby('Neighbourhood').count().sort_values(by='Venue',ascending=False)['Venue']

Neighbourhood
Adelaide, King, Richmond                                                                                      100
Commerce Court, Victoria Hotel                                                                                100
St. James Town                                                                                                100
Ryerson, Garden District                                                                                      100
Harbourfront East, Toronto Islands, Union Station                                                             100
First Canadian Place, Underground city                                                                        100
Design Exchange, Toronto Dominion Centre                                                                      100
Stn A PO Boxes 25 The Esplanade                                                                                94
Chinatown, Grange Park, Kensington Market                                 

In [268]:
downtown_venues_onehot_df = pd.get_dummies(downtown_venues_df['Vcat'],prefix='',prefix_sep='')
downtown_venues_onehot_df['Neighbourhood'] = downtown_venues_df['Neighbourhood']
downtown_venues_onehot_df

Unnamed: 0,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Neighbourhood
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Harbourfront
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Harbourfront
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Harbourfront
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Harbourfront
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Harbourfront
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1302,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Church and Wellesley
1303,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Church and Wellesley
1304,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Church and Wellesley
1305,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Church and Wellesley


Let's now group again by neighbourhood and get the average value of each category.

In [270]:
dtohve_grouped = downtown_venues_onehot_df.groupby('Neighbourhood').mean().reset_index()
dtohve_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.055556,0.055556,0.055556,0.111111,0.166667,0.111111,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,...,0.0,0.0,0.0,0.012658,0.0,0.0,0.012658,0.0,0.0,0.012658
5,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.046512,0.0,0.069767,0.011628,0.0,0.0,0.0
6,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Church and Wellesley,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,0.0,...,0.0,0.0,0.0,0.0,0.0,0.012195,0.0,0.012195,0.0,0.012195
8,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
9,"Design Exchange, Toronto Dominion Centre",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0


Notice that the ratio neighbourhoods to features (categories) is here $19/25\sim 0.09$ which is $~25\%$ smaller to the one for New York case study where we had a ratio of $40/338\sim 0.12$.

Let's print each neighbourhood with its 5 most frequent venue types.

In [275]:
num_top_venues = 5

for hood in dtohve_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = dtohve_grouped[dtohve_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
              venue  freq
0       Coffee Shop  0.07
1              Café  0.04
2        Steakhouse  0.04
3  Asian Restaurant  0.03
4        Restaurant  0.03


----Berczy Park----
          venue  freq
0   Coffee Shop  0.07
1  Cocktail Bar  0.05
2        Bakery  0.04
3    Steakhouse  0.04
4          Café  0.04


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
                 venue  freq
0      Airport Service  0.17
1       Airport Lounge  0.11
2     Airport Terminal  0.11
3          Coffee Shop  0.06
4  Rental Car Location  0.06


----Cabbagetown, St. James Town----
         venue  freq
0  Coffee Shop  0.07
1         Café  0.04
2       Market  0.04
3  Pizza Place  0.04
4         Park  0.04


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.15
1  Italian Restaurant  0.05
2                Café  0.04
3        Burger Joint  0.04
4      Ice Cream Shop  0.04


---

Let's display the top 10 venue types for each neighbourhood.

In [370]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = dtohve_grouped['Neighbourhood']

for ind in np.arange(dtohve_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dtohve_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.sort_values(by='1st Most Common Venue')

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Boat or Ferry,Plane,Airport,Airport Food Court,Airport Gate,Sculpture Garden,Boutique
5,"Chinatown, Grange Park, Kensington Market",Café,Vietnamese Restaurant,Bar,Dumpling Restaurant,Coffee Shop,Chinese Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Burger Joint,Donut Shop
11,"Harbord, University of Toronto",Café,Restaurant,Bakery,Bar,Bookstore,Japanese Restaurant,Sandwich Place,Flower Shop,Pub,Poutine Place
0,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Restaurant,Bar,Bakery,Thai Restaurant,Cosmetics Shop,Asian Restaurant,Gym
16,"Ryerson, Garden District",Coffee Shop,Clothing Store,Cosmetics Shop,Bakery,Café,Japanese Restaurant,Italian Restaurant,Ramen Restaurant,Pizza Place,Bookstore
14,Queen's Park,Coffee Shop,Gym,Park,Fried Chicken Joint,Salad Place,Restaurant,Portuguese Restaurant,Nightclub,Mexican Restaurant,Juice Bar
13,"Harbourfront East, Toronto Islands, Union Station",Coffee Shop,Aquarium,Hotel,Café,Italian Restaurant,Restaurant,Scenic Lookout,Brewery,Fried Chicken Joint,Pizza Place
12,Harbourfront,Coffee Shop,Park,Bakery,Pub,Café,Restaurant,Mexican Restaurant,Shoe Store,Breakfast Spot,Brewery
10,"First Canadian Place, Underground city",Coffee Shop,Café,Restaurant,Steakhouse,Asian Restaurant,Japanese Restaurant,Deli / Bodega,Hotel,Seafood Restaurant,Bar
9,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant,Bar,Steakhouse,Gastropub,Seafood Restaurant,Deli / Bodega


Let's now use the information on distribution of venue types to cluster the data. 

In [371]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

dtohve_clustering = dtohve_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dtohve_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 3, 2, 2, 2, 4, 2, 2, 2], dtype=int32)

Let's add the cluster labels to the top-10-venues table and combine this information with the borough, neighbourhood and its location all together in a single table.

In [372]:
toBN = toNeigGeo_df.drop('Postal Code',1)[toNeigGeo_df['Borough'].str.contains('owntown')]
toBN.head(2)

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
2,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,Downtown Toronto,Queen's Park,43.662301,-79.389494


In [373]:
to_downtown_venues_info = pd.DataFrame(neighbourhoods_venues_sorted)
to_downtown_venues_info.insert(0,'Cluster label',kmeans.labels_)
to_downtown_venues_info

Unnamed: 0,Cluster label,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,2,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Restaurant,Bar,Bakery,Thai Restaurant,Cosmetics Shop,Asian Restaurant,Gym
1,2,Berczy Park,Coffee Shop,Cocktail Bar,Café,Cheese Shop,Bakery,Steakhouse,Seafood Restaurant,Beer Bar,Farmers Market,Eastern European Restaurant
2,3,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Boat or Ferry,Plane,Airport,Airport Food Court,Airport Gate,Sculpture Garden,Boutique
3,2,"Cabbagetown, St. James Town",Coffee Shop,Market,Restaurant,Pub,Café,Bakery,Pizza Place,Park,Chinese Restaurant,Italian Restaurant
4,2,Central Bay Street,Coffee Shop,Italian Restaurant,Japanese Restaurant,Sandwich Place,Café,Burger Joint,Juice Bar,Ice Cream Shop,Salad Place,Spa
5,2,"Chinatown, Grange Park, Kensington Market",Café,Vietnamese Restaurant,Bar,Dumpling Restaurant,Coffee Shop,Chinese Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Burger Joint,Donut Shop
6,4,Christie,Grocery Store,Café,Park,Athletics & Sports,Diner,Restaurant,Italian Restaurant,Bank,Baby Store,Candy Store
7,2,Church and Wellesley,Coffee Shop,Japanese Restaurant,Restaurant,Gay Bar,Sushi Restaurant,Gym,Café,Men's Store,Hotel,Gastropub
8,2,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Hotel,Restaurant,Gym,Seafood Restaurant,Bakery,Deli / Bodega,Steakhouse,Gastropub
9,2,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant,Bar,Steakhouse,Gastropub,Seafood Restaurant,Deli / Bodega


In [374]:
to_downtown_venues_summ = toBN.join(to_downtown_venues_info.set_index('Neighbourhood'),on='Neighbourhood')
to_downtown_venues_summ.head()

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,Harbourfront,43.65426,-79.360636,0,Coffee Shop,Park,Bakery,Pub,Café,Restaurant,Mexican Restaurant,Shoe Store,Breakfast Spot,Brewery
4,Downtown Toronto,Queen's Park,43.662301,-79.389494,0,Coffee Shop,Gym,Park,Fried Chicken Joint,Salad Place,Restaurant,Portuguese Restaurant,Nightclub,Mexican Restaurant,Juice Bar
9,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,2,Coffee Shop,Clothing Store,Cosmetics Shop,Bakery,Café,Japanese Restaurant,Italian Restaurant,Ramen Restaurant,Pizza Place,Bookstore
15,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Coffee Shop,Café,Restaurant,Breakfast Spot,Cocktail Bar,Beer Bar,Cosmetics Shop,Bakery,Italian Restaurant,Hotel
20,Downtown Toronto,Berczy Park,43.644771,-79.373306,2,Coffee Shop,Cocktail Bar,Café,Cheese Shop,Bakery,Steakhouse,Seafood Restaurant,Beer Bar,Farmers Market,Eastern European Restaurant


Finally, let's map our clusters.

In [375]:
dt_loc = geocoder.geocode('Downtown, Toronto, Ontario, Canada')
lat, lng = dt_loc[1]
lat,lng

(43.6563221, -79.3809161)

In [376]:
# create map
map_clusters = folium.Map(location=dt_loc[1], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(to_downtown_venues_summ['Latitude'], to_downtown_venues_summ['Longitude'], 
                                  to_downtown_venues_summ['Neighbourhood'], 
                                  to_downtown_venues_summ['Cluster label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Detailed cluster composition

### Cluster 1

In [377]:
to_downtown_venues_summ.loc[to_downtown_venues_summ['Cluster label'] == 0, to_downtown_venues_summ.columns[[1] + list(range(5, to_downtown_venues_summ.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Harbourfront,Coffee Shop,Park,Bakery,Pub,Café,Restaurant,Mexican Restaurant,Shoe Store,Breakfast Spot,Brewery
4,Queen's Park,Coffee Shop,Gym,Park,Fried Chicken Joint,Salad Place,Restaurant,Portuguese Restaurant,Nightclub,Mexican Restaurant,Juice Bar


In [378]:
to_downtown_venues_summ.loc[to_downtown_venues_summ['Cluster label'] == 1, to_downtown_venues_summ.columns[[1] + list(range(5, to_downtown_venues_summ.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
91,Rosedale,Park,Playground,Trail,Deli / Bodega,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Discount Store


In [379]:
to_downtown_venues_summ.loc[to_downtown_venues_summ['Cluster label'] == 2, to_downtown_venues_summ.columns[[1] + list(range(5, to_downtown_venues_summ.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,"Ryerson, Garden District",Coffee Shop,Clothing Store,Cosmetics Shop,Bakery,Café,Japanese Restaurant,Italian Restaurant,Ramen Restaurant,Pizza Place,Bookstore
15,St. James Town,Coffee Shop,Café,Restaurant,Breakfast Spot,Cocktail Bar,Beer Bar,Cosmetics Shop,Bakery,Italian Restaurant,Hotel
20,Berczy Park,Coffee Shop,Cocktail Bar,Café,Cheese Shop,Bakery,Steakhouse,Seafood Restaurant,Beer Bar,Farmers Market,Eastern European Restaurant
24,Central Bay Street,Coffee Shop,Italian Restaurant,Japanese Restaurant,Sandwich Place,Café,Burger Joint,Juice Bar,Ice Cream Shop,Salad Place,Spa
30,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Restaurant,Bar,Bakery,Thai Restaurant,Cosmetics Shop,Asian Restaurant,Gym
36,"Harbourfront East, Toronto Islands, Union Station",Coffee Shop,Aquarium,Hotel,Café,Italian Restaurant,Restaurant,Scenic Lookout,Brewery,Fried Chicken Joint,Pizza Place
42,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant,Bar,Steakhouse,Gastropub,Seafood Restaurant,Deli / Bodega
48,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Hotel,Restaurant,Gym,Seafood Restaurant,Bakery,Deli / Bodega,Steakhouse,Gastropub
80,"Harbord, University of Toronto",Café,Restaurant,Bakery,Bar,Bookstore,Japanese Restaurant,Sandwich Place,Flower Shop,Pub,Poutine Place
84,"Chinatown, Grange Park, Kensington Market",Café,Vietnamese Restaurant,Bar,Dumpling Restaurant,Coffee Shop,Chinese Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Burger Joint,Donut Shop


In [380]:
to_downtown_venues_summ.loc[to_downtown_venues_summ['Cluster label'] == 3, to_downtown_venues_summ.columns[[1] + list(range(5, to_downtown_venues_summ.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
87,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Boat or Ferry,Plane,Airport,Airport Food Court,Airport Gate,Sculpture Garden,Boutique


In [381]:
to_downtown_venues_summ.loc[to_downtown_venues_summ['Cluster label'] == 4, to_downtown_venues_summ.columns[[1] + list(range(5, to_downtown_venues_summ.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25,Christie,Grocery Store,Café,Park,Athletics & Sports,Diner,Restaurant,Italian Restaurant,Bank,Baby Store,Candy Store


The cluster size-distribution is quite skewed, as we can see below. This is partly due to the small number of neighbourhoods compared to the number of venue categories we found. We should expect a more meaningful clustering if we consider a wider area than just downtown Toronto.

In [402]:
cluster_sizes = to_downtown_venues_summ.groupby('Cluster label').count()
cluster_sizes.columns.values[:1] = ['size'] 
cluster_sizes.index.names=['cluster']
cluster_sizes = cluster_sizes[['size']]
cluster_sizes.sort_values(by='size',ascending=False)

Unnamed: 0_level_0,size
cluster,Unnamed: 1_level_1
2,14
0,2
1,1
3,1
4,1
