The following section is used to extract data from the following Wikipedia page: <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"><strong>List of postal codes of Canada: M</strong></a>

import required libraries

In [1]:
# import pandas library and numpy library
import pandas as pd
import numpy as np

# import map rendering library
import folium
# import a tool for concerting an address into latitude and longitude 
from geopy.geocoders import Nominatim
import geopy.geocoders

# import matplotlib and associated plotting modukes
import matplotlib.cm as cm
import matplotlib.colors as colors

# import library for validating SSL encryted URLs
import ssl
import certifi
from urllib import request

#import library to handle requests
import requests

#import tool for transforming JSON file into a dataframe
from pandas.io.json import json_normalize

# import k-means from sklearn
from sklearn.cluster import KMeans

print('Libraries imported.')


Libraries imported.


Load the data and read into a pandas DataFrame structure

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

context = ssl._create_unverified_context(cafile = certifi.where())
response = request.urlopen(url, context = context)
html = response.read()

toronto_neighborhoods = pd.read_html(html, header = 0)[0]
toronto_neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Drop the rows that do not have assigned boroughs

In [3]:
toronto_neighborhoods.drop(toronto_neighborhoods[toronto_neighborhoods['Borough'] == 'Not assigned'].index, inplace = True)
toronto_neighborhoods.reset_index(inplace = True, drop = True)
toronto_neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Check which cell has a Borough but a Not assigned Neighbourhood

In [4]:
toronto_neighborhoods[toronto_neighborhoods.Neighbourhood == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Not assigned


Use name of the Borough for the Neighbourhood that has 'Not assigned' value

In [5]:
empty_index = toronto_neighborhoods[toronto_neighborhoods.Neighbourhood == 'Not assigned'].index
toronto_neighborhoods.loc[empty_index,'Neighbourhood'] = toronto_neighborhoods.loc[empty_index,'Borough']
toronto_neighborhoods[toronto_neighborhoods.Neighbourhood == 'Not assigned']
toronto_neighborhoods.shape

(211, 3)

Next, we want to combine the Neighourboods that share the same Postcode. To do this, we first need to find out which Postcodes are shared by more than one Neighbourhood

In [6]:
postcode_count = toronto_neighborhoods.groupby('Postcode').count()
postcode_count = postcode_count[postcode_count['Borough'] != 1]
duplicated_postcodes = postcode_count.index
duplicated_postcodes
print('The number of duplicated postocdes are: {}'.format(len(duplicated_postcodes)))

The number of duplicated postocdes are: 57


Create a new dataframe that excludes those duplicated postcodes found in the previous step

In [7]:
single_neighborhoods = toronto_neighborhoods.set_index('Postcode')
single_neighborhoods = single_neighborhoods.drop(duplicated_postcodes, axis = 0)
single_neighborhoods.reset_index(inplace = True)
single_neighborhoods.head()
single_neighborhoods.shape

(46, 3)

Next, we add Boroughs that contain more than one Neighbourhood. This is done by looping through and creating dataframe for each duplicated postcode. The respective Postcode, Borough, Neighbourhood information can then be appended to the single_neighborhoods dataframe created in the previous step.

In [8]:
for postcode in duplicated_postcodes:
    temp_df = toronto_neighborhoods[toronto_neighborhoods.Postcode == postcode].reset_index(drop = True)
    temp_postcode = temp_df['Postcode'][0]
    temp_borough = temp_df['Borough'][0]
    temp_neighbourhood = temp_df['Neighbourhood'].str.cat(sep = ', ')
    
    single_neighborhoods = single_neighborhoods.append({'Postcode': temp_postcode, 'Borough': temp_borough, 'Neighbourhood': temp_neighbourhood}, ignore_index= True)
    
single_neighborhoods.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
98,M9B,Etobicoke,"Cloverdale, Islington, Martin Grove, Princess ..."
99,M9C,Etobicoke,"Bloordale Gardens, Eringate, Markland Wood, Ol..."
100,M9M,North York,"Emery, Humberlea"
101,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
102,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


Rename the dataframe and get its dimension

In [9]:
simple_neighourhoods = single_neighborhoods
simple_neighourhoods.shape

(103, 3)

Obtain data from a csv file that has the geographical coordinates of each postal code

In [10]:
geo_data = pd.read_csv('Geospatial_Coordinates.csv')
geo_data.head()
# print('This csv file contains {} geographical coordiantes.'.format(geo_data.shape[0]))

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Make sure the column names are consistent in both tables

In [11]:
geo_data.rename(columns={'Postal Code':'Postcode'}, inplace = True)
geo_data.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the two tables together

In [12]:
merged_df = pd.merge(single_neighborhoods, geo_data, how = 'outer')
merged_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
3,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
4,M3B,North York,Don Mills North,43.745906,-79.352188


For the rest of this project, I will be using only the Borough that contains the word 'Toronto' in it. After filtering out the the undesired Boroughs, I will use Foursquare API to explore the venues around the neighbourhoods. Finally, these neighbourhoods will be clustered together using k-means and will be visualized on a map.

Let's first find the Boroughs that contain the word 'Toronto'

In [13]:
toronto_df = merged_df.loc[['Toronto' in borough for borough in merged_df['Borough']],:]
toronto_df = toronto_df.reset_index(drop = True)
toronto_df.head()
# print('This datafrane contains {} geographical coordiantes.'.format(toronto_df.shape[0]))

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
1,M4E,East Toronto,The Beaches,43.676357,-79.293031
2,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
4,M6G,Downtown Toronto,Christie,43.669542,-79.422564


Use geopy library get the latitude and longitude values of Toronto

In [14]:
geopy.geocoders.options.default_ssl_context = context

address  = 'Toronto, ON'

geolocator = Nominatim(user_agent = 'toronto_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Toronto are 43.653963, -79.387207.


Create a map of Toronto with neighbourhoods superimposed on top

In [15]:
map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 12)

for lat, lng, borough, neighbourhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Define Foursquare Credentials and Version

In [16]:
CLIENT_ID = '2OS4YFSMYGEMPT2KFDEI0OEB5HUCP5R40MITFVN0APGD0GNE' # your Foursquare ID
CLIENT_SECRET = 'VWWZJTT2FA2B00Z3FCZLCEQIG5FCXQBPYNHU4YVNEN0TZZ5K' # your Foursquare Secret
VERSION = '20190723' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2OS4YFSMYGEMPT2KFDEI0OEB5HUCP5R40MITFVN0APGD0GNE
CLIENT_SECRET:VWWZJTT2FA2B00Z3FCZLCEQIG5FCXQBPYNHU4YVNEN0TZZ5K


Find the first Neighbourhood in the above dataframe

In [17]:
toronto_df['Neighbourhood'][0]

'St. James Town'

Let's find the latitude and longitude information of this Neighbourhood

In [18]:
neigh_latitude = toronto_df.loc[0, 'Latitude']
neigh_longitude = toronto_df.loc[0, 'Longitude']
neigh_name = toronto_df.loc[0, 'Neighbourhood']

Now we can use Foursquare API to explore around this Neighbourhood and retrieve the top 5 venues within a radius of 500 meters.

Let's first create the URL for GET request

In [19]:
radius = 500
LIMIT = 15

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}\
&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, 
                                         neigh_latitude, neigh_longitude, 
                                         radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=2OS4YFSMYGEMPT2KFDEI0OEB5HUCP5R40MITFVN0APGD0GNE&client_secret=VWWZJTT2FA2B00Z3FCZLCEQIG5FCXQBPYNHU4YVNEN0TZZ5K&v=20190723&ll=43.6514939,-79.3754179&radius=500&limit=15'

In [20]:
search_result = requests.get(url).json()
search_result

{'meta': {'code': 200, 'requestId': '5d37c6304651320025a57fa2'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'St. Lawrence',
  'headerFullLocation': 'St. Lawrence, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 128,
  'suggestedBounds': {'ne': {'lat': 43.6559939045, 'lng': -79.36921018606671},
   'sw': {'lat': 43.646993895499996, 'lng': -79.3816256139333}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '574ad72238fa943556d93b8e',
       'name': 'Gyu-Kaku Japanese BBQ',
       'location': {'address': '81 Church St',
        'crossStreet': 'at Adelaide St E',
        'lat': 43.651422275497914,
        'lng': -79.37504693687086,
        'labeledLatLngs'

Create a function to extract the category of the venue

In [21]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Convert the returned JSON dictionary into a pandas dataframe

In [22]:
venues = search_result['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

Unnamed: 0,name,categories,lat,lng
0,Gyu-Kaku Japanese BBQ,Japanese Restaurant,43.651422,-79.375047
1,Crepe TO,Creperie,43.650063,-79.374587
2,Terroni,Italian Restaurant,43.650927,-79.375602
3,GEORGE Restaurant,Restaurant,43.653346,-79.374445
4,Pearl Diver,Gastropub,43.651481,-79.3736
5,Fahrenheit Coffee,Coffee Shop,43.652384,-79.372719
6,Hogtown Smoke,Food Truck,43.649287,-79.374689
7,Versus Coffee,Coffee Shop,43.651213,-79.375236
8,Mystic Muffin,Middle Eastern Restaurant,43.652484,-79.372655
9,Triple A Bar (AAA),BBQ Joint,43.651658,-79.37272


Now I will repeat this same process to all neighbourhoods in the **toronto_df** dataframe

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        search_results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in search_results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

toronto_venues = getNearbyVenues(names = toronto_df['Neighbourhood'], latitudes = toronto_df['Latitude'], longitudes = toronto_df['Longitude'])

St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Studio District
Lawrence Park
Roselawn
Davisville North
North Toronto West
Davisville
Rosedale
Stn A PO Boxes 25 The Esplanade
Church and Wellesley
Business Reply Mail Processing Centre 969 Eastern
The Danforth West, Riverdale
The Beaches West, India Bazaar
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Cabbagetown, St. James Town
Harbourfront, Regent Park
Ryerson, Garden District
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
First Canadian Place, Underground city
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, 

Now that a new dataframe containing the venues information become available, I will check the size of this new dataframe as well as the number of unique Venue Categories

In [24]:
print('The Toronto Venue dataframe contains {} rows and {} columns.'.format(toronto_venues.shape[0], toronto_venues.shape[1]))
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

The Toronto Venue dataframe contains 494 rows and 7 columns.
There are 150 uniques categories.


Since the values in the Venue Category column is categorical, the next step is to one-hot-encode this column so I can perform analysis with sklearn

In [25]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,Arts & Crafts Store,...,Tennis Court,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Check the size of the encoded dataframe

In [26]:
toronto_onehot.shape

(494, 151)

To get an idea of how frequency of occurence of each Venue Category in each Neighbourhood, I will use the groupby method of a dataframe and then calculate the mean value.

In [27]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,Arts & Crafts Store,...,Tennis Court,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.066667,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.066667,0.066667,0.066667,0.133333,0.133333,0.133333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.133333,0.066667,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0


Now I will find out the top 5 most common venue categories in each Neighbourhood

In [28]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
              venue  freq
0        Steakhouse  0.13
1         Speakeasy  0.07
2             Hotel  0.07
3       Coffee Shop  0.07
4  Greek Restaurant  0.07


----Berczy Park----
               venue  freq
0     Farmers Market  0.13
1        Coffee Shop  0.07
2  French Restaurant  0.07
3       Cocktail Bar  0.07
4             Museum  0.07


----Brockton, Exhibition Place, Parkdale Village----
                venue  freq
0         Coffee Shop  0.13
1      Breakfast Spot  0.13
2  Italian Restaurant  0.07
3                 Gym  0.07
4        Climbing Gym  0.07


----Business Reply Mail Processing Centre 969 Eastern----
           venue  freq
0    Pizza Place  0.07
1  Auto Workshop  0.07
2     Restaurant  0.07
3        Butcher  0.07
4  Burrito Place  0.07


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0    Airport Lounge  0.13
1   Airport Service  0.13
2  Airport 

Create a funtion to sort the venue categories in descending order and create a dataframe to display the top 10 venue categories for each Neighbourhood

In [29]:
# first create the function
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# then apply the function to find the top 10 venue categories
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(10)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Steakhouse,Opera House,Concert Hall,Seafood Restaurant,Hotel,Speakeasy,Plaza,Pizza Place,Asian Restaurant,Gym / Fitness Center
1,Berczy Park,Farmers Market,Cocktail Bar,Thai Restaurant,Beer Bar,Breakfast Spot,Steakhouse,Seafood Restaurant,Liquor Store,Concert Hall,Park
2,"Brockton, Exhibition Place, Parkdale Village",Breakfast Spot,Coffee Shop,Furniture / Home Store,Stadium,Climbing Gym,Performing Arts Venue,Pet Store,Caribbean Restaurant,Café,Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Garden,Pizza Place,Skate Park,Spa,Restaurant,Burrito Place,Farmers Market,Fast Food Restaurant,Butcher,Auto Workshop
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Lounge,Airport Service,Airport Terminal,Airport,Bar,Boutique,Boat or Ferry,Harbor / Marina,Sculpture Garden,Airport Food Court
5,"Cabbagetown, St. James Town",Café,General Entertainment,Bakery,Pet Store,Butcher,Pub,Restaurant,Deli / Bodega,Jewelry Store,Diner
6,Central Bay Street,Coffee Shop,Italian Restaurant,Bubble Tea Shop,Sandwich Place,Modern European Restaurant,Gastropub,Park,Spa,Tea Room,Seafood Restaurant
7,"Chinatown, Grange Park, Kensington Market",Café,Vietnamese Restaurant,Cocktail Bar,Arts & Crafts Store,Coffee Shop,Farmers Market,Bar,Bakery,Mexican Restaurant,Organic Grocery
8,Christie,Café,Park,Grocery Store,Athletics & Sports,Coffee Shop,Convenience Store,Restaurant,Diner,Nightclub,Baby Store
9,Church and Wellesley,Juice Bar,Tea Room,Dance Studio,Bookstore,Salon / Barbershop,Bubble Tea Shop,Diner,Restaurant,Ramen Restaurant,Pizza Place


Finally, k-means cluster can be performed on the **toronto_grouped** dataframe

In [30]:
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 2, 1, 1, 2, 1, 2, 2, 1], dtype=int32)

I will also create a new dataframe that includes the cluster label as well as the top 10 venue categories for each Neighbourhood

In [31]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_final = toronto_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_final = toronto_final.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_final.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Japanese Restaurant,Coffee Shop,Italian Restaurant,Cosmetics Shop,Hotel,Restaurant,Creperie,BBQ Joint,Middle Eastern Restaurant,Gym
1,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Health Food Store,Neighborhood,Trail,Pub,Yoga Studio,Dessert Shop,Deli / Bodega,Dance Studio,Cuban Restaurant,Creperie
2,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Farmers Market,Cocktail Bar,Thai Restaurant,Beer Bar,Breakfast Spot,Steakhouse,Seafood Restaurant,Liquor Store,Concert Hall,Park
3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1,Coffee Shop,Italian Restaurant,Bubble Tea Shop,Sandwich Place,Modern European Restaurant,Gastropub,Park,Spa,Tea Room,Seafood Restaurant
4,M6G,Downtown Toronto,Christie,43.669542,-79.422564,2,Café,Park,Grocery Store,Athletics & Sports,Coffee Shop,Convenience Store,Restaurant,Diner,Nightclub,Baby Store


Finally, the clusters can be visualzed on a map

In [33]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_final['Latitude'], toronto_final['Longitude'], toronto_final['Neighbourhood'], toronto_final['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters