# Scouting restaurant locations in Toronto

<h2>Brief Description:</h2>
    
    Here we tackle the problem of finding the most promising location to open a restaurant in a given city. There is no set in stone way of mathematically calculating "promise" or "potential" of any given location. As such, I intend to develop a model which calculates this attribute with some cadre of layman-ly sense. This will be achieved by inculcating every location's population, surface area, per capita income, pre-existing restaurants, etc. Ofcourse, one can include other parameters like each pre-existing restaurant's average rating, price range, number of likes, etc. However, since the purpose of this script is to provide a proof of concept, we will stick to parameters for which we only have to make regular calls to Foursquare, since premium calls on a personal account are very limited.


<h3> Import required libraries -

In [1]:
import numpy as np
import pandas as pd

!pip -q install folium    # installing folium (used fro creating maps)
import folium

<h3> Read-in Toronto's demographic data - 

In [3]:
df = pd.read_html('https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods')[1]
print(df.shape)
df.head()

(175, 13)


Unnamed: 0,Name,FM,Census Tracts,Population,Land area (km2),Density (people/km2),% Change in Population since 2001,Average Income,Transit Commuting %,% Renters,Second most common language (after English) by name,Second most common language (after English) by percentage,Map
0,Toronto CMA Average,,All,5113149,5903.63,866,9.0,40704,10.6,11.4,,,
1,Agincourt,S,"0377.01, 0377.02, 0377.03, 0377.04, 0378.02, 0...",44577,12.45,3580,4.6,25750,11.1,5.9,Cantonese (19.3%),19.3% Cantonese,
2,Alderwood,E,"0211.00, 0212.00",11656,4.94,2360,-4.0,35239,8.8,8.5,Polish (6.2%),06.2% Polish,
3,Alexandra Park,OCoT,0039.00,4355,0.32,13609,0.0,19687,13.8,28.0,Cantonese (17.9%),17.9% Cantonese,
4,Allenby,OCoT,0140.00,2513,0.58,4333,-1.0,245592,5.2,3.4,Russian (1.4%),01.4% Russian,


In [4]:
df_new = df[['Name', 'Population', 'Land area (km2)', 'Density (people/km2)', 'Average Income']].reset_index()
df_new.drop(['index'], axis=1, inplace=True)
df_new.head()

Unnamed: 0,Name,Population,Land area (km2),Density (people/km2),Average Income
0,Toronto CMA Average,5113149,5903.63,866,40704
1,Agincourt,44577,12.45,3580,25750
2,Alderwood,11656,4.94,2360,35239
3,Alexandra Park,4355,0.32,13609,19687
4,Allenby,2513,0.58,4333,245592


In [5]:
df_new.shape

(175, 5)

<h3> Create a function to fetch coordinates of each neighborhood -

In [6]:
from geopy.geocoders import Nominatim    # used to find geographical coordinates  

def get_coordinates(names, code='TO'):
    
    coordinates = []    # list of all the (latitude, longitude) pairs
    empty_names=[]    # list of all the names for which geographical coordinates are not available
    
    for index, name in enumerate(names):
        name_ = name.split('/')
        address = '{}, {}'.format(name_[0], code)
        
        try:
            geolocator = Nominatim(user_agent="ny_explorer")
            location = geolocator.geocode(address)
            latitude = location.latitude
            longitude = location.longitude
            coordinates.append([latitude, longitude])
        except:
            print('Couldn\'t find coordinates for {}, index {}'.format(name, index))
            empty_names.append(name)
            
    return coordinates, empty_names

In [7]:
coordinates, empty_names = get_coordinates(names=df_new['Name'])
coordinates

Couldn't find coordinates for Toronto CMA Average, index 0
Couldn't find coordinates for Humber Bay Shores, index 72
Couldn't find coordinates for Humbermede, index 77
Couldn't find coordinates for Pelmo Park, index 118
Couldn't find coordinates for Playter Estates, index 119
Couldn't find coordinates for Rockcliffe–Smythe, index 128
Couldn't find coordinates for Tam O'Shanter – Sullivan, index 146
Couldn't find coordinates for Thorncliffe Park, index 155


[[43.7853531, -79.2785494],
 [43.6017173, -79.5452325],
 [43.650786999999994, -79.40431814731767],
 [14.5953432, 121.0352197],
 [42.8579536, -70.9300921],
 [43.7439436, -79.4308512],
 [52.0601807, -1.3402795],
 [47.6271498, -65.648293],
 [43.6673421, -79.3884571],
 [43.7691966, -79.3766617],
 [43.7981268, -79.3829726],
 [43.7373876, -79.4109253],
 [43.7535196, -79.2553355],
 [43.6918051, -79.2644935],
 [43.6493184, -79.4844358],
 [43.6761954, -79.4280155],
 [36.6411357, -93.2175285],
 [43.7381512, -79.3725113],
 [43.6509173, -79.4400216],
 [43.6644734, -79.3669861],
 [50.0163858, -114.89273791438612],
 [43.6707006, -79.4532993],
 [43.6781015, -79.409415775],
 [43.7874914, -79.1507681],
 [43.7025981, -79.4032704],
 [43.6671385, -79.4227656],
 [43.6655242, -79.3838011],
 [48.2806809, -1.9758742],
 [43.7088231, -79.2959856],
 [43.7218363, -79.2362138],
 [43.7111699, -79.2481769],
 [43.6573699, -79.3565129],
 [43.695403, -79.293099],
 [42.3124161, -85.18514409429133],
 [43.6715454, -79.448

<h3> Remove the neighborhoods whose coordinates weren't available(i.e. empty_names) from df_new - 

In [8]:
df_new.index = df['Name']
df_new.drop(index=empty_names, inplace=True)
df_new.drop(columns='Name', inplace=True)
df_new.head()

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Agincourt,44577,12.45,3580,25750
Alderwood,11656,4.94,2360,35239
Alexandra Park,4355,0.32,13609,19687
Allenby,2513,0.58,4333,245592
Amesbury,17318,3.51,4934,27546


<h3> Sanity check: see whether the length of (coordinates) matches that of (df_new) -  

In [9]:
print(len(coordinates))
print(df_new.shape)

167
(167, 4)


PHEW! ;-)

<h3> Add the coordinates data to (df_new) -

In [10]:
coordinates = np.array(coordinates)
latitudes = coordinates[:, 0]
longitudes = coordinates[:, 1]

# Create a dataframe with 2 columns, namely 'latitude' and 'longitude'
coords_df = pd.DataFrame({'latitude': latitudes, 'longitude': longitudes})
coords_df.index = df_new.index.values

# Combine (df_new) with (coords_df)
compiled_df = df_new.join(coords_df)
compiled_df.head()

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,latitude,longitude
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,44577,12.45,3580,25750,43.785353,-79.278549
Alderwood,11656,4.94,2360,35239,43.601717,-79.545232
Alexandra Park,4355,0.32,13609,19687,43.650787,-79.404318
Allenby,2513,0.58,4333,245592,14.595343,121.03522
Amesbury,17318,3.51,4934,27546,42.857954,-70.930092


In [11]:
compiled_df.shape

(167, 6)

<h3> Find the coordinates for Toronto itself -

In [12]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are ({}, {}).'.format(latitude, longitude))

The geograpical coordinates of Toronto are (43.6534817, -79.3839347).


<h3> Create a pre-liminary map to show the location of each neighborhood in Toronto -

In [70]:
map_toronto = folium.Map(location=[latitude+0.05, longitude], zoom_start=10.75)    # center point of the map

for lat, lng, label in zip(compiled_df['latitude'], compiled_df['longitude'], compiled_df.index.values):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(    # attributes of each bubble in the map
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Sweet! Our job now is to gather information about restaurants in each neighborhood to find out which neighborhood presents the best "potential" to open up a new restaurant! For this purpose, we will need to use the Foursquare API. We will make only regular calls for the purpose of this project since premium calls are fairly limited on a personal account.    

<h3> Enter Foursquare credentials -

In [14]:
CLIENT_ID = 'B0MWNNRJM5A4AEKPTTQKHDBWWAZF4MGKIESUMLTJOUPGQSED'    # Foursquare ID
CLIENT_SECRET = 'HH5LLRRQEEPEXL4FDETHRBNCFDCTQW5GIPVY5OY0NKONEIVW'    # Foursquare Secret
VERSION = '20200606' # Foursquare API version

# CLIENT_ID = 'QNXERFM0BE1Z3XS3KHHJMNH3VTF5MBGUJS4300ZR4ADDAVBP'
# CLIENT_SECRET = 'LXRLT31GCOPEJMSQZ40OGHKSKF2CPIRBYJZ3DBU1OW15K4NM'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: B0MWNNRJM5A4AEKPTTQKHDBWWAZF4MGKIESUMLTJOUPGQSED
CLIENT_SECRET:HH5LLRRQEEPEXL4FDETHRBNCFDCTQW5GIPVY5OY0NKONEIVW


<h3> Create a function to make API requests for each location and return a dataframe with all the required data - 

In [15]:
def get_category_type(row):    # unwrapping the results obtained from all the API requests 
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


def getNearbyVenues(names, latitudes, longitudes, query_key, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        print(name)
            
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            query_key,
            radius, 
            LIMIT)    # the API request URL
        
        results = requests.get(url).json()    
        venues = results['response']['venues']    # assign relevant part of JSON to venues

        dataframe = json_normalize(venues)    # tranform venues into a dataframe
        
        # keep only columns that include venue name, and anything that is associated with location
        try:
            filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
            dataframe_filtered = dataframe.loc[:, filtered_columns]
            #print(dataframe_filtered.head())
        except:
            continue
        
        # filter the category for each row
        dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

        # clean column names by keeping only last term
        dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]
        
        frequency = dataframe_filtered.shape[0]    # number of restaurants in a given location
        dataframe_filtered['frequency'] = [frequency for i in range(frequency)]     
        dataframe_filtered['city'] = [name for i in range(frequency)]
        
        venues_list.append(dataframe_filtered)
    
    return pd.concat(venues_list)
        

<h3> Import libraries required by the function -

In [None]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

<h3> Call the 'getNearbyVenues' function to obtain all the required restaurant data -

In [19]:
query = 'restaurant'    # this will be our key word
toronto_venues = getNearbyVenues(names=compiled_df.index.values,
                                 latitudes=compiled_df['latitude'],
                                 longitudes=compiled_df['longitude'],
                                 query_key = query
                                )
toronto_venues.reset_index(inplace=True)
toronto_venues

Agincourt
Alderwood
Alexandra Park
Allenby
Amesbury
Armour Heights
Banbury
Bathurst Manor
Bay Street Corridor
Bayview Village
Bayview Woods – Steeles
Bedford Park
Bendale
Birch Cliff
Bloor West Village
Bracondale Hill
Branson
Bridle Path
Brockton
Cabbagetown
Caribou Park
Carleton Village
Casa Loma
Centennial
Chaplin Estates
Christie Pits
Church and Wellesley
Clairville
Clairlea
Cliffcrest
Cliffside
Corktown
Crescent Town
Cricket Club
Davenport
Davisville
Deer Park
Discovery District
Distillery District/West Don Lands
Don Mills
Don Valley Village
Dorset Park
Dovercourt Park
Downsview
Dufferin Grove
Earlscourt
East Danforth
Eatonville
Eglinton East
Elia (Jane and Finch)
Eringate
Fairbank
Fashion District
Financial District
Flemingdon Park
Forest Hill
Fort York/Liberty Village
Garden District
Glen Park
Governor's Bridge/Bennington Heights
Grange Park
Graydon Hall
Guildwood
Harbord Village
Harbourfront / CityPlace
Harwood
Henry Farm
High Park North
Highland Creek
Hillcrest
Hoggs Hollow
Hum

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Unnamed: 0,index,address,categories,cc,city,country,crossStreet,distance,formattedAddress,frequency,id,labeledLatLngs,lat,lng,name,neighborhood,postalCode,state
0,0,4271 Sheppard Ave. E,Chinese Restaurant,CA,Agincourt,Canada,btwn Brimley & Midland Ave.,212,[4271 Sheppard Ave. E (btwn Brimley & Midland ...,18,4be303c52fc7d13ae879083a,"[{'label': 'display', 'lat': 43.78593747713555...",43.785937,-79.276031,Beef Noodle Restaurant 老李牛肉麵,,M1S 4G4,ON
1,1,1 Glen Watford Dr.,Chinese Restaurant,CA,Agincourt,Canada,,247,"[1 Glen Watford Dr., Scarborough ON, Canada]",18,4baa46def964a520485a3ae3,"[{'label': 'display', 'lat': 43.78620976205095...",43.786210,-79.275701,South Sea Fish Village Chinese Restaurant,,,ON
2,2,"25 Glen Watford Dr, Unit 9",Chinese Restaurant,CA,Agincourt,Canada,at Sheppard Ave E,314,"[25 Glen Watford Dr, Unit 9 (at Sheppard Ave E...",18,53dbb77e498e6dfadda043fa,"[{'label': 'display', 'lat': 43.78707762430626...",43.787078,-79.275454,Old Neighbour Restaurant 老街坊天津韩记包子铺,,M1S 2B7,ON
3,3,9 Glen Watford Dr.,Korean Restaurant,CA,Agincourt,Canada,at Sheppard Ave. E,260,"[9 Glen Watford Dr. (at Sheppard Ave. E), Scar...",18,4d2f8a98789a8cfa6b0826c6,"[{'label': 'display', 'lat': 43.78646767038441...",43.786468,-79.275693,In Cheon House Korean & Japanese Restaurant 인천관,,M1S 2B9,ON
4,4,,Chinese Restaurant,CA,Agincourt,Canada,,55,[Canada],18,4f3852b7e4b0ea2d7edc00ca,"[{'label': 'display', 'lat': 43.78489781410844...",43.784898,-79.278272,Lucky House Restaurant,,,
5,5,"4227 Sheppard Ave, E. Unit B1",Cantonese Restaurant,CA,Agincourt,Canada,Midland Ave,64,"[4227 Sheppard Ave, E. Unit B1 (Midland Ave), ...",18,5647afde498e42030b40c86d,"[{'label': 'display', 'lat': 43.78508719298561...",43.785087,-79.277843,King Huang Chinese Restaurant,,M1S 5H5,ON
6,6,4192 Sheppard Ave E,Chinese Restaurant,CA,Agincourt,Canada,at Midland Ave,74,"[4192 Sheppard Ave E (at Midland Ave), Scarbor...",18,54838447498e2f5b90530813,"[{'label': 'display', 'lat': 43.78516372850498...",43.785164,-79.279433,Tianjin Dumpling Restaurant 天津包子铺,,M1S 1T3,ON
7,7,4227 Sheppard Ave E Unit B1,Cantonese Restaurant,CA,Agincourt,Canada,,106,"[4227 Sheppard Ave E Unit B1, Scarborough ON, ...",18,53489594498e2802cb19ddca,"[{'label': 'display', 'lat': 43.78472450642784...",43.784725,-79.277556,Ox Land Restaurant,,,ON
8,8,4227 Sheppard Ave. E. Unit B2,Chinese Restaurant,CA,Agincourt,Canada,,114,"[4227 Sheppard Ave. E. Unit B2, Scarborough ON...",18,5d80fe5ba86ac4000795406a,"[{'label': 'display', 'lat': 43.78451273285425...",43.784513,-79.277735,May Yan Seafood Restaurant 陸福海鮮酒家,,M1S 5H5,ON
9,9,,Asian Restaurant,CA,Agincourt,Canada,,156,[Canada],18,4d2bbbf3888af04db4abe2af,"[{'label': 'display', 'lat': 43.78582201864889...",43.785822,-79.276714,pengfuxuan Restaurant,,,


<h3> Wrangle the above obtained dataframe and find out the number of restaurants in each neighborhood -

In [20]:
columns = ['name', 'city', 'categories', 'lat', 'lng', 'id', 'frequency']
toronto_venues = toronto_venues[columns]
print(toronto_venues['city'].value_counts())

Grange Park                             48
Kensington Market                       41
Financial District                      37
Garden District                         36
Discovery District                      30
Bay Street Corridor                     29
Fashion District                        29
Yorkville                               27
Alexandra Park                          25
Church and Wellesley                    21
Harbord Village                         20
North York City Centre                  19
Agincourt                               18
Harbourfront / CityPlace                14
Milliken                                13
Christie Pits                           11
Niagara                                 11
Newtonbrook                             10
Don Valley Village                       9
Chaplin Estates                          8
Allenby                                  8
Lawrence Park                            8
Cabbagetown                              8
Parkdale   

In [44]:
# Wrangling continued...
all_venues = toronto_venues.iloc[:, :]
all_venues.reset_index(inplace=True)
all_venues.drop(columns='index', inplace=True)
all_venues

Unnamed: 0,name,city,categories,lat,lng,id,frequency
0,Beef Noodle Restaurant 老李牛肉麵,Agincourt,Chinese Restaurant,43.785937,-79.276031,4be303c52fc7d13ae879083a,18
1,South Sea Fish Village Chinese Restaurant,Agincourt,Chinese Restaurant,43.786210,-79.275701,4baa46def964a520485a3ae3,18
2,Old Neighbour Restaurant 老街坊天津韩记包子铺,Agincourt,Chinese Restaurant,43.787078,-79.275454,53dbb77e498e6dfadda043fa,18
3,In Cheon House Korean & Japanese Restaurant 인천관,Agincourt,Korean Restaurant,43.786468,-79.275693,4d2f8a98789a8cfa6b0826c6,18
4,Lucky House Restaurant,Agincourt,Chinese Restaurant,43.784898,-79.278272,4f3852b7e4b0ea2d7edc00ca,18
5,King Huang Chinese Restaurant,Agincourt,Cantonese Restaurant,43.785087,-79.277843,5647afde498e42030b40c86d,18
6,Tianjin Dumpling Restaurant 天津包子铺,Agincourt,Chinese Restaurant,43.785164,-79.279433,54838447498e2f5b90530813,18
7,Ox Land Restaurant,Agincourt,Cantonese Restaurant,43.784725,-79.277556,53489594498e2802cb19ddca,18
8,May Yan Seafood Restaurant 陸福海鮮酒家,Agincourt,Chinese Restaurant,43.784513,-79.277735,5d80fe5ba86ac4000795406a,18
9,pengfuxuan Restaurant,Agincourt,Asian Restaurant,43.785822,-79.276714,4d2bbbf3888af04db4abe2af,18


<h3> Combine 'all_venues' with 'compiled_df' to obtain the final dataframe with all the required parameters -

In [50]:
final_df = compiled_df.iloc[:, :]

# create a mask for names for which we actually have any restaurant data
mask = [True if name in list(all_venues['city']) else False for name in final_df.index.values]
final_df = final_df[mask]

final_df['frequency'] = [all_venues[all_venues['city'] == name].shape[0] for name in final_df.index.values]
final_df['potential'] = final_df['Density (people/km2)'] * final_df['Average Income'] / final_df['frequency']

# only keep the required columns
columns = ['Population', 'Land area (km2)', 'Density (people/km2)', 'Average Income', 'frequency', 'potential', 'latitude', 'longitude']
final_df = final_df[columns]

# remove rogue locations ;P
final_df = final_df[final_df['latitude'] >= 43]
final_df = final_df[final_df['latitude'] <= 44]
final_df = final_df[final_df['longitude'] >= -80]
final_df = final_df[final_df['longitude'] <= -79]

# final_df.reset_index(inplace=True)
# final_df.drop(columns='index', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [51]:
final_df

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,frequency,potential,latitude,longitude
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Agincourt,44577,12.45,3580,25750,18,5.121389e+06,43.785353,-79.278549
Alderwood,11656,4.94,2360,35239,1,8.316404e+07,43.601717,-79.545232
Alexandra Park,4355,0.32,13609,19687,25,1.071682e+07,43.650787,-79.404318
Bay Street Corridor,4787,0.11,43518,40598,29,6.092220e+07,43.667342,-79.388457
Bloor West Village,5175,0.74,6993,55578,5,7.773139e+07,43.649318,-79.484436
Bracondale Hill,5343,0.62,8618,41605,1,3.585519e+08,43.676195,-79.428016
Brockton,9039,1.10,8217,27260,1,2.239954e+08,43.650917,-79.440022
Cabbagetown,11120,1.40,7943,50398,8,5.003891e+07,43.664473,-79.366986
Carleton Village,6544,0.74,8843,23301,3,6.868358e+07,43.670701,-79.453299
Chaplin Estates,4906,0.93,5275,81288,8,5.359928e+07,43.702598,-79.403270


In [52]:
final_df.sort_values(by='potential', ascending=False)

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,frequency,potential,latitude,longitude
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
High Park North,22746,2.18,10434,46437,1,4.845237e+08,43.657383,-79.470961
Old Mill/Baby Point,4010,1.07,3748,110372,1,4.136743e+08,43.649826,-79.494334
The Kingsway,8780,2.58,3403,110944,1,3.775424e+08,43.647381,-79.511333
Bracondale Hill,5343,0.62,8618,41605,1,3.585519e+08,43.676195,-79.428016
Humewood–Cedarvale,27515,3.19,8624,40404,1,3.484441e+08,43.690248,-79.422097
Wallace Emerson,10338,0.88,11748,25029,1,2.940407e+08,43.666733,-79.446478
Forest Hill,24056,4.35,5530,101631,2,2.810097e+08,43.693559,-79.413902
Crescent Town,8157,0.40,20393,23021,2,2.347336e+08,43.695403,-79.293099
Lawrence Manor,13750,2.14,6425,36361,1,2.336194e+08,43.722079,-79.437507
Brockton,9039,1.10,8217,27260,1,2.239954e+08,43.650917,-79.440022


High Park North seems to be the place with the most potential!

<h3> Map this dataframe with the radius of each bubble being proportional to its potential -

In [53]:
map_toronto = folium.Map(location=[latitude+0.05, longitude], zoom_start=11.5)    # center point of the map

for lat, lng, label, pot in zip(final_df['latitude'], final_df['longitude'], final_df.index.values, final_df['potential']):
    label = folium.Popup(label, parse_html=True)
    folium.Circle(    # attributes of each bubble in the map
        [lat, lng],
        radius=pot*2/(10**6),
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

One apparent shortcoming of this model is the fact that 'potential' is very sensitive towards neighborhoods with only 1 or 2 documented restaurants, since even if only one or two more restaurants are added in these places, the potential will get halved. This makes the model very vulnerable if there is even one restaurant undocumented in these places. 

So, to counter this issue and add a scent of robustness to the model, let's consider only those neighborhoods which have more than 10 documented restaurants. This will make 'potential' significantly less sensitive if more restaurants are documented at a later stage.

<h3> Remove places with less than 10 restaurants -

In [54]:
sliced_df = final_df[final_df['frequency'] >= 10]
sliced_df

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,frequency,potential,latitude,longitude
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Agincourt,44577,12.45,3580,25750,18,5121389.0,43.785353,-79.278549
Alexandra Park,4355,0.32,13609,19687,25,10716820.0,43.650787,-79.404318
Bay Street Corridor,4787,0.11,43518,40598,29,60922200.0,43.667342,-79.388457
Christie Pits,5124,0.64,8006,30556,11,22239210.0,43.667139,-79.422766
Church and Wellesley,13397,0.55,24358,37653,21,43673890.0,43.665524,-79.383801
Discovery District,7262,0.66,6998,41998,30,9796733.0,43.657556,-79.38948
Fashion District,4642,0.98,4737,63282,29,10336790.0,43.645456,-79.394994
Financial District,548,0.47,1166,63952,37,2015352.0,43.648664,-79.38154
Garden District,8240,0.52,15846,37614,36,16556430.0,43.6565,-79.377114
Grange Park,9007,0.84,10793,35277,48,7932180.0,43.652197,-79.392319


In [55]:
print(sliced_df['frequency'].sum())
sliced_df.sort_values(by='potential', ascending=False)

420


Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,frequency,potential,latitude,longitude
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bay Street Corridor,4787,0.11,43518,40598,29,60922200.0,43.667342,-79.388457
Niagara,6524,0.55,11862,44611,11,48106880.0,43.644075,-79.408698
Harbourfront / CityPlace,14368,1.87,9228,69232,14,45633780.0,43.64008,-79.38015
Church and Wellesley,13397,0.55,24358,37653,21,43673890.0,43.665524,-79.383801
Yorkville,6045,0.56,10795,105239,27,42076110.0,43.671386,-79.390168
Christie Pits,5124,0.64,8006,30556,11,22239210.0,43.667139,-79.422766
Harbord Village,5906,0.64,9228,45792,20,21128430.0,43.661522,-79.409745
Garden District,8240,0.52,15846,37614,36,16556430.0,43.6565,-79.377114
Newtonbrook,36046,8.77,4110,33428,10,13738910.0,43.793886,-79.425679
Alexandra Park,4355,0.32,13609,19687,25,10716820.0,43.650787,-79.404318


Here again, land area seems to dominate 'potential'. A way around could be to scale all parameters between 1 and 2 and then recalculate 'potential'. That will give each parameter equal opportunity to affect the final potential.

<h3> Re-scale all 'potential' affecting parameters -

In [None]:
scaled_df = sliced_df.iloc[:, :]

# We want (1 / landArea(km2)) to fall between 1 and 2, where 1 means highest area and 2 means lowest area
scaled_df['Land area (km2) inv'] = 1 + ( 1/scaled_df['Land area (km2)'] - 1/scaled_df['Land area (km2)'].max() ) / ( 1/scaled_df['Land area (km2)'].min() - 1/scaled_df['Land area (km2)'].max() )

# We want (population) to fall between 1 and 2, where 1 means lowest population and 2 means highest population
scaled_df['Population pro'] = 1 + ( scaled_df['Population'] - scaled_df['Population'].min() ) / ( scaled_df['Population'].max() - scaled_df['Population'].min() )

# We want (averageIncome) to fall between 1 and 2, where 1 means lowest averageIncome and 2 means highest averageIncome
scaled_df['Average Income pro'] = 1 + ( scaled_df['Average Income'] - scaled_df['Average Income'].min() ) / ( scaled_df['Average Income'].max() - scaled_df['Average Income'].min() )

# We want (1 / frequency) to fall between 1 and 2, where 1 means highest frequency and 2 means lowest frequency
scaled_df['frequency inv'] = 1 + ( 1/scaled_df['frequency'] - 1/scaled_df['frequency'].max() ) / ( 1/scaled_df['frequency'].min() - 1/scaled_df['frequency'].max() )

# potential will simply be the product of all the above paramters!
scaled_df['potential'] = scaled_df['Land area (km2) inv'] * scaled_df['Population pro'] * scaled_df['Average Income pro'] * scaled_df['frequency inv']

scaled_df

In [59]:
sliced_df.sort_values(by='potential', ascending=False)

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,frequency,potential,latitude,longitude,Land area (km2) inv,Population pro,Average Income pro,frequency inv
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Newtonbrook,36046,8.77,4110,33428,10,4.208387,43.793886,-79.425679,1.00374,1.806241,1.160616,2.0
Harbourfront / CityPlace,14368,1.87,9228,69232,14,3.572285,43.64008,-79.38015,1.050434,1.313884,1.579121,1.639098
Niagara,6524,0.55,11862,44611,11,3.298033,43.644075,-79.408698,1.192869,1.135729,1.291332,1.885167
Yorkville,6045,0.56,10795,105239,27,3.223104,43.671386,-79.390168,1.189265,1.12485,2.0,1.204678
Bay Street Corridor,4787,0.11,43518,40598,29,3.198895,43.667342,-79.388457,2.0,1.096277,1.244424,1.172414
Agincourt,44577,12.45,3580,25750,18,3.081097,43.785353,-79.278549,1.0,2.0,1.070869,1.438596
Milliken,26272,7.19,3654,25243,13,2.901275,43.823174,-79.301763,1.006521,1.584251,1.064943,1.708502
Christie Pits,5124,0.64,8006,30556,11,2.731306,43.667139,-79.422766,1.164493,1.103931,1.127046,1.885167
Church and Wellesley,13397,0.55,24358,37653,21,2.495469,43.665524,-79.383801,1.192869,1.29183,1.210001,1.338346
Harbord Village,5906,0.64,9228,45792,20,2.332847,43.661522,-79.409745,1.164493,1.121693,1.305136,1.368421


<h3> Create the final map with this transformed potential -

In [163]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=13.5)    # center point of the map

for lat, lng, label, pot in zip(sliced_df['latitude'], sliced_df['longitude'], sliced_df.index.values, sliced_df['potential']):
    label = folium.Popup(label, parse_html=True)
    folium.Circle(    # attributes of each bubble in the map
        [lat, lng],
        radius=pot*200,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<h2> Some Observations: </h2>
    
    If one wishes to open a retaurant anywhere in Toronto, Newtonbrook is the place for them. It has a significant population with decent per capita income and relatively low number of pre-existing restaurants! 
    
    However, if one insists on opening a restaurant in downtown Toronto because of tourist attractions and networking, Harbourfront(City Place) is the optimum location. If one wants to open a restaurant with lowest risk of failure, i.e. a place where many restaurants are known to flourish, Bay street corridor is the optimum location. Basically, any end-user using this model can apply their own filters on the final results according to their needs and wishes and the model will give them the optimum location leveraging which parameters they value and which they don't. Cool!

<h2> Clustering:</h2>
    
    Now that we have individual data points readily available, we can find optimal locations as per our needs from the analysis above. Extending this idea, we can create clusters of all these locations to find out which of them are similar and in what way. This will further help us in combining our analytical findings with geospatial data! 

<h3> create a new dataframe, cutting off all the unwanted/repeated parameters -

In [72]:
refined_data = sliced_df.iloc[:, 5:]
refined_data

Unnamed: 0_level_0,potential,latitude,longitude,Land area (km2) inv,Population pro,Average Income pro,frequency inv
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Agincourt,3.081097,43.785353,-79.278549,1.0,2.0,1.070869,1.438596
Alexandra Park,1.805503,43.650787,-79.404318,1.3379,1.086466,1.0,1.242105
Bay Street Corridor,3.198895,43.667342,-79.388457,2.0,1.096277,1.244424,1.172414
Christie Pits,2.731306,43.667139,-79.422766,1.164493,1.103931,1.127046,1.885167
Church and Wellesley,2.495469,43.665524,-79.383801,1.192869,1.29183,1.210001,1.338346
Discovery District,1.95039,43.657556,-79.38948,1.159238,1.15249,1.260789,1.157895
Fashion District,2.136232,43.645456,-79.394994,1.104331,1.092984,1.509573,1.172414
Financial District,2.007871,43.648664,-79.38154,1.227215,1.0,1.517405,1.078236
Garden District,1.861562,43.6565,-79.377114,1.20451,1.174703,1.209545,1.087719
Grange Park,1.583003,43.652197,-79.392319,1.123206,1.192123,1.182228,1.0


<h3> Import the KMeans library and fit the scaled version of our parameters -

In [78]:
from sklearn.cluster import KMeans

x = refined_data.iloc[:, 3:]
k_means = KMeans(init='k-means++', n_clusters=4, n_init=12).fit(x)
k_means.labels_

array([1, 0, 2, 3, 0, 0, 2, 2, 0, 0, 0, 3, 0, 1, 1, 3, 2], dtype=int32)

<h3> Add these labels to (refined_data) -

In [79]:
refined_data['label'] = k_means.labels_
refined_data.sort_values(by='potential', ascending=False)

Unnamed: 0_level_0,potential,latitude,longitude,Land area (km2) inv,Population pro,Average Income pro,frequency inv,label
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Newtonbrook,4.208387,43.793886,-79.425679,1.00374,1.806241,1.160616,2.0,1
Harbourfront / CityPlace,3.572285,43.64008,-79.38015,1.050434,1.313884,1.579121,1.639098,3
Niagara,3.298033,43.644075,-79.408698,1.192869,1.135729,1.291332,1.885167,3
Yorkville,3.223104,43.671386,-79.390168,1.189265,1.12485,2.0,1.204678,2
Bay Street Corridor,3.198895,43.667342,-79.388457,2.0,1.096277,1.244424,1.172414,2
Agincourt,3.081097,43.785353,-79.278549,1.0,2.0,1.070869,1.438596,1
Milliken,2.901275,43.823174,-79.301763,1.006521,1.584251,1.064943,1.708502,1
Christie Pits,2.731306,43.667139,-79.422766,1.164493,1.103931,1.127046,1.885167,3
Church and Wellesley,2.495469,43.665524,-79.383801,1.192869,1.29183,1.210001,1.338346,0
Harbord Village,2.332847,43.661522,-79.409745,1.164493,1.121693,1.305136,1.368421,0


<h3> Map this dataframe with each location's color scheme depending on its label -

In [164]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13.5)

kclusters = 4

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, pot in zip(refined_data['latitude'], refined_data['longitude'], refined_data.index.values, refined_data['label'], refined_data['potential']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.Circle(
        [lat, lon],
        radius=pot*200,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3> Let's now look at each label(cluster) individually -

In [166]:
def highlight(s):    # to add background colour to dataframes as per their label numbers (row wise)
    index = int(s[-1] - 1)
    return ['background-color: {}'.format(rainbow[index]) for _ in s]

def highlight_col(s):    # # to add background colour to dataframes as per their label numbers (column wise)
    if s[0]=='-':
        return ['' for _ in s]
    else:
        index = int(s[0] - 1)
        return ['background-color: {}'.format(rainbow[index]) for _ in s]

In [127]:
# cluster 0 points:
zero_df = sliced_df[refined_data['label'] == 0].iloc[:, :6]
zero_df['label'] = refined_data['label']
zero_df.style.apply(highlight, axis=1)

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,frequency,potential,label
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alexandra Park,4355,0.32,13609,19687,25,1.8055,0
Church and Wellesley,13397,0.55,24358,37653,21,2.49547,0
Discovery District,7262,0.66,6998,41998,30,1.95039,0
Garden District,8240,0.52,15846,37614,36,1.86156,0
Grange Park,9007,0.84,10793,35277,48,1.583,0
Harbord Village,5906,0.64,9228,45792,20,2.33285,0
Kensington Market,3740,0.36,10389,23335,41,1.51827,0


In [124]:
# cluster 1 points:
one_df = sliced_df[refined_data['label'] == 1].iloc[:, :6]
one_df['label'] = refined_data['label']
one_df.style.apply(highlight, axis=1)

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,frequency,potential,label
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Agincourt,44577,12.45,3580,25750,18,3.0811,1
Milliken,26272,7.19,3654,25243,13,2.90127,1
Newtonbrook,36046,8.77,4110,33428,10,4.20839,1


In [125]:
# cluster 2 points:
two_df = sliced_df[refined_data['label'] == 2].iloc[:, :6]
two_df['label'] = refined_data['label']
two_df.style.apply(highlight, axis=1)

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,frequency,potential,label
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Bay Street Corridor,4787,0.11,43518,40598,29,3.19889,2
Fashion District,4642,0.98,4737,63282,29,2.13623,2
Financial District,548,0.47,1166,63952,37,2.00787,2
Yorkville,6045,0.56,10795,105239,27,3.2231,2


In [126]:
# cluster 3 points:
three_df = sliced_df[refined_data['label'] == 3].iloc[:, :6]
three_df['label'] = refined_data['label']
three_df.style.apply(highlight, axis=1)

Unnamed: 0_level_0,Population,Land area (km2),Density (people/km2),Average Income,frequency,potential,label
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Christie Pits,5124,0.64,8006,30556,11,2.73131,3
Harbourfront / CityPlace,14368,1.87,9228,69232,14,3.57229,3
Niagara,6524,0.55,11862,44611,11,3.29803,3


<h3> Let's look at the stats of all these clusters together in one dataframe -

In [167]:
summary_df = sliced_df.iloc[:, :6].describe().iloc[1, :].to_frame()
summary_df['Cluster 0'] = zero_df.describe().iloc[1, :]
summary_df['Cluster 1'] = one_df.describe().iloc[1, :]
summary_df['Cluster 2'] = two_df.describe().iloc[1, :]
summary_df['Cluster 3'] = three_df.describe().iloc[1, :]

labels = pd.DataFrame({'mean': ['-'], 'Cluster 0': [0], 'Cluster 1': [1], 'Cluster 2': [2], 'Cluster 3': [3]})
labels.set_axis(['label'], inplace=True)

counts = pd.DataFrame({'mean': [sliced_df.shape[0]], 'Cluster 0': [zero_df.shape[0]], 'Cluster 1': [one_df.shape[0]], 'Cluster 2': [two_df.shape[0]], 'Cluster 3': [three_df.shape[0]]})
counts.set_axis(['count'], inplace=True)

summary_df = pd.concat([labels, counts, summary_df])
summary_df.style.apply(highlight_col, axis=0)

Unnamed: 0,mean,Cluster 0,Cluster 1,Cluster 2,Cluster 3
label,-,0.0,1.0,2.0,3.0
count,17,7.0,3.0,4.0,3.0
Population,11814.1,7415.29,35631.7,4005.5,8672.0
Land area (km2),2.20471,0.555714,9.47,0.53,1.02
Density (people/km2),11286.9,13031.6,3781.33,15054.0,9698.67
Average Income,43720.4,34479.4,28140.3,68267.8,48133.0
frequency,24.7059,31.5714,13.6667,30.5,12.0
potential,2.58268,1.93529,3.39692,2.64153,3.20054


<h2> Clustered Groups: </h2>
    
   <h3>Group 0 :</h3> 
   
    If one wishes to open a retaurant anywhere in Toronto, Newtonbrook is the place for them. It has a significant population with decent per capita income and relatively low number of pre-existing restaurants! 
    
    However, if one insists on opening a restaurant in downtown Toronto because of tourist attractions and networking, Harbourfront(City Place) is the optimum location. If one wants to open a restaurant with lowest risk of failure, i.e. a place where many restaurants are known to flourish, Bay street corridor is the optimum location. Basically, any end-user using this model can apply their own filters on the final results according to their needs and wishes and the model will give them the optimum location leveraging which parameters they value and which they don't. Cool!    