# The Battle of Neighborhoods

## Introduction

The problem we want to address is to identify the location that is suitable to open a new resturant in the City of Toronto.


## Data

The data we shall use is the neighborhoods of Toronto (from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M); the location information for each neighborhoods (from `geocoder`) and the venues information for each neighborhoods (from *FourSquare*).

We shall fit the Multi-dimensional Linear Regression where the features are the number of venues which is not a restaurant; and the dependent variable is the number of venues which is a restaurant. Then we use this model to predict the expected number of restaurants for each neighbourhood. If the actual value is lower than the expected value, then we shall consider it is a good location to open a new restaurant as there are still markets in that neighbourhood.  In contrast, if their actual value is larger than expected value, then we shall consider it is a bad location as the competition is too high in that neighbourhood. Then we can sort the result to see which location is good to open a new restaurant.

Let's first fetch the neighborhood data from WikiPedia and clean it.

In [2]:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
neighbor = pd.read_html(url)[0]
neighbor = neighbor[neighbor['Borough'] != 'Not assigned']
neighbor.index = range(neighbor.shape[0])

print(neighbor.shape)
neighbor.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Then we shall fetch the location data. Since `geocoder` is unstable, we shall use the pre-fetch csv file.

In [3]:
geo = pd.read_csv('Geospatial_Coordinates.csv')
print(geo.shape)
geo.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In the next steps, we can fetch venues data using FourSquare API.

In [4]:
CLIENT_ID = 'ZMMQQSIUFY2YSNOQMF31PYZU1KLB5NKEKK3ZUJZSTTSAB2XO'
CLIENT_SECRET = 'IRXJD5LHR0JK0GSSVUNCXSN01HVP4ZBLCCLLM3QIKOVRVIZH'
VERSION = '20180605'

In [5]:
import requests # library to handle requests

LIMIT = 100
RADIUS = 500

def getVenues(names, latitudes, longitudes):
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            RADIUS, 
            LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    venues_df = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    venues_df.columns = ['Postal Code', 
                         'Neighborhood Latitude', 
                         'Neighborhood Longitude', 
                         'Venue', 
                         'Venue Latitude', 
                         'Venue Longitude', 
                         'Venue Category']
    return venues_df

In [6]:
venues = getVenues(geo['Postal Code'], geo['Latitude'], geo['Longitude'])
print(venues.shape)
venues.head()

(2153, 7)


Unnamed: 0,Postal Code,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,M1B,43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,M1E,43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
4,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


Let's convert it to one-hot encoding and sum them up.

In [7]:
venues_onehot = pd.get_dummies(venues['Venue Category'], prefix="", prefix_sep="")
venues_onehot['Postal Code'] = venues['Postal Code']
venues_onehot = venues_onehot.groupby('Postal Code').sum().reset_index()
print(venues_onehot.shape)
venues_onehot.head()

(100, 270)


Unnamed: 0,Postal Code,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1G,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1H,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's sum any columns contains `Food` or `Restaurant` and make it into a new column `Restaurant`

In [8]:
restaurants = venues_onehot[(v for v in venues_onehot.columns if 'Food' in v or 'Restaurant' in v)].sum(axis=1)
venues_onehot.drop(columns=(v for v in venues_onehot.columns if 'Food' in v or 'Restaurant' in v), inplace=True)
venues_onehot['Restaurant'] = restaurants
print(venues_onehot.shape)
venues_onehot.head()

(100, 219)


Unnamed: 0,Postal Code,Accessories Store,Airport,Airport Lounge,Airport Service,Airport Terminal,Antique Shop,Aquarium,Art Gallery,Art Museum,...,Train Station,Video Game Store,Video Store,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Restaurant
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
3,M1G,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,M1H,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3


Let's merge above data to finish data preparation.

In [9]:
df = neighbor.merge(geo, how='left', on='Postal Code').merge(venues_onehot, how='left', on='Postal Code')
df.fillna(0, inplace=True)

print(df.shape)
df.head()

(103, 223)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Accessories Store,Airport,Airport Lounge,Airport Service,Airport Terminal,...,Train Station,Video Game Store,Video Store,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Restaurant
0,M3A,North York,Parkwoods,43.753259,-79.329656,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0


## Methodology

We shall use multi-linear regerssion to fit the data.

In [10]:
X = df.drop(['Postal Code', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude', 'Restaurant'], axis=1)
X.head()

Unnamed: 0,Accessories Store,Airport,Airport Lounge,Airport Service,Airport Terminal,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,...,Trail,Train Station,Video Game Store,Video Store,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [11]:
y = df['Restaurant']
y.head()

0    1.0
1    1.0
2    5.0
3    1.0
4    2.0
Name: Restaurant, dtype: float64

In [12]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X, y)
y_hat = pd.Series(model.predict(X))

y_hat.head()

0    0.214769
1    0.985277
2    4.999925
3    0.840990
4    1.983324
dtype: float64

## Result
Let's see the capacity of opening a new restaurant for each neighbor.

In [13]:
df['Restaurant Expected'] = y_hat
df['Restaurant Capacity'] = df['Restaurant Expected'] - df['Restaurant']
df.sort_values(by='Restaurant Capacity', ascending=False, inplace=True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Accessories Store,Airport,Airport Lounge,Airport Service,Airport Terminal,...,Video Store,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Restaurant,Restaurant Expected,Restaurant Capacity
91,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.338766,0.338766
50,M9L,North York,Humber Summit,43.756303,-79.565963,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.235887,0.235887
97,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.648429,-79.382280,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,31.0,31.220384,0.220384
44,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.219806,0.219806
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,14.215303,0.215303
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.840990,-0.159010
26,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,2.799852,-0.200148
13,M3C,North York,Don Mills,43.725900,-79.340923,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,6.736376,-0.263624
32,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.428976,-0.428976


We can see the map below, where red means good location and blue means bad location.

In [14]:
import folium
import math

toronto = folium.Map(location=[43.741667, -79.373333], zoom_start=11)

capmax = df['Restaurant Capacity'].max()
capmin = df['Restaurant Capacity'].min()

for lat, lon, poi, cap in zip(df['Latitude'], df['Longitude'], df['Postal Code'], df['Restaurant Capacity']):
    label = folium.Popup(str(poi) + ' Capacity ' + str(cap), parse_html=True)
    r = int(math.floor((cap - capmin) / (capmax - capmin) * 255))
    b = int(255 - r)
    color = '#%02x%02x%02x' % (r, 0, b)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=color,
        fill=True,
        fill_color=color,
        fill_opacity=0.7).add_to(toronto)

toronto

## Discussion

As we can see above, `M4W` is a great place to open a new restaurant, and `M9L`, `M5X`, `M1L`, `M5E`, `M2M` can also be good alternatives. However, `M1J` and `M3A` are not suitable to open a new restaurant due to its competition.

## Conclusion

Thus, the recommandation is to open a new restaurant at neighborhood `M4W`.