 <h1><center> Food Delivery in Manhattan </center></h1>

## 1. Problem Definition 

A food delivery company has decided to open a new office in Manhattan. They hired a data scientist in order to find the most suitable location of this new office, in which the shipment costs are minimized and the demand of food delivery is high.
This project will explore where the highest concentration of restaurants is located. This minimizes the shipment costs (since we would have an higher number of venues close to our new office) and improves the productivity, since the single shipment requires less time and more shipments can be done in the same time interval.
We will also look at the highest average score of each cluster of restaurants, since each costumer is likely to order from an higher rated restaurant (in terms of 
value for money).

## 2. Data 


The data necessary to accomplish my Data Science Task are the ones of FourSquare. Using an API call, I will retrieve a Json file containing all the venues and convert it into a Pandas dataframe. Those data already contains position of the restaurants and their scores, as well as the number of reviews. Then I will create several clusters depending on the area that the Food Delivery company is willing to cover, and evaluate the best location to place the new office based on the FourSquare data (as described in the problem definition).

## 3. Code

In [1]:
# import libraries and modules

import numpy as np
import pandas as pd
import json
from geopy.geocoders import Nominatim
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!pip install folium
import folium
from sklearn.cluster import DBSCAN
import sklearn.utils

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 7.3MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


Get first the entire data of New York:

In [2]:
# get New York json data

!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [3]:
# convert json info into df

neigh_data = newyork_data['features']
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
neigh = pd.DataFrame(columns=column_names)

for data in neigh_data:
    
    borough = neigh_name = data['properties']['borough']
    neigh_name = data['properties']['name']
    neigh_latlon = data['geometry']['coordinates']
    neigh_lat = neigh_latlon[1]
    neigh_lon = neigh_latlon[0]
    neigh = neigh.append({'Borough': borough, 
                                         'Neighborhood': neigh_name,
                                         'Latitude': neigh_lat,
                                         'Longitude': neigh_lon}, ignore_index=True)

Select data of Manhattan:

In [33]:
# get Manhattan data

manhattan_data = neigh[neigh['Borough']=='Manhattan'].reset_index(drop=True)

manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Use FourSquare to get the data on all the venues in Manhattan:

In [34]:
# FourSquare credentials

CLIENT_ID = '1LVVYKSFCSS01J4EYWOPGAGAWJW1YXUVRCFGGZ4DQTOYOERT'
CLIENT_SECRET = 'BCV0MQVF2ZLHKJQQVTAFNT1K24BJOTQAHCOZJVQ23P42J15S'
VERSION = '20180605'

In [35]:
# Limit the radius and the number of venues for each API 

LIMIT = 100
radius = 500

Here we define a function in order to apply the same API to all neighborhoods in Manhattan:

In [9]:
def getManhVenues(names, lat, lon, radius = 500):
    
    manh_venues = []
    
    for name, lat, lon in zip(names, lat, lon):
        
        #API with user data
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lon, 
            radius, 
            LIMIT)
        retrieved_data = requests.get(url).json()['response']['groups'][0]['items']
        manh_venues.append([(
            name,
            lat,
            lon,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name'],
            v['venue']['id']) for v in retrieved_data])
        
    venues_df = pd.DataFrame([item for manh_venues in manh_venues for item in manh_venues])
    venues_df.columns = ['Neighborhood', 
                'Neighborhood Latitude', 
                'Neighborhood Longitude', 
                'Venue', 
                'Venue Latitude', 
                'Venue Longitude', 
                'Venue Category',
                'Venue ID']
    return(venues_df)

In [10]:
venues_manh = getManhVenues(names=manhattan_data['Neighborhood'],
                           lat=manhattan_data['Latitude'],
                           lon=manhattan_data['Longitude']
                           )

In [36]:
venues_manh.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place,4b4429abf964a52037f225e3
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio,4baf59e8f964a520a6f93be3
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner,4b79cc46f964a520c5122fe3
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop,55f81cd2498ee903149fcc64
4,Marble Hill,40.876551,-73.91066,Astral Fitness & Wellness Center,40.876705,-73.906372,Gym,4cf6ae55d3a8a1cd71a9d243


### Select only the venues which are restaurants:

In [86]:
rest_data = venues_manh[venues_manh['Venue Category'].str.contains('Restaurant')].reset_index(drop=True)

In [87]:
rest_data.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID
0,Marble Hill,40.876551,-73.91066,Land & Sea Restaurant,40.877885,-73.905873,Seafood Restaurant,4b9c9c6af964a520b27236e3
1,Marble Hill,40.876551,-73.91066,Boston Market,40.87743,-73.905412,American Restaurant,585c205665e7c70a2f1055ea
2,Chinatown,40.715618,-73.994279,Kiki's,40.714476,-73.992036,Greek Restaurant,5521c2ff498ebe2368634187
3,Chinatown,40.715618,-73.994279,Spicy Village,40.71701,-73.99353,Chinese Restaurant,4db3374590a0843f295fb69b
4,Chinatown,40.715618,-73.994279,Wah Fung Number 1 Fast Food 華豐快飯店,40.717278,-73.994177,Chinese Restaurant,4a96bf8ff964a520ce2620e3


Create a map of the restaurants in Manhattan:

In [88]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


In [89]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(rest_data['Venue Latitude'], rest_data['Venue Longitude'], rest_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

Cluster the restaurants in Manhattan with DBSCAN:

In [90]:
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = rest_data[['Venue Latitude','Venue Longitude']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)

# Compute DBSCAN
db = DBSCAN(eps=0.15, min_samples=20).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
rest_data["Clus_Db"]=labels

realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels)) 

Number of restaurants in each cluster:

In [91]:
for value in range(-1, 5):
    print(len(rest_data[rest_data['Clus_Db'] == value]))

201
316
108
34
32
200


Select the cluster with the largest nr of restaurants:

In [92]:
big_clust = rest_data[rest_data['Clus_Db'] == 0]
big_clust.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Clus_Db
2,Chinatown,40.715618,-73.994279,Kiki's,40.714476,-73.992036,Greek Restaurant,5521c2ff498ebe2368634187,0
3,Chinatown,40.715618,-73.994279,Spicy Village,40.71701,-73.99353,Chinese Restaurant,4db3374590a0843f295fb69b,0
4,Chinatown,40.715618,-73.994279,Wah Fung Number 1 Fast Food 華豐快飯店,40.717278,-73.994177,Chinese Restaurant,4a96bf8ff964a520ce2620e3,0
5,Chinatown,40.715618,-73.994279,Da Yu Hot Pot 大渝火锅,40.716735,-73.995752,Hotpot Restaurant,5d992946dbf3ca0008d05211,0
6,Chinatown,40.715618,-73.994279,Forgtmenot,40.714459,-73.991546,New American Restaurant,4fd38a04e4b065401a9aaf88,0


Map the cluster:

In [93]:
map_manhattan_clust = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, label in zip(big_clust['Venue Latitude'], big_clust['Venue Longitude'], big_clust['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan_clust)  
    
map_manhattan_clust

### Apply a second DBSCAN cluster, to select clusters inside the largest cluster:

In [94]:
new_clust = big_clust.copy()
new_clust.drop(['Clus_Db'], axis=1, inplace=True)
new_clust.shape

(316, 8)

In [135]:
sklearn.utils.check_random_state(1000)
Clus_dataSet_2 = new_clust[['Venue Latitude','Venue Longitude']]
Clus_dataSet_2 = np.nan_to_num(Clus_dataSet_2)
Clus_dataSet_2 = StandardScaler().fit_transform(Clus_dataSet_2)

# Compute DBSCAN
db_2 = DBSCAN(eps=0.28, min_samples=10).fit(Clus_dataSet_2)
core_samples_mask_2 = np.zeros_like(db_2.labels_, dtype=bool)
core_samples_mask_2[db_2.core_sample_indices_] = True
labels_2 = db_2.labels_
new_clust["Clus_Db"]=labels_2

realClusterNum=len(set(labels_2)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels_2)) 

Nr of restaurants in each cluster:

In [136]:
for value in range(-1, 8):
    print(len(new_clust[new_clust['Clus_Db'] == value]))

58
13
70
53
45
18
24
25
10


Select the clusters with higher nr of restaurants and higher density (index -1 corresponding to 58 was not selected since this cluster has low density of points)

In [141]:
final_clust_1 = new_clust[new_clust['Clus_Db'] == 1]
final_clust_2 = new_clust[new_clust['Clus_Db'] == 3]
final_clust_3 = new_clust[new_clust['Clus_Db'] == 2]


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Clus_Db
424,East Village,40.727847,-73.982226,Kura,40.726803,-73.983444,Japanese Restaurant,510c85e7e4b0056826b88297,3
425,East Village,40.727847,-73.982226,Smør,40.729295,-73.981521,Scandinavian Restaurant,5c660fa3286fda00399ae820,3
426,East Village,40.727847,-73.982226,Cafe Mogador,40.727277,-73.984505,Moroccan Restaurant,41044980f964a520750b1fe3,3
427,East Village,40.727847,-73.982226,Thursday Kitchen,40.727661,-73.983761,Korean Restaurant,578bec6c498e3150fc369f3b,3
428,East Village,40.727847,-73.982226,Westville East,40.728428,-73.981894,American Restaurant,4758483af964a520cc4c1fe3,3


Map of each sub-cluster:

In [142]:
map_manhattan_clust_2 = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, label in zip(final_clust_1['Venue Latitude'], final_clust_1['Venue Longitude'], final_clust_1['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan_clust_2)  
    
map_manhattan_clust_2

In [144]:
map_manhattan_clust_3 = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, label in zip(final_clust_2['Venue Latitude'], final_clust_2['Venue Longitude'], final_clust_2['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan_clust_3)  
    
map_manhattan_clust_3

In [145]:
map_manhattan_clust_4 = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, label in zip(final_clust_3['Venue Latitude'], final_clust_3['Venue Longitude'], final_clust_3['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan_clust_4)  
    
map_manhattan_clust_4

Print the neighborhoods in each sub-cluster and the corresponding nr of restaurants 

In [149]:
final_clust_1['Neighborhood'].unique()

array(['Chinatown', 'Little Italy', 'Soho', 'Noho'], dtype=object)

In [156]:
print('Chinatown Restaurants: ', final_clust_1[final_clust_1.Neighborhood == 'Chinatown'].shape[0])
print('Little Italy Restaurants: ', final_clust_1[final_clust_1.Neighborhood == 'Little Italy'].shape[0])
print('Soho Restaurants: ', final_clust_1[final_clust_1.Neighborhood == 'Soho'].shape[0])
print('Noho Restaurants: ', final_clust_1[final_clust_1.Neighborhood == 'Noho'].shape[0])

Chinatown Restaurants:  24
Little Italy Restaurants:  30
Soho Restaurants:  15
Noho Restaurants:  1


In [157]:
final_clust_2['Neighborhood'].unique()

array(['East Village', 'Noho'], dtype=object)

In [158]:
print('East Village Restaurants: ', final_clust_2[final_clust_2.Neighborhood == 'East Village'].shape[0])
print('Noho Restaurants: ', final_clust_2[final_clust_2.Neighborhood == 'Noho'].shape[0])

East Village Restaurants:  37
Noho Restaurants:  8


In [151]:
final_clust_3['Neighborhood'].unique()

array(['Greenwich Village', 'Soho'], dtype=object)

In [160]:
print('Greenwich Restaurants: ', final_clust_3[final_clust_3.Neighborhood == 'Greenwich Village'].shape[0])
print('Soho Restaurants: ', final_clust_3[final_clust_3.Neighborhood == 'Soho'].shape[0])

Greenwich Restaurants:  42
Soho Restaurants:  11


### As a second criterium to select our office location, we consider each restaurant rating and the overall mean. This is done through an API (premium API) to FourSquare.

In [161]:
rest_venues_1 = []
    
for venue_ID in final_clust_1['Venue ID']:

    url2 = 'https://api.foursquare.com/v2/venues/{}?&client_id={}&client_secret={}&v={}'.format(
                venue_ID,
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION)
    new_rest_data_1 = requests.get(url2).json()['response']
    rest_venues_1.append((
    new_rest_data_1['venue']['id'],
    new_rest_data_1['venue']['rating']))
    


In [179]:
df_1 = pd.DataFrame(rest_venues_1)
df_1.columns = ['Venue ID', 'Rating']
df_1.head()

Unnamed: 0,Venue ID,Rating
0,4db3374590a0843f295fb69b,8.7
1,4a96bf8ff964a520ce2620e3,8.5
2,5d992946dbf3ca0008d05211,8.3
3,5894c9a15e56b417cf79e553,8.8
4,57a29225498e96334ebe06d9,8.4


In [182]:
mean_1 = df_1['Rating'].mean()
mean_1

8.60857142857143

In [183]:
rest_venues_2 = []
    
for venue_ID in final_clust_2['Venue ID']:

    url2 = 'https://api.foursquare.com/v2/venues/{}?&client_id={}&client_secret={}&v={}'.format(
                venue_ID,
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION)
    new_rest_data_2 = requests.get(url2).json()['response']
    rest_venues_2.append((
    new_rest_data_2['venue']['id'],
    new_rest_data_2['venue']['rating']))
    


In [184]:
df_2 = pd.DataFrame(rest_venues_2)
df_2.columns = ['Venue ID', 'Rating']
df_2.head()

Unnamed: 0,Venue ID,Rating
0,510c85e7e4b0056826b88297,9.2
1,5c660fa3286fda00399ae820,9.1
2,41044980f964a520750b1fe3,9.1
3,578bec6c498e3150fc369f3b,9.0
4,4758483af964a520cc4c1fe3,8.9


In [185]:
mean_2 = df_2['Rating'].mean()
mean_2

8.637777777777776

In [186]:
rest_venues_3 = []
    
for venue_ID in final_clust_3['Venue ID']:

    url2 = 'https://api.foursquare.com/v2/venues/{}?&client_id={}&client_secret={}&v={}'.format(
                venue_ID,
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION)
    new_rest_data_3 = requests.get(url2).json()['response']
    rest_venues_3.append((
    new_rest_data_3['venue']['id'],
    new_rest_data_3['venue']['rating']))

In [187]:
df_3 = pd.DataFrame(rest_venues_3)
df_3.columns = ['Venue ID', 'Rating']
df_3.head()

Unnamed: 0,Venue ID,Rating
0,504b2a9ee4b006c435a465d3,9.1
1,4d9e8aa89b91a1cdc7c958c0,9.0
2,3fd66200f964a52006e61ee3,9.0
3,555e7399498eccd4b34fe416,9.0
4,5ab53749446ea6289e41b0e6,8.9


In [188]:
mean_3 = df_3['Rating'].mean()
mean_3

8.752830188679246

The area of the third cluster is the one in which we will place the office.

In [None]:
In order to define a precise location, we compute the center of the cluster. In this way we minimize the average distance from

In [190]:
delivery_lat = final_clust_3['Venue Latitude'].mean()
delivery_lon = final_clust_3['Venue Longitude'].mean()
print(delivery_lat)
print(delivery_lon)

40.72741040915054
-74.0015493998475


In [193]:
label = folium.Popup('Soho', parse_html=True)
folium.CircleMarker(
    [delivery_lat, delivery_lon],
    radius=6,
    popup=label,
    color='red',
    fill=True,
    fill_color='#FF0000',
    fill_opacity=0.7,
    parse_html=False).add_to(map_manhattan_clust_4)

map_manhattan_clust_4