# Prediction of Restaurant "Likes" using Foursquare API and Machine Learning for Munich

## 1. Introduction

Munich is one of the most famous cities in Germany with a great variety of restaurants to choose from. One point that influences the choice of a restaurant, especially during the first visit, is the image that one has of the restaurant. This image is formed to a large extent through the website and the reviews, which reflect other people's impressions. 

For a new business owner (or existing company) to open a new restaurant in Munich, knowing ahead of time the potential social media image they can have would provide an excellent solution to the ever present business problem of uncertainty. In this case the uncertainty is regarding performance of social media presence.

We can mitigate this uncertainty through leveraging data gathered from FourSquare's API, specifically, we are able to scrape "likes" data of different restaurants directly from the API as well as their category of cuisine. The question we will try to address is, how accurately can we predict the amount of "likes" a new restaurant opening in this city can expect to have based on the type of cuisine.

Leveraging this data will solve the problem as it allows the new business owner (or existing company) to make preemptive business decisions regarding opening the restaurant in terms of whether to open one in this city and expect good social media presence and what type of cuisine would be the best. This project will analyze and model the data via machine learning through comparing both linear and logistic regressions to see which method will yield better predictive capabilities after training and testing.

#### Imports

In [126]:
import numpy as np 
import pandas as pd 
import json

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim 
import requests
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors

#!conda install -c conda-forge folium --yes
import folium 
from urllib.request import urlopen
from bs4 import BeautifulSoup

import matplotlib.pyplot as plt
import pylab as pl

from sklearn import linear_model
from sklearn.metrics import jaccard_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, r2_score
import itertools

#### Config

In [127]:
# credentials
CLIENT_ID = '5B1LWXEGFJV52UMEQ0ZGKITJLRFL4DMJHBRRD3BX4WL4E1GB' # your Foursquare ID
CLIENT_SECRET = 'YNUF5JX3RPQ5XLUMOIWQGAOM0OTEUE4XJ0FNUWLDU3T4YWBM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

# magic numbers thresholds for data to scrape
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius


## 2. Data

### 2.1 Data Scraping and Cleaning

In this section we will first retrieve the geographical coordinates of Munich. Then, we will leverage the FourSquare API to obtain URLs that lead to the raw data in JSON form. We will speerately scrape the raw data in these URLs in order to retrieve the following columns: "name", "categories", "latitude", "longitude" and "id". 

It is important to note that the extracts are not of every restaurant in the city but rather all of the restaurants within a 1KM range of the geographical coordinates that geolocator was able to provide. However, the extraction from the FourSquare API actually obtains venue data so it will include venues other than restaurants such as concert halls, stores, libraries etc. As such, this means that the data will need to be further cleaned somewhat manually by removing all of the non-restaurant rows. Once this is complete, we have a shortened by cleaned list to pull "likes" data. The reason the cleaning takes precedence is mainly that pulling the "likes" data is the computing process which takes the longest time in this project so we want to make sure we are not pulling information that will end up being dropped anyways.

The "id" is an important column as it will allow us to further pull the "likes" from the API. We can retreive the "likes" based on the restaurant "id" and then append it to the data frame. Once this is complete, we finally name the dataframe 'raw_dataset' as it is the most complete compiled form before needing any processing for analysis via machine learning.

In [128]:
# addresses of interest 
address = 'Germany, Berlin'

geolocator = Nominatim(user_agent="foursquare_agent")

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Munich are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Munich are 52.5170365, 13.3888599.


In [129]:
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
results = requests.get(url).json()

In [130]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [131]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng', 'venue.id']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng,id
0,Dussmann das KulturKaufhaus,Bookstore,52.518312,13.388708,4adcda8ef964a520a74a21e3
1,Dussmann English Bookshop,Bookstore,52.518223,13.389239,562a9474498e20b9ac65c6fe
2,Freundschaft,Wine Bar,52.518294,13.390344,5b85b8241fa763002ccd8cc3
3,Cookies Cream,Vegetarian / Vegan Restaurant,52.516569,13.388008,4adf61aef964a520177a21e3
4,COS,Clothing Store,52.515947,13.389172,4b4b65d9f964a5201a9a26e3


In [132]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


In [133]:
# list of type of venues to remove 
removal_list = ['Exhibit', 'Souvenir Shop', 'Outdoor Sculpture', 'Roof Deck', 'Monument / Landmark', 'Church', 
               'Department Store', 'History Museum', 'Vacation Rental', 'Drugstore', 'Garden', 
               'Indie Theater', 'Scenic Lookout', 'Capitol Building', 
               'Bookstore', 'Clothing Store', 'Opera House', 'Hotel',
               'Cosmetics Shop', 'Plaza', 'Concert Hall', 'Historic Site',
               'Theater', 'Art Gallery', 'Spa', 'Music Venue', 'Furniture / Home Store',
               'Supermarket', 'Piano Bar', 'Art Museum', 'Pharmacy', 'Performing Arts Venue', 'Park', 
               'Museum', 'Comedy Club', 'Gym / Fitness Center']

nearby_venues = nearby_venues[~nearby_venues['categories'].isin(removal_list)]

# check that only restaurant data remains
nearby_venues['categories'].unique()

array(['Wine Bar', 'Vegetarian / Vegan Restaurant', 'Gourmet Shop',
       'Cocktail Bar', 'Chocolate Shop', 'Sandwich Place', 'Restaurant',
       'Coffee Shop', 'Italian Restaurant', 'Sushi Restaurant',
       'Hotel Bar', 'German Restaurant', 'Noodle House', 'Steakhouse',
       'Indian Restaurant', 'Bar', 'Asian Restaurant', 'Burrito Place',
       'Café', 'Pizza Place', 'Eastern European Restaurant'], dtype=object)

In [134]:
# pull the likes from the API based on venue ID

url_list = []
like_list = []
json_list = []

for idx, venue in enumerate(list(nearby_venues.id)):
  venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(venue, CLIENT_ID, CLIENT_SECRET, VERSION)
  url_list.append(venue_url)
print("venue url fetching complete")

for idx, link in enumerate(url_list):
  result = requests.get(link).json()
  likes = result['response']['likes']['count']
  like_list.append(likes)
print("venue likes fetching complete")

nearby_venues['likes'] = like_list

venue url fetching complete
venue likes fetching complete


In [135]:
# check that data has been appropriately scraped and parsed
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng,id,likes
2,Freundschaft,Wine Bar,52.518294,13.390344,5b85b8241fa763002ccd8cc3,35
3,Cookies Cream,Vegetarian / Vegan Restaurant,52.516569,13.388008,4adf61aef964a520177a21e3,387
6,Lafayette Gourmet,Gourmet Shop,52.514385,13.389569,5622323a498efb9560ccd9b2,80
7,Windhorst,Cocktail Bar,52.518553,13.38627,4adcda79f964a520874621e3,133
9,Ritter Sport Bunte Schokowelt,Chocolate Shop,52.514906,13.390378,4b5099f3f964a520f82827e3,614


### 2.2 Data Preparation

The data still needs some more processing before it is suitable for model training and testing. Mainly, the "categories" column contains too many different types of cuisines to allow a model to yield any meaningful results. However, the different types of natural cuisines have natural groupings based on conventionally accepted cultural groupings of cuisine. Broadly speaking, all of the different types of cuisine could be reclassified as European, Latin American, Asian, North American, drinking establishments (bars), or casual establishments such as coffee shops or ice cream parlours. We can implement manual classification as there really aren't that many different types of cuisines.

As this project will compare both linear and logistic regression, it makes sense to have "likes" as both a continuous and categorical (but ordinal) variable. In the case of turning into a categorical variable, we can bin the data based on percentiles and classify them into these ordinal percentile categories.

In [136]:
# inspecting the raw dataset shows that there may be too many different types of cuisines
raw_dataset = nearby_venues
raw_dataset['categories'].unique()

array(['Wine Bar', 'Vegetarian / Vegan Restaurant', 'Gourmet Shop',
       'Cocktail Bar', 'Chocolate Shop', 'Sandwich Place', 'Restaurant',
       'Coffee Shop', 'Italian Restaurant', 'Sushi Restaurant',
       'Hotel Bar', 'German Restaurant', 'Noodle House', 'Steakhouse',
       'Indian Restaurant', 'Bar', 'Asian Restaurant', 'Burrito Place',
       'Café', 'Pizza Place', 'Eastern European Restaurant'], dtype=object)

In [137]:
# we can group some cuisines together to make a better categorical variable

euro = ['Italian Restaurant', 'German Restaurant', 'Pizza Place', 'Eastern European Restaurant', 'Vegetarian / Vegan Restaurant']

non_euro = ['Sushi Restaurant', 'Noodle House','Indian Restaurant', 'Asian Restaurant', 'Burrito Place', 'Steakhouse']

bar = ['Wine Bar', 'Cocktail Bar', 'Hotel Bar', 'Bar' ]

casual = ['Gourmet Shop', 'Chocolate Shop', 'Sandwich Place', 'Coffee Shop', 'Café']

def categorize_restaurants(df):
    if df['categories'] in euro:
        return 'euro'
    if df['categories'] in non_euro:
        return 'non_euro'
    if df['categories'] in casual:
        return 'casual'
    if df['categories'] in bar:
        return 'bar'



raw_dataset['categories_classified'] = raw_dataset.apply(categorize_restaurants, axis=1)
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,likes,categories_classified
2,Freundschaft,Wine Bar,52.518294,13.390344,5b85b8241fa763002ccd8cc3,35,bar
3,Cookies Cream,Vegetarian / Vegan Restaurant,52.516569,13.388008,4adf61aef964a520177a21e3,387,euro
6,Lafayette Gourmet,Gourmet Shop,52.514385,13.389569,5622323a498efb9560ccd9b2,80,casual
7,Windhorst,Cocktail Bar,52.518553,13.38627,4adcda79f964a520874621e3,133,bar
9,Ritter Sport Bunte Schokowelt,Chocolate Shop,52.514906,13.390378,4b5099f3f964a520f82827e3,614,casual


In [138]:
# check how many are of each category
pd.crosstab(index=raw_dataset["categories_classified"], columns="count")

col_0,count
categories_classified,Unnamed: 1_level_1
bar,7
casual,10
euro,6
non_euro,6


In [139]:
# classify the likes into different ranking levels
# Determine 3 rankings by binning based on percentiles

thres1 = np.percentile(raw_dataset['likes'], 33)
thres2 = np.percentile(raw_dataset['likes'], 66)


def apply_rankings(df, thres1, thres2):
  """
    return new column with rankings based on threshold bins
  """
  if df['likes'] < thres1:
      return 1
  if df['likes'] >= thres1 or df['likes'] <= thres2:
      return 2
  if df['likes'] > thres2:
      return 3

raw_dataset['ranking'] = raw_dataset.apply(apply_rankings, axis=1, args = [thres1, thres2])
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,likes,categories_classified,ranking
2,Freundschaft,Wine Bar,52.518294,13.390344,5b85b8241fa763002ccd8cc3,35,bar,1
3,Cookies Cream,Vegetarian / Vegan Restaurant,52.516569,13.388008,4adf61aef964a520177a21e3,387,euro,2
6,Lafayette Gourmet,Gourmet Shop,52.514385,13.389569,5622323a498efb9560ccd9b2,80,casual,2
7,Windhorst,Cocktail Bar,52.518553,13.38627,4adcda79f964a520874621e3,133,bar,2
9,Ritter Sport Bunte Schokowelt,Chocolate Shop,52.514906,13.390378,4b5099f3f964a520f82827e3,614,casual,2


In [141]:
# create dummies for linear regression modelling

# one hot encoding
reg_dataset = pd.get_dummies(raw_dataset['categories_classified'], 
                               prefix="", 
                               prefix_sep="")

# add name, ranking, and likes columns back to dataframe
reg_dataset['ranking'] = raw_dataset['ranking']
reg_dataset['likes'] = raw_dataset['likes']
reg_dataset['name'] = raw_dataset['name']

# move name column to the first column
reg_columns = [reg_dataset.columns[-1]] + list(reg_dataset.columns[:-1])
reg_dataset = reg_dataset[reg_columns]


reg_dataset.head()

Unnamed: 0,name,bar,casual,euro,non_euro,ranking,likes
2,Freundschaft,1,0,0,0,1,35
3,Cookies Cream,0,0,1,0,2,387
6,Lafayette Gourmet,0,1,0,0,2,80
7,Windhorst,1,0,0,0,2,133
9,Ritter Sport Bunte Schokowelt,0,1,0,0,2,614


## 3. Methodology

This project will utilize both linear and logistic regression machine learning methods to train and test the data. Namely, linear regression will be used in an attempt to predict the number of "likes" a new restaurant in this region will have. We will utilize the Sci-Kit Learn Package to run the model.

We can also utilize logisitc regression as a classification method rather than direct prediction of the number of likes. Since the number of "likes" can be binned into different categories based on different percentile bins, it is also potentiallly possible to see which range of "likes" a new restaurant in this region will have.

Since the "likes" are binned into multiple (more than 2) categories, the type of logistic regression will be multinomial. Additionally, although the ranges are indeed discrete categories, they are also ordinal in nature. Therefore the logistic regression will need to be specified as being both multinomial and ordinal. This can be done through the Sci-Kit Learn Package as well.

## 4. Result

### 4.1 Linear Regression Results

A linear regression model was trained on a random subsample of 80% of the sample and then tested on the other 20%. To see if this is a reasonable model. the residual sum of squares score and variance score were both calculated. Given the low variance score, this is probably not a valid/good way of modelling the data. Therefore, we move on to logistic regression.

In [158]:
# Multiple Linear Regression

msk = np.random.rand(len(reg_dataset)) < 0.8
train = reg_dataset[msk]
test = reg_dataset[~msk]

regr = linear_model.LinearRegression()
x = np.asanyarray(train[['euro', 'non_euro', 'bar', 'casual']])
y = np.asanyarray(train[['likes']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)

Coefficients:  [[115.74464286  38.44464286 -90.00535714 -64.18392857]]


In [159]:
# Multiple Linear Regression Prediction Capabilities

y_hat= regr.predict(test[['euro', 'non_euro', 'bar', 'casual']])
x = np.asanyarray(test[['euro', 'non_euro', 'bar', 'casual']])
y = np.asanyarray(test[['likes']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x, y))

Residual sum of squares: 54331.22
Variance score: -0.20


### 4.2 Logistic Regression Results

A multinomial ordinal logisitc regression model was trained on a random subsample of 80% of the sample and then tested on the other 20%. To see if this is a reasonable model, its jaccard similarity score and log-loss were calculated (66.66% and 1.009 respectively). Although this is not a perfect prediction, a similarity of 66% between the training set and test set is a reasonable result. The classification report is also printed later on below.

Given the modestly accurate ability of this model, we can also run the model on the full dataset. The coefficients show that opening a european restaurant in Munich is associated positivly with "likes". The most negative correlation is between bars and "likes".

In [149]:
# Multinomial Ordinal Logistic Regression

x_train = np.asanyarray(train[['euro', 'non_euro', 'bar', 'casual']])
y_train = np.asanyarray(train['ranking'])

x_test = np.asanyarray(test[['euro', 'non_euro', 'bar', 'casual']])
y_test = np.asanyarray(test['ranking'])

mul_ordinal = linear_model.LogisticRegression(multi_class='multinomial',
                                              solver='newton-cg',
                                              fit_intercept=True).fit(x_train,
                                                                      y_train)

mul_ordinal

coef = mul_ordinal.coef_[0]
print (coef)

[-0.26133803 -0.01589167  0.29712164 -0.01989051]


In [162]:
# Multinomial Ordinal Logistic Regression Prediction Capabilities

yhat = mul_ordinal.predict(x_test)
print(yhat)

yhat_prob = mul_ordinal.predict_proba(x_test)
print(yhat_prob)


#jaccard_score(y_test, yhat)

[2 2 2 2 2 2 2 2 2 2 2]
[[0.14855988 0.85144012]
 [0.24017911 0.75982089]
 [0.14855988 0.85144012]
 [0.34773239 0.65226761]
 [0.24751371 0.75248629]
 [0.14855988 0.85144012]
 [0.24602715 0.75397285]
 [0.14855988 0.85144012]
 [0.24602715 0.75397285]
 [0.24751371 0.75248629]
 [0.14855988 0.85144012]]


In [152]:
log_loss(y_test, yhat_prob)

0.8613887687615468

In [155]:
# Exploration of Coefficient Magnitudes of Full Dataset

x_all = np.asanyarray(reg_dataset[['euro', 'non_euro', 'bar', 'casual']])
y_all = np.asanyarray(reg_dataset['ranking'])



LR = linear_model.LogisticRegression(
    multi_class='multinomial',
    solver='newton-cg',
    fit_intercept=True
  ).fit(x_all, y_all)

coef = LR.coef_[0]
print (coef)

[-0.06250197 -0.06250197  0.01884541 -0.19004812]


In [156]:
print (classification_report(y_test, yhat))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         5
           2       0.55      1.00      0.71         6

    accuracy                           0.55        11
   macro avg       0.27      0.50      0.35        11
weighted avg       0.30      0.55      0.39        11



  _warn_prf(average, modifier, msg_start, len(result))


## 5. Discussion

The first thing to note is that given the data, logistic regression presents a better fit for the data over linear regression. Using logistic regression we were able to obtain a Jaccard Similarity Score of 51%, which although not perfect, is more reasonable than the low variance score obtained from the linear regression. As stated before, please note that for the purposes of this project, we are assumming that likes are a good proxy for how well a new restaurant will do in terms of brand, image and by extension how well the restaurant will perform business-wise. Whether or not these assumptions hold up in a real-life scenario is up for discussion, but this project does contain limitations in scope due to the amount of data that can be fetched from the FourSquare API.



As such, to obtain insights into this data, we can proceed with breaking down the results of the logistic regression model. The results showed that the model is better at predicting if a restaurant will fall into the best or worst percentile of likes. This allows us to roughly predict the potential performance of the business opportunity. Different binning methods for the classes were attempted, but the use of 3 bins yielded the best Jaccard Similarity Score.


## 6. Conclusion

In conclusion, after analyzing restaurant "likes" in Munich from 100 restaurants, we have developed a general classification model for which "ranking" of likes a new restaurant will potentially fall into based on its characteristics.