## Capstone Project

### Introduction

The goal that I set for this project is to create the classifiers for highly rated venues on Foursquare. The problem is that there are a number of venues with the missing values for the rating although the rating scores could be used as a valuable reference for tourists outside the location to determine their destination. Therefore, the purpose of this project is to create the algorithm to identify the venues with the possibility of being highly rated. In addition, in order to improve the accuracy of the classifier, 4 different classification techniques which are Decision Tree, SVM, Logistic Regression, and KNN have been built and compared since the difference in the models lead to the variation in the predictive accuracy of the model.

### Data
The data used for creating the algorithm is imported from Foursquare via API, resulting in extracting the data from 53 venues located in New York, Tokyo, Paris, London, Hong Kong, Australia, Brazil, and Shang Hai. For the target variable of the classification model, a categorical variable, rating category, with 3 levels which are high, mid, and low, is created based on the existing rating data. Regarding this, the label, high, is assigned to a rating greater or equal to 8.3, the label, mid, is assigned to a rating below 8.3 and above 6.5, and the label, low, is assigned to a rating less or equal to 6.5. This label assignment is based on the values on the 3rd quartile and 1st quartile of the rating variable. For the types of data being used for building the model, the total number of variables being used is 57 which are not related to rating such as price message and listed count.

### Code
Code
The code is based on 3 parts which are 1) data preprocessing, 2) modeling building, and 3) model evaluation

### 1) Data preprocessing

In [19]:
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
from sklearn.tree import DecisionTreeClassifier as dtc
from sklearn import preprocessing as pr
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.metrics import jaccard_similarity_score
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from geopy.geocoders import Nominatim
import requests
print('Libraries imported.')

Libraries imported.


For the data preprocessing, the data is first imported via Foursquare API. Regarding the data imported from Foursquare, there are 2 categories of information requested through API which are general venue info by the given cities and specific venue information including rating scores. Due to the limitation on the number of daily calls available for each account, several accounts have been created for importing data and integrated and saved in a format of csv. Therefore, even though the code for importing the Foursquare has been provided here for the transparency of the coding procedure, the actual data used for the analysis has been imported from the local environment.

#### Sample code for data requests on Foursquare API
##### Import venue info
CLIENT_ID = 'your Foursquare ID'
CLIENT_SECRET = 'your Client Secret' 
VERSION = '20190730'

url_NY = "https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll=40.7127281,-74.0060152&v={}".format(CLIENT_ID, CLIENT_SECRET, VERSION) 
results_NY = requests.get(url_NY).json()
results_NY

url_TK = "https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll=35.652832,139.839478&v={}".format(CLIENT_ID, CLIENT_SECRET, VERSION) 
results_TK = requests.get(url_TK).json()
results_TK

url_Paris = "https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll=48.8566101,2.3514992&v={}".format(CLIENT_ID, CLIENT_SECRET, VERSION) 
results_Paris = requests.get(url_Paris).json()
results_Paris

url_Ld = "https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll=51.509865,-0.118092&v={}".format(CLIENT_ID, CLIENT_SECRET, VERSION) 
results_Ld = requests.get(url_Ld).json()
results_Ld

url_HK = "https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll=22.28552,114.15769&v={}".format(CLIENT_ID, CLIENT_SECRET, VERSION) 
results_HK = requests.get(url_HK).json()
results_HK

url_AS = "https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll=-33.865143,151.209900&v={}".format(CLIENT_ID, CLIENT_SECRET, VERSION) 
results_AS = requests.get(url_AS).json()
results_AS

url_BR = "https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll=-23.533773,-46.625290&v={}".format(CLIENT_ID, CLIENT_SECRET, VERSION) 
results_BR = requests.get(url_BR).json()
results_BR

url_SH = "https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll=31.22222,121.45806&v={}".format(CLIENT_ID, CLIENT_SECRET, VERSION) 
results_SH = requests.get(url_SH).json()
results_SH

##### Merge venue info
df_frames = [df_TK, df_NY, df_Paris, df_Ld, df_HK, df_AS, df_BR, df_SH]
df_con = pd.concat(df_frames, axis=0, join = 'inner')
df_con.to_csv('df_con.csv')

filtered_columns = ['name', 'categories'] + [col for col in df_con.columns if col.startswith('location.')] + ['id']
df_con_filtered = df_con.loc[:, filtered_columns]

##### keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in df_con.columns if col.startswith('location.')] + ['id']
df_con_filtered = df_con.loc[:, filtered_columns]

##### function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

##### filter the category for each row
df_con_filtered['categories'] = df_con_filtered.apply(get_category_type, axis=1)

##### clean column names by keeping only last term
df_con_filtered.columns = [column.split('.')[-1] for column in df_con_filtered.columns]
df_con_filtered['categories'].nunique()
df_con_filtered = pd.DataFrame(df_con_filtered)
df_con_filtered.to_csv('df_con_filtered.csv')
list_id = df_con_filtered['id']
list_id = list_id.tolist()

##### Getting specific venue info including
id_list = df_con_filtered['id'].values.tolist()
df_test = []
for i in id_list:
    url_test = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(i, CLIENT_ID, CLIENT_SECRET, VERSION)
    results_test = requests.get(url_test).json()
    venue_test = results_test['response']['venue']
    df_test.append(venue_test)

##### keep the rows with  rating input
df_test_con.drop_duplicates(subset = 'id', keep = False, inplace = True)
df_con_rating = df_test_con.dropna(subset=['rating'])
df_data = df_con_rating[['id','rating', 'ratingSignals', 'likes.count', 'listed.count', 'photos.count', 'price.message', 'price.tier', 'tips.count', 'page.user.tips.count']]
df_data = df_data.merge(df_con_filtered[['id', 'categories', 'cc', 'lat', 'lng']], on = 'id', how = 'inner')
df_data.set_index('id', inplace = True)

##### Replacing missing values with average scores ※replace missing values with means for price.tier &  page.user.tips.count and replace missing values with most frequent value for price.message
missing_data = df_data.isnull()

for i in missing_data.columns.values.tolist():
    print(i)
    print (missing_data[i].value_counts())
    print("")
av_price_tier = df_data['price.tier'].mean(axis = 0)
df_data['price.tier'].replace(np.nan, av_price_tier, inplace = True)
av_page_user_tips_count = df_data['page.user.tips.count'].mean(axis=0)
df_data['page.user.tips.count'].replace(np.nan, av_page_user_tips_count, inplace = True)
df_data['price.message'].replace(np.nan, 'Moderate', inplace = True)

for i in missing_data.columns.values.tolist():
    print(i)
    print (missing_data[i].value_counts())
    print("")

##### One hot encoding
df_data_dummies = pd.get_dummies(df_data)
df_data_dummies.to_csv('df_data_dummies.csv')

###### Import the merged data from the local file

In [8]:
df_data = pd.read_csv('local path') ##local path for the merged data
df_data.set_index('id', inplace = True)
df_data.columns

Index(['rating', 'ratingSignals', 'likes.count', 'listed.count',
       'photos.count', 'price.tier', 'tips.count', 'page.user.tips.count',
       'lat', 'lng', 'price.message_Cheap', 'price.message_Expensive',
       'price.message_Moderate', 'price.message_Very Expensive',
       'categories_Bar', 'categories_Beer Bar', 'categories_Bistro',
       'categories_Café', 'categories_Cantonese Restaurant',
       'categories_Chinese Restaurant', 'categories_Church',
       'categories_City Hall', 'categories_Cocktail Bar',
       'categories_Coffee Shop', 'categories_Cosmetics Shop',
       'categories_Department Store', 'categories_Electronics Store',
       'categories_French Restaurant', 'categories_Fried Chicken Joint',
       'categories_German Restaurant', 'categories_Hotel',
       'categories_Japanese Restaurant', 'categories_Movie Theater',
       'categories_Multiplex', 'categories_Park',
       'categories_Pedestrian Plaza', 'categories_Pizza Place',
       'categories_Plaza', '

##### Defining a target variable & explanatory variables

In [9]:
df_data['rating'].describe() #high = >=8.3, mid = 8.3 > x > 6.5, low = <=6.5 

rating_category = []

for i, r in df_data.iterrows():
    if int(df_data.loc[i, ['rating']]) >= 8.3:
        rating_category.append('high')
    elif int(df_data.loc[i, ['rating']]) < 8.3 and int(df_data.loc[i, ['rating']])> 6.5:
        rating_category.append('mid')
    elif int(df_data.loc[i, ['rating']]) < 6.5:
        rating_category.append('low')
        
df_data['rating_category'] =  rating_category
x = df_data.iloc[:, 2:58]
y = df_data['rating_category']

### 2) Model Building

##### Standardization & Data split

In [12]:
x = preprocessing.StandardScaler().fit(x).transform(x)
x_trainset, x_testset, y_trainset, y_testset = train_test_split(x, y, test_size=0.2, random_state=3)

##### Decision Tree

In [14]:
rating_category_tree = dtc(criterion = 'entropy', max_depth = 3)
rating_category_tree.fit(x_trainset, y_trainset)
tree_pred = rating_category_tree.predict(x_testset)
print(rating_category_tree)
print (tree_pred [0:5])
print (y_testset [0:5])

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
['low' 'mid' 'mid' 'low' 'mid']
id
4bbf498ab083a593a620a3e9    low
4b05caf7f964a5206de322e3    mid
58d800df9435a979b8a645fa    mid
5614b48d498ea3525f672509    low
4b0588def964a520b9dd22e3    mid
Name: rating_category, dtype: object


##### KNN

In [15]:
k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(x_trainset,y_trainset)
KNN_pred = neigh.predict(x_testset)
KNN_pred[:5]
print (neigh)
print (KNN_pred[0:5])
print (y_testset[0:5])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='uniform')
['low' 'mid' 'low' 'low' 'low']
id
4bbf498ab083a593a620a3e9    low
4b05caf7f964a5206de322e3    mid
58d800df9435a979b8a645fa    mid
5614b48d498ea3525f672509    low
4b0588def964a520b9dd22e3    mid
Name: rating_category, dtype: object


##### Logistic Regression

In [16]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_trainset,y_trainset)
LR_pred = LR.predict(x_testset)
LR_prob = LR.predict_proba(x_testset)
print(LR)
print (LR_pred [0:5])
print (y_testset[0:5])

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
['high' 'mid' 'high' 'mid' 'high']
id
4bbf498ab083a593a620a3e9    low
4b05caf7f964a5206de322e3    mid
58d800df9435a979b8a645fa    mid
5614b48d498ea3525f672509    low
4b0588def964a520b9dd22e3    mid
Name: rating_category, dtype: object


##### SVM

In [17]:
clf_svc = svm.SVC(kernel='rbf')
clf_svc.fit(x_trainset, y_trainset) 
svc_pred = clf_svc.predict(x_testset)
print(clf_svc)
print (svc_pred [0:5])
print (y_testset[0:5])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
['mid' 'mid' 'mid' 'mid' 'mid']
id
4bbf498ab083a593a620a3e9    low
4b05caf7f964a5206de322e3    mid
58d800df9435a979b8a645fa    mid
5614b48d498ea3525f672509    low
4b0588def964a520b9dd22e3    mid
Name: rating_category, dtype: object


### 3) Model Evaluation

In [18]:
print("DecisionTrees Train set Accuracy: ", metrics.accuracy_score(y_trainset, rating_category_tree.predict(x_trainset)))
print("DecisionTrees Test set Accuracy: ", metrics.accuracy_score(y_testset, tree_pred))
print("DecisionTrees jaccard_similarity_score: ", jaccard_similarity_score(y_testset, tree_pred))
print("DecisionTrees f1: ", f1_score(y_testset, tree_pred, average='weighted'))

print("KNN Train set Accuracy: ", metrics.accuracy_score(y_trainset, neigh.predict(x_trainset)))
print("KNN Test set Accuracy: ", metrics.accuracy_score(y_testset, KNN_pred))
print('KNN jaccard_similarity_score:', jaccard_similarity_score(y_testset, KNN_pred))
print('KNN f1:', f1_score(y_testset, KNN_pred, average='weighted'))

print('Logistic Regression Train set Accuracy:', metrics.accuracy_score(y_trainset, LR.predict(x_trainset)))
print('Logistic Regression Test set Accuracy:', metrics.accuracy_score(y_testset, LR_pred))
print('Logistic Regression jaccard_similarity_score: ', jaccard_similarity_score(y_testset, LR_pred))

print("svc Train set Accuracy: ", metrics.accuracy_score(y_trainset, clf_svc.predict(x_trainset)))
print("svc Test set Accuracy: ", metrics.accuracy_score(y_testset, svc_pred))
print('svc jaccard_similarity_score:', jaccard_similarity_score(y_testset, svc_pred))
print('svc f1:', f1_score(y_testset, svc_pred, average='weighted'))

DecisionTrees Train set Accuracy:  0.9523809523809523
DecisionTrees Test set Accuracy:  0.7272727272727273
DecisionTrees jaccard_similarity_score:  0.7272727272727273
DecisionTrees f1:  0.7132867132867133
KNN Train set Accuracy:  0.6428571428571429
KNN Test set Accuracy:  0.6363636363636364
KNN jaccard_similarity_score: 0.6363636363636364
KNN f1: 0.5606060606060606
Logistic Regression Train set Accuracy: 0.8809523809523809
Logistic Regression Test set Accuracy: 0.36363636363636365
Logistic Regression jaccard_similarity_score:  0.36363636363636365
svc Train set Accuracy:  0.8571428571428571
svc Test set Accuracy:  0.45454545454545453
svc jaccard_similarity_score: 0.45454545454545453
svc f1: 0.2840909090909091


  'precision', 'predicted', average, warn_for)


### Results

The metrics being used for comparing 4 different classification models are as follows; accuracy score,  Jaccard similarity score, and F1 score. The complete list of metric scores on each model is described on Fig 1: Metric Scores. The result clearly indicates that Decision Tree has the highest score on all the metrics.

### Discussion

The result indicates the model based on Decision Tree is the most powerful as a classification model for highly rated venues. On the other hand, the result has 2 major limitations. Firstly, the size of data is relatively small for building a machine learning model. Secondly,  there might be other types of data should be considered regarding the model building process such as the data on users since the rating scores could be affected by the preference of highly active users on Foursquare.

### Conclusion

Although there could be potential limitations on the result of this project, the result shows that among 4 models, Decision Tree is the most powerful classification model for identifying possibly highly rated venues.