# Recommendation system for restaurants
### Based on the [Yelp Dataset](https://www.kaggle.com/yelp-dataset/yelp-dataset).  

## 0. Libraries
First of all, we define all the libraries we need.

In [None]:
from matplotlib.ticker import PercentFormatter as _PercentFormatter
import matplotlib.pyplot as _plt
import numpy as _np
import pandas as _pd
import joblib as _jl
import glob as _glob
import os as _os
import re as _re
import time as _time
from multiprocessing import Pool as _Pool
from sklearn.preprocessing import OrdinalEncoder as _OrdinalEncoder
from sklearn.metrics import confusion_matrix as _confusion_matrix, roc_curve as _roc_curve, classification_report as _classification_report, accuracy_score as _accuracy_score
from sklearn.model_selection import GridSearchCV as _GridSearchCV
from sklearn.svm import LinearSVC as _LinearSVC
from sklearn.metrics.pairwise import cosine_similarity as _cosine_similarity
from scipy.sparse import csr_matrix as _csr_matrix

_pd.set_option('display.max_columns', None)

Since we are going to use big datasets, and we'll need to load them more
times, we define a commodity function that deletes all user defined variables,
in order to free some memory.

In [None]:
def _del_all():
    %reset_selective -f [^_]

## 1. Data cleaning
### Based on [Ashish Gandhe's kernel](https://www.kaggle.com/wenqihou828/recommendation-for-yelp-users-itself).

We execute the code in ```recommendation_system_preprocessing.ipynb``` in order to
clean the data and to reduce the size of the dataset, using pickles instead of json and dropping unnecessary columns.

We explore the resulting datasets: 

In [None]:
dataset_list = _glob.glob("../dataset/[!checked]*.pickle")
for d in dataset_list:
    dataset = _pd.read_pickle(d)
    
    f = _os.path.splitext(_os.path.basename(d))[0]
    c = ", ".join(list(dataset.columns))
    s = dataset.shape
    
    print("Dataset '" + f + "':")
    print("\tfeatures:", c)
    print("\tshape:", s)
    print()

In [None]:
_del_all()

## 2. Fake Review Detection
### Based on Zhiwei Zhang's [work](https://medium.com/@zhiwei_zhang/final-blog-642fb9c7e781) and [code](https://github.com/zzhang83/Yelp_Sentiment_Analysis).

Then, in order to filter out deceptive reviews, that could alter the results
of our analysis, we load the model based on Support Vector Machine
defined in ```Yelp_sentiment_analysis/Scripts/fake_reviews.ipynb```
by [Zhiwei Zhang](https://medium.com/@zhiwei_zhang/final-blog-642fb9c7e781),
that has the best scores for accuracy, precision, recall and f1-score.

In [None]:
vectorizer = _jl.load('../models/tfidf_vectorizer.joblib')
svc = _jl.load('../models/fake_review_svc_model.joblib')

Now, we can apply this model to our data.

In [None]:
review = _pd.read_pickle("../dataset/all_review.pickle")

review.head()

In [None]:
texts = list(review["text"])
X = vectorizer.transform(texts)
predictions = svc.predict(X)

In [None]:
print(type(predictions))
print("SVC predictions:", predictions)

Now we repeat the whole process with a different model that allows us to
obtain real weights instead of a binary evaluation. 

In [None]:
cal_svc = _jl.load('../models/fake_review_cal_svc_model.joblib')
cal_predictions = cal_svc.predict_proba(X)


In [None]:
print("Calibrated SVC predictions:\n", cal_predictions)
cal_predictions = _np.array([x[1] for x in cal_predictions])
print("Calibrated SVC predictions for class '1':\n", cal_predictions)

In [None]:
print("columns before:\n", review.columns)
checked_review = review.assign(bin_truth_score=predictions, real_truth_score=cal_predictions)
print("columns after:\n", checked_review.columns)

Let's see what we just obtained.

In [None]:
checked_review[['review_id', 'text', 'bin_truth_score', 'real_truth_score']].head()

In [None]:
data = checked_review['bin_truth_score']
_plt.hist(data, weights=_np.ones(len(data)) / len(data))
_plt.title("SVC labels distribution")
_plt.gca().yaxis.set_major_formatter(_PercentFormatter(1))
_plt.show()

In [None]:
data = checked_review['real_truth_score']
_plt.hist(data, weights=_np.ones(len(data)) / len(data))
_plt.title("Calibrated SVC labels distribution")
_plt.gca().yaxis.set_major_formatter(_PercentFormatter(1))
_plt.show()

Finally, we can save the new dataset without the ```text``` column,
in order to save space and computation time.  

In [None]:
checked_review.drop(columns=['text'], inplace=True)
checked_review.to_pickle('../dataset/checked_review.pickle')

Check that everything has worked properly. 

In [None]:
final_review = _pd.read_pickle('../dataset/checked_review.pickle')
print(final_review.columns)
final_review.head()

In [None]:
_del_all()

## 3. Historical features

Following [this paper](https://www.semanticscholar.org/paper/Restaurant-Recommendation-System-Gandhe/093cecc3e53f2ba4c0c466ad3d8294ba64962050),
we add some historical features to our dataset:
1. user-level features:
    <br>1.1. average of the ratings given by a certain user,
    <br>1.2. number of reviews written by a certain user,
2. business-level features:
    <br>2.1. average of the ratings given to a certain restaurant,
    <br>2.2. number of reviews written about a certain restaurant,
3. user-business features:
    <br>3.1. average rating given by a certain user to each category,
    <br>3.2. average of the ratings given by a certain user to the categories of a certain restaurant.

Before proceeding with the computation of the new features, we have to split the dataset in three parts:
1. <i>Test set</i>, from the last day considered in the dataset, to the previous `M` months;
2. <i>Training set</i>, from the day before the beginning of the test set, up to `N` months before;
3. <i>History</i>, the remaining part of the dataset, used to compute historical features.

For the moment, we pick `m=2` and `n=9`, so the test set goes from 10/1/2018 to 11/30/2018,
the training set goes from 1/1/2018 to 9/30/2018, the history contains the remaining data,
from 10/12/2004 to 12/31/2017.

In [None]:
review_all = _pd.read_pickle("../dataset/checked_review.pickle")
review_test = review_all[review_all['date']>=_np.datetime64('2018-09-01')]
review_train = review_all[(review_all['date']>=_np.datetime64('2018-01-01')) & (review_all['date']<_np.datetime64('2018-09-01'))]
# review_hist = review_all[review_all['date']<_np.datetime64('2018-01-01')]

review_test.to_pickle('../dataset/m2_n9/review_test.pickle')
review_train.to_pickle('../dataset/m2_n9/review_train.pickle')
# review_hist.to_pickle('../dataset/m2_n9/review_hist.pickle')

In [None]:
tips_all = _pd.read_pickle("../dataset/all_tips.pickle")
tips_test = tips_all[tips_all['tips_date']>=_np.datetime64('2018-10-01')]
tips_train = tips_all[(tips_all['tips_date']>=_np.datetime64('2018-01-01')) & (tips_all['tips_date']<_np.datetime64('2018-10-01'))]
tips_hist = tips_all[tips_all['tips_date']<_np.datetime64('2018-01-01')]

tips_test.to_pickle('../dataset/m2_n9/tips_test.pickle')
tips_train.to_pickle('../dataset/m2_n9/tips_train.pickle')
tips_hist.to_pickle('../dataset/m2_n9/tips_hist.pickle')

In [None]:
_del_all()

### 3.1. User-level features

In [None]:
review_hist = _pd.read_pickle('../dataset/m2_n9/review_hist.pickle')
users = _pd.read_pickle("../dataset/all_users.pickle")

In [None]:
avg_stars = review_hist['stars'].mean()

users = users.assign(average_stars=avg_stars)
users = users.assign(num_reviews=0)
users = users.assign(average_stars_bin=avg_stars)
users = users.assign(num_reviews_bin=0)
users = users.assign(average_stars_real=avg_stars)
users = users.assign(num_reviews_real=0)
users = users.set_index('user_id')
users.head()

In [None]:
def _f(grouped):
    d = {}
    
    d['num'] = grouped['stars'].size
    d['stars'] = grouped['stars'].mean()
    
    non_fake = _np.ma.masked_where(grouped['bin_truth_score']<0, grouped['stars']).compressed()
    d['num_bin'] = non_fake.size
    d['stars_bin'] = non_fake.mean()
    
    d['num_real'] = grouped['real_truth_score'].sum()
    d['stars_real'] = _np.average(grouped['stars'], weights=grouped['real_truth_score'])
    
    return _pd.Series(d, index=['num', 'stars', 'num_bin', 'stars_bin', 'num_real', 'stars_real'])

In [None]:
grouped_reviews = review_hist.groupby('user_id').apply(_f)
grouped_reviews.head()

In [None]:
import random
import statistics

current_milli_time = lambda: int(round(_time.time() * 1000))

def get_time(df):
    us_id = random.choice(grouped_reviews.index)
    x = random.randrange(1000)
    t = current_milli_time()
    df.loc[us_id, ["test"]] = x
    t0 = current_milli_time()
    return t0-t

def get_time_mul(df):
    us_id = random.choice(grouped_reviews.index)
    x = random.randrange(1000)
    y = random.randrange(1000)
    z = random.randrange(1000)
    t = current_milli_time()
    df.loc[us_id, ["test", "ciao", "prova"]] = [x, y, z]
    t0 = current_milli_time()
    return t0-t

def test():
    df = users.copy()
    df['test'] = -1
    times = []
    for i in range(1000):
         times += [get_time(df)]
    avg_time = statistics.mean(times)
    del df
    return avg_time

def test_mul():
    df = users.copy()
    df['test'] = -1
    df['ciao'] = -1
    df['prova'] = -1
    times = []
    for i in range(1000):
         times += [get_time(df)]
    avg_time = statistics.mean(times)
    del df
    return avg_time

def tot_time(ops, x, k):
    time_millis = ops * k * x
    hours = time_millis/1000/60/60
    return hours

tot = len(grouped_reviews)
x = test()
print("hours:", tot_time(tot, x, 6))
x = test_mul()
print("hours mul:", tot_time(tot, x, 1))

In [None]:
count = 1
tot = len(grouped_reviews)
print("tot:", tot)

for index, row in grouped_reviews.iterrows():
    uid = index
    num = row['num']
    stars = row['stars']
    num_bin = row['num_bin']
    stars_bin = row['stars_bin']
    num_real = row['num_real']
    stars_real = row['stars_real']
    
    cols = ["num_reviews", "average_stars", "num_reviews_bin",
            "average_stars_bin", "num_reviews_real", "average_stars_real"]
    vals = [num, stars, num_bin, stars_bin, num_real, stars_real]
    users.loc[uid, cols] = vals
    
    count += 1
    if count % 1000 == 0:
        percent = (count/tot)*100
        print("row {}/{} - {}%".format(count, tot, percent))

In [None]:
users = users.reset_index()
users.to_pickle('../dataset/m2_n9/users.pickle')
_del_all()

### 3.2. Business-level features

In [None]:
restaurants = _pd.read_pickle("../dataset/restaurants.pickle")
review_hist = _pd.read_pickle('../dataset/m2_n9/review_hist.pickle')
avg_stars = review_hist['stars'].mean()

In [None]:
restaurants = restaurants.assign(average_stars=avg_stars)
restaurants = restaurants.assign(num_reviews=0)
restaurants = restaurants.assign(average_stars_bin=avg_stars)
restaurants = restaurants.assign(num_reviews_bin=0)
restaurants = restaurants.assign(average_stars_real=avg_stars)
restaurants = restaurants.assign(num_reviews_real=0)
restaurants = restaurants.set_index('business_id')
restaurants.head()

In [None]:
grouped_reviews = review_hist.groupby('business_id').apply(_f)
grouped_reviews.head()

In [None]:
count = 1
tot = len(grouped_reviews)
print("tot:", tot)

for index, row in grouped_reviews.iterrows():
    uid = index
    num = row['num']
    stars = row['stars']
    num_bin = row['num_bin']
    stars_bin = row['stars_bin']
    num_real = row['num_real']
    stars_real = row['stars_real']
    
    cols = ["num_reviews", "average_stars", "num_reviews_bin",
            "average_stars_bin", "num_reviews_real", "average_stars_real"]
    vals = [num, stars, num_bin, stars_bin, num_real, stars_real]
    restaurants.loc[uid, cols] = vals
    
    count += 1
    if count % 1000 == 0:
        percent = (count/tot)*100
        print("row {}/{} - {}%".format(count, tot, percent))

In [None]:
restaurants = restaurants.reset_index()
restaurants.to_pickle('../dataset/m2_n9/restaurants.pickle')
_del_all()


### 3.3. User-Business level features

#### 3.3.1. Average rating given by a certain user to each category

In [None]:
restaurants = _pd.read_pickle('../dataset/m2_n9/restaurants.pickle')
restaurants.head()

In [None]:
restaurants.columns

In [None]:
review_hist = _pd.read_pickle('../dataset/m2_n9/review_hist.pickle')
review_hist.head()

In [None]:
joined_reviews = review_hist.join(restaurants.set_index('business_id'), on = 'business_id', lsuffix='_review', rsuffix='_rest')
joined_reviews.head()

In [None]:
categories = ', '.join(list(restaurants['categories'].unique()))
categories = categories.split(', ')
print(len(categories))

cat = []
for h in categories:
    if h not in cat:
        cat.append(h)
        
print(len(cat))

cuisines = ', '.join(list(restaurants['cuisine'].unique()))
cuisines = cuisines.split(', ')
print(len(cuisines))

_cuisines_unique = []
for cuisine in cuisines:
    if not cuisine in _cuisines_unique:
        _cuisines_unique.append(cuisine)
        
print("Number of cuisines: {0}".format(len(_cuisines_unique)))
print(_cuisines_unique)

In [None]:
joined_reviews.to_pickle('../dataset/m2_n9/join_restaurants_reviewhist.pickle')

In [None]:
_del_all()

Checkpoint

In [None]:
joined_reviews =  _pd.read_pickle('../dataset/m2_n9/join_restaurants_reviewhist.pickle')
joined_reviews.head()

In [None]:
# joined_reviews = joined_reviews.reset_index()
joined_reviews = joined_reviews[['review_id', 'user_id', 'business_id', 'bin_truth_score', 'real_truth_score', 'cuisine', 'stars_review']]
joined_reviews.head()

In [None]:
#cuisines_unique = ['Chinese', 'Japanese', 'Mexican', 'Italian', 'Others', 'American', 'Korean', 'Mediterranean', 'Thai', 'Asian Fusion']

In [None]:
def each_cuisine_ratings(grouped):
    d = {}
    index = []
    for cuisine in _cuisines_unique:
        cuisine_av = cuisine + "_av"
        cuisine_records = _np.ma.masked_where(~grouped['cuisine'].str.contains(cuisine), grouped['stars_review']).compressed()
        d[cuisine_av] = cuisine_records.mean()
        index.append(cuisine_av)
    # print("cuisine_av done")
        
    for cuisine in _cuisines_unique:
        cuisine_av_bin = cuisine + "_av_bin"
        #non_fake = _np.ma.masked_where(grouped['bin_truth_score'] < 0, grouped).compressed()
        non_fake = grouped[grouped['bin_truth_score'] > 0]
        cuisine_records = _np.ma.masked_where(~non_fake['cuisine'].str.contains(cuisine), non_fake['stars_review']).compressed()
        d[cuisine_av_bin] = cuisine_records.mean()
        index.append(cuisine_av_bin)
    # print("cuisine_av_bin done")
    
    for cuisine in _cuisines_unique:
        cuisine_av_real = cuisine + "_av_real"
        cuisine_records = _np.ma.masked_where(~grouped['cuisine'].str.contains(cuisine), grouped['stars_review']).compressed()
        cuisine_truth_score = _np.ma.masked_where(~grouped['cuisine'].str.contains(cuisine), grouped['real_truth_score']).compressed()
        d[cuisine_av_real] = _np.ma.average(cuisine_records, weights = cuisine_truth_score)
        index.append(cuisine_av_real)
    # print("cuisine_av_real done")
    
    return _pd.Series(d, index = index)
    

In [None]:
grouped_reviews = joined_reviews.groupby('user_id').apply(each_cuisine_ratings)

In [None]:
grouped_reviews.head()

Checkpoint 2

In [None]:
users = _pd.read_pickle('../dataset/m2_n9/users.pickle')
users.head()

In [None]:
users = users.assign(av_rat_chinese_cuisine = _np.nan, av_rat_japanese_cuisine = _np.nan, av_rat_mexican_cuisine = _np.nan, 
                     av_rat_italian_cuisine = _np.nan, av_rat_others_cuisine = _np.nan, av_rat_american_cuisine = _np.nan, 
                     av_rat_korean_cuisine = _np.nan, av_rat_mediterranean_cuisine = _np.nan, av_rat_thai_cuisine = _np.nan, 
                     av_rat_asianfusion_cuisine = _np.nan)

users = users.assign(av_rat_chinese_cuisine_bin = _np.nan, av_rat_japanese_cuisine_bin = _np.nan, av_rat_mexican_cuisine_bin = _np.nan, 
                     av_rat_italian_cuisine_bin = _np.nan, av_rat_others_cuisine_bin = _np.nan, av_rat_american_cuisine_bin = _np.nan, 
                     av_rat_korean_cuisine_bin = _np.nan, av_rat_mediterranean_cuisine_bin = _np.nan, av_rat_thai_cuisine_bin = _np.nan, 
                     av_rat_asianfusion_cuisine_bin = _np.nan)

users = users.assign(av_rat_chinese_cuisine_real = _np.nan, av_rat_japanese_cuisine_real = _np.nan, av_rat_mexican_cuisine_real = _np.nan, 
                     av_rat_italian_cuisine_real = _np.nan, av_rat_others_cuisine_real = _np.nan, av_rat_american_cuisine_real = _np.nan, 
                     av_rat_korean_cuisine_real = _np.nan, av_rat_mediterranean_cuisine_real = _np.nan, av_rat_thai_cuisine_real = _np.nan, 
                     av_rat_asianfusion_cuisine_real = _np.nan)

users = users.set_index('user_id')

In [None]:
users.head()

In [None]:
grouped_reviews = _pd.read_pickle('../dataset/m2_n9/grouped_reviews.pickle')
grouped_reviews.head()

In [None]:
# split grouped_reviews and users datasets into n_cores parts, where n_cores is the number of available processors
n_cores = _os.cpu_count()

df_out = _np.array_split(users, n_cores)   # list of input dataframes (from grouped_reviews)

df_out_names = []   # list of paths of output dataframes (from users)
df_in = []
for i, df in enumerate(df_out):
    name = "../dataset/m2_n9/tmp/df_out_" + str(i) + ".pickle"
    df_out_names += [name]
    
    df_tmp = grouped_reviews.loc[df.index]
    df_in += [df_tmp]

In [None]:
from multiproc_utils import user_business_features

if __name__ ==  '__main__':
    with _Pool(processes=n_cores) as p:
        p.map(user_business_features, zip(df_in, df_out, df_out_names))

In [None]:
users_chunks = []

# add chunks produced by subprocesses
for name in df_out_names:
    df_out_i = _pd.read_pickle(name)
    users_chunks += [df_out_i]
    _os.remove(name)

users = _pd.concat(users_chunks)
users.head()

In [None]:
users = users.reset_index()
users.to_pickle('../dataset/m2_n9/users_2.pickle')

In [None]:
users.shape

In [None]:
users_pre = _pd.read_pickle("../dataset/m2_n9/users.pickle")
users_pre.shape

In [None]:
len(grouped_reviews)

In [None]:
print("expected diff:", users.shape[0]-len(grouped_reviews))

In [None]:
users_tmp = users[['av_rat_chinese_cuisine', 'av_rat_japanese_cuisine', 'av_rat_mexican_cuisine', 'av_rat_italian_cuisine', 
            'av_rat_others_cuisine', 'av_rat_american_cuisine', 'av_rat_korean_cuisine', 'av_rat_mediterranean_cuisine',
            'av_rat_thai_cuisine', 'av_rat_asianfusion_cuisine',
           
           'av_rat_chinese_cuisine_bin', 'av_rat_japanese_cuisine_bin', 'av_rat_mexican_cuisine_bin', 
           'av_rat_italian_cuisine_bin', 'av_rat_others_cuisine_bin', 'av_rat_american_cuisine_bin', 
           'av_rat_korean_cuisine_bin', 'av_rat_mediterranean_cuisine_bin', 'av_rat_thai_cuisine_bin', 
           'av_rat_asianfusion_cuisine_bin',
           
           'av_rat_chinese_cuisine_real', 'av_rat_japanese_cuisine_real', 'av_rat_mexican_cuisine_real', 
           'av_rat_italian_cuisine_real', 'av_rat_others_cuisine_real', 'av_rat_american_cuisine_real', 
           'av_rat_korean_cuisine_real', 'av_rat_mediterranean_cuisine_real', 'av_rat_thai_cuisine_real', 
           'av_rat_asianfusion_cuisine_real']]

count_na = 0
for i, r in users_tmp.iterrows():
        if r.isna().all():
            count_na += 1

print("actual diff:", count_na)

In [None]:
_del_all()

#### 3.3.2. Average of the ratings given by a certain user to the categories of a certain restaurant.

##### Test set

In [None]:
review_test = _pd.read_pickle('../dataset/m2_n9/review_test.pickle')
review_test = review_test.sort_values(by=['review_id'])
review_test = review_test.reset_index(drop = True)
review_test.shape

In [None]:
review_test.head()

In [None]:
restaurants = _pd.read_pickle('../dataset/m2_n9/restaurants.pickle')
restaurants = restaurants.reset_index(drop = True)
restaurants = restaurants[['cuisine', 'business_id']]
restaurants.head()

In [None]:
review_test_rest = review_test.join(restaurants.set_index('business_id'), on = 'business_id')
review_test_rest.to_pickle('../dataset/m2_n9/review_test_cuisine.pickle')
review_test_rest.shape

In [None]:
review_test_rest.head()

In [None]:
del restaurants

users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users.head()

In [None]:
users = users[['user_id', 'av_rat_chinese_cuisine', 'av_rat_japanese_cuisine', 'av_rat_mexican_cuisine', 'av_rat_italian_cuisine', 
            'av_rat_others_cuisine', 'av_rat_american_cuisine', 'av_rat_korean_cuisine', 'av_rat_mediterranean_cuisine',
            'av_rat_thai_cuisine', 'av_rat_asianfusion_cuisine',
           
           'av_rat_chinese_cuisine_bin', 'av_rat_japanese_cuisine_bin', 'av_rat_mexican_cuisine_bin', 
           'av_rat_italian_cuisine_bin', 'av_rat_others_cuisine_bin', 'av_rat_american_cuisine_bin', 
           'av_rat_korean_cuisine_bin', 'av_rat_mediterranean_cuisine_bin', 'av_rat_thai_cuisine_bin', 
           'av_rat_asianfusion_cuisine_bin',
           
           'av_rat_chinese_cuisine_real', 'av_rat_japanese_cuisine_real', 'av_rat_mexican_cuisine_real', 
           'av_rat_italian_cuisine_real', 'av_rat_others_cuisine_real', 'av_rat_american_cuisine_real', 
           'av_rat_korean_cuisine_real', 'av_rat_mediterranean_cuisine_real', 'av_rat_thai_cuisine_real', 
           'av_rat_asianfusion_cuisine_real']]

users.head()

In [None]:
test_join = review_test_rest.join(users.set_index('user_id'), on = 'user_id', lsuffix = '_test_revirew', rsuffix = '_users')
test_join.shape

In [None]:
test_join.head()

In [None]:
test_join.to_pickle('../dataset/m2_n9/join_test_users_review.pickle')
del users, review_test_rest

In [None]:
def _restaturants_users_cuisine_ratings(grouped):
    cuisines = str(grouped['cuisine']).split(", ")
    
    d = {'review_id' : grouped['review_id'],'cuisine_av_hist' : 0, 'cuisine_av_hist_bin' : 0, 'cuisine_av_hist_real': 0}
    index = ['review_id', 'cuisine_av_hist', 'cuisine_av_hist_bin', 'cuisine_av_hist_real']
   
    values = []
    for cuisine in cuisines:
        cui = cuisine.lower().replace(" ", "")
        name = "av_rat_{0}_cuisine".format(cui)
        values.append(grouped[name])
    d['cuisine_av_hist'] = _np.average(values)
    
    values = []
    for cuisine in cuisines:
        cui = cuisine.lower().replace(" ", "")
        name = "av_rat_{0}_cuisine_bin".format(cui)
        values.append(grouped[name])
    d['cuisine_av_hist_bin'] = _np.average(values)
    
    values = []
    for cuisine in cuisines:
        cui = cuisine.lower().replace(" ", "")
        name = "av_rat_{0}_cuisine_real".format(cui)
        values.append(grouped[name])
    d['cuisine_av_hist_real'] = _np.average(values)
    
    return _pd.Series(d, index = index)

In [None]:
applied_test = test_join.apply(_restaturants_users_cuisine_ratings, axis = 1)
applied_test.shape

In [None]:
applied_test = applied_test.sort_values(by=['review_id'])
applied_test = applied_test.reset_index(drop = True)
applied_test.head()

In [None]:
applied_test.to_pickle('../dataset/m2_n9/applied_test_users_review.pickle')

In [None]:
review_test.shape

In [None]:
review_test.head()

In [None]:
review_test = review_test.assign(cuisine_av_hist = applied_test['cuisine_av_hist'],
                                 cuisine_av_hist_bin = applied_test['cuisine_av_hist_bin'],
                                 cuisine_av_hist_real = applied_test['cuisine_av_hist_real'])
review_test.shape

In [None]:
review_test.head()

In [None]:
test_set = review_test
test_set.to_pickle('../dataset/m2_n9/review_test_cuisine_final.pickle')
_del_all()

###### Training set

In [None]:
review_train = _pd.read_pickle('../dataset/m2_n9/review_train.pickle')
review_train = review_train.sort_values(by=['review_id'])
review_train = review_train.reset_index(drop = True)
review_train.shape

In [None]:
review_train.head()

In [None]:
restaurants = _pd.read_pickle('../dataset/m2_n9/restaurants.pickle')
restaurants = restaurants.reset_index(drop = True)
restaurants = restaurants[['cuisine', 'business_id']]
restaurants.head()

In [None]:
review_train_rest = review_train.join(restaurants.set_index('business_id'), on = 'business_id')
review_train_rest.to_pickle('../dataset/m2_n9/review_train_cuisine.pickle')
review_train_rest.shape

In [None]:
review_train_rest.head()

In [None]:
del restaurants

users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users.head()

In [None]:
users = users[['user_id', 'av_rat_chinese_cuisine', 'av_rat_japanese_cuisine', 'av_rat_mexican_cuisine', 'av_rat_italian_cuisine', 
            'av_rat_others_cuisine', 'av_rat_american_cuisine', 'av_rat_korean_cuisine', 'av_rat_mediterranean_cuisine',
            'av_rat_thai_cuisine', 'av_rat_asianfusion_cuisine',
           
           'av_rat_chinese_cuisine_bin', 'av_rat_japanese_cuisine_bin', 'av_rat_mexican_cuisine_bin', 
           'av_rat_italian_cuisine_bin', 'av_rat_others_cuisine_bin', 'av_rat_american_cuisine_bin', 
           'av_rat_korean_cuisine_bin', 'av_rat_mediterranean_cuisine_bin', 'av_rat_thai_cuisine_bin', 
           'av_rat_asianfusion_cuisine_bin',
           
           'av_rat_chinese_cuisine_real', 'av_rat_japanese_cuisine_real', 'av_rat_mexican_cuisine_real', 
           'av_rat_italian_cuisine_real', 'av_rat_others_cuisine_real', 'av_rat_american_cuisine_real', 
           'av_rat_korean_cuisine_real', 'av_rat_mediterranean_cuisine_real', 'av_rat_thai_cuisine_real', 
           'av_rat_asianfusion_cuisine_real']]

users.head()

In [None]:
train_join = review_train_rest.join(users.set_index('user_id'), on = 'user_id', lsuffix = '_train_revirew', rsuffix = '_users')
train_join.shape

In [None]:
train_join.head()

In [None]:
train_join.to_pickle('../dataset/m2_n9/join_train_users_review.pickle')
del users, review_train_rest

In [None]:
applied_train = train_join.apply(_restaturants_users_cuisine_ratings, axis = 1)
applied_train.shape

In [None]:
applied_train = applied_train.sort_values(by=['review_id'])
applied_train = applied_train.reset_index(drop = True)
applied_train.head()

In [None]:
applied_train.to_pickle('../dataset/m2_n9/applied_train_users_review.pickle')

In [None]:
review_train.shape

In [None]:
review_train.head()

In [None]:
review_train = review_train.assign(cuisine_av_hist = applied_train['cuisine_av_hist'],
                                   cuisine_av_hist_bin = applied_train['cuisine_av_hist_bin'],
                                   cuisine_av_hist_real = applied_train['cuisine_av_hist_real'])
review_train.shape

In [None]:
review_train.head()

In [None]:
train_set = review_train
train_set.to_pickle('../dataset/m2_n9/review_train_cuisine_final.pickle')
_del_all()

### 4. User-based collaborative approach

$pred(u, r) = a_u + \frac{\sum_{u_i \in U} sim(u, u_i) * a_{u_i, r} - a_r} {\sum_{u_i \in U} sim(u, u_i)}$

In [None]:
restaurants = _pd.read_pickle('../dataset/m2_n9/restaurants.pickle')
restaurants.set_index('business_id', inplace=True)
restaurants.head()

#### Training set

In [None]:
review_train = _pd.read_pickle('../dataset/m2_n9/review_train_cuisine_final.pickle')
review_train.shape

In [None]:
review_train.assign(coll_score=_np.nan, coll_score_bin=_np.nan, coll_score_real=_np.nan)
review_train.shape

Standard

In [None]:
users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users.set_index('user_id', inplace=True)
users.shape

In [None]:
user_ids = list(set(users.index) & set(review_train.user_id.unique()))
len(user_ids)

In [None]:
sub_user = users.loc[user_ids, ['av_rat_chinese_cuisine', 'av_rat_japanese_cuisine', 'av_rat_mexican_cuisine', 'av_rat_italian_cuisine', 
            'av_rat_others_cuisine', 'av_rat_american_cuisine', 'av_rat_korean_cuisine', 'av_rat_mediterranean_cuisine',
            'av_rat_thai_cuisine', 'av_rat_asianfusion_cuisine']]
sub_user = sub_user.fillna(sub_user.mean())
sub_user = _csr_matrix(sub_user.values)
cos = _cosine_similarity(sub_user)

In [None]:
del user_ids, sub_user

In [None]:
cos = cos.tolil()
cos[cos<0.5] = 0

In [None]:
data_cos = _pd.DataFrame.sparse.from_spmatrix(data=cos, columns=sub_user.index, index=sub_user.index)
data_cos.head()

In [None]:
del users

In [None]:
count = 0
tot = review_train.shape[0]

for rid, row in review_train.iterrows():
    rest_id = row['business_id']
    user_id = row['user_id']
    
    a_u_r = row['cuisine_av_hist']
    a_r = restaurants.loc[rid, 'average_stars']
    numerator = (data_cos[user_id] * (a_u_r - a_r)).sum()
    denominator = data_cos[user_id].sum()
    
    review_train.loc[rid, 'coll_score'] = numerator/denominator
    
    count += 1
    if count % 1000 == 0:
        percent = (count/tot)*100
        print("row {}/{} - {}%".format(count, tot, percent))

In [None]:
del data_cos
review_train.to_pickle('../dataset/m2_n9/review_train.pickle')

Binary

In [None]:
users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users.set_index('user_id', inplace=True)
users.shape

In [None]:
user_ids = list(set(users.index) & set(review_train.user_id.unique()))
len(user_ids)

In [None]:
sub_user = users.loc[user_ids, ['av_rat_chinese_cuisine_bin', 'av_rat_japanese_cuisine_bin', 'av_rat_mexican_cuisine_bin', 
           'av_rat_italian_cuisine_bin', 'av_rat_others_cuisine_bin', 'av_rat_american_cuisine_bin', 
           'av_rat_korean_cuisine_bin', 'av_rat_mediterranean_cuisine_bin', 'av_rat_thai_cuisine_bin', 
           'av_rat_asianfusion_cuisine_bin']]
sub_user = sub_user.fillna(sub_user.mean())
sub_user = _csr_matrix(sub_user.values)
cos_bin = _cosine_similarity(sub_user)

In [None]:
del user_ids, sub_user

In [None]:
cos_bin = cos_bin.tolil()
cos_bin[cos_bin<0.5] = 0

In [None]:
data_bin = _pd.DataFrame.sparse.from_spmatrix(data=cos_bin, columns=sub_user.index, index=sub_user.index)
data_bin.head()

In [None]:
del users

In [None]:
count = 0
tot = review_train.shape[0]

for rid, row in review_train.iterrows():
    rest_id = row['business_id']
    user_id = row['user_id']
    
    a_u_r_bin = row['cuisine_av_hist_bin']
    a_r_bin = restaurants.loc[rid, 'average_stars_bin']
    numerator_bin = (data_bin[user_id] * (a_u_r_bin - a_r_bin)).sum()
    denominator_bin = data_bin[user_id].sum()
    
    review_train.loc[rid, 'coll_score_bin'] = numerator_bin/denominator_bin
    
    count += 1
    if count % 1000 == 0:
        percent = (count/tot)*100
        print("row {}/{} - {}%".format(count, tot, percent))

In [None]:
del data_bin
review_train.to_pickle('../dataset/m2_n9/review_train.pickle')

Real

In [None]:
users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users.set_index('user_id', inplace=True)
users.shape

In [None]:
user_ids = list(set(users.index) & set(review_train.user_id.unique()))
len(user_ids)

In [None]:
sub_user = users.loc[user_ids, ['av_rat_chinese_cuisine_real', 'av_rat_japanese_cuisine_real', 'av_rat_mexican_cuisine_real', 
           'av_rat_italian_cuisine_real', 'av_rat_others_cuisine_real', 'av_rat_american_cuisine_real', 
           'av_rat_korean_cuisine_real', 'av_rat_mediterranean_cuisine_real', 'av_rat_thai_cuisine_real', 
           'av_rat_asianfusion_cuisine_real']]
sub_user = sub_user.fillna(sub_user.mean())
sub_user = _csr_matrix(sub_user.values)
cos_real = _cosine_similarity(sub_user)

In [None]:
del user_ids, sub_user

In [None]:
cos_real = cos_real.tolil()
cos_real[cos_real<0.5] = 0

In [None]:
data_real = _pd.DataFrame.sparse.from_spmatrix(data=cos_real, columns=sub_user.index, index=sub_user.index)
data_real.head()

In [None]:
del users

In [None]:
count = 0
tot = review_train.shape[0]

for rid, row in review_train.iterrows():
    rest_id = row['business_id']
    user_id = row['user_id']
    
    a_u_r_real = row['cuisine_av_hist_real']
    a_r_real = restaurants.loc[rid, 'average_stars_real']
    numerator_real = (data_real[user_id] * (a_u_r_real - a_r_real)).sum()
    denominator_real = data_real[user_id].sum()
    
    review_train.loc[rid, 'coll_score_real'] = numerator_real/denominator_real
    
    count += 1
    if count % 1000 == 0:
        percent = (count/tot)*100
        print("row {}/{} - {}%".format(count, tot, percent))

In [None]:
del data_real
review_train.to_pickle('../dataset/m2_n9/review_train.pickle')
del review_train

#### Test set

In [None]:
review_test = _pd.read_pickle('../dataset/m2_n9/review_test_cuisine_final.pickle')
review_test.shape

In [None]:
review_test.assign(coll_score=_np.nan, coll_score_bin=_np.nan, coll_score_real=_np.nan)
review_test.shape

Standard

In [None]:
users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users.set_index('user_id', inplace=True)
users.shape

In [None]:
user_ids = list(set(users.index) & set(review_test.user_id.unique()))
len(user_ids)

In [None]:
sub_user = users.loc[user_ids, ['av_rat_chinese_cuisine', 'av_rat_japanese_cuisine', 'av_rat_mexican_cuisine', 'av_rat_italian_cuisine', 
            'av_rat_others_cuisine', 'av_rat_american_cuisine', 'av_rat_korean_cuisine', 'av_rat_mediterranean_cuisine',
            'av_rat_thai_cuisine', 'av_rat_asianfusion_cuisine']]
sub_user = sub_user.fillna(sub_user.mean())
sub_user = _csr_matrix(sub_user.values)
cos = _cosine_similarity(sub_user)

In [None]:
del user_ids, sub_user

In [None]:
cos = cos.tolil()
cos[cos<0.5] = 0

In [None]:
data_cos = _pd.DataFrame.sparse.from_spmatrix(data=cos, columns=sub_user.index, index=sub_user.index)
data_cos.head()

In [None]:
del users

In [None]:
count = 0
tot = review_test.shape[0]

for rid, row in review_test.iterrows():
    rest_id = row['business_id']
    user_id = row['user_id']
    
    a_u_r = row['cuisine_av_hist']
    a_r = restaurants.loc[rid, 'average_stars']
    numerator = (data_cos[user_id] * (a_u_r - a_r)).sum()
    denominator = data_cos[user_id].sum()
    
    review_test.loc[rid, 'coll_score'] = numerator/denominator
    
    count += 1
    if count % 1000 == 0:
        percent = (count/tot)*100
        print("row {}/{} - {}%".format(count, tot, percent))

In [None]:
del data_cos
review_test.to_pickle('../dataset/m2_n9/review_test.pickle')

Binary

In [None]:
users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users.set_index('user_id', inplace=True)
users.shape

In [None]:
user_ids = list(set(users.index) & set(review_test.user_id.unique()))
len(user_ids)

In [None]:
sub_user = users.loc[user_ids, ['av_rat_chinese_cuisine_bin', 'av_rat_japanese_cuisine_bin', 'av_rat_mexican_cuisine_bin', 
           'av_rat_italian_cuisine_bin', 'av_rat_others_cuisine_bin', 'av_rat_american_cuisine_bin', 
           'av_rat_korean_cuisine_bin', 'av_rat_mediterranean_cuisine_bin', 'av_rat_thai_cuisine_bin', 
           'av_rat_asianfusion_cuisine_bin']]
sub_user = sub_user.fillna(sub_user.mean())
sub_user = _csr_matrix(sub_user.values)
cos_bin = _cosine_similarity(sub_user)

In [None]:
del user_ids, sub_user

In [None]:
cos_bin = cos_bin.tolil()
cos_bin[cos_bin<0.5] = 0

In [None]:
data_bin = _pd.DataFrame.sparse.from_spmatrix(data=cos_bin, columns=sub_user.index, index=sub_user.index)
data_bin.head()

In [None]:
del users

In [None]:
count = 0
tot = review_test.shape[0]

for rid, row in review_test.iterrows():
    rest_id = row['business_id']
    user_id = row['user_id']
    
    a_u_r_bin = row['cuisine_av_hist_bin']
    a_r_bin = restaurants.loc[rid, 'average_stars_bin']
    numerator_bin = (data_bin[user_id] * (a_u_r_bin - a_r_bin)).sum()
    denominator_bin = data_bin[user_id].sum()
    
    review_test.loc[rid, 'coll_score_bin'] = numerator_bin/denominator_bin
    
    count += 1
    if count % 1000 == 0:
        percent = (count/tot)*100
        print("row {}/{} - {}%".format(count, tot, percent))

In [None]:
del data_bin
review_test.to_pickle('../dataset/m2_n9/review_test.pickle')

Real

In [None]:
users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users.set_index('user_id', inplace=True)
users.shape

In [None]:
user_ids = list(set(users.index) & set(review_test.user_id.unique()))
len(user_ids)

In [None]:
sub_user = users.loc[user_ids, ['av_rat_chinese_cuisine_real', 'av_rat_japanese_cuisine_real', 'av_rat_mexican_cuisine_real', 
           'av_rat_italian_cuisine_real', 'av_rat_others_cuisine_real', 'av_rat_american_cuisine_real', 
           'av_rat_korean_cuisine_real', 'av_rat_mediterranean_cuisine_real', 'av_rat_thai_cuisine_real', 
           'av_rat_asianfusion_cuisine_real']]
sub_user = sub_user.fillna(sub_user.mean())
sub_user = _csr_matrix(sub_user.values)
cos_real = _cosine_similarity(sub_user)

In [None]:
del user_ids, sub_user

In [None]:
cos_real = cos_real.tolil()
cos_real[cos_real<0.5] = 0

In [None]:
data_real = _pd.DataFrame.sparse.from_spmatrix(data=cos_real, columns=sub_user.index, index=sub_user.index)
data_real.head()

In [None]:
del users

In [None]:
count = 0
tot = review_test.shape[0]

for rid, row in review_test.iterrows():
    rest_id = row['business_id']
    user_id = row['user_id']
    
    a_u_r_real = row['cuisine_av_hist_real']
    a_r_real = restaurants.loc[rid, 'average_stars_real']
    numerator_real = (data_real[user_id] * (a_u_r_real - a_r_real)).sum()
    denominator_real = data_real[user_id].sum()
    
    review_test.loc[rid, 'coll_score_real'] = numerator_real/denominator_real
    
    count += 1
    if count % 1000 == 0:
        percent = (count/tot)*100
        print("row {}/{} - {}%".format(count, tot, percent))

In [None]:
del data_real
review_test.to_pickle('../dataset/m2_n9/review_test.pickle')
_del_all()

## 5. Some more preprocessing

 - We don't need the dataset <i>checkin</i>, and from the dataset <i>tips</i>
   we take only the feature "compliments";
 - The train set is a join of all the data needed for training;
 - The test set is a join of all the data needed for training and performance evaluation (labels included);
 - The label is a feature 'likes' that is 1 if that user will like that
   restaurant (4 or 5 stars) or 0 if he/she won't like that restaurant (1, 2 or 3 stars).

### 5.1. 

##### Training set

In [None]:
review_train = _pd.read_pickle('../dataset/m2_n9/review_train_cuisine_final.pickle')
review_train = review_train.assign(likes = _np.nan)
review_train['likes'] = _np.where(review_train['stars'].isin([4, 5]), 1, 0)
review_train.head(10)

In [None]:
restaurants = _pd.read_pickle('../dataset/m2_n9/restaurants.pickle')
restaurants = restaurants.reset_index(drop = True)
restaurants.head()

In [None]:
review_rest_train = review_train.join(restaurants.set_index('business_id'), on = 'business_id', lsuffix = '_review', rsuffix = '_restaurant')
review_rest_train.head()

In [None]:
print(len(review_train))
print(len(review_rest_train))

In [None]:
tips = _pd.read_pickle('../dataset/m2_n9/tips_train.pickle')
tips = tips.reset_index(drop = True)
tips.head()

In [None]:
tips_agg = tips.groupby(['business_id', 'user_id'])['compliment_count'].agg(_np.sum)
tips_agg.head()

In [None]:
review_tip_train = review_rest_train.join(tips_agg, on=['business_id', 'user_id'], lsuffix = '_review', rsuffix = '_tip')
review_tip_train.head()

In [None]:
print(len(review_train))
print(len(review_tip_train))

In [None]:
users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users = users.reset_index(drop = True)
users.head()

In [None]:
train_set = review_tip_train.join(users.set_index('user_id'), on = 'user_id', lsuffix = '_review', rsuffix = '_user')
del review_rest_train, users
train_set.head()

In [None]:
print(len(review_train))
print(len(train_set))

In [None]:
train_set.to_pickle('../dataset/m2_n9/model_train_set.pickle')
_del_all()

##### Test set

In [None]:
review_test = _pd.read_pickle('../dataset/m2_n9/review_test_cuisine_final.pickle')
review_test = review_test.assign(likes = _np.nan)
review_test['likes'] = _np.where(review_test['stars'].isin([4, 5]), 1, 0)
review_test.head()

In [None]:
print(len(review_test))

In [None]:
restaurants = _pd.read_pickle('../dataset/m2_n9/restaurants.pickle')
restaurants = restaurants.reset_index(drop = True)
restaurants.head()

In [None]:
review_rest_test = review_test.join(restaurants.set_index('business_id'), on = 'business_id', lsuffix = '_review', rsuffix = '_restaurant')
del restaurants
review_rest_test.head()

In [None]:
print(len(review_test))
print(len(review_rest_test))

In [None]:
tips = _pd.read_pickle('../dataset/m2_n9/tips_test.pickle')
tips = tips.reset_index(drop = True)
tips.head()

In [None]:
tips_agg = tips.groupby(['business_id', 'user_id'])['compliment_count'].agg(_np.sum)
tips_agg.head()

In [None]:
review_tip_test = review_rest_test.join(tips_agg, on=['business_id', 'user_id'], lsuffix = '_review', rsuffix = '_tip')
review_tip_test.head()

In [None]:
print(len(review_test))
print(len(review_tip_test))

In [None]:
users = _pd.read_pickle('../dataset/m2_n9/users_2.pickle')
users = users.reset_index(drop = True)
users.head()

In [None]:
test_set = review_tip_test.join(users.set_index('user_id'), on = 'user_id', lsuffix = '_review', rsuffix = '_user')
del review_rest_test, users
test_set.head()

In [None]:
print(len(review_test))
print(len(test_set))

In [None]:
test_set.to_pickle('../dataset/m2_n9/model_test_set.pickle')
_del_all()

### 5.1. Prepare data for the models

We have to fill missing values in the dataset, and then convert non-numerical
features into numerical features, or drop them if they are not necessary for
our models, so that the remaining features are readable by our models.

We summarize what kind of data we have at the moment, in order to decide
what to do with each feature.

In [None]:
train_set = _pd.read_pickle('../dataset/m2_n9/model_train_set.pickle')
train_set.head()

In [None]:
test_set = _pd.read_pickle('../dataset/m2_n9/model_test_set.pickle')
test_set.head()

In [None]:
train_test_set = _pd.concat([train_set, test_set], sort=False)

In [None]:
print("train size:", train_set.shape)
print("test size:", test_set.shape)
print("train_test size:", train_test_set.shape)
print(train_set.shape[0] + test_set.shape[0] == train_test_set.shape[0])
_train_len = train_set.shape[0]

In [None]:
train_test_set.info()

In [None]:
train_test_types = train_test_set.dtypes

In [None]:
for ind, dtype in train_test_types.iteritems():
    if not _np.issubdtype(dtype, _np.number):
        if "id" not in ind:
            uniq_vals = train_test_set[ind].unique()
            null_vals = train_test_set[ind].isnull().sum()
            print(ind + " - " + str(dtype) + "  - unique: " + str(len(uniq_vals)) + " - nulls: " + str(null_vals))
            print(uniq_vals[:10])
            print()

Drop useless features

In [None]:
train_test_set.drop(columns=['date', 'name', 'address', 'yelping_since', 'user_name', 'cuisine'], inplace=True)

Fill missing values

In [None]:
train_test_set['OutdoorSeating'] = train_test_set['OutdoorSeating'].fillna('None')
train_test_set['BusinessAcceptsCreditCards'] = train_test_set['BusinessAcceptsCreditCards'].fillna('None')
train_test_set['RestaurantsDelivery'] = train_test_set['RestaurantsDelivery'].fillna('None')
train_test_set['RestaurantsReservations'] = train_test_set['RestaurantsReservations'].fillna('None')
train_test_set['WiFi'] = train_test_set['WiFi'].fillna('None')
train_test_set['Alcohol'] = train_test_set['Alcohol'].fillna('None')

In [None]:
train_test_set['Monday_Open'] = train_test_set["Monday_Open"].astype(str)
train_test_set['Monday_Open'] = train_test_set['Monday_Open'].fillna(train_test_set['Monday_Open'].mode())
train_test_set['Tuesday_Open'] = train_test_set["Tuesday_Open"].astype(str)
train_test_set['Tuesday_Open'] = train_test_set['Tuesday_Open'].fillna(train_test_set['Tuesday_Open'].mode())
train_test_set['Wednesday_Open'] = train_test_set["Wednesday_Open"].astype(str)
train_test_set['Wednesday_Open'] = train_test_set['Wednesday_Open'].fillna(train_test_set['Wednesday_Open'].mode())
train_test_set['Thursday_Open'] = train_test_set["Thursday_Open"].astype(str)
train_test_set['Thursday_Open'] = train_test_set['Thursday_Open'].fillna(train_test_set['Thursday_Open'].mode())
train_test_set['Friday_Open'] = train_test_set["Friday_Open"].astype(str)
train_test_set['Friday_Open'] = train_test_set['Friday_Open'].fillna(train_test_set['Friday_Open'].mode())
train_test_set['Saturday_Open'] = train_test_set["Saturday_Open"].astype(str)
train_test_set['Saturday_Open'] = train_test_set['Saturday_Open'].fillna(train_test_set['Saturday_Open'].mode())
train_test_set['Sunday_Open'] = train_test_set["Sunday_Open"].astype(str)
train_test_set['Sunday_Open'] = train_test_set['Sunday_Open'].fillna(train_test_set['Sunday_Open'].mode())
train_test_set['Monday_Close'] = train_test_set["Monday_Close"].astype(str)
train_test_set['Monday_Close'] = train_test_set['Monday_Close'].fillna(train_test_set['Monday_Close'].mode())
train_test_set['Tuesday_Close'] = train_test_set["Tuesday_Close"].astype(str)
train_test_set['Tuesday_Close'] = train_test_set['Tuesday_Close'].fillna(train_test_set['Tuesday_Close'].mode())
train_test_set['Wednesday_Close'] = train_test_set["Wednesday_Close"].astype(str)
train_test_set['Wednesday_Close'] = train_test_set['Wednesday_Close'].fillna(train_test_set['Wednesday_Close'].mode())
train_test_set['Thursday_Close'] = train_test_set["Thursday_Close"].astype(str)
train_test_set['Thursday_Close'] = train_test_set['Thursday_Close'].fillna(train_test_set['Thursday_Close'].mode())
train_test_set['Friday_Close'] = train_test_set["Friday_Close"].astype(str)
train_test_set['Friday_Close'] = train_test_set['Friday_Close'].fillna(train_test_set['Friday_Close'].mode())
train_test_set['Saturday_Close'] = train_test_set["Saturday_Close"].astype(str)
train_test_set['Saturday_Close'] = train_test_set['Saturday_Close'].fillna(train_test_set['Saturday_Close'].mode())
train_test_set['Sunday_Close'] = train_test_set["Sunday_Close"].astype(str)
train_test_set['Sunday_Close'] = train_test_set['Sunday_Close'].fillna(train_test_set['Sunday_Close'].mode())

In [None]:
for ind, dtype in train_test_types.iteritems():
    if _np.issubdtype(dtype, _np.floating):
        train_test_set[ind] = train_test_set[ind].fillna(train_test_set[ind].mean())
    elif _np.issubdtype(dtype, _np.integer):
        train_test_set[ind] = train_test_set[ind].fillna(round(train_test_set[ind].mean()))

In [None]:
# check any feature still has null values
train_test_set.info()

Convert non-numerical features

We print and plot the distribution of the cities, to see the long tail and decide how many of them to keep.

In [None]:
city_count = train_test_set['city'].value_counts()
print(city_count.to_string())
print(city_count.shape)

In [None]:
data = train_test_set.loc[train_test_set['city']!="Las Vegas", 'city']
weights = _np.ones(len(data)) / len(data)
_plt.figure(figsize=(20,10))
_plt.hist(data, weights=weights, bins=100)
_plt.title("City distribution")
_plt.gca().yaxis.set_major_formatter(_PercentFormatter(1))
_plt.show()

In [None]:
main_cities = city_count.where(city_count >= 100).dropna()
print(main_cities.to_string())
print(main_cities.shape)
main_cities = '|'.join(list(main_cities.index))

In [None]:
train_test_set['city'] = train_test_set['city'].str.findall(main_cities)
train_test_set['city'] = train_test_set['city'].map(lambda x: 'Other' if x==[] else x[0])
train_test_set.head()

We print and plot the distribution of the categories, to see the long tail and decide how many of them to keep.

In [None]:
category_count = _pd.Series(', '.join(list(train_test_set['categories'])).split(', ')).value_counts()
print(category_count.to_string())
print(category_count.shape)

In [None]:
data = category_count.drop(labels=['Restaurants', 'Food']).index
vals = category_count.drop(labels=['Restaurants', 'Food']).values
weights = vals / vals.sum()
_plt.figure(figsize=(20,10))
_plt.hist(data, weights=weights, bins=100)
_plt.title("Category distribution")
_plt.gca().yaxis.set_major_formatter(_PercentFormatter(1))
_plt.show()

In [None]:
main_categories = category_count.drop(labels=['Restaurants', 'Food']).where(category_count >= 200).dropna()
print(main_categories.to_string())
print(main_categories.shape)
main_categories = '|'.join([_re.escape(x) for x in main_categories.index])

In [None]:
train_test_set['categories'] = train_test_set['categories'].str.findall(main_categories)
train_test_set['categories'] = train_test_set['categories'].map(lambda x: set(x))
train_test_set['categories'] = train_test_set['categories'].map(lambda x: ['Other'] if not bool(x) else list(x))
train_test_set['categories'] = train_test_set['categories'].map(', '.join) 
train_test_set.head()

Now we apply the actual conversion

In [None]:
train_test_set.shape

In [None]:
cat_cols = ['OutdoorSeating', 'BusinessAcceptsCreditCards', 'RestaurantsDelivery', 'RestaurantsReservations', 'WiFi',
        'Alcohol', 'city']
train_test_set = _pd.get_dummies(train_test_set, columns=cat_cols, prefix=cat_cols)
train_test_set.shape

In [None]:
categories = train_test_set['categories'].str.get_dummies(',')
f1 = lambda x: "categories_" + x
categories.rename(columns=f1, inplace=True)
train_test_set[categories.columns] = categories
train_test_set.drop(columns=['categories'], inplace=True)
train_test_set.shape

In [None]:
oe = _OrdinalEncoder()

In [None]:
ord_cols = ['Monday_Open', 'Tuesday_Open', 'Wednesday_Open', 'Thursday_Open', 'Friday_Open',
            'Saturday_Open', 'Sunday_Open', 'Monday_Close', 'Tuesday_Close', 'Wednesday_Close',
            'Thursday_Close','Friday_Close', 'Saturday_Close', 'Sunday_Close', 'postal_code']

train_test_set[ord_cols] = oe.fit_transform(train_test_set[ord_cols].to_numpy())

The resulting dataset

In [None]:
train_test_set.info()

In [None]:
train_set = train_test_set[:_train_len]
test_set = train_test_set[_train_len:]

In [None]:
train_set.head(10)

In [None]:
train_set.shape

In [None]:
train_set.to_pickle('../dataset/m2_n9/model_train_set_3.pickle')

In [None]:
test_set.head(10)

In [None]:
test_set.shape

In [None]:
test_set.to_pickle('../dataset/m2_n9/model_test_set_3.pickle')

In [None]:
_del_all()

## 6. Models
### 6.1. Linear SVM

(see the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html))

In [None]:
train_set = _pd.read_pickle('../dataset/m2_n9/model_train_set_3.pickle')
train_set.head()

In [None]:
#sub_train_set = train_set[:round(train_set.shape[0]/3)]
sub_train_set = train_set
del train_set
sub_train_set.shape

In [None]:
# define classifier
svc_classifier = _LinearSVC(random_state=0, max_iter=50000)
svc_classifier.get_params()

In [None]:
# fine tune classifier
# param_grid = {'C':[0.001,0.01,0.1,0.25,0.5,0.75,1,10,100,1000], 'gamma':[3,2,1,0.1,0.001,0.0001]}
param_grid = {'C':[0.001,0.01,0.1,0.25,0.5,0.75,1,10,100,1000]}
# grid = _GridSearchCV(estimator=svc_classifier, param_grid=param_grid, refit=True, verbose=2, cv=3, error_score=_np.nan, n_jobs=1, pre_dispatch=1)
grid = _GridSearchCV(estimator=svc_classifier, param_grid=param_grid, refit=True, verbose=2, cv=3, error_score=_np.nan, n_jobs=-1, pre_dispatch=6)
grid.fit(sub_train_set.drop(columns=['likes', 'stars_review', 'review_id', 'user_id', 'business_id']), sub_train_set['likes'])
print("best params:", grid.best_params_, "- best score:", grid.best_score_)

In [None]:
print("results:", grid.cv_results_)

In [None]:
del sub_train_set
train_set = _pd.read_pickle('../dataset/m2_n9/model_train_set_3.pickle')

In [None]:
best_model = grid.best_estimator_
best_model.fit(train_set.drop(columns=['likes', 'stars_review', 'review_id', 'user_id', 'business_id']), train_set['likes'])

In [None]:
print("coef:", best_model.coef_)
print("intercept:", best_model.intercept_)

In [None]:
del train_set
test_set = _pd.read_pickle('../dataset/m2_n9/model_test_set_3.pickle')
test_set.head()

In [None]:
# test classifier
predic = best_model.predict(test_set.drop(columns=['likes', 'stars_review', 'review_id', 'user_id', 'business_id']))
print("predictions:\n", predic)

In [None]:
# evaluate classifier

print("Report for Support Vector Machine:")
print(_classification_report(test_set['likes'], predic))

print("Accuracy for Support Vector Machine:", _accuracy_score(test_set['likes'], predic)*100)

In [None]:
# Confusion matrix for SVC

print("Confusion Matrix for SVC before balance the data: ")
_confusion_matrix(test_set['likes'], predic)

In [None]:
# draw ROC curve
fpr, tpr, thresholds = _roc_curve(test_set['likes'], predic)

_plt.plot(fpr,tpr)
_plt.xlim([0.0,1.0])
_plt.ylim([0.0,1.0])

_plt.title("Deceptive Review Dection SVM")
_plt.xlabel("False Positive")
_plt.ylabel("True Positive")

_plt.grid(True)
_plt.show()

In [None]:
%reset