## Modelling

In this section we will create a recommendation system using the datasets to solve our main problem.
There are different types of recomentation models, in this project we will focus on three types of recommentation systems

* 1. Content-Based Recommender systems
* 2. Collaborative Filtering Systems
* 3. Deep Neural Networks

Now, in each of these categories we will compare the different models and see which ones perform best. For validation and comparison we will use the RMSE (root mean squared error) metric, that is how far is the prediction from the true value.

#### Feature Engineering 
 
This feature engineering step helps prepare your data for analysis and modeling by selecting and transforming the most relevant attributes, which can lead to more effective modeling and improved insights for our project.
> We'll start by creating a new **review column** that aggregates all the text reviews pertaining a single restaurant from all the users into one text.


In [1]:
# importing necesarry packages

import collections
import folium
import json 
import numpy as np
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import string
import pickle
from surprise import Reader , Dataset
from tabulate import tabulate
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras import models ,layers, optimizers , losses, regularizers, metrics
from wordcloud import WordCloud

from understanding import DataLoader, DataInfo


# plotting styles
plt.style.use("fivethirtyeight")
%matplotlib inline

2024-08-12 23:10:07.636394: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-08-12 23:10:07.915231: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-08-12 23:10:08.196886: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-12 23:10:08.530384: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-12 23:10:08.535018: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-12 23:10:08.840199: I tensorflow/core/platform/cpu_feature_guard.cc:

#### i) Cleaned Restaurant Informational Data

In [2]:
# Instantiate the DataLoader class
loader= DataLoader()

# Instantiate the DataInfo class
summary= DataInfo()

# Reading the restaurants csv file
restaurant_data= loader.read_data("data/filtered_restaurants_data.csv")

# Summary information on the restaurant df
print(f'\nRESTAURANT DATASET INFORMATION\n' + '=='*20 + '\n')
summary.info(restaurant_data)


RESTAURANT DATASET INFORMATION

Shape of the dataset : (38552, 15) 

Column Names
Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'attributes', 'categories', 'hours', 'location'],
      dtype='object') 
 

Data Summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38552 entries, 0 to 38551
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   38552 non-null  object 
 1   name          38552 non-null  object 
 2   address       38552 non-null  object 
 3   city          38552 non-null  object 
 4   state         38552 non-null  object 
 5   postal_code   38552 non-null  object 
 6   latitude      38552 non-null  float64
 7   longitude     38552 non-null  float64
 8   stars         38552 non-null  float64
 9   review_count  38552 non-null  int64  
 10  is_open       38552 non-null  int64  
 11  attribu

Unnamed: 0,latitude,longitude,stars,review_count,is_open
count,38552.0,38552.0,38552.0,38552.0,38552.0
mean,36.899127,-87.678808,3.610383,110.331215,0.65457
std,6.15935,13.596218,0.748755,230.42083,0.475514
min,27.564457,-120.026076,1.0,5.0,0.0
25%,30.026555,-90.206195,3.0,18.0,0.0
50%,38.810572,-86.011028,3.5,47.0,1.0
75%,39.956489,-75.348735,4.0,117.0,1.0
max,53.649743,-74.685404,5.0,7568.0,1.0


Dataset Overview


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,location
0,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,Missouri,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...",Italian,Unknown,"State:Missouri, City:Affton, Address:8025 Mack..."
1,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,Missouri,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...",American (Traditional),Unknown,"State:Missouri, City:Affton, Address:8025 Mack..."
2,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,Missouri,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...",Greek,Unknown,"State:Missouri, City:Affton, Address:8025 Mack..."


#### ii) Cleaned User Review Data

In [3]:
# Loading the users csv file
users_data= loader.read_data("data/cleaned_users_data.csv")

# Summary information on the user review data
print(f'\nUSER DATASET INFORMATION\n' + '=='*20 + '\n')
summary.info(users_data)


USER DATASET INFORMATION

Shape of the dataset : (429771, 6) 

Column Names
Index(['review_id', 'user_id', 'business_id', 'stars', 'text', 'date'], dtype='object') 
 

Data Summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429771 entries, 0 to 429770
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   review_id    429771 non-null  object
 1   user_id      429771 non-null  object
 2   business_id  429771 non-null  object
 3   stars        429771 non-null  int64 
 4   text         429771 non-null  object
 5   date         429771 non-null  object
dtypes: int64(1), object(5)
memory usage: 19.7+ MB

Descriptive Statistics


Unnamed: 0,stars
count,429771.0
mean,3.820449
std,1.513978
min,1.0
25%,3.0
50%,5.0
75%,5.0
max,5.0


Dataset Overview


Unnamed: 0,review_id,user_id,business_id,stars,text,date
0,iBUJvIOkToh2ZECVNq5PDg,iAD32p6h32eKDVxsPHSRHA,YB26JvvGS2LgkxEKOObSAw,5,I've been eating at this restaurant for over 5...,2021-01-08 01:49:36
1,HgEofz6qEQqKYPT7YLA34w,rYvWv-Ny16b1lMcw1IP7JQ,jfIwOEXcVRyhZjM4ISOh4g,1,How does a delivery person from here get lost ...,2021-01-02 00:19:00
2,Kxo5d6EOnOE-vERwQf2a1w,2ntnbUia9Bna62W0fqNcxg,S-VD26LE_LeJNx5nASk_pw,5,"The service is always good, the employees are ...",2021-01-26 18:01:45


In [4]:
# merging the two datasets into one using the business_id primary key

data=pd.merge(left=users_data, right=restaurant_data, how='left', on='business_id')

# previewing the new merge dataset
data.head()


Unnamed: 0,review_id,user_id,business_id,stars_x,text,date,name,address,city,state,postal_code,latitude,longitude,stars_y,review_count,is_open,attributes,categories,hours,location
0,iBUJvIOkToh2ZECVNq5PDg,iAD32p6h32eKDVxsPHSRHA,YB26JvvGS2LgkxEKOObSAw,5,I've been eating at this restaurant for over 5...,2021-01-08 01:49:36,Unagi & Sushi,"2701 Airline Dr, Ste A",Metairie,Louisiana,70001.0,29.974478,-90.15037,4.0,62.0,1.0,"{'Alcohol': ""u'beer_and_wine'"", 'Caters': 'Fal...",Japanese,"{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...","State:Louisiana, City:Metairie, Address:2701 A..."
1,HgEofz6qEQqKYPT7YLA34w,rYvWv-Ny16b1lMcw1IP7JQ,jfIwOEXcVRyhZjM4ISOh4g,1,How does a delivery person from here get lost ...,2021-01-02 00:19:00,,,,,,,,,,,,,,
2,Kxo5d6EOnOE-vERwQf2a1w,2ntnbUia9Bna62W0fqNcxg,S-VD26LE_LeJNx5nASk_pw,5,"The service is always good, the employees are ...",2021-01-26 18:01:45,Kings and Queens Liberian Cuisine,107 Fairfield Ave,Upper Darby,Pennsylvania,19082.0,39.960828,-75.262968,4.0,84.0,1.0,"{'NoiseLevel': ""u'average'"", 'DogsAllowed': 'F...",African,"{'Tuesday': '11:0-20:0', 'Wednesday': '11:0-17...","State:Pennsylvania, City:Upper Darby, Address:..."
3,Kxo5d6EOnOE-vERwQf2a1w,2ntnbUia9Bna62W0fqNcxg,S-VD26LE_LeJNx5nASk_pw,5,"The service is always good, the employees are ...",2021-01-26 18:01:45,Kings and Queens Liberian Cuisine,107 Fairfield Ave,Upper Darby,Pennsylvania,19082.0,39.960828,-75.262968,4.0,84.0,1.0,"{'NoiseLevel': ""u'average'"", 'DogsAllowed': 'F...",Halal,"{'Tuesday': '11:0-20:0', 'Wednesday': '11:0-17...","State:Pennsylvania, City:Upper Darby, Address:..."
4,STqHwh6xd05bgS6FoAgRqw,j4qNLF-VNRF2DwBkUENW-w,yE1raqkLX7OZsjmX3qKIKg,5,two words: whipped. feta. \nexplosion of amazi...,2021-01-27 23:28:03,Butcher & Bee,902 Main St,Nashville,Tennessee,37206.0,36.175896,-86.75682,4.0,863.0,1.0,"{'NoiseLevel': ""u'average'"", 'Alcohol': ""u'ful...",Middle Eastern,"{'Monday': '0:0-0:0', 'Tuesday': '17:0-21:30',...","State:Tennessee, City:Nashville, Address:902 M..."


### Renaming columns

Renaming the **stars_x** and **stars_y** columns into **rating** and **b/s_rating** columns for better understanding

In [5]:
data.rename(columns={'stars_x':'rating', 'stars_y':'b/s_rating'}, inplace=True)

In [6]:
# combining the address columns
data['location']=data[['city','state','address']]\
            .apply( lambda x: f"State:{x['state']}, City:{x['city']}, Address:{x['address']} ", axis=1)

# then we drop the combined columns
data.drop(columns=['state', 'city','address'], axis=1, inplace=True)

data.location

0         State:Louisiana, City:Metairie, Address:2701 A...
1                         State:nan, City:nan, Address:nan 
2         State:Pennsylvania, City:Upper Darby, Address:...
3         State:Pennsylvania, City:Upper Darby, Address:...
4         State:Tennessee, City:Nashville, Address:902 M...
                                ...                        
535192    State:Tennessee, City:Nashville, Address:550 B...
535193    State:Nevada, City:Reno, Address:55 Mount Rose...
535194    State:Nevada, City:Reno, Address:55 Mount Rose...
535195                    State:nan, City:nan, Address:nan 
535196    State:Indiana, City:Indianapolis, Address:130 ...
Name: location, Length: 535197, dtype: object

> Then we will convert the **user_id** column form string into integer, by assigning the unique string ids integer values. This will aid in our modeling process in the later sections.

In [7]:
# converting the user_id into intergers

# selecting only the unique user ids as a dataframe
ids=data[['user_id']].drop_duplicates('user_id').reset_index(drop=True).copy()

# resetting the indexes, to include a continous numbering 
ids=ids.reset_index()

# merging the ids dataframe with our original dataframe using the user id column as primary key
# renaming the index column to represent the user ids
data=pd.merge(data,ids, how='left', on='user_id').drop('user_id', axis=1).rename(columns={'index':'user_id'})

# writting a function to order the user ids to start from 1 instead of '0'
def add(x):
    """ adds 1 to the existing user id"""
    y=x+1
    return y
data.user_id=data.user_id.apply(add )  # applyng the function to our user ids
data.head()

Unnamed: 0,review_id,business_id,rating,text,date,name,postal_code,latitude,longitude,b/s_rating,review_count,is_open,attributes,categories,hours,location,user_id
0,iBUJvIOkToh2ZECVNq5PDg,YB26JvvGS2LgkxEKOObSAw,5,I've been eating at this restaurant for over 5...,2021-01-08 01:49:36,Unagi & Sushi,70001.0,29.974478,-90.15037,4.0,62.0,1.0,"{'Alcohol': ""u'beer_and_wine'"", 'Caters': 'Fal...",Japanese,"{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...","State:Louisiana, City:Metairie, Address:2701 A...",1
1,HgEofz6qEQqKYPT7YLA34w,jfIwOEXcVRyhZjM4ISOh4g,1,How does a delivery person from here get lost ...,2021-01-02 00:19:00,,,,,,,,,,,"State:nan, City:nan, Address:nan",2
2,Kxo5d6EOnOE-vERwQf2a1w,S-VD26LE_LeJNx5nASk_pw,5,"The service is always good, the employees are ...",2021-01-26 18:01:45,Kings and Queens Liberian Cuisine,19082.0,39.960828,-75.262968,4.0,84.0,1.0,"{'NoiseLevel': ""u'average'"", 'DogsAllowed': 'F...",African,"{'Tuesday': '11:0-20:0', 'Wednesday': '11:0-17...","State:Pennsylvania, City:Upper Darby, Address:...",3
3,Kxo5d6EOnOE-vERwQf2a1w,S-VD26LE_LeJNx5nASk_pw,5,"The service is always good, the employees are ...",2021-01-26 18:01:45,Kings and Queens Liberian Cuisine,19082.0,39.960828,-75.262968,4.0,84.0,1.0,"{'NoiseLevel': ""u'average'"", 'DogsAllowed': 'F...",Halal,"{'Tuesday': '11:0-20:0', 'Wednesday': '11:0-17...","State:Pennsylvania, City:Upper Darby, Address:...",3
4,STqHwh6xd05bgS6FoAgRqw,yE1raqkLX7OZsjmX3qKIKg,5,two words: whipped. feta. \nexplosion of amazi...,2021-01-27 23:28:03,Butcher & Bee,37206.0,36.175896,-86.75682,4.0,863.0,1.0,"{'NoiseLevel': ""u'average'"", 'Alcohol': ""u'ful...",Middle Eastern,"{'Monday': '0:0-0:0', 'Tuesday': '17:0-21:30',...","State:Tennessee, City:Nashville, Address:902 M...",4


### Collaborative filtering Models
#### Neighborhood-Based Models

Here the tasks related to building a collaborative filtering recommendation system using the Surprise library are undertaken for collaborative filtering by selecting the relevant columns, importing the Surprise library, initializing a Reader object to specify the data format, and then loading the data into a Surprise Dataset object for further analysis and model building.

> Now, we will compare the different neighborhood-based models and see which ones perform best based on the RMSE metric, then compare the neighborhood-based model with the model-based models and pick the best model.

In [11]:
 #selecting specific columns that are relevant for collaborative filtering models
new_df = data[['user_id', 'business_id', 'rating']]

# using Reader() from surprise module to convert dataframe into surprise dataformat
# instantiating a readerobject
reader = Reader()

# using the reader to read the trainset
data_2 = Dataset.load_from_df(new_df,reader)

dataset = data_2.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of Restaurants: ', dataset.n_items)

Number of users:  220872 

Number of Restaurants:  31834


#### Model-Based Models

> First , we will model a baseline SVD() model using the default parameters.

In [13]:
# instantating the SVD model
svd = SVD()

# using cross-validate to get the test rmse scores for 5 splits
results=cross_validate(svd, data_2, cv=5, n_jobs=-1)


for values in results.items():
    print(values)
print("-------------------------")
print("Mean RMSE: ",results['test_rmse'].mean())

('test_rmse', array([1.21630593, 1.21693405, 1.21327197, 1.21998972, 1.21871319]))
('test_mae', array([0.9478778 , 0.94777215, 0.94557668, 0.95039208, 0.9487414 ]))
('fit_time', (13.62192964553833, 16.698712587356567, 9.084782361984253, 11.87769341468811, 8.370072364807129))
('test_time', (2.1538827419281006, 1.7114405632019043, 1.3569321632385254, 2.257051944732666, 0.8384842872619629))
-------------------------
Mean RMSE:  1.2170429697019367


Using the GridSearchCv we will tune the SVD model in order to improve the training RMSE scores.

In [14]:
# define a dictionary params with hyperparameter values to be tested
params = {'n_factors': [20, 50, 100], # number of factors for matrix factorization
         'reg_all': [0.02, 0.05, 0.1]} # regularization term
# create a GridSearchCV object 'g_s_svd' for hyperparameter tuning
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1) # specify the algorithm (SVD) to be tuned
# fit the GridSearchCV object to the data to find the best hyperparameters
g_s_svd.fit(data_2)

Here we perform hyperparameter tuning for the SVD collaborative filtering model using grid search and cross-validation. It tests different values of the number of latent factors (n_factors) and the regularization term (reg_all) to find the combination that results in the best model performance. The final best hyperparameters can be accessed from the g_s_svd object for use in the model.

In [15]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 1.2171066185237724, 'mae': 0.9478990451899783}
{'rmse': {'n_factors': 100, 'reg_all': 0.02}, 'mae': {'n_factors': 100, 'reg_all': 0.02}}


The RMSE value for the optimized SVD model is approximately 1.254, indicating the model's average prediction error in terms of user ratings. Lower RMSE values are desirable as they signify better predictive accuracy.                              
The MAE value for the optimized SVD model is approximately 1.01, representing the average absolute difference between predicted and actual user ratings. A lower MAE indicates improved prediction accuracy.                                            
The best-performing hyperparameter values are as follows:                       
1) For RMSE, the optimal hyperparameters are 'n_factors' = 20 and 'reg_all' = 0.05.
2) For MAE, the optimal hyperparameters are 'n_factors' = 20 and 'reg_all' = 0.02.   
These results indicate that the SVD collaborative filtering model, when configured with these hyperparameters, provides a relatively low prediction error and is well-suited for making personalized recommendations based on user ratings.

In [16]:
# created an instance of the SVD model with specified hyperparameters
svd = SVD(n_factors= 20, reg_all=0.02)
# fit the SVD model to the dataset
svd.fit(dataset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f21e74f10d0>

The code we just did initializes an SVD model with specific hyperparameters and then trains the model on the provided dataset. The trained SVD model can be used for various tasks, such as making personalized recommendations based on user-item interactions.

In [17]:
# using the model, we'll try and make a rating prediction of user 15, on restaurant with id "Pns2l4eNsfO8kk83dixA6A"
svd.predict("15", "Pns2l4eNsfO8kk83dixA6A")

Prediction(uid='15', iid='Pns2l4eNsfO8kk83dixA6A', r_ui=None, est=3.8457184924429697, details={'was_impossible': False})

> First before creating a collaborative filtering function, we will first create a function **restaurant_rater()** that suggests restaurants to users for them to input their rating in order based on the entered rating to offer recommendations using the SVD model since it cannot offer recommendations when the user has no data in the database (cold start problem) 

In [19]:
# define a function named 'restaurant_rater' that takes user inputs to rate restaurants

def restaurant_rater(data=data,num:int=3, location:str =None, category:str =None):
    
    """
    The functions takes the following inputs:
    data: DataFrame - a dataframe containing only rows of the unique business 
    num: int - number of ratings
    location: string - preferred location
    category: string - preferred category of restaurant
    
    Then randomly draws restaurant names from the dataframe for the user to rate
    """
    
    df_restaurant=data
    #assigning the rating user a user_id
    user_id=df_restaurant.user_id.max()+1                                               

    rating_df = pd.DataFrame()   # create an empty dataframe to store user rated restaurants
    # continue the loop until the desired num is collected
    while num > 0:
        # select a random restaurant that matches the specified location
        if location: 
            restaurant = df_restaurant[df_restaurant['location'].str.contains(location)].sample(1)
         # select a random restaurant that matches the specified category    
        elif category:
            restaurant = df_restaurant[df_restaurant['categories'].str.contains(category)].sample(1)
        else:  # or else selects a random restaurant
            restaurant = df_restaurant.sample(1)
        # prints the selected restaurant    
        print(tabulate(restaurant[['name','b/s_rating','categories']], headers='keys', tablefmt='fancy_grid', showindex=False))
        # asks for rating from user
        rating = input("How do you rate this restaurant on a scale of 1-5, Enter: ")
        
        # creating a function that checks the validity of the entered rating ie should be bewteen 1-5
        def checker(rating):
            if (len(rating)!= 0):
                while (float(rating)>5) :
                    print("Enter valid rating, scale of 1-5 or Enter")  
                    rating= input()
                return rating
            else: return rating
            
        # calling the function to confirm the selected rating 
        rating = checker(rating)
        if len(rating) == 0:                                        # if no rating is entered 
            num-=1                                                  # the jumps to select another restaurant
            continue
        else:
            restaurant.loc[:,('user_id')]= user_id                   # then the selected restaurant is assigned the user id
            restaurant.loc[:,('rating')]= rating
            rating_df=pd.concat([rating_df,restaurant], axis=0)   # the movie is added to our new user rated dataframe
            num-=1                                                  # then another restaurant is suggested till num==0
            # return the list of user ratings and restaurant information
    return rating_df

This function above allows a user to interactively rate restaurants by providing their ratings for a specified number of restaurants, and it collects this information in a list for further analysis or use in a recommendation system. The code also considers the restaurant category for selecting restaurants to rate if a category is provided.

In [21]:
data['categories'] = data['categories'].fillna('')

In [25]:
data['categories'].head()

0          Japanese
1                  
2           African
3             Halal
4    Middle Eastern
Name: categories, dtype: object

In [53]:
# rating 4  restaurants that have sandwiches on their menu
restaurant_rater( num=4, category='African')

╒════════════╤══════════════╤══════════════╕
│ name       │   b/s_rating │ categories   │
╞════════════╪══════════════╪══════════════╡
│ Addis Nola │          4.5 │ African      │
╘════════════╧══════════════╧══════════════╛
╒══════════════════════╤══════════════╤══════════════╕
│ name                 │   b/s_rating │ categories   │
╞══════════════════════╪══════════════╪══════════════╡
│ Bennachin Restaurant │            4 │ African      │
╘══════════════════════╧══════════════╧══════════════╛
╒════════════════╤══════════════╤══════════════╕
│ name           │   b/s_rating │ categories   │
╞════════════════╪══════════════╪══════════════╡
│ The Funky Monk │            3 │ African      │
╘════════════════╧══════════════╧══════════════╛
╒════════════════╤══════════════╤══════════════╕
│ name           │   b/s_rating │ categories   │
╞════════════════╪══════════════╪══════════════╡
│ The Floribbean │          4.5 │ African      │
╘════════════════╧══════════════╧══════════════╛


Unnamed: 0,review_id,business_id,rating,text,date,name,postal_code,latitude,longitude,b/s_rating,review_count,is_open,attributes,categories,hours,location,user_id,userId,restId
200142,4Cj5hC_BnoWtwjZgf6P5rQ,cIzp7QZOaUyRLl3ZcwsrHw,2,Amazing. Even if you think you don't like raw ...,2021-03-01 00:35:18,Addis Nola,70119,29.962119,-90.089907,4.5,152.0,1.0,"{'BusinessParking': ""{'garage': False, 'street...",African,"{'Monday': '0:0-0:0', 'Tuesday': '11:0-21:0', ...","State:Louisiana, City:New Orleans, Address:424...",220873,81311,20088
531793,aSZwmqa4qnuQEBezZSd-gw,k1yDRDZ4QCvNW9Wm9-IOaA,3,What a supreme spot in the sense that it is sm...,2021-02-24 22:47:31,Bennachin Restaurant,70116,29.962183,-90.060791,4.0,430.0,1.0,"{'HasTV': 'True', 'NoiseLevel': ""u'average'"", ...",African,"{'Wednesday': '11:0-20:0', 'Thursday': '11:0-2...","State:Louisiana, City:New Orleans, Address:121...",220873,9445,24008
304604,t9c5kTwEB8158smCATvV0w,52wBQghQ0jwzSuvLvwe5rA,4,So I'll start out with the good. The sandwich ...,2021-03-31 07:36:40,The Funky Monk,85701,32.221958,-110.966058,3.0,58.0,1.0,"{'RestaurantsAttire': ""'casual'"", 'WiFi': ""u'f...",African,"{'Monday': '0:0-0:0', 'Tuesday': '17:0-23:0', ...","State:Arizona, City:Tucson, Address:350 Congre...",220873,73647,3134
374598,vRm39n6a82smP6obBTAsrQ,QUyLaPjsoZiRJ-RUBVa5rA,5,We came to have lunch with a friend who is loc...,2021-07-21 19:13:51,The Floribbean,33712,27.770781,-82.666049,4.5,144.0,1.0,"{'RestaurantsTableService': 'False', 'NoiseLev...",African,"{'Monday': '0:0-0:0', 'Tuesday': '11:30-20:0',...","State:Florida, City:St. Petersburg, Address:24...",220873,142,13770


In [54]:
# rating ratsurants from PA
restaurant_rater(location='PA')

╒════════╤══════════════╤════════════════════════╕
│ name   │   b/s_rating │ categories             │
╞════════╪══════════════╪════════════════════════╡
│ IHOP   │          2.5 │ American (Traditional) │
╘════════╧══════════════╧════════════════════════╛
╒════════╤══════════════╤════════════════════════╕
│ name   │   b/s_rating │ categories             │
╞════════╪══════════════╪════════════════════════╡
│ IHOP   │          1.5 │ American (Traditional) │
╘════════╧══════════════╧════════════════════════╛
╒═════════════╤══════════════╤════════════════════════╕
│ name        │   b/s_rating │ categories             │
╞═════════════╪══════════════╪════════════════════════╡
│ Red Lobster │            3 │ American (Traditional) │
╘═════════════╧══════════════╧════════════════════════╛


Unnamed: 0,review_id,business_id,rating,text,date,name,postal_code,latitude,longitude,b/s_rating,review_count,is_open,attributes,categories,hours,location,user_id,userId,restId
235641,eZHlkcCFrUb8ToxuNv4aug,YLYLVY1HuQG1IvYjXyHzww,2,We are ready to eat and then we see the big mo...,2021-09-28 13:18:11,IHOP,33607,27.95926,-82.526261,2.5,106.0,1.0,"{'RestaurantsGoodForGroups': 'True', 'Restaura...",American (Traditional),"{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...","State:Florida, City:TAMPA, Address:4910 Spruce...",220873,120415,17650
249159,Qo8PjtNKOLje41T6mMkRWQ,DhLIjn4oZHB0qzdlM5baFA,3,We waited 10min to get seated but a whole hour...,2021-06-06 16:39:53,IHOP,33612,28.03218,-82.420407,1.5,68.0,1.0,"{'OutdoorSeating': 'False', 'Alcohol': ""u'none...",American (Traditional),"{'Monday': '7:0-22:0', 'Tuesday': '7:0-22:0', ...","State:Florida, City:TAMPA, Address:3501 E Busc...",220873,125765,7443
78509,dWOwWMg7npbnzLQEHReBXA,-xtnwq4VBA2XFobjDGz0Ww,4,Ate at Red Lobster by the Park Mall. Had the w...,2021-05-25 17:01:49,Red Lobster,85711,32.221079,-110.867993,3.0,124.0,1.0,"{'RestaurantsGoodForGroups': 'True', 'HasTV': ...",American (Traditional),"{'Monday': '0:0-0:0', 'Tuesday': '11:0-20:0', ...","State:Arizona, City:Tucson, Address:5870 E Bro...",220873,49019,511


> With the defined rated() function, we proceed to create a collaborative filtering function while using our SVD model. The function when the user has not entered any rating to the suggested restaurants (cold start problem). Then the function will recommend using the content-based system, solving the cold start problem.

In [29]:
# creating a folium_map function that displays restaurant lovations

def folium_map(data):
    """
    The function takes in a dataframe and using the latitude and longitude columns displays a map showing the locations of 
    all the restaurants available in the input data
    """
    # reseting the index in the input dataframe
    dff=data.reset_index(drop=True)


# Set up center latitude and longitude
    center_lat = dff['latitude'][0]
    center_long = dff['longitude'][0]

# Initialize map with center lat and long
    map_ =folium.Map([center_lat,center_long], zoom_start=7)

# Adjust this limit to see more or fewer businesses
    limit=dff.shape[0]
    print(f"{limit-1} Restaurant Locations")
    for index in range(limit-1):
        # Extract information about business
        lat = dff.loc[index,'latitude']
        long = dff.loc[index,'longitude']
        name = dff.loc[index,'name']
        rating = dff.loc[index,'b/s_rating']
        location = dff.loc[index,'location']
        details = "{}\nStars: {} {}".format(name,rating,location)

# Create popup with relevant details
        popup = folium.Popup(details,parse_html=True)

# Create marker with relevant lat/long and popup
        marker = folium.Marker(location=[lat,long], popup=popup)

        marker.add_to(map_)

    return display(map_)  # returning a map display

In [31]:
def content_based(df=data, name:str= None , rating:int =1, num:int=5, text: str=None, location:str = None):
    """
    The function takes the following input;
    
    df: DataFrame - a dataframe containing unique resturants
    name: str - name of restaurant to recommend similar restaurants
    num:int - number of restaurants to recommend
    location: string - preferred location
    rating: string - preferred rating of restaurant
    text: - User preferences inform of text
    
    Then based on the input parameters offers similar restaurants according to the input parameters to users
    """
    
    if name:
        index_=df.loc[df.name== name].index[0]                          # find the index of the input name
        sim=list(enumerate(cosine_similarity[index_]))                  # extract similarity vector of that name index
        sim=sorted(sim, key=lambda x: x[1], reverse=True)[1:num+1]      # arrange the vector values in ascending order
        indices= [i[0] for i in sim]                                    # Extract the indices of the top high scores
        print(f"Top {num} Restaurants Like [{name}]")
        
        # if the location parameter is passed then the dataframe is filtered based on the input location
        if location:                                                
            df=df.loc[ (df['b/s_rating']>=rating) & ( df.location.str.contains(location))]
            folium_map(df)
        else: 
            df= df.loc[ (df['b/s_rating']>=rating) ] 
        # filtering the data based on the selected indices    
        df=df.loc[indices,('name','b/s_rating','review_count','location')].sort_values('b/s_rating', ascending=False)
        return  df.reset_index(drop=True)
    
    # if the name is None then switch to other parameters
    else:
        # if the text has a passed input values then this if statement runs            
        if text: 
                text=text.lower()                                           # converting the text into lowercase
                tokens=stem_and_tokenize(text)                              # tokenizing and stemming the words
                tokens=[ word for word in tokens if word not in stopwords]  # removing stopwords
                text_set=set(tokens)                                        # taking only unique words
                
                if location: # using entered location to filter the data
                    df=df.loc[ (df.location.str.contains(location)) & (df['b/s_rating']>=rating)].reset_index(drop=True)

                vectors=[] # creating an emplty list to append the intersection values
                for words in df.details:                                     # looping over the text in the details column
                    words=words.lower()                                      # lowering the text
                    words=stem_and_tokenize(text)                            # tokenizing and stemming the words
                    words=[ word for word in tokens if word not in stopwords] # removing stopwords
                    words=set(words)                                         # taking only unique words
                    vector=text_set.intersection(words)                      # checking for intersection with entered text 
                    vectors.append(len(vector))                              # appending value to vectors list
                    
                vectors=sorted(list(enumerate(vectors)), key= lambda x: x[1], reverse=True)[:num] # sorting the list in desc
                indices= [i[0] for i in vectors]                                         # selecting indices of top values
                print(f"Top {num} Best Restaurants Based on entered text:")
                # using the indices fileter the dataframe 
                df=df.loc[indices].sort_values(by=['b/s_rating','review_count'],ascending=False)
                if location: folium_map(df)                                   # calling the folim_map of the selected values
                return df[['name','b/s_rating','review_count','location']].reset_index(drop=True) # offering recommendations
        
        # the if only location is entered as a parameter then the top businesses in that location are recommended
        if location:
            df=df.loc[ df.location.str.contains(location)& (df['b/s_rating']>=rating)] #filtering dataframe
            df=df.sort_values(['review_count','b/s_rating'])[:num]     # sorting in descending order
            folium_map(data=df)
            return df[['name','b/s_rating','review_count','location']].reset_index(drop=True) # offering recommendations
         
        # if both the name, text and location are None the most popular restaurants are recommended
        else:                
            df=df.loc[data['b/s_rating']>=rating].sort_values(by=['review_count','b/s_rating'],ascending=False)[:num]
            if location: folium_map(data=df)
            print("Most Popular Restaurants")
            return df[['name','b/s_rating','review_count','location']].reset_index(drop=True)
    
    

In [33]:
def cf_model(df=data,num:int=3, location:str=None, name=None , text=None):
    """
    The function takes the following inputs;
    
    df: DataFrame - a dataframe containing unique restaurants
    name: str - name of restaurant to recomend similar restaurants
    num:int - number of restaurants to recommend
    location: string - preferred location
    category: string - preferred category of restaurant
    text: - User preferences in form of text
    
    The function then takes user ratings and appends them to the dataframe then fits this new dataframe to the SVD model
    Then predicts this user rating on all the restaurants in the dataframe then selects the top rating predictions and 
    recommends those restaurants
    """
    
    # calling the rater function for user to enter restaurant ratings
    user_ratings=restaurant_rater(num=num, location=location)
    
    # when the user ratings come back blank, ie no ratings given ie cold start problem
    # then the content-based method is called, which then makes the recommendations
    if len(user_ratings)==0:
        return content_based(df=df,num=num,name=name,location=location,text=text,)
    
    # then add the user ratings to our df
    df=pd.concat([df,user_ratings],axis=0)

    # convert the new dataset into surprise format
    dataset = Dataset.load_from_df(df[['user_id','business_id','rating']],reader)
    
    # then fit the surprise data to the SVD
    svd = SVD(n_factors= 20, reg_all=0.02)
    svd.fit(dataset.build_full_trainset())
    
    # extract the user rating in the ratings dataframe
    user_id=user_ratings['user_id'].values[0]
    
    # select a random restaurant that matches the specified category
    if location: 
            df = df.loc[df['location'].str.contains(location)]
        
    #create an empty list to append the model predictions
    user_predictions=[]
    
    # loopin over all the unique restaurant ids in the dataframe and appending the predictions to user prediction list
    for  iid in df.business_id.unique():
        user_predictions.append( (iid , svd.predict(user_id, iid)[3]))
    
    # sorting the predictions in descending order of the predictions values
    top_pred = sorted(user_predictions , key =lambda x: x[1], reverse=True)
    
    # selecting the top 'num'(number of predictions) prediction indicies
    indices=[i[0] for i in top_pred[:num]]  
    
    #using the extracted indices extract the restaurant titles
    rec = df.loc[ df['business_id'].isin(indices)].sort_values('b/s_rating', ascending=False)
    display(folium_map(rec))
    
    #then retun the resturants details to user
    return rec[['name','b/s_rating','location']].reset_index(drop=True)  

In [55]:
# based on default parameters
cf_model()

╒═══════════╤══════════════╤══════════════╕
│ name      │   b/s_rating │ categories   │
╞═══════════╪══════════════╪══════════════╡
│ Umai Umai │          4.5 │ Asian Fusion │
╘═══════════╧══════════════╧══════════════╛


╒════════╤══════════════╤══════════════╕
│   name │   b/s_rating │ categories   │
╞════════╪══════════════╪══════════════╡
│    nan │          nan │              │
╘════════╧══════════════╧══════════════╛
╒═════════╤══════════════╤══════════════╕
│ name    │   b/s_rating │ categories   │
╞═════════╪══════════════╪══════════════╡
│ Pho Tay │          3.5 │ Vietnamese   │
╘═════════╧══════════════╧══════════════╛
129 Restaurant Locations


ValueError: Location values cannot contain NaNs.

In [56]:
# offering recommendations based on a specified location

cf_model(num=5,location='LA')

╒════════╤══════════════╤════════════════════════╕
│ name   │   b/s_rating │ categories             │
╞════════╪══════════════╪════════════════════════╡
│ IHOP   │          2.5 │ American (Traditional) │
╘════════╧══════════════╧════════════════════════╛
╒════════════════════════════╤══════════════╤════════════════════════╕
│ name                       │   b/s_rating │ categories             │
╞════════════════════════════╪══════════════╪════════════════════════╡
│ Jay's Steak & Hoagie Joint │          4.5 │ American (Traditional) │
╘════════════════════════════╧══════════════╧════════════════════════╛
╒═════════════╤══════════════╤════════════════════════╕
│ name        │   b/s_rating │ categories             │
╞═════════════╪══════════════╪════════════════════════╡
│ Red Lobster │            3 │ American (Traditional) │
╘═════════════╧══════════════╧════════════════════════╛
╒════════════════════════════╤══════════════╤════════════════════════╕
│ name                       │   b/s_ra

None

Unnamed: 0,name,b/s_rating,location
0,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."
1,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."
2,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."
3,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."
4,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."
5,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."
6,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."
7,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."
8,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."
9,Jay's Steak & Hoagie Joint,4.5,"State:Pennsylvania, City:LANGHORNE, Address:12..."


In [36]:
# using the cf model to offer content-based recommendations
# and not inputting any ratings on the suggected restaurants
cf_model(location='New Orleans',text="restaurant with delicious crabs and nice outdoor setting")

╒═══════════════╤══════════════╤══════════════╕
│ name          │   b/s_rating │ categories   │
╞═══════════════╪══════════════╪══════════════╡
│ Cajun Seafood │            4 │ French       │
╘═══════════════╧══════════════╧══════════════╛
╒══════════════════════════╤══════════════╤══════════════╕
│ name                     │   b/s_rating │ categories   │
╞══════════════════════════╪══════════════╪══════════════╡
│ The Original Italian Pie │            3 │ Italian      │
╘══════════════════════════╧══════════════╧══════════════╛
╒═════════════════════════════════╤══════════════╤══════════════╕
│ name                            │   b/s_rating │ categories   │
╞═════════════════════════════════╪══════════════╪══════════════╡
│ Coterie Restaurant & Oyster Bar │            4 │ Southern     │
╘═════════════════════════════════╧══════════════╧══════════════╛
405 Restaurant Locations


None

Unnamed: 0,name,b/s_rating,location
0,Olive,5.0,"State:Louisiana, City:New Orleans, Address:339..."
1,Olive,5.0,"State:Louisiana, City:New Orleans, Address:339..."
2,Olive,5.0,"State:Louisiana, City:New Orleans, Address:339..."
3,Olive,5.0,"State:Louisiana, City:New Orleans, Address:339..."
4,Olive,5.0,"State:Louisiana, City:New Orleans, Address:339..."
...,...,...,...
401,Brigtsen's Restaurant,4.5,"State:Louisiana, City:New Orleans, Address:723..."
402,Brigtsen's Restaurant,4.5,"State:Louisiana, City:New Orleans, Address:723..."
403,Brigtsen's Restaurant,4.5,"State:Louisiana, City:New Orleans, Address:723..."
404,Brigtsen's Restaurant,4.5,"State:Louisiana, City:New Orleans, Address:723..."


### Neural Networks - Model

We will run a Keras deep neural network to implement a recommendation system and try to improve our RMSE scores by using neural networks.

> We are going to encode the user_id and business_id features into numeric integers in preparation for the deep learning model.

In [37]:
# Encoding the user_id column
user_encoder = LabelEncoder()                                    # instantiating the encoder
data['userId'] = user_encoder.fit_transform(data.user_id.values) # fitting and transforming the encoder to our column
n_users=data['userId'].nunique()                                 # assigning the number of users to n_user vaiable
print("Number of Users: ",n_users)

# Encoding the business_id column
item_encoder = LabelEncoder()                                          # instantiating the encoder
data['restId'] = user_encoder.fit_transform(data.business_id.values)   # fitting and transforming the encoder to our column
n_rests = data['restId'].nunique()                                  # assigning the number of restaurants to n_rests vaiable
print("Number of Restaurants: ",n_rests)

Number of Users:  220872
Number of Restaurants:  31834


> Splitting the data into training and testing sets for model evaluation.

In [38]:
# subsetting the x variable
X = data[['userId', 'restId']].values
# subsetting the y variable
y = data['rating'].values

# creating the train test splits and stratifying on basis of the y values 
# because of the uneven nature of the rating counts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(428157, 2) (428157,)
(107040, 2) (107040,)


> Calculate the minimum and maximum ratings, which will be used to scale the output of the neural network later.

In [39]:
# Find the minimum and maximum rating
min_rating = min(data['rating'])
max_rating = max(data['rating'])

> The predicted ratings is calculated by multiplying the user and restaurant embeddings, then adding the user and restaurant bias. Therefore were are going to create user and restaurant embeddings together with bias.

In [40]:
# Number of latent factors
embedding_size = 50

> Defining user embedding

In [41]:
# User embeddings

# user input layer
user = layers.Input(shape=(1,))

# Embedding layer for calculating user latent factors of size 50
user_emb = layers.Embedding(n_users, embedding_size, embeddings_regularizer=regularizers.l2(1e-6))(user)

# Reshaping the layer to flatten the embedding vector.
user_emb = layers.Reshape((embedding_size,))(user_emb)

> Defining user bias, and reshape it.

In [42]:
# User bias

# Embedding layer
user_bias = layers.Embedding(n_users, 1, embeddings_regularizer=regularizers.l2(1e-6))(user)

# Reshapin the user bias layer
user_bias = layers.Reshape((1,))(user_bias)

> Defining restaurants embeddings

In [43]:
# restaurant embeddings

# Input layer
restaurant= layers.Input(shape=(1,))

# Embedding layer
rest_emb = layers.Embedding(n_rests, embedding_size, embeddings_regularizer=regularizers.l2(1e-6))(restaurant)

# Reshape layer
rest_emb = layers.Reshape((embedding_size,))(rest_emb)

> Defining restaurant bias, and reshape it.

In [44]:
# Restaurant bias

# Embedding layer
rest_bias = layers.Embedding(n_rests, 1, embeddings_regularizer=regularizers.l2(1e-6))(restaurant)

# Reshape layer
rest_bias = layers.Reshape((1,))(rest_bias)

> After defining the embedding and bias layers, the predicted rating is calculated by dot product of the user and restaurant embeddings and then adding the bias values in order to get more accurate ratings.

In [45]:
# Dot product of the user and restaurant embeddings
rating = layers.Concatenate()([user_emb, rest_emb])

# Add biases to the ratings
# Adding the user and restaurant bias to the predicted rating
rating = layers.Add()([rating, user_bias, rest_bias])

> We move on to pass the calculated rating to layers of dense networks and finally converting the rating score from binary values into a range of 1-5. 

We create our baseline model.

In [48]:

# first dense layer of 30 nodes with relu activation
rating = layers.Dense(30, activation='relu')(rating)

# second dense layer of 15 nodes
rating = layers.Dense(15, activation='relu')(rating)

# output layer with one node that produces values between 0 and 1 due to the sigmoid activation
rating = layers.Dense(1, activation='sigmoid')(rating)
# rating= layers.Dense(5, activation='softmax')(rating)

# Scales the predicted ratings to a range of 1 - 5
rating = layers.Lambda(lambda x:x*(max_rating - min_rating) + min_rating)(rating)


# Baseline Model 
baseline_model = models.Model([user, restaurant], rating)

# Compile the model
baseline_model.compile( optimizer='sgd', loss='mse',  metrics=[metrics.RootMeanSquaredError()])

# training the model
baseline_model .fit(x=[X_train[:,0], X_train[:,1]], y=y_train,
                    batch_size=256, 
                    epochs=10, 
                    verbose=1,
                    validation_data=([X_test[:,0], X_test[:,1]], y_test))

Epoch 1/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m246s[0m 146ms/step - loss: 2.2653 - root_mean_squared_error: 1.5015 - val_loss: 2.2630 - val_root_mean_squared_error: 1.5007
Epoch 2/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m248s[0m 148ms/step - loss: 2.2557 - root_mean_squared_error: 1.4983 - val_loss: 2.2616 - val_root_mean_squared_error: 1.5003
Epoch 3/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m245s[0m 138ms/step - loss: 2.2602 - root_mean_squared_error: 1.4998 - val_loss: 2.2623 - val_root_mean_squared_error: 1.5005
Epoch 4/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m341s[0m 185ms/step - loss: 2.2639 - root_mean_squared_error: 1.5011 - val_loss: 2.2614 - val_root_mean_squared_error: 1.5002
Epoch 5/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m275s[0m 157ms/step - loss: 2.2591 - root_mean_squared_error: 1.4994 - val_loss: 2.2607 - val_root_mean_squared_error: 1.5000
Epoch 6/10

<keras.src.callbacks.history.History at 0x7f21fe2e3950>

> Our baseline model, does not overfit since the training RMSE score and the validation scores are not far off. We then proceed to tune the model in order to get better rmse scores, by reducing the model complexity.

In [50]:

rating = layers.Concatenate()([user_emb, rest_emb])
rating = layers.Add()([rating, user_bias, rest_bias])

# redusing the first dense layer into 15 neurons and adding a l2 regularization
rating = layers.Dense(15, activation='relu',kernel_regularizer=regularizers.l2(1e-3))(rating)
# creating a dropout layer
rating = layers.Dropout(0.3)(rating)
# output layer
rating = layers.Dense(1, activation='sigmoid')(rating)
#convertion of output rating
rating = layers.Lambda(lambda x:x*(max_rating - min_rating) + min_rating)(rating)

model_1 = models.Model([user, restaurant], rating)

# Compile the model
model_1.compile( optimizer='sgd', loss='mse',  metrics=[metrics.RootMeanSquaredError()])

# Train the model
model_1.fit(x=[X_train[:,0], X_train[:,1]], y=y_train,
            batch_size=256,
            epochs=20, 
            verbose=1,
            validation_data=([X_test[:,0], X_test[:,1]], y_test))

Epoch 1/20


[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m313s[0m 184ms/step - loss: 2.1169 - root_mean_squared_error: 1.4413 - val_loss: 2.0627 - val_root_mean_squared_error: 1.4232
Epoch 2/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m256s[0m 153ms/step - loss: 1.8676 - root_mean_squared_error: 1.3530 - val_loss: 2.0022 - val_root_mean_squared_error: 1.4022
Epoch 3/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m321s[0m 188ms/step - loss: 1.7143 - root_mean_squared_error: 1.2955 - val_loss: 1.9576 - val_root_mean_squared_error: 1.3865
Epoch 4/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m308s[0m 180ms/step - loss: 1.5578 - root_mean_squared_error: 1.2340 - val_loss: 1.9376 - val_root_mean_squared_error: 1.3795
Epoch 5/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m224s[0m 134ms/step - loss: 1.4274 - root_mean_squared_error: 1.1802 - val_loss: 1.9344 - val_root_mean_squared_error: 1.3785
Epoch 6/20
[1m1673/1

<keras.src.callbacks.history.History at 0x7f21f98c1b50>

> The second model has performed worse than the first with a higher rmse score and the model is overfitting the training data i.e it has a good train score but poor validation score.

we will try and simplify the model further. 

In [51]:

rating = layers.Concatenate()([user_emb, rest_emb])
# Adds the user and restaurant embedding to the dot product of the embeddings
rating = layers.Add()([rating, user_bias, rest_bias])

# reducing the first layer further to 10 node
rating = layers.Dense(10, activation='relu')(rating)
# increasing the dropout rate to 0.2
rating = layers.Dropout(0.6)(rating)
# output layer
rating = layers.Dense(1, activation='sigmoid')(rating)
# conertion of output rating
rating = layers.Lambda(lambda x:x*(max_rating - min_rating) + min_rating)(rating)

model_2 = models.Model([user, restaurant], rating)

# Compile the model
model_2.compile( optimizer= 'sgd',
                loss='mse', 
                metrics= [metrics.RootMeanSquaredError()])

# Train the model
model_2.fit(x=[X_train[:,0], X_train[:,1]], y=y_train,
            batch_size=256, 
            epochs=20, 
            verbose=1,
            validation_data=([X_test[:,0], X_test[:,1]], y_test))

Epoch 1/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m237s[0m 141ms/step - loss: 1.6052 - root_mean_squared_error: 1.2586 - val_loss: 1.9195 - val_root_mean_squared_error: 1.3816
Epoch 2/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m250s[0m 133ms/step - loss: 1.1682 - root_mean_squared_error: 1.0758 - val_loss: 1.9942 - val_root_mean_squared_error: 1.4083
Epoch 3/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m263s[0m 134ms/step - loss: 1.1277 - root_mean_squared_error: 1.0568 - val_loss: 2.0187 - val_root_mean_squared_error: 1.4170
Epoch 4/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m261s[0m 133ms/step - loss: 1.0749 - root_mean_squared_error: 1.0315 - val_loss: 2.0266 - val_root_mean_squared_error: 1.4198
Epoch 5/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m262s[0m 133ms/step - loss: 1.0261 - root_mean_squared_error: 1.0076 - val_loss: 2.0196 - val_root_mean_squared_error: 1.4173
Epoch 6/20

<keras.src.callbacks.history.History at 0x7f22018e22d0>

> The third model has further overfitted the training data as it has high validation score and low training score.
Therefore our best neural model is baseline model which has a validation score of 1.3179.

In [52]:
# evaluating the best model on the training data
print("Training data: ")
print(baseline_model.evaluate([X_train[:,0], X_train[:,1]], y_train))

# evaluating the best model on the test data
print("Testing data: ")
print(baseline_model.evaluate([X_test[:,0], X_test[:,1]], y_test))

Training data: 
[1m13380/13380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m359s[0m 27ms/step - loss: 1.1958 - root_mean_squared_error: 1.0885
[1.1989389657974243, 1.089964509010315]
Testing data: 
[1m3345/3345[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m90s[0m 27ms/step - loss: 1.7751 - root_mean_squared_error: 1.3282
[1.779281497001648, 1.3298012018203735]


> The baseline model has a training RMSE of 1.1635 and a test RMSE of 1.302 hence being our better neural networks model with the lowest test scores.

In all the models SVD has emerged to be the best RMSE score of 1.25