***
## 4. MODELLING
***

In this section we will create a recommendation system using the datasets to solve our main problem.
There are different types of recomentation models, in this project we will focus on three types of recommentation systems

* 1. Content-Based Recommender systems
* 2. Collaborative Filtering Systems
* 3. Deep Neural Networks

Now, in each of these categories we will compare the different models and see which ones perform best. For validation and comparison we will use the RMSE (root mean squared error) metric, that is how far is the prediction from the true value.

In [1]:
# Suppressing warnings
import warnings
warnings.filterwarnings('ignore')

# Core libraries
import pickle
import random
import numpy as np
import pandas as pd

# Text processing 
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

# Machine learning and model selection
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from surprise import Reader, Dataset
from surprise import Dataset, Reader, SVD, accuracy, NormalPredictor,NMF
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split 

# Deep learning with TensorFlow
from tensorflow.keras import models, layers, optimizers, losses, regularizers, metrics

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import folium

# Utility functions
from tabulate import tabulate

# Custom imports
from classes.understanding import DataLoader


***
### 1. CONTENT BASED FILTERING
***
To perfrom the content based filtering, we utilized the restaurant data dataset.

The restaurant's features such as types of cuisine they offer and attribues such as WiFi, Alcohol, Happy Hour, Noise Level, Restaurants Attire, Wheelchair Accessible, Restaurants TableService etc, were able to provide information to use cosine similarity to recommend the restaurants with the closest similarity.

In [2]:
# Loading the restaurant data from the pickled file
df = pd.read_pickle('pickled_files/restaurants_data.pkl')

# Overview of dataset information to understand the features we require
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38528 entries, 2 to 52285
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   business_id      38528 non-null  object 
 1   name             38528 non-null  object 
 2   address          38528 non-null  object 
 3   city             38528 non-null  object 
 4   state            38528 non-null  object 
 5   postal_code      38528 non-null  object 
 6   latitude         38528 non-null  float64
 7   longitude        38528 non-null  float64
 8   stars            38528 non-null  float64
 9   review_count     38528 non-null  int64  
 10  is_open          38528 non-null  int64  
 11  attributes       38528 non-null  object 
 12  categories       38528 non-null  object 
 13  hours            38528 non-null  object 
 14  location         38528 non-null  object 
 15  attributes_true  38528 non-null  object 
dtypes: float64(3), int64(2), object(11)
memory usage: 5.0+ MB


In [3]:
# Preprocessing function
def preprocess(df):
    """
    Function to preprocess the data to combine the needed features into one column
    Returns a dataframe with the combined_features columns
    """
    filtered_df=df.copy()
    # Combining the features into one column
    filtered_df['combined_features'] = (
                                        filtered_df['attributes'] + " " +
                                        filtered_df['attributes_true'] 
                                        )
    # resetting the index
    filtered_df = filtered_df.reset_index(drop=True)

    # Return turns the filtered df
    return filtered_df

In [4]:
# Vectorization function
def create_feature_vectors(df):
    """
    Performing vectorization of the preprocessed categorical features 
    and combining with the numerical features
    """
    # Vectorize the combined text features
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(df['combined_features'])
    
    # Combine the TF-IDF matrix with numerical columns
    numerical_features = df[['stars']].values
    combined_features = np.hstack((tfidf_matrix.toarray(), numerical_features))
    
    return combined_features


Using the cosine similarity matrix we will now create a content-based recommendation system that offers recommendations to users based on the restaurant names or text words representing the specifications of their desired restaurant and attributes.

We use the cosine similarity matrix to compare similarities between different restaurants and the customer's preferences, then pick the top n similar restaurants to recommend based on his/her input.

In [5]:
# Recommendation function
def recommendation(df, state, name=None, category=None):
    """
    Creates recommendation based on name or category/cuisine using cosine similarity and filtering
    Returns a dataframe containing name, state, city, address, stars and categories
    """
    preprocessed = preprocess(df)
    
    def cuisines(cuisine=None, state=state):
        """
        Function to filter to get the recommendations based on cuisine input
        """
        preprocessed=df[df["state"]==state]
        cuisine_df = preprocessed[preprocessed['categories'] == cuisine]
        cuisine_df_sorted = cuisine_df.sort_values(by=["stars", "city"], ascending=False)
        return cuisine_df_sorted[['name', 'state', 'city', 'stars', 'address', 'categories']]
    
    if name:
        if name not in preprocessed['name'].values:
            raise ValueError(f"Restaurant with name '{name}' not found in the filtered data.")

        # Finding the index of the restaurant name
        idx = preprocessed[preprocessed['name'] == name].index[0]
        exclude_names = [name]

        # Locating the restaurant row in the preprocessed df 
        row_to_add = preprocessed.iloc[idx]
        
        # convering it to a df
        row_to_add_df = pd.DataFrame([row_to_add])     
        
        #generating a df for only the state i want to recommend in
        specific_state= preprocessed[preprocessed["state"] == state]
        
        # concatinating it to the specific state df and reseting the index
        specific_state = pd.concat([specific_state, row_to_add_df]).reset_index(drop=True)
        
        # Finding the new index for the restaurant name
        idx = specific_state[specific_state['name'] == name].index[0]
        
        # Creating feature vectors
        combined_features = create_feature_vectors(specific_state)

        # Finding the cosine similarity
        cosine_sim = cosine_similarity(combined_features, combined_features)

        # Finding the top indices of the restaurants to recommend
        sim_scores = list(enumerate(cosine_sim[idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        top_indices = [i[0] for i in sim_scores]  

        # Finding the rows of the top recommended restaurants
        recommended_restaurants = specific_state.iloc[top_indices]
        recommended_restaurants = recommended_restaurants[~recommended_restaurants['name'].isin(exclude_names)]        

        # Return a df with the required features
        return recommended_restaurants[['name', 'state', 'city', 'stars', 'address','categories']].drop_duplicates(subset='name')[:20]
    
    elif category:
        # Filter based on cuisine/cateogry
        return cuisines(category)

The content_based function uses content-based recommendation techniques to provide restaurant recommendations based on user input preferences, restaurant names, or cuisine choice. 

In [6]:
# Getting a random restaurant name
random_name = df['name'].sample(n=1).values[0]
print("Random Restaurant Name:", random_name)

# Information on sampled restaurant
df[df.name==random_name]

Random Restaurant Name: Bawarchi Biryani Point


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,location,attributes_true
7798,8QeaCReOO-ryojndLvkBNg,Bawarchi Biryani Point,1516 Demonbreun St,Nashville,Tennessee,37203,36.153262,-86.790054,3.0,33,0,"{'RestaurantsGoodForGroups': 'True', 'GoodForK...",Indian,"{'Monday': '11:0-21:30', 'Tuesday': '11:0-21:3...","State:Tennessee, City:Nashville, Address:1516 ...",RestaurantsGoodForGroups GoodForKids Restauran...
18805,gtOVX8hGKKIryQs0lGPt3w,Bawarchi Biryani Point,2628 Kirkwood Hwy,Newark,Delaware,19711,39.706175,-75.684116,3.5,53,1,"{'OutdoorSeating': 'False', 'Ambience': ""{'tou...",Halal,"{'Monday': '17:0-22:0', 'Wednesday': '17:0-22:...","State:Delaware, City:Newark, Address:2628 Kirk...",AmbienceAmbience RestaurantsPriceRange2 Restau...
18805,gtOVX8hGKKIryQs0lGPt3w,Bawarchi Biryani Point,2628 Kirkwood Hwy,Newark,Delaware,19711,39.706175,-75.684116,3.5,53,1,"{'OutdoorSeating': 'False', 'Ambience': ""{'tou...",Indian,"{'Monday': '17:0-22:0', 'Wednesday': '17:0-22:...","State:Delaware, City:Newark, Address:2628 Kirk...",AmbienceAmbience RestaurantsPriceRange2 Restau...
26171,6UyNeDQWiqt4GvVsRU1aEQ,Bawarchi Biryani Point,"625 Bakers Bridge Ave, Ste 100",Franklin,Tennessee,37067,35.958327,-86.805382,4.0,155,1,"{'Caters': 'True', 'RestaurantsTableService': ...",Indian,"{'Monday': '11:0-21:30', 'Tuesday': '11:0-21:3...","State:Tennessee, City:Franklin, Address:625 Ba...",Caters RestaurantsPriceRange2 WiFi Restaurants...


In [7]:
# Randomly chosen state
random_state = df['state'].sample(n=1).values[0]
print("Random State:", random_state)

# recommendations based on random state and random restaurant name
restaurants = recommendation(df, state=random_state,  name=random_name)
restaurants.head(10)

Random State: Pennsylvania


Unnamed: 0,name,state,city,stars,address,categories
7953,Tabla Indian Cuisine,Pennsylvania,Exton,3.0,290 E Lincoln Hwy,Indian
9179,Soprano's Pizzeria & Restaurant,Pennsylvania,Warrington,3.0,1380 Easton Rd,Italian
6574,Maple Glen Pizza,Pennsylvania,Maple Glen,3.0,641 E Welsh Rd,Italian
483,Sicilian Trattoria,Pennsylvania,Elkins Park,3.5,7901 High School Rd,Italian
4803,Ochatto Hot Pot,Pennsylvania,Philadelphia,3.5,3717 Chestnut St,Chinese
659,Capital Beer,Pennsylvania,Philadelphia,3.5,2661 E Cumberland St,Chinese
4086,PrimoHoagies,Pennsylvania,Newtown,3.0,2100 S Eagle Rd,Italian
2672,Pho & Cafe Viet Huong,Pennsylvania,Philadelphia,3.5,"1110 Washington Ave, Ste 2A",Vietnamese
3126,Jade Harbor,Pennsylvania,Philadelphia,3.0,942 Race St,Chinese
8516,Don Quixote Tapas & Things,Pennsylvania,Philadelphia,3.0,526 S 4th St,Spanish


In [8]:
# Randomly chosen state
random_state = df['state'].sample(n=1).values[0]
print("Random State:", random_state)

# recommendations based on random state and random restaurant name
restaurants = recommendation(df, state=random_state,  name=random_name)
restaurants.head(10)

Random State: Indiana


Unnamed: 0,name,state,city,stars,address,categories
1174,Puccini's Pizza Pasta-Oaklandon,Indiana,Indianapolis,3.5,7829 Sunnyside Rd,Italian
2264,Puccini's Pizza Pasta - Fishers,Indiana,Fishers,3.5,8993 E 116th St,Italian
762,Mama's House,Indiana,Indianapolis,3.5,8867 Pendleton Pike,Vietnamese
1313,Chipotle Mexican Grill,Indiana,Avon,3.0,10403 E US Highway 36,Mexican
85,El Arado Mexican Grill,Indiana,Indianapolis,3.0,1063 Virginia Ave,Latin American
410,QDOBA Mexican Eats,Indiana,Fishers,3.0,8971 E 116th St,Mexican
21,Los Rancheros,Indiana,Indianapolis,3.5,7125 Georgetown Rd,Mexican
30,Blueberry Hill Pancake House,Indiana,Indianapolis,3.5,7803 E Washington St,American (Traditional)
170,The Egg & I,Indiana,Carmel,3.0,"2271 Pointe Pkwy, Ste 150",American (Traditional)
614,Puerto Vallarta Mexican Restaurant & Cantina,Indiana,Indianapolis,3.5,5510 Lafayette Rd,Mexican


**Observations**
***

- A randomly sampled restaurant name and a randomly sampled state were chosen for demostrational purposes.
- It can be noted that, recommendation locations are accurate in that the restaurant may not be from that state but recommendations are given for the state in question.
- We can also see that majority of the cuisines/categories match the restaurant in question.
- Other attributes_true features contribute to the recommendations


In [9]:
random_state = df['state'].sample(n=1).values[0]
print("Random State:", random_state)

random_cuisine = df['categories'].sample(n=1).values[0]
print("Random Cuisine:", random_cuisine)

# Example recommendations based on state, category/cuisine
cuisines = recommendation(df, state=random_state,  category=random_cuisine)
cuisines.head()

Random State: Indiana
Random Cuisine: Mediterranean


Unnamed: 0,name,state,city,stars,address,categories
33196,Petos,Indiana,Indianapolis,5.0,"6020 E 82nd St, Ste 1411",Mediterranean
40909,Yannis Golden Gyros,Indiana,Indianapolis,5.0,6658 W Washington St,Mediterranean
2420,The Palm Deli,Indiana,Zionsville,4.5,10400 N Michigan Rd,Mediterranean
734,Bitter Sweet,Indiana,Indianapolis,4.5,5543 E Washington St,Mediterranean
2430,Naf Naf Grill,Indiana,Indianapolis,4.5,921 Indiana Ave,Mediterranean


**Observations**
***
- A randomly sampled state and a randomly sampled cuisine was used for demostration puposes
- It can be seen that the system recommends the cuisine desired 
- As well as recommending specifically in the state desired.

***
### COLLABORATIVE FILTERING MODELS
***

Here the tasks related to building a collaborative filtering recommendation system using the Surprise library are undertaken for collaborative filtering by selecting the relevant columns, importing the Surprise library, initializing a Reader object to specify the data format, and then loading the data into a Surprise Dataset object for further analysis and model building.

In [10]:
# Loading the users csv file
users_data= pd.read_csv("data/users.csv")

# summary information on data
users_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2559586 entries, 0 to 2559585
Data columns (total 4 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   user_id      object
 1   business_id  object
 2   stars        int64 
 3   date         object
dtypes: int64(1), object(3)
memory usage: 78.1+ MB


In [11]:
# merging the two datasets into one using the business_id primary key
data=pd.merge(left=users_data, right=df, how='inner', on='business_id')

# previewing the new merge dataset
data.head()

Unnamed: 0,user_id,business_id,stars_x,date,name,address,city,state,postal_code,latitude,longitude,stars_y,review_count,is_open,attributes,categories,hours,location,attributes_true
0,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,2017-01-14 20:54:15,Melt,2549 Banks St,New Orleans,Louisiana,70119,29.962102,-90.087958,4.0,32,0,"{'BusinessParking': ""{'garage': False, 'street...",American (Traditional),"{'Monday': '0:0-0:0', 'Friday': '11:0-17:0', '...","State:Louisiana, City:New Orleans, Address:254...",BusinessParkingBusinessParking GoodForMealGood...
1,RreNy--tOmXMl1en0wiBOg,cPepkJeRMtHapc_b2Oe_dw,4,2018-07-17 03:30:07,Naked Tchopstix Express,"2902 W 86th St, Ste 70",Indianapolis,Indiana,46268,39.912505,-86.211285,3.5,33,0,"{'GoodForMeal': ""{'dessert': False, 'latenight...",Hawaiian,"{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...","State:Indiana, City:Indianapolis, Address:2902...",OutdoorSeating RestaurantsTakeOut RestaurantsG...
2,Jha0USGDMefGFRLik_xFQg,bMratNjTG5ZFEA6hVyr-xQ,5,2017-02-19 13:32:05,Portobello Cafe,1423 Chester Pike,Eddystone,Pennsylvania,19022,39.865032,-75.344051,4.0,137,1,"{'BikeParking': 'True', 'RestaurantsReservatio...",Italian,"{'Monday': '16:30-21:0', 'Tuesday': '16:30-21:...","State:Pennsylvania, City:Eddystone, Address:14...",BikeParking RestaurantsReservations HasTV Rest...
3,iYY5Ii1LGpZCpXFkHlMefw,Zx7n8mdt8OzLRXVzolXNhQ,5,2018-04-27 23:03:21,Milk and Honey Nashville,214 11th Ave S,Nashville,Tennessee,37203,36.154702,-86.784541,4.0,1725,1,"{'WheelchairAccessible': 'True', 'RestaurantsP...",American (Traditional),"{'Monday': '0:0-0:0', 'Thursday': '6:30-15:0',...","State:Tennessee, City:Nashville, Address:214 1...",WheelchairAccessible RestaurantsPriceRange2 Bu...
4,S7bjj-L07JuRr-tpX1UZLw,I6L0Zxi5Ww0zEWSAVgngeQ,4,2018-07-07 20:50:12,Cafe Beignet on Bourbon Street,311 Bourbon St,New Orleans,Louisiana,70130,29.955845,-90.068436,3.5,1066,1,"{'GoodForKids': 'True', 'OutdoorSeating': 'Tru...",Cajun/Creole,"{'Monday': '0:0-0:0', 'Tuesday': '8:0-15:0', '...","State:Louisiana, City:New Orleans, Address:311...",GoodForKids OutdoorSeating BusinessAcceptsCred...


In [12]:
# Renaming the **stars_x** and **stars_y** columns into **rating** and **b/s_rating** columns for better understanding
data.rename(columns={'stars_x':'b/s_rating', 'stars_y':'rating'}, inplace=True)

# previewing the data information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2386719 entries, 0 to 2386718
Data columns (total 19 columns):
 #   Column           Dtype  
---  ------           -----  
 0   user_id          object 
 1   business_id      object 
 2   b/s_rating       int64  
 3   date             object 
 4   name             object 
 5   address          object 
 6   city             object 
 7   state            object 
 8   postal_code      object 
 9   latitude         float64
 10  longitude        float64
 11  rating           float64
 12  review_count     int64  
 13  is_open          int64  
 14  attributes       object 
 15  categories       object 
 16  hours            object 
 17  location         object 
 18  attributes_true  object 
dtypes: float64(3), int64(3), object(13)
memory usage: 346.0+ MB


#### **Preprocessing Data For Modeling**
***

In [182]:

#selecting specific columns that are relevant for collaborative filtering models
new_df = data[['user_id', 'business_id', 'rating']]

# using Reader() from surprise module to convert dataframe into surprise dataformat
# instantiating a reader object
reader = Reader(rating_scale=(1, 5))

# using the reader to read the trainset
data_2 = Dataset.load_from_df(new_df,reader)

# Creating a train set with all available data
dataset = data_2.build_full_trainset()

# Split the data into training and test sets
trainset, testset = train_test_split(data_2, test_size=0.25)

print('Number of users: ', dataset.n_users, '\n')
print('Number of Restaurants: ', dataset.n_items)

Number of users:  738495 

Number of Restaurants:  24835


In [141]:
new_df.to_csv('data/new_df.csv')

### **Baseline Model Using Normal Predictor**
***

In [14]:
# Initialize the Normal Predictor algorithm
model_1 = NormalPredictor()

# Train the model on the training set
model_1.fit(trainset)

# Predict ratings for the test set
predictions = model_1.test(testset)

# Compute rmse
accuracy.rmse(predictions)

RMSE: 0.8192


0.8192241778434399

**Observations:**
***
- A normal predictor model from the surpise library was used as the initial dummy prediction model
- The model was able to achieve an RMSE of 0.819 

### **NMF Model With Default Parameters**
***

In [15]:
# Initialize the SVD algorithm
model_2 = NMF(random_state=42)

# Train the model on the training set
model_2.fit(trainset)

# Predict ratings for the test set
predictions = model_2.test(testset)

# Compute RMSE
accuracy.rmse(predictions)

RMSE: 0.3489


0.3489352110051806

**Observations:**
***
- A Non-Negative Matrix Factorization(NMF) model was used as it is ideal when ratings are non-negative (i.e., ratings from 1 to 5).
- The model was able to achieve an RMSE of 0.3495 which was a great improvement on the Normal Predictor.

### **SVD Model With Default Parameters**
***

In [16]:
# Initialize the SVD algorithm
model_3 = SVD(random_state=42)

# Train the model on the training set
model_3.fit(trainset)

# Predict ratings for the test set
predictions = model_3.test(testset)

# Compute RMSE
accuracy.rmse(predictions)

RMSE: 0.1171


0.11709911789888452

In [17]:
# using cross-validate to get the test rmse scores for 5 splits
results=cross_validate(model_3, data_2, cv=5, n_jobs=-1)


for values in results.items():
    print(values)
print("-------------------------")
print("Mean RMSE: ",results['test_rmse'].mean())

('test_rmse', array([0.11434558, 0.11308608, 0.11462078, 0.1136382 , 0.11273292]))
('test_mae', array([0.05442948, 0.05430722, 0.05431262, 0.05436357, 0.05414083]))
('fit_time', (50.03808856010437, 52.460954666137695, 45.77838850021362, 39.04082918167114, 34.15108132362366))
('test_time', (10.056745052337646, 8.097829580307007, 7.090398073196411, 5.896436929702759, 5.190059661865234))
-------------------------
Mean RMSE:  0.11368471505593795


**Observations:**
***
- A Singe value Decomposition(SVD) model was used as it works well with explicit feedback (i.e. ratings)
- The model was able to achieve an RMSE of 0.119 which further improved the RMSE
- The model was then cross validated and ahcieved an RMSE mean of 0.113

## 

### **Hyperparameter Tuning SVD model**
***
Hyperparameter tuning was carried out using grid search and cross-validation. It tests different values of the number of latent factors (n_factors),regularization term (reg_all) and the number of epochs (n_epochs) to find the combination that results in the best model performance. The final best hyperparameters can be accessed from the g_s_svd object for use in the model.

In [19]:
# define a dictionary params with hyperparameter values to be tested
params = {'n_factors': [20, 50, 100], 
         'reg_all': [00.01, .02, 0.05],
         'n_epochs':[20,30,40]} 

# create a GridSearchCV object 'g_s_svd' for hyperparameter tuning
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1) 

# fit the GridSearchCV object to the data to find the best hyperparameters
g_s_svd.fit(data_2)

In [20]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 0.06584980717002013, 'mae': 0.026864421536656607}
{'rmse': {'n_factors': 20, 'reg_all': 0.01, 'n_epochs': 40}, 'mae': {'n_factors': 20, 'reg_all': 0.01, 'n_epochs': 40}}


In [124]:
# Fitting optimal parameters

# Initialize the SVD algorithm
model_4 = SVD(n_factors= 20, reg_all=0.01, n_epochs = 40)

# Train the model on the training set
model_4.fit(trainset)

# Predict ratings for the test set
predictions = model_4.test(testset)

# Compute RMSE
accuracy.rmse(predictions)

RMSE: 0.0683


0.0683430536959298

In [None]:
# Saving the model into a pickle
with open('pickled_files/svd.pkl', 'wb') as file:
    pickle.dump(model_4, file)

**Observations:**
***

The RMSE value for the optimized SVD model is approximately 0.0683, indicating the model's average prediction error in terms of user ratings. Lower RMSE values are desirable as they signify better predictive accuracy.                              
                                           
The best-performing hyperparameter values are as follows:                       
For optimal RMSE, the optimal hyperparameters are 'n_factors' = 20,'reg_all' = 0.01 and 'n_epochs': 40.
  
These results indicate that the SVD collaborative filtering model, when configured with these hyperparameters, provides a relatively low prediction error and is well-suited for making personalized recommendations based on user ratings.

This model was then saved in a pickle file for deployment

**Collaborative filtering Intergration**
***


This below  allows a user to interactively rate restaurants by providing their ratings for a specified number of restaurants, and it collects this information in a list for further analysis or use in a recommendation system. The code also considers the restaurant category for selecting restaurants to rate if a category is provided.

In [281]:
def collect_ratings(df, state, num_samples=5, max_ratings=3):
    """
    Function to collect ratings for a number of randomly sampled restaurants.
    Allows users to skip restaurants they have never been to and limits the total number of final ratings to a specified maximum.
    """
    ratings = []

    while len(ratings) < max_ratings:
        # Sampling the specified number of restaurants
        filtered_df=df[df.state==state]

        sampled_restaurants = filtered_df.sample(n=num_samples)
        
        for _, row in sampled_restaurants.iterrows():
            restaurant_id = row['business_id']
            restaurant_name = row['name']
            restaurant_state=row["state"]
            restaurant_city =row["city"]

            while True:
                # Ask if the user has been to the restaurant
                been_to_restaurant = input(f"Have you been to {restaurant_name}, {restaurant_state},{restaurant_city}? (y/n): ").strip().lower()
                
                if been_to_restaurant == 'n':
                    print(f"Skipping {restaurant_name}.")
                    break  
                
                elif been_to_restaurant == 'y':
                    while True:
                        try:
                            # Prompt user for rating
                            print(f"Please rate {restaurant_name} on a scale of 1 to 5:")
                            rating = int(input())
                            
                            # Check if the rating is within the valid range
                            if 1 <= rating <= 5:
                                print(f"Rating for {restaurant_name}: {rating}")
                                ratings.append((restaurant_id, rating))
                                
                                # Check if we have reached the maximum number of ratings
                                if len(ratings) >= max_ratings:
                                    print("Maximum number of ratings collected.")
                                    return ratings
                                
                                break  
                            else:
                                print("Rating must be between 1 and 5. Please try again.")
                        except ValueError:
                            print("Invalid input. Please enter a number between 1 and 5.")
                    break  
                
                else:
                    print("Please answer 'yes' or 'no'.")
                
        # If not enough ratings are collected, continue sampling
        print(f"Collected {len(ratings)} ratings so far. Collecting more samples.")

    return ratings


In [286]:
def recommend_restaurants(user_id, rated_restaurants, all_restaurants_df=df, state=None):

    """
    Function to recommend restaurants based on state. the rated restaurnats are concatenated to the new_df dataframe, a model is created
    the rated restaurants are filtered from the unrated restaurants and prediction is performed. the resulting dataframe is sorted by highest 
    predicted ratings 
    """
    # Filter by state if provided
    all_restaurants_df = all_restaurants_df[all_restaurants_df["state"] == state]

    # Get all restaurant IDs
    all_restaurant_ids = all_restaurants_df['business_id'].unique()

    # Prepare the data for training the model
    # Create DataFrame for the user's ratings
    user_ratings_df = pd.DataFrame(rated_restaurants, columns=['business_id', 'rating'])
    user_ratings_df['user_id'] = user_id  

    # Combine user-specific ratings with the full dataset
    combined_df = pd.concat([new_df, user_ratings_df])

    # Filter combined_df to include only relevant restaurants
    filtered_df = combined_df[combined_df['business_id'].isin(all_restaurant_ids)]

    # Define the Reader and Dataset
    reader = Reader(rating_scale=(1, 5))
    data = Dataset.load_from_df(filtered_df[['user_id', 'business_id', 'rating']], reader)

    trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
    # trainset = data.build_full_trainset()

    # Retrain the model
    model_4 = SVD(n_factors= 20, reg_all=0.01, n_epochs = 40, random_state=42)
    model_4.fit(trainset)

    # Filter out the restaurants that the user has already rated
    rated_restaurant_ids = [rid for rid, _ in rated_restaurants]
    unrated_restaurants = [rid for rid in all_restaurant_ids if rid not in rated_restaurant_ids]

    # Predict ratings for all unrated restaurants
    predictions = [model_4.predict(user_id, rid) for rid in unrated_restaurants]

    # Create a DataFrame for the predictions
    pred_df = pd.DataFrame({
        'business_id': [pred.iid for pred in predictions],
        'predicted_rating': [pred.est for pred in predictions]
    })

    # Merge with the original restaurants DataFrame to get more information
    recommendations = pred_df.merge(all_restaurants_df, on='business_id', how='left')

    # Sort by predicted rating and get top recommendations
    recommendations = recommendations.sort_values(by='predicted_rating', ascending=False)

    return recommendations

In [287]:
# Collect ratings from the user
print("You will be asked to rate 5 random restaurants.")
user_ratings = collect_ratings(df, state="Pennsylvania")
print(user_ratings)



You will be asked to rate 5 random restaurants.
Please answer 'yes' or 'no'.
Please answer 'yes' or 'no'.
Please rate Ambrosia Ristorante BYOB on a scale of 1 to 5:
Rating for Ambrosia Ristorante BYOB: 4
Please rate Dave & Buster's on a scale of 1 to 5:
Rating for Dave & Buster's: 4
Please rate Pizzeria Nonna on a scale of 1 to 5:
Rating for Pizzeria Nonna: 1
Maximum number of ratings collected.
[('-Ti5pwj6mA99khsxxur8aQ', 4), ('Gr6nYrQ_-3p4LcE4M84lTw', 4), ('aca4m9TSqTxQsEQ2H0KSwA', 1)]


In [288]:
# Entering user_id
user_id = 'uu1651'

# Recommendation based on state and ratings
recommended_restaurants = recommend_restaurants(user_id=user_id, rated_restaurants=user_ratings, state= "Pennsylvania").drop_duplicates(subset='name')

# Viewing the top 5 entries
recommended_restaurants.head(3)


Unnamed: 0,business_id,predicted_rating,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,location,attributes_true
4543,lBHZc-fGzL8fRWREy7VlZA,1.036582,Taco Bell,2422 W Passyunk Avenue,Philadelphia,Pennsylvania,19145,39.9221,-75.18726,1.0,17,1,"{'DriveThru': 'True', 'RestaurantsPriceRange2'...",Mexican,"{'Monday': '10:30-0:0', 'Tuesday': '10:30-0:0'...","State:Pennsylvania, City:Philadelphia, Address...",DriveThru RestaurantsPriceRange2 RestaurantsDe...
6590,zm4TZxLGRbGkhfw_aaMwnQ,1.197435,Pizza Hut,1084 E Lancaster Ave,Rosemont,Pennsylvania,19010,40.026275,-75.328725,1.0,18,0,"{'RestaurantsTableService': 'False', 'Business...",Italian,"{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'...","State:Pennsylvania, City:Rosemont, Address:108...",BusinessAcceptsCreditCards RestaurantsAttire A...
1406,sCPx4Sy4I1wMeZwsTzCFRg,1.307281,Chipotle Mexican Grill,1000 S Broad St,Philadelphia,Pennsylvania,19146,39.938642,-75.166875,1.5,24,1,"{'RestaurantsDelivery': 'True', 'BusinessParki...",Mexican,"{'Monday': '0:0-0:0', 'Tuesday': '10:45-22:0',...","State:Pennsylvania, City:Philadelphia, Address...",RestaurantsDelivery RestaurantsTakeOut


**Observations:**
***


For the implementaion of collaborative filtering to be effective, it is highly dependent on the state from which you would choose to rate th restaurants. This is shown using the example above

If you are from Pennsylvania state, Philadelphia city, you would receive rating requests for the state of Pennsylvania. It is possible to skip to only rate restaurants in Philadelphia city only.

The user would have to rate at least 3 restaurants to be able to get recommendations

These ratings would then be passed through an SVD model to predict all the unrated restaurants giving a predicted rating

This predicted rating is then sorted in descending order and used to give a list of all recommendations

***
### NEURAL NETWORKS MODEL
***

We will run a Keras deep neural network to implement a recommendation system and try to improve our RMSE scores by using neural networks.

> We are going to encode the user_id and business_id features into numeric integers in preparation for the deep learning model.

In [37]:
# Encoding the user_id column
user_encoder = LabelEncoder()                                    # instantiating the encoder
data['userId'] = user_encoder.fit_transform(data.user_id.values) # fitting and transforming the encoder to our column
n_users=data['userId'].nunique()                                 # assigning the number of users to n_user vaiable
print("Number of Users: ",n_users)

# Encoding the business_id column
item_encoder = LabelEncoder()                                          # instantiating the encoder
data['restId'] = user_encoder.fit_transform(data.business_id.values)   # fitting and transforming the encoder to our column
n_rests = data['restId'].nunique()                                  # assigning the number of restaurants to n_rests vaiable
print("Number of Restaurants: ",n_rests)

Number of Users:  220872
Number of Restaurants:  31834


> Splitting the data into training and testing sets for model evaluation.

In [38]:
# subsetting the x variable
X = data[['userId', 'restId']].values
# subsetting the y variable
y = data['rating'].values

# creating the train test splits and stratifying on basis of the y values 
# because of the uneven nature of the rating counts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(428157, 2) (428157,)
(107040, 2) (107040,)


> Calculate the minimum and maximum ratings, which will be used to scale the output of the neural network later.

In [39]:
# Find the minimum and maximum rating
min_rating = min(data['rating'])
max_rating = max(data['rating'])

> The predicted ratings is calculated by multiplying the user and restaurant embeddings, then adding the user and restaurant bias. Therefore were are going to create user and restaurant embeddings together with bias.

In [40]:
# Number of latent factors
embedding_size = 50

> Defining user embedding

In [41]:
# User embeddings

# user input layer
user = layers.Input(shape=(1,))

# Embedding layer for calculating user latent factors of size 50
user_emb = layers.Embedding(n_users, embedding_size, embeddings_regularizer=regularizers.l2(1e-6))(user)

# Reshaping the layer to flatten the embedding vector.
user_emb = layers.Reshape((embedding_size,))(user_emb)

> Defining user bias, and reshape it.

In [42]:
# User bias

# Embedding layer
user_bias = layers.Embedding(n_users, 1, embeddings_regularizer=regularizers.l2(1e-6))(user)

# Reshapin the user bias layer
user_bias = layers.Reshape((1,))(user_bias)

> Defining restaurants embeddings

In [43]:
# restaurant embeddings

# Input layer
restaurant= layers.Input(shape=(1,))

# Embedding layer
rest_emb = layers.Embedding(n_rests, embedding_size, embeddings_regularizer=regularizers.l2(1e-6))(restaurant)

# Reshape layer
rest_emb = layers.Reshape((embedding_size,))(rest_emb)

> Defining restaurant bias, and reshape it.

In [44]:
# Restaurant bias

# Embedding layer
rest_bias = layers.Embedding(n_rests, 1, embeddings_regularizer=regularizers.l2(1e-6))(restaurant)

# Reshape layer
rest_bias = layers.Reshape((1,))(rest_bias)

> After defining the embedding and bias layers, the predicted rating is calculated by dot product of the user and restaurant embeddings and then adding the bias values in order to get more accurate ratings.

In [45]:
# Dot product of the user and restaurant embeddings
rating = layers.Concatenate()([user_emb, rest_emb])

# Add biases to the ratings
# Adding the user and restaurant bias to the predicted rating
rating = layers.Add()([rating, user_bias, rest_bias])

> We move on to pass the calculated rating to layers of dense networks and finally converting the rating score from binary values into a range of 1-5. 

We create our baseline model.

In [48]:

# first dense layer of 30 nodes with relu activation
rating = layers.Dense(30, activation='relu')(rating)

# second dense layer of 15 nodes
rating = layers.Dense(15, activation='relu')(rating)

# output layer with one node that produces values between 0 and 1 due to the sigmoid activation
rating = layers.Dense(1, activation='sigmoid')(rating)
# rating= layers.Dense(5, activation='softmax')(rating)

# Scales the predicted ratings to a range of 1 - 5
rating = layers.Lambda(lambda x:x*(max_rating - min_rating) + min_rating)(rating)


# Baseline Model 
baseline_model = models.Model([user, restaurant], rating)

# Compile the model
baseline_model.compile( optimizer='sgd', loss='mse',  metrics=[metrics.RootMeanSquaredError()])

# training the model
baseline_model .fit(x=[X_train[:,0], X_train[:,1]], y=y_train,
                    batch_size=256, 
                    epochs=10, 
                    verbose=1,
                    validation_data=([X_test[:,0], X_test[:,1]], y_test))

Epoch 1/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m246s[0m 146ms/step - loss: 2.2653 - root_mean_squared_error: 1.5015 - val_loss: 2.2630 - val_root_mean_squared_error: 1.5007
Epoch 2/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m248s[0m 148ms/step - loss: 2.2557 - root_mean_squared_error: 1.4983 - val_loss: 2.2616 - val_root_mean_squared_error: 1.5003
Epoch 3/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m245s[0m 138ms/step - loss: 2.2602 - root_mean_squared_error: 1.4998 - val_loss: 2.2623 - val_root_mean_squared_error: 1.5005
Epoch 4/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m341s[0m 185ms/step - loss: 2.2639 - root_mean_squared_error: 1.5011 - val_loss: 2.2614 - val_root_mean_squared_error: 1.5002
Epoch 5/10
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m275s[0m 157ms/step - loss: 2.2591 - root_mean_squared_error: 1.4994 - val_loss: 2.2607 - val_root_mean_squared_error: 1.5000
Epoch 6/10

<keras.src.callbacks.history.History at 0x7f21fe2e3950>

> Our baseline model, does not overfit since the training RMSE score and the validation scores are not far off. We then proceed to tune the model in order to get better rmse scores, by reducing the model complexity.

In [50]:

rating = layers.Concatenate()([user_emb, rest_emb])
rating = layers.Add()([rating, user_bias, rest_bias])

# redusing the first dense layer into 15 neurons and adding a l2 regularization
rating = layers.Dense(15, activation='relu',kernel_regularizer=regularizers.l2(1e-3))(rating)
# creating a dropout layer
rating = layers.Dropout(0.3)(rating)
# output layer
rating = layers.Dense(1, activation='sigmoid')(rating)
#convertion of output rating
rating = layers.Lambda(lambda x:x*(max_rating - min_rating) + min_rating)(rating)

model_1 = models.Model([user, restaurant], rating)

# Compile the model
model_1.compile( optimizer='sgd', loss='mse',  metrics=[metrics.RootMeanSquaredError()])

# Train the model
model_1.fit(x=[X_train[:,0], X_train[:,1]], y=y_train,
            batch_size=256,
            epochs=20, 
            verbose=1,
            validation_data=([X_test[:,0], X_test[:,1]], y_test))

Epoch 1/20


[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m313s[0m 184ms/step - loss: 2.1169 - root_mean_squared_error: 1.4413 - val_loss: 2.0627 - val_root_mean_squared_error: 1.4232
Epoch 2/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m256s[0m 153ms/step - loss: 1.8676 - root_mean_squared_error: 1.3530 - val_loss: 2.0022 - val_root_mean_squared_error: 1.4022
Epoch 3/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m321s[0m 188ms/step - loss: 1.7143 - root_mean_squared_error: 1.2955 - val_loss: 1.9576 - val_root_mean_squared_error: 1.3865
Epoch 4/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m308s[0m 180ms/step - loss: 1.5578 - root_mean_squared_error: 1.2340 - val_loss: 1.9376 - val_root_mean_squared_error: 1.3795
Epoch 5/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m224s[0m 134ms/step - loss: 1.4274 - root_mean_squared_error: 1.1802 - val_loss: 1.9344 - val_root_mean_squared_error: 1.3785
Epoch 6/20
[1m1673/1

<keras.src.callbacks.history.History at 0x7f21f98c1b50>

> The second model has performed worse than the first with a higher rmse score and the model is overfitting the training data i.e it has a good train score but poor validation score.

we will try and simplify the model further. 

In [51]:

rating = layers.Concatenate()([user_emb, rest_emb])
# Adds the user and restaurant embedding to the dot product of the embeddings
rating = layers.Add()([rating, user_bias, rest_bias])

# reducing the first layer further to 10 node
rating = layers.Dense(10, activation='relu')(rating)
# increasing the dropout rate to 0.2
rating = layers.Dropout(0.6)(rating)
# output layer
rating = layers.Dense(1, activation='sigmoid')(rating)
# conertion of output rating
rating = layers.Lambda(lambda x:x*(max_rating - min_rating) + min_rating)(rating)

model_2 = models.Model([user, restaurant], rating)

# Compile the model
model_2.compile( optimizer= 'sgd',
                loss='mse', 
                metrics= [metrics.RootMeanSquaredError()])

# Train the model
model_2.fit(x=[X_train[:,0], X_train[:,1]], y=y_train,
            batch_size=256, 
            epochs=20, 
            verbose=1,
            validation_data=([X_test[:,0], X_test[:,1]], y_test))

Epoch 1/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m237s[0m 141ms/step - loss: 1.6052 - root_mean_squared_error: 1.2586 - val_loss: 1.9195 - val_root_mean_squared_error: 1.3816
Epoch 2/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m250s[0m 133ms/step - loss: 1.1682 - root_mean_squared_error: 1.0758 - val_loss: 1.9942 - val_root_mean_squared_error: 1.4083
Epoch 3/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m263s[0m 134ms/step - loss: 1.1277 - root_mean_squared_error: 1.0568 - val_loss: 2.0187 - val_root_mean_squared_error: 1.4170
Epoch 4/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m261s[0m 133ms/step - loss: 1.0749 - root_mean_squared_error: 1.0315 - val_loss: 2.0266 - val_root_mean_squared_error: 1.4198
Epoch 5/20
[1m1673/1673[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m262s[0m 133ms/step - loss: 1.0261 - root_mean_squared_error: 1.0076 - val_loss: 2.0196 - val_root_mean_squared_error: 1.4173
Epoch 6/20

<keras.src.callbacks.history.History at 0x7f22018e22d0>

> The third model has further overfitted the training data as it has high validation score and low training score.
Therefore our best neural model is baseline model which has a validation score of 1.3179.

In [52]:
# evaluating the best model on the training data
print("Training data: ")
print(baseline_model.evaluate([X_train[:,0], X_train[:,1]], y_train))

# evaluating the best model on the test data
print("Testing data: ")
print(baseline_model.evaluate([X_test[:,0], X_test[:,1]], y_test))

Training data: 
[1m13380/13380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m359s[0m 27ms/step - loss: 1.1958 - root_mean_squared_error: 1.0885
[1.1989389657974243, 1.089964509010315]
Testing data: 
[1m3345/3345[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m90s[0m 27ms/step - loss: 1.7751 - root_mean_squared_error: 1.3282
[1.779281497001648, 1.3298012018203735]


**Observations:**
***

> The baseline model has a training RMSE of 1.1635 and a test RMSE of 1.302 hence being our better neural networks model with the lowest test scores.

> In all the models SVD has emerged to be the best RMSE score of 0.068