# Advanced content-based & evaluation


We first install the package `surprise`.

In [None]:
!pip install pandas scikit-surprise



Now, we need to import all the needed libraries and load the data.

In [1]:
import pandas as pd
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise import accuracy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import string

# Load data
food_data = pd.read_csv('data/food.csv')
ratings_data = pd.read_csv('data/ratings.csv')

In [2]:
food_data.head()

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
0,1,summer squash salad,Healthy Food,veg,"white balsamic vinegar, lemon juice, lemon rin..."
1,2,chicken minced salad,Healthy Food,non-veg,"olive oil, chicken mince, garlic (minced), oni..."
2,3,sweet chilli almonds,Snack,veg,"almonds whole, egg white, curry leaves, salt, ..."
3,4,tricolour salad,Healthy Food,veg,"vinegar, honey/sugar, soy sauce, salt, garlic ..."
4,5,christmas cake,Dessert,veg,"christmas dry fruits (pre-soaked), orange zest..."


We define a simple function that will clean the description of our meals.

In [3]:
def text_cleaning(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

In [4]:
food_data['Describe'] = food_data['Describe'].apply(text_cleaning)

We merge both datasets to get a complete dataset including ratings AND users.

In [5]:
# Merge data
merged_data = pd.merge(ratings_data, food_data, on='Food_ID')

# Define a Reader
reader = Reader(rating_scale=(1, 10))

# Create Surprise Dataset
data = Dataset.load_from_df(merged_data[['User_ID', 'Food_ID', 'Rating']], reader)

# Train-test split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [6]:
merged_data.head()

Unnamed: 0,User_ID,Food_ID,Rating,Name,C_Type,Veg_Non,Describe
0,1.0,88.0,4.0,peri peri chicken satay,Snack,non-veg,boneless skinless chicken thigh trimmed salt a...
1,1.0,46.0,3.0,steam bunny chicken bao,Japanese,non-veg,buns all purpose white flour dry yeast sugar s...
2,3.0,46.0,2.0,steam bunny chicken bao,Japanese,non-veg,buns all purpose white flour dry yeast sugar s...
3,20.0,46.0,6.0,steam bunny chicken bao,Japanese,non-veg,buns all purpose white flour dry yeast sugar s...
4,69.0,46.0,9.0,steam bunny chicken bao,Japanese,non-veg,buns all purpose white flour dry yeast sugar s...


In the cell below, we define the TF-IDF Vectorizer to compute TF-IDF features for all the elements of our datasets.

We also define the function that gets $n$ recommendations from a food's name.

In [7]:
# Create TF-IDF vectorizer for content-based filtering
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(food_data['Describe'])

# Compute cosine similarity
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to get recommendations based on content
def content_based_recommendation(food_name):
    idx = food_data.index[food_data['Name'] == food_name].tolist()[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    food_indices = [i[0] for i in sim_scores]
    print(sim_scores)
    return food_data['Name'].iloc[food_indices]

# Example usage
ex_food = "christmas cake"
recommended_food = content_based_recommendation(ex_food)
print(f"Content-Based Recommendations for {ex_food}:")
print(recommended_food)


[(378, 0.26442155224665964), (234, 0.1883071039839689), (393, 0.17890054153954782), (227, 0.17838287654553503), (250, 0.17782542270572887), (64, 0.17119161695288831), (198, 0.16162927688061762), (272, 0.1611563765947765), (233, 0.16069567766867066), (253, 0.1573820626216767)]
Content-Based Recommendations for christmas cake:
378      Grilled Chicken with Almond and Garlic Sauce
234                                  whole wheat cake
393    Fig and Sesame Tart with Cardamom Orange Cream
227                         chocolate chip cheesecake
250                            lemon poppy seed cake 
64                     almond  white chocolate gujiya
198                             lemon poppy seed cake
272                           corn & jalapeno poppers
233                             cinnamon star cookies
253                            orange quinoa sevaiyan
Name: Name, dtype: object


We can see above that the top recommendations seem to be really close to `christmas cake`.

In [8]:
# Build Collaborative Filtering Recommender System using Surprise
algo = SVD()
algo.fit(trainset)

# Make predictions
predictions = algo.test(testset)

# Evaluate the model
rmse = accuracy.rmse(predictions)

RMSE: 2.9050


In [9]:
from surprise.model_selection import cross_validate

cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.9655  2.9619  2.5473  2.9390  2.9917  2.8811  0.1677  
MAE (testset)     2.5330  2.5630  2.1157  2.5624  2.6088  2.4766  0.1821  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    


For this model without feature engineering, we get a RMSE of 2.9 and a MAE of 2.48 in average.

## With feature engineering

We'd like to perform some feature engineering to benefit of the other features of the dataset. We create a feature named `soup` that is the concatenation of the other features `Describe`, `C_Type` and `Veg_Non`.

In [10]:
def create_soup(x):
  return " ".join([x['Describe'], x['C_Type'], x['Veg_Non']])

food_data['soup'] = food_data.apply(create_soup, axis=1)

In [11]:
# Create TF-IDF vectorizer for content-based filtering
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(food_data['soup'])

# Compute cosine similarity
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to get recommendations based on content
def content_based_recommendation(food_name):
    idx = food_data.index[food_data['Name'] == food_name].tolist()[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    food_indices = [i[0] for i in sim_scores]
    print(sim_scores)
    return food_data['Name'].iloc[food_indices]

# Example usage
ex_food = "christmas cake"
recommended_food = content_based_recommendation(ex_food)
print(f"Content-Based Recommendations for {ex_food}:")
print(recommended_food)

[(378, 0.25057225014594453), (393, 0.20085024821815067), (227, 0.1980791009255726), (250, 0.197949350187168), (233, 0.18437889059768756), (234, 0.18192330919980182), (198, 0.18029768136168772), (231, 0.17549342202115817), (64, 0.16592963636681982), (207, 0.1644196446457626)]
Content-Based Recommendations for christmas cake:
378      Grilled Chicken with Almond and Garlic Sauce
393    Fig and Sesame Tart with Cardamom Orange Cream
227                         chocolate chip cheesecake
250                            lemon poppy seed cake 
233                             cinnamon star cookies
234                                  whole wheat cake
198                             lemon poppy seed cake
231                             apple and walnut cake
64                     almond  white chocolate gujiya
207              fennel scented sweet banana fritters
Name: Name, dtype: object


In [12]:
# Build Collaborative Filtering Recommender System using Surprise
algo = SVD()
algo.fit(trainset)

# Make predictions
predictions = algo.test(testset)

# Evaluate the model
rmse = accuracy.rmse(predictions)

RMSE: 2.9128


In [13]:
from surprise.model_selection import cross_validate

cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.8882  2.9285  2.8852  2.8200  2.9807  2.9005  0.0531  
MAE (testset)     2.4733  2.5235  2.4633  2.3778  2.6033  2.4882  0.0742  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    


Again, RMSE and MSE are on average the same as the model without feature engineering.

Let's test the GridSearch with cross validations to find the best parameters and obtain a fine-tuned model.

In [14]:
from surprise.model_selection import GridSearchCV

param_grid = {
  'n_factors': [20, 50, 100, 200, 400, 800, 1000],
  'n_epochs': [5, 10, 20, 50, 100, 200]
}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=10)
gs.fit(data)

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

2.8661072260812035
{'n_factors': 20, 'n_epochs': 5}


Best parameters are `n_factors=20` and `n_epochs=5`.

Since the best parameters have been found, we can retrain the model using these parameters.

In [15]:
# best hyperparameters
best_factor = gs.best_params['rmse']['n_factors']
best_epoch = gs.best_params['rmse']['n_epochs']

# We'll use the famous SVD algorithm.
svd = SVD(n_factors=best_factor, n_epochs=best_epoch)

# Train the algorithm on the trainset
svd.fit(trainset)

# Make predictions
# predictions = algo.test(testset)

# Evaluate the model
# rmse = accuracy.rmse(predictions)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f581d16b100>

In [16]:
cv = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.8361  2.8637  2.8389  3.1065  2.7089  2.8708  0.1296  
MAE (testset)     2.4833  2.4113  2.4560  2.7519  2.2823  2.4770  0.1538  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    


RMSE is not really better with fine-tuning unfortunately (RMSE=2.8910 and MSE=2.4429)