<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# Estimating Baseline Performance
<br>
Estimating baseline performance is as important as choosing right metrics for model evaluation. In this notebook, we briefly discuss about why do we care about baseline performance and how to measure it.

The notebook covers two example scenarios under the context of movie recommendation: 1) rating prediction and 2) top-k recommendation.

### Why does baseline performance matter?
<br>
Before we go deep dive into baseline performance estimation, it is worth to think about why we need that.

As we can simply see from the definition of the word 'baseline', <b>baseline performance</b> is a minimum performance we expect to achieve by a model or starting point used for model comparisons.

Once we train a model and get results from evaluation metrics we choose, we will wonder how should we interpret the metrics or even wonder if the trained model is better than a simple rule-based model. Baseline results help us to understand those.

Let's say we are building a food recommender. We evaluated the model on the test set and got nDCG (at 10) = 0.3. At that moment, we would not know if the model is good or bad. But once we find out that a simple rule of <i>'recommending top-10 most popular foods to all users'</i> can achieve nDCG = 0.4, we see that our model is not good enough. Maybe the model is not trained well, or maybe we should think about if nDCG is the right metric for prediction of user behaviors in the given problem.

## **Import libraries**

In [None]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import re
import time
import datetime
import random
import nltk
nltk.download('stopwords')
nltk.download('punkt')

from tqdm import tqdm
from bs4 import BeautifulSoup
from sklearn.metrics.pairwise import cosine_similarity
from numpy.linalg import norm
from numpy import dot
from numpy import sqrt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

sns.set()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Read dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/DS300_DoAn/data

/content/drive/MyDrive/DS300_DoAn/data


In [None]:
df = pd.read_csv('/content/drive/MyDrive/DS300_DoAn/data/data_history.csv')
convert_timestamp = lambda x: time.mktime(datetime.datetime.strptime(x, "%Y-%m-%d").timetuple())

df['timestamp'] = df['Date'].apply(convert_timestamp)

In [None]:
df = df.rename(columns={'IDuser': "userID", 'IDhotel': "movieID", 'Rating': "rating"})
ratings = df[['userID','movieID','rating','timestamp']]

In [None]:
ratings = ratings.rename(columns={'userID': "UserID", 'movieID': "HotelID", 'rating': "Rating"})
ratings = ratings[['UserID', 'HotelID', 'Rating']]

In [None]:
ratings

Unnamed: 0,UserID,HotelID,Rating
0,5140,277,6.3
1,2059,256,6.7
2,7845,171,7.3
3,9689,182,7.0
4,3040,150,6.0
...,...,...,...
18264,5961,183,8.0
18265,1331,5,7.0
18266,5405,106,10.0
18267,5015,70,7.0


### How can we estimate the baseline performance?
<br>
To estimate the baseline performance, we first pick a baseline model and evaluate it by using the same evaluation metrics we will use for our main model. In general, a very simple rule or even <b>zero rule</b>--<i>predicts the mean for regression or the mode for classification</i>--will be a enough as a baseline model (Random-prediction might be okay for certain problems, but usually it performs poor than the zero rule). If we already have a running model in hand and now trying to improve that, we can use the previous results as a baseline performance for sure.

Most importantly, <b>different baseline approaches should be taken for different problems and business goals</b>. For example, recommending the previously purchased items could be used as a baseline model for food or restaurant recommendation since people tend to eat the same foods repeatedly. For TV show and/or movie recommendation, on the other hand, recommending previously watched items does not make sense. Probably recommending the most popular (most watched or highly rated) items is more likely useful as a baseline.

In this notebook, we demonstrate how to estimate the baseline performance for the movie recommendation with MovieLens dataset. We use the mean for rating prediction, i.e. our baseline model will predict a user's rating of a movie by averaging the ratings the user previously submitted for other movies. For the top-k recommendation problem, we use top-k most-rated movies as the baseline model. We choose the number of ratings here because we regard the binary signal of 'rated vs. not-rated' as user's implicit preference when evaluating ranking metrics.

Now, let's jump into the implementation!

In [None]:
%cd /content/drive/MyDrive/Colab_me/DS300/recommenders/

/content/drive/MyDrive/Colab_me/DS300/recommenders


In [None]:
# !pip install scrapbook
# !pip install papermill
# !pip install cornac
# !pip install retrying
# !pip install pandera

In [None]:
import sys

import itertools
import pandas as pd
import scrapbook as sb

from recommenders.utils.notebook_utils import is_jupyter
# from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_random_split
from recommenders.datasets.pandas_df_utils import filter_by
from recommenders.evaluation.python_evaluation import (
    rmse, mae, rsquared, exp_var,
    map_at_k, ndcg_at_k, precision_at_k, recall_at_k
)

print(f"System version: {sys.version}")
print(f"Pandas version: {pd.__version__}")

System version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Pandas version: 1.5.3


First, let's prepare training and test data sets.

In [None]:
TOP_K = 10

In [None]:
data = ratings[['UserID', 'HotelID', 'Rating']]
data = data.rename(columns={'UserID': "userID", 'HotelID': "itemID", 'Rating': "rating"})

data.head()

Unnamed: 0,userID,itemID,rating
0,5140,277,6.3
1,2059,256,6.7
2,7845,171,7.3
3,9689,182,7.0
4,3040,150,6.0


In [None]:
data = data[data.userID.map(data.userID.value_counts()) > 4]
data

Unnamed: 0,userID,itemID,rating
8,3827,378,6.0
10,5272,182,6.0
11,5822,308,6.0
14,5754,141,6.7
15,5961,334,6.7
...,...,...,...
18247,5668,103,8.0
18255,8139,47,10.0
18256,4080,140,8.0
18263,5857,571,4.0


In [None]:
len(data.userID.unique())

485

In [None]:
len(data.itemID.unique())

533

In [None]:
train, test = python_random_split(data, ratio=0.8, seed=42)

In [None]:
train

Unnamed: 0,userID,itemID,rating
7858,3865,179,8.0
4225,5451,13,5.2
10884,5521,16,10.0
4916,5794,394,9.7
15816,6539,515,6.0
...,...,...,...
16387,3197,478,5.0
1353,6605,334,7.5
9833,5857,226,10.0
13117,9608,10,10.0


In [None]:
test

Unnamed: 0,userID,itemID,rating
1621,5932,307,10.0
2605,2542,412,10.0
4890,944,301,6.0
2040,5857,228,8.0
5399,5885,139,6.6
...,...,...,...
5365,5972,210,9.0
15687,5914,394,8.8
1896,5961,128,6.0
6694,6786,65,6.3


### 1. Rating prediction baseline

As we discussed earlier, we use each user's **mean rating** as the baseline prediction.

In [None]:
# Calculate avg ratings from the training set
users_ratings = train.groupby(["userID"])["rating"].mean()
users_ratings = users_ratings.to_frame().reset_index()
users_ratings.rename(columns={"rating": "AvgRating"}, inplace=True)

users_ratings.head()

Unnamed: 0,userID,AvgRating
0,70,6.571429
1,86,7.625
2,100,4.5
3,126,8.381818
4,145,7.428571


In [None]:
# Generate prediction for the test set
baseline_predictions = pd.merge(test, users_ratings, on=["userID"], how="inner")

# baseline_predictions.loc[baseline_predictions["userID"] == 1].head()
baseline_predictions

Unnamed: 0,userID,itemID,rating,AvgRating
0,70,402,7.0,6.571429
1,100,395,1.0,4.500000
2,100,425,6.0,4.500000
3,100,465,8.0,4.500000
4,126,220,6.9,8.381818
...,...,...,...,...
954,10594,404,5.7,6.740000
955,10636,65,10.0,6.975000
956,10636,253,10.0,6.975000
957,10657,399,10.0,7.957143


Now, let's evaluate how our baseline model will perform on regression metrics

In [None]:
baseline_predictions = baseline_predictions[["userID", "itemID", "AvgRating"]]

cols = {
    "col_user": "userID",
    "col_item": "itemID",
    "col_rating": "rating",
    "col_prediction": "AvgRating",
}

eval_rmse = rmse(test, baseline_predictions, **cols)
eval_mae = mae(test, baseline_predictions, **cols)
eval_rsquared = rsquared(test, baseline_predictions, **cols)
eval_exp_var = exp_var(test, baseline_predictions, **cols)

print("RMSE:\t\t%f" % eval_rmse,
      "MAE:\t\t%f" % eval_mae,
      "rsquared:\t%f" % eval_rsquared,
      "exp var:\t%f" % eval_exp_var, sep='\n')

RMSE:		2.075237
MAE:		1.684464
rsquared:	-0.143746
exp var:	-0.141710


As you can see, our baseline model actually performed quite well on the metrics. E.g. MAE (Mean Absolute Error) was around 0.84 on MovieLens 100k data, saying that users actual ratings would be within +-0.84 of their mean ratings. This also gives us an insight that users' rating could be biased where some users tend to give high ratings for all movies while others give low ratings.

Now, next time we build our machine-learning model, we will want to make the model performs better than this baseline.

### 2. Top-k recommendation baseline

Recommending the **most popular items** is intuitive and simple approach that works for many of recommendation scenarios. Here, we use top-k most-rated movies as the baseline model as we discussed earlier.

In [None]:
item_counts = train["itemID"].value_counts().to_frame().reset_index()
item_counts.columns = ["itemID", "Count"]
item_counts.head()

Unnamed: 0,itemID,Count
0,246,69
1,269,63
2,417,56
3,212,52
4,347,50


In [None]:
user_item_col = ["userID", "itemID"]

# Cross join users and items
test_users = test['userID'].unique()
user_item_list = list(itertools.product(test_users, item_counts['itemID']))
users_items = pd.DataFrame(user_item_list, columns=user_item_col)

print("Number of user-item pairs:", len(users_items))

# Remove seen items (items in the train set) as we will not recommend those again to the users
users_items_remove_seen = filter_by(users_items, train, user_item_col)

print("After remove seen items:", len(users_items_remove_seen))

Number of user-item pairs: 192153
After remove seen items: 189133


In [None]:
# Generate recommendations
baseline_recommendations = pd.merge(item_counts, users_items_remove_seen, on=['itemID'], how='inner')
baseline_recommendations.head()

Unnamed: 0,itemID,Count,userID
0,246,69,70
1,246,69,100
2,246,69,126
3,246,69,145
4,246,69,415


In [None]:
cols["col_prediction"] = "Count"

eval_map = map_at_k(test, baseline_recommendations, k=TOP_K, **cols)
eval_ndcg_10 = ndcg_at_k(test, baseline_recommendations, k=10, **cols)
eval_ndcg_5 = ndcg_at_k(test, baseline_recommendations, k=5, **cols)
eval_precision_10 = precision_at_k(test, baseline_recommendations, k=10, **cols)
eval_precision_5 = precision_at_k(test, baseline_recommendations, k=5, **cols)
eval_recall_10 = recall_at_k(test, baseline_recommendations, k=10, **cols)
eval_recall_5 = recall_at_k(test, baseline_recommendations, k=5, **cols)

print("MAP:\t%f" % eval_map,
      "NDCG@10:\t%f" % eval_ndcg_10,
      "NDCG@5:\t%f" % eval_ndcg_5,
      "Precision@10:\t%f" % eval_precision_10,
      "Precision@5:\t%f" % eval_precision_5,
      "Recall@10:\t%f" % eval_recall_10,
      "Recall@5:\t%f" % eval_recall_5,sep='\n')

MAP:	0.036325
NDCG@10:	0.068779
NDCG@5:	0.049045
Precision@10:	0.026121
Precision@5:	0.030079
Recall@10:	0.115351
Recall@5:	0.060106


Again, the baseline is quite high, nDCG = 0.25 and Precision = 0.22.

<br>

### Concluding remarks

In this notebook, we discussed how to measure baseline performance for the movie recommendation example.
We covered very naive approaches as baselines, but still they are useful in a sense that they can provide reference numbers to estimate the complexity of the given problem as well as the relative performance of the recommender models we are building.

In [None]:
if is_jupyter():
    # Record results with papermill and scrapbook for tests
    sb.glue("map", eval_map)
    sb.glue("ndcg", eval_ndcg)
    sb.glue("precision", eval_precision)
    sb.glue("recall", eval_recall)
    sb.glue("rmse", eval_rmse)
    sb.glue("mae", eval_mae)
    sb.glue("exp_var", eval_exp_var)
    sb.glue("rsquared", eval_rsquared)

### References

[[1](https://dl.acm.org/citation.cfm?id=1401944)] Yehuda Koren,	Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD '08 pp. 426-434 2008.  
[[2](https://surprise.readthedocs.io/en/stable/basic_algorithms.html)] Surprise lib, Basic algorithms