# BLU12 - Learning Notebook - Part 2 of 3 - Validation

In [None]:
from surprise import Dataset

from surprise.accuracy import mae, rmse

from surprise.model_selection import train_test_split, cross_validate

from surprise.prediction_algorithms import BaselineOnly, KNNBasic

# 1 Standard Notation

## 1.1 Users and Items

We write $U$ and $I$ the sets of all users and items. Individual users $u, v \in U$ and items $i, j \in I$.

The set $U_i$ contains all users that rated $i$, and $U_{ij}$ contains users that rated both $i$ and $j$.

Similarly, $I_u$ stands for all items rated by user $u$, $I_{uv}$ for items rated by both users $u$ and $v$.

## 1.2 Ratings

$R$ denotes the set of all ratings, while $\hat{R}$ is the set of predicted ratings.

We introduce $R_{train}$ and $R_{test}$, which stand for ratings in the train and test sets, respectively.

Concerning ratings, $r_{ui}$ is the true rating given by user $u$ to the item $i$, while $\hat{r}_{ui}$ is the predicted one.

## 1.3 Statistics

The mean of all ratings is $\mu$, $\mu_u$ is the mean of all ratings by $u$, and $\mu_i$ is the mean rating given to $i$.

Finally, $\sigma$ is the global standard deviation, $\sigma_u$ and $\sigma_i$ are the standard deviations for user $u$ and item $i$, respectively.

# 2 Scikit-Surprise

In this notebook, we introduce `scikit-surprise`, Surprise from now on, a Python package that excels at Collaborative Filtering (CF).

From the documentation:

> [Surprise](http://surpriselib.com/) is a Python [scikit](https://www.scipy.org/scikits.html) for building and analyzing recommender systems.

The package provides convenient implementations for most steps in the CF pipeline, including:
* Dataset handling
* Built-in similarity measures
* Ready-to-use prediction algorithms, including baseline and neighborhood methods, plus advanced approaches
* Model selection methods.

Once we develop a solid intuition about RS and CF, we can use it prototype ideas, without excessive focus on implementation.

(Surprise takes care of sparsity, vectorization and linear algebra for us, that is.)

Recommendation algorithms in Surprise work like standard `sklearn` estimators.

# 3 Load Dataset

Surprise allows us to use built-in datasets, or to build our own, which we do in due time. 

Fortunately, the [MovieLens](https://grouplens.org/datasets/movielens/100k/) dataset we've been using it's readily available.

The `Dataset` class is used to manage datasets, although we should never instantiate it directly, instead using:
* `Dataset.load_builtin()`: load a built-in dataset
* `Dataset.load_from_file()`: load a dataset from a custom file
* `Dataset.load_from_foalds()`: load a dataset from custom files with predefined cross-validation folds.

We use the `Dataset.load_builtin()` method to load (and download, if needed) the dataset.

In [None]:
dataset = Dataset.load_builtin('ml-100k')

As the cross-validation folds are not predefined (or don't exist at all), we have to explicitly define them.

We use `build_full_trainset()` to avoid splitting the dataset into folds, returning the ratings from the whole dataset.

In [None]:
R = dataset.build_full_trainset()

# 4 Baseline Model

Now that we have the ratings $R$ in a convenient format, we want to get to a baseline fast. (No surprises there.)

Ratings exhibit systematic user and item tendencies, i.e., some users give, and some items receive higher ratings than others.

There is a simple, yet effective, way to generate baseline predictions from such tendencies or *biases*.

We write $b_{ui}$ the baseline estimate, which accounts for user bias and item bias.

$$\hat{r}_{ui} = b_{ui} = \mu + b_u + b_i$$

Typically, the calculations of $b_u$ and $b_i$ are coupled, for accuracy.

## 4.1 Example

Consider an average movie rating $\mu$ of 3. 

Knowing that item $i$ is rated 0.5 stars above average, and that $u$ rates 0.3 below average, we have:

$$\hat{r}_{ui} = b_{ui} = 3 + 0.5 + (-0.3) = 3.2$$

In this case, $\hat{r}_{ui}$ corresponds to the rating of a critical user to a good movie.

## 4.2 Estimating the Biases

Consider the $(u, i) \in U \times I$ pairs, such as $K = \{(u, i) |  r_{ui}$ is known $\}$.

It works by minimizing the regularized squared error, as:

$$\min_{b*} \sum\limits_{(u, i) \in K} (r_{ui} - (\mu + b_u + b_i))^2 + \lambda(b_u^2 + b_i^2)$$

The first term $(r_{ui} - (\mu + b_u + b_i))^2$ corresponds to the error, finds the best $b_u$ and $b_i$, for all $u \in U$ and $i \in I$.

The regularization term $\lambda(b_u^2 + b_i^2)$ manages overfitting, penalizing the magnitudes of the parameters.

Baselines can be estimated in two different ways: stochastic gradient descent (SGD) and alternating least squares (ALS).

## 4.3 Implementation

We start by initializing the estimator, the built-in `BaselineOnly()`, an implementation of the algorithm above.

By passing a parameter `bsl_options`, we define how baselines are computed.

In [None]:
bsl_options = {'method': 'sgd', 'learning_rate': 0.005}


baseline = BaselineOnly(bsl_options=bsl_options)

We use the `.fit()` method to train the algorithm and initialize some internal structures, including the similarity matrix (when needed).

Then, the `.predict()` method predicts $\hat{r}_{ui}$, calling the defined estimate method, i.e., the baseline algorithm. 

We use the attribute `.est` to retrieve the estimated rating $\hat{r}_{ui}$ from the prediction. 

In [None]:
uid = str(196)
iid = str(302)

pred = baseline.fit(R).predict(uid, iid)

pred.est

# 5 Collaborative Filtering (CF)

The most common approach to CF uses neighborhood models.

In the CF pipeline, before prediction, we have to compute user-user or item-item similarities.

![collaborative_filtering](../media/collaborative_filtering.png)

Thus, central to both is the similarity measure.

## 5.1 Distance Measures

### 5.1.1 Mean Squared Difference (MSD)

By default, Surprise uses the MSD.

The MSD shares the pitfalls of the Euclidean Distance because it considers the magnitude of the vectors and is sensitive to scaling.

#### User-user MSD

$$msd(u, v) = \frac{\sum\limits_{i \in I_{uv}} (r_{ui} - r_{vi})^2}{|I_{uv}|}$$

#### Item-item MSD

$$msd(i, j) = \frac{\sum\limits_{u \in U_{ij}} (r_{ui} - r_{uj})^2}{|U_{ij}|}$$

### 5.1.2 Cosine Similarity

The cosine similarity is the most widely used. Also, it is a normalized dot-product, i.e., considers only differences in direction.

#### User-user cosine

$$cos(u, v) = \frac{u \cdot v}{||u|| \cdot ||v||}$$

#### Item-item cosine

$$cos(i, j) = \frac{i \cdot j}{||i|| \cdot ||j||}$$

### 5.1.3 Pearson Correlation

Another popular similarity measure is the Pearson correlation.

#### User-user correlation

$$pearson(u, v) = \frac{cov(u, v)}{\sigma_u \cdot \sigma_v}$$

#### Item-item correlation

$$pearson(i, j) = \frac{cov(i, j)}{\sigma_i \cdot \sigma_j}$$

## 5.2 Implementation

We configure similarity measures the same way we did for baseline options, i.e., by passing a `sim_options` parameter to our estimator.

These are the available options, in line with our needs:
* `name`: the name of the similarity measure to use (including `cosine`, `msd`, and `pearson`)
* `user_based`: `True` if we want to compute user-user similarities, `False` for item-item
* `min_support`: the minimum number of common items (if `user_based: True`) or users (if `user_based: False`).

In [None]:
sim_options = {'name': 'cosine', 'user_based': False}

The similarity-based brand of recommenders we have studied fits under the $k$-NN family.

### 5.2.1 User-user prediction

Consider $N_i^k(u)$ as the $k$ nearest neighbors of user $u$ that have rated $i$, computed using a similarity metrics of our choice.

$$\hat{r}_{ui} = \frac{\sum\limits_{v \in N_i^k(u)} sim(u, v) \cdot r_{vi}}{\sum\limits_{v \in N_i^k(u)} sim(u, v)}$$

### 5.2.2 Item-item prediction

Take also $N_u^k(i)$, the $k$ nearest neighbors of item $i$ rated by user $u$, computed using our similarity measure.

$$\hat{r}_{ui} = \frac{\sum\limits_{j \in N_u^k(i)} sim(i, j) \cdot r_{uj}}{\sum\limits_{v \in N_u^k(i)} sim(i, j)}$$

In the `sim_options` above, we already chose item-item recommendations based on cosine similarities.

Now, as we usually do, we initialize the estimator and call `fit()` and `predict()` on it.

In [None]:
knn = KNNBasic(sim_options=sim_options)
knn.fit(R)
pred = knn.predict(uid, iid)
    
pred.est

Note that Surprise provides extensions to the base $k$-NN, including using [mean-centering](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans), [z-score](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithZScore) and [bias](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBaseline) to scale predictions.

All prediction algorithms are [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html), with excellent documentation.

Now, we zoom in into how we perform model validation and selection in RS.

# 6 Model Selection

When ratings are available, the RS core computation is to predict $\hat{r}_{ui}$ for new items $i \in I \setminus I_u$ for user $u$.

We learn a function $f$ that maps user-item pairs into ratings $f : U \times I \to S$ given by $\hat{r}_{u, i} = f(u, i)$.

Assuming a continuous ratings scales $S$, e.g., $S = [1, 5]$, we have ratings prediction as regression.

Given a loss function $\mathcal{L}$ that compares predictions $\hat{r}_{ui} = f(u, i)$ with known ratings $r_{ui}$, we want $f$ that minimizes the total cost $J$.

$$\min_f J = \sum\limits_{r_{ui} \in R} \mathcal{L}(f(u, i), r_{ui})$$

Consequentially, we need model validation after training to evaluate different recommendation techniques. 

In this sense, an RS is just like any other ML system we have studied so far.

## 6.1 Measuring Accuracy

### 6.1.1 Mean Absolute Error (MAE)

The two most popular accuracy measures for continuous variables are the MAE and the RMSE.

Both metrics express average prediction error in units of the variable of interest, and lower values are always best.

The MAE measures the average of the absolute errors $|f(u, i) - r_{ui}|$, i.e., the average magnitude of the prediction errors.

$$MAE(f) = \frac{\sum\limits_{r_{ui} \in R_{test}} |f(u, i) - r_{ui} |}{|R_{test}|}$$

From an interpretation standpoint, the MAE is best.

### 6.1.2 Root Mean Squared Error (RMSE)

The RMSE is the square root of the average of squared errors $(f(u, i) - r_{ui})^2$.

$$MSE(f) = \frac{\sum\limits_{r_{ui} \in R_{test}} (f(u, i) - r_{ui})^2}{|R_{test}|}$$

Therefore:

$$RMSE(f) = \sqrt{MSE(f)}$$

Since the errors are squared and then averaged, the RMSE gives a higher weight to large errors.

The $RMSE(f) \geqslant MAE(f)$, since the RMSE amplifies large errors. If all of the errors have the same magnitude, then $RMSE(f)=MAE(f)$.

## 6.2 Train-test Split

In order to perform model validation, we need to split our ratings dataser $R$ into $R_{train}$ and $R_{test}$.

Surprise provides a `model_selection` package, inspired in `sklearn`.

The package contains a `train_test_split()` function, that splits a dataset (i.e., a Surprise `Dataset` object) into train and test.

In [None]:
R_train, R_test = train_test_split(dataset)

We start by computing the MAE and RMSE for the baseline model.

In [None]:
baseline.fit(R_train)
R_pred = baseline.test(R_test)

mae(R_pred)

In [None]:
rmse(R_pred)

Not bad. 

What about the $k$-NN, item-item CF model? Is it any better?

In [None]:
knn.fit(R_train)
R_pred_ = knn.test(R_test)

mae(R_pred_)

Wow. Not promissing, the baseline seems to outperform the $k$-NN.

In [None]:
rmse(R_pred_)

It's official: the $k$-NN approach to CF, built on top of item-item cosine similarities, isn't an improvement.

## 6.3 Cross-validation

Without surprise, given the similarities with `sklearn`, we can also run a cross validation procedure.

We pass algorithm, dataset and the number of splits to the `cross_validate` method.

In [None]:
res = cross_validate(baseline, dataset, measures=['RMSE', 'MAE'], cv=5)

Let's take a look into `res`.

In [None]:
res

The return is a dictionary with accuracy metrics, fit time and test time.

Time to compare the results.

In [None]:
res['test_mae']

In [None]:
res['test_rmse']

We repeat the cross-validation process for the $k$-NN, item-item CF.

In [None]:
res_ = cross_validate(knn, dataset, measures=['RMSE', 'MAE'], cv=5)

res_['test_mae']

In [None]:
res_['test_rmse']

It's official: the $k$-NN doesn't improve the baseline predictions.