In [4]:
import sys
import os

import numpy as np

package_path = os.path.abspath('../../src')
if package_path not in sys.path:
    sys.path.append(package_path)

from surprise import SVD, SVDpp
from surprise import NormalPredictor, KNNBaseline

from surprise.model_selection import train_test_split
from surprise.dump import dump

from scripts.DataLoader import DataLoader
from scripts.Metrics import Metrics
from scripts.utils import test_algorithms, grid_search, predict_top_n_recs, compute_topN_metrics

## Input data

Let's load the data in a format required by surprise package:

In [2]:
ratings_train_path = '../../data/ratings_training.csv' 
movies_path = '../../data/movies.csv'

dl = DataLoader(ratings_train_path, movies_path)
ratings = dl.load_dataset()

## Quick evaluation

At first, let's compare the baseline performance of a few algorithms provided by surprise, in order to know, where to focus our attention during the parameter tuning phase. Let's use regular 75/25 train/test split in this initial phase, just to roughly evaluate the performance of particular algorithms.

In [3]:
# Define candidate algorithms
# algorithms = {'normal_predictor': NormalPredictor(), 'knn_baseline': KNNBaseline(), 'svd': SVD(), 'svd++': SVDpp()}
algorithms = {'normal_predictor': NormalPredictor(), 'svd': SVD(), 'svd++': SVDpp()}

# Split the data once, not to do it each time a new model is trained
train, test = train_test_split(ratings, test_size=0.25, random_state=0)

In [5]:
eval_df = test_algorithms(algorithms, train, test)
print(eval_df)

Evaluating performance for: normal_predictor
Training the model and making predictions...
Computing RMSE and MAE...
RMSE: 1.4424
MAE:  1.1551

Done.

Evaluating performance for: svd
Training the model and making predictions...
Computing RMSE and MAE...
RMSE: 0.9217
MAE:  0.7215

Done.

Evaluating performance for: svd++
Training the model and making predictions...
Computing RMSE and MAE...
RMSE: 0.9025
MAE:  0.7000

Done.



It seems, that SVD++ achieves the best performance on the validation set out of all four candidates (using default hyperparameters), in terms of both RMSE and MAE (actually, NormalPredictor randomly generates predictions from a normal distribution, so it was just a reference point).

## Parameter tuning

After initial algorithm selection, let's try to tune its parameters to, hopefully, further reduce prediction error of the model. Let's use 3-fold cross-validation, not to spend too much time on this process.

In [13]:
svdpp_param_grid = {
    'n_factors': [20,50,100,200],
    'n_epochs': [20,40],
    'lr_all': [0.005, 0.007, 0.009],
    'reg_all': [0.02, 0.05, 0.1, 0.5]
}

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   6 | elapsed:  4.1min remaining:  2.0min
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  4.3min finished


In [5]:
params, scroes = grid_search(ratings, SVDpp, {'n_factors': [10]}, joblib_verbose=1)

Initializing the search...


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Extracting best parameters and scores...

Done


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   20.6s finished


## User-centered validation metrics

Unfortunately, RMSE, MAE or any other accuracy metric alone is not a good indicator of performance of a recommendation model (this also turned out to be the case after original Netflix Prize). Recommendation systems are living things and the ultimate validation of their performance is provided by the customers and their shopping decisions. However, what we can do to gain more information about model's behavior, is to use user-centric metrics, like:
- hit rate (HR) (are predicted items relevant to a user?)
- cumulative hit rate (cHR) (does the model predict items, which user really likes?)
- average reciprocal hit rank (ARHR)
- user coverage (what percentage of users have at least one good recommendation, in terms of predicted rating?)
- diversity (how diverse/dissimilar are the items recommended to users, on average?)
- novelty (how many non-mainstream items are recommended to users, on average?)

In order to compute these metrics, we will need a list of predicted items for each user, which can be obtained with help of leave-one-out cross-validation. User coverage, diversity and novelty need also information about number of users, similarity scores between movies and movie popularity ranks, respectively.

In [3]:
svdpp = SVDpp()
left_out_predictions, topN = predict_top_n_recs(svdpp, ratings)

In [23]:
# For novelty
popularity_ranks = dl.getPopularityRanks()

# For diversity
full_training_set = ratings.build_full_trainset()
sim_options = {'name': 'cosine', 'user_based': False}
simsAlgo = KNNBaseline(sim_options=sim_options, verbose=False)
simsAlgo.fit(full_training_set)

# For user coverage
n_users = len(np.unique([int(row[0]) for row in ratings.raw_ratings]))

In [36]:
metrics = compute_topN_metrics(left_out_predictions, topN, popularity_ranks, simsAlgo, n_users)

HR = 0.030736
cHR = 0.045129
ARHR = 0.009693
Diversity = 0.029402
User Coverage = 0.956295
Novelty = 0.000000

Legend:

HR:        Hit Rate; how often we are able to recommend a left-out rating. Higher is better.
cHR:       Cumulative Hit Rate; hit rate, confined to ratings above a certain threshold. Higher is better.
ARHR:      Average Reciprocal Hit Rank - Hit rate that takes the ranking into account. Higher is better.
Diversity: 1-S, where S is the average similarity score between every possible pair of recommendations
           for a given user. Higher means more diverse.
User Coverage:  Ratio of users for whom recommendations above a certain threshold exist. Higher is better.
Novelty:   Average popularity rank of recommended items. Higher means more novel.


### Remarks

Until we compare these metrics for a few different models, we can really say, whether the values we obtained are high or low. What's clearly noticeable is that novelty is equal to 0. This is to be expected, because this metric is computed using movie popularity ranks (the more ratings a movie has, the lower is its rank) and I've removed infrequently rated movies before. Also, user coverage seems to be quite high (~96%), but it should be noted that it was computed on a small sample.

## Save the model

Surprise provides a wrapper around Pickle to serialize fitted models. Let's save the final model, so we can validate it on a test set in a separate notebook.

In [5]:
dump('../models/svd_final', algo=svdpp, verbose=1)

The dump has been saved as file ../models/svd_final


## Recommend movies

The ultimate goal of a reccomentation model/system is to, well, recommend different items, which in our case are movies.