In [5]:
import sys
import os

import numpy as np
import pandas as pd

package_path = os.path.abspath('../../src')
if package_path not in sys.path:
    sys.path.append(package_path)

from surprise import SVD, SVDpp
from surprise import NormalPredictor, KNNBaseline

from scripts.DataLoader import DataLoader
from scripts.Data import Data
from scripts.Algorithm import Algorithm
from scripts.utils import test_algorithms, grid_search

## Input data

Let's load the data in a format required by surprise package:

In [2]:
ratings_train_path = '../../data/ratings_training.csv' 
movies_path = '../../data/movies.csv'

dl = DataLoader(ratings_train_path, movies_path)
data = dl.load_dataset()

ratings = Data(data)

# Required to compute Novelty
popularity_ranks = dl.get_popularity_ranks()

## Quick evaluation

At first, let's compare the baseline performance of a few algorithms provided by surprise, in order to know, where to focus our attention during the parameter tuning phase. Let's use regular 75/25 train/test split in this initial phase, just to roughly evaluate the performance of particular algorithms.

In [6]:
# Define candidate algorithms
sim_options = {"name": "cosine", "user_based": False}
bsl_options = {"method": "sgd", "learning_rate": 0.005}
algorithms = {'normal_predictor': NormalPredictor(), 'knn_baseline': KNNBaseline(sim_options=sim_options, bsl_options=bsl_options), 'svd': SVD(), 'svd++': SVDpp()}

In [4]:
eval_df = test_algorithms(algorithms, ratings)
print(eval_df)

Evaluating performance for: normal_predictor
Training the model and making predictions...
Computing RMSE and MAE...
RMSE: 1.4403
MAE:  1.1533

Done.

Evaluating performance for: svd
Training the model and making predictions...
Computing RMSE and MAE...
RMSE: 0.9197
MAE:  0.7202

Done.

Evaluating performance for: svd++
Training the model and making predictions...
Computing RMSE and MAE...
RMSE: 0.9037
MAE:  0.7014

Done.

                    RMSE     MAE
svd++             0.9037  0.7014
svd               0.9197  0.7202
normal_predictor  1.4403  1.1533


It seems, that SVD++ achieves the best performance on the validation set out of all four candidates (using default hyperparameters), in terms of both RMSE and MAE (actually, NormalPredictor randomly generates predictions from a normal distribution, so it was just a reference point).

## Parameter tuning

After initial algorithm selection, let's try to tune its parameters to, hopefully, further reduce prediction error of the model. Let's use 3-fold cross-validation, not to spend too much time on this process.

In [13]:
svdpp_param_grid = {
    'n_factors': [50,100,200],
    'n_epochs': [20],
    'lr_all': [0.005, 0.05, 0.5],
    'reg_all': [0.02, 0.05, 0.1, 0.5]
}

best_params, best_scores = grid_search(ratings, SVDpp, svdpp_param_grid, joblib_verbose=1)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   6 | elapsed:  4.1min remaining:  2.0min
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  4.3min finished


## User-centered validation metrics

Unfortunately, RMSE, MAE or any other accuracy metric alone is not a good indicator of performance of a recommendation model (this also turned out to be the case after original Netflix Prize). Recommendation systems are living things and the ultimate validation of their performance is provided by the customers and their shopping decisions. However, what we can do to gain more information about model's behavior, is to use user-centric metrics, like:
- hit rate (HR) (are predicted items relevant to a user?)
- cumulative hit rate (cHR) (does the model predict items, which user really likes?)
- average reciprocal hit rank (ARHR)
- user coverage (what percentage of users have at least one good recommendation, in terms of predicted rating?)
- diversity (how diverse/dissimilar are the items recommended to users, on average?)
- novelty (how many non-mainstream items are recommended to users, on average?)

In order to compute these metrics, we will need a list of predicted items for each user, which can be obtained with help of leave-one-out cross-validation. User coverage, diversity and novelty need also information about number of users, similarity scores between movies and movie popularity ranks, respectively.

In [3]:
algo = Algorithm(SVDpp(**best_params), ratings, popularity_ranks)
# Compute accuracy metrics using single trainset and validation set
# Compute user-centered metrics using leave-one-out cross-validation
algo.evaluate(print_accuracy_only=False) 

### Remarks

Until we compare these metrics for a few different models, we cannot really say, whether the values we obtained are high or low. What's clearly noticeable is that novelty is equal to 0. This is to be expected, because this metric is computed using movie popularity ranks (the more ratings a movie has, the lower is its rank) and I've removed infrequently rated movies before. Also, user coverage seems to be quite high (~96%), but it should be noted that it was computed on a small sample.

## Recommend movies

The ultimate goal of a recommendation model/system is to, well, recommend different items, which in our case are movies.

In [5]:
recs = algo.generate_recommendations('410917',topN=10)

In [9]:
print('Recommendations:')
for movie_id, rating in recs:
    print(f'{dl.get_movie_name(int(movie_id))}: {np.round(rating,3)}')

Recommendations:
The Simpsons: Season 4: 4.766
Monty Python's Life of Brian: 4.708
Simpsons Gone Wild: 4.664
The Office Special: 4.659
Band of Brothers: 4.579
The Third Man: 4.55
Sunset Boulevard: 4.545
The Simpsons: Season 2: 4.511
Michael Moore's The Awful Truth: Season 1: 4.51
Eternal Sunshine of the Spotless Mind: 4.507


## Model's performance on the test set

In [4]:
ratings_test_path = '../../data/ratings_test.csv'
testset = pd.read_csv(ratings_test_path)
algo.validate_test(testset)

RMSE: 1.0900652100508552
MAE: 0.9073789256477611
