In [7]:
import json
import sys
import os

import numpy as np
import pandas as pd

package_path = os.path.abspath('../../')
if package_path not in sys.path:
    sys.path.append(package_path)

from surprise import SVD, SVDpp
from surprise import NormalPredictor, KNNBaseline
from surprise.dump import dump

from src.scripts.DataLoader import DataLoader
from src.scripts.Data import Data
from src.scripts.Algorithm import Algorithm
from src.scripts.utils import test_algorithms, grid_search

## Input data

Let's load the data in a format required by surprise package:

In [2]:
ratings_train_path = '../../data/ratings_training.csv' 
movies_path = '../../data/movies.csv'

dl = DataLoader(ratings_train_path, movies_path)
data = dl.load_dataset()

ratings = Data(data)

# Required to compute Novelty
popularity_ranks = dl.get_popularity_ranks()

## Quick evaluation

At first, let's compare the baseline performance of a few algorithms provided by surprise, in order to know, where to focus our attention during the parameter tuning phase. Let's use regular 75/25 train/test split in this initial phase, just to roughly evaluate the performance of particular algorithms.

In [4]:
# Define candidate algorithms
sim_options = {"name": "cosine", "user_based": False}
bsl_options = {"method": "sgd", "learning_rate": 0.005}
algorithms = {'normal_predictor': NormalPredictor(), 'knn_baseline': KNNBaseline(sim_options=sim_options, bsl_options=bsl_options), 'svd': SVD(), 'svd++': SVDpp()}

In [5]:
eval_df = test_algorithms(algorithms, ratings, verbose=True)
print(eval_df)

Evaluating performance for: normal_predictor
Training the model and making predictions...
Computing RMSE and MAE...
RMSE: 1.4570
MAE:  1.1683

Done.

Evaluating performance for: knn_baseline
Training the model and making predictions...
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing RMSE and MAE...
RMSE: 0.8964
MAE:  0.6979

Done.

Evaluating performance for: svd
Training the model and making predictions...
Computing RMSE and MAE...
RMSE: 0.8853
MAE:  0.6904

Done.

Evaluating performance for: svd++
Training the model and making predictions...
Computing RMSE and MAE...
RMSE: 0.8740
MAE:  0.6756

Done.

                    RMSE     MAE
svd++             0.8740  0.6756
svd               0.8853  0.6904
knn_baseline      0.8964  0.6979
normal_predictor  1.4570  1.1683


It seems, that SVD++ achieves the best performance on the validation set out of all four candidates (using default hyperparameters), in terms of both RMSE and MAE (actually, NormalPredictor randomly generates predictions from a normal distribution, so it was just a reference point).

## Parameter tuning

After initial algorithm selection, let's try to tune its parameters to, hopefully, further reduce prediction error of the model. Let's use 3-fold cross-validation, not to spend too much time on this process.

In [10]:
svdpp_param_grid = {
    'n_factors': [100,125],
    'n_epochs': [20],
    'lr_all': [0.005, 0.05, 0.5],
    'reg_all': [0.02, 0.05, 0.1]
}
best_params, best_scores = grid_search(data, SVDpp, svdpp_param_grid, joblib_verbose=1)

Initializing the search...
Extracting best parameters and scores...

Done.


In [11]:
print(f'Best params: {best_params}')
print(f'Best scores: {best_scores}')

Best params: {'rmse': {'n_factors': 125, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.05}, 'mae': {'n_factors': 125, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.05}}
Best scores: {'rmse': 0.876660358692631, 'mae': 0.6847703061233724}


## User-centered validation metrics

Unfortunately, RMSE, MAE or any other accuracy metric alone is not a good indicator of performance of a recommendation model (this also turned out to be the case after original Netflix Prize). Recommendation systems are living things and the ultimate validation of their performance is provided by the customers and their shopping decisions. However, what we can do to gain more information about model's behavior, is to use user-centric metrics, like:
- hit rate (HR) (are predicted items relevant to a user?)
- cumulative hit rate (cHR) (does the model predict items, which user really likes?)
- average reciprocal hit rank (ARHR)
- user coverage (what percentage of users have at least one good recommendation, in terms of predicted rating?)
- diversity (how diverse/dissimilar are the items recommended to users, on average?)
- novelty (how many non-mainstream items are recommended to users, on average?)

In order to compute these metrics, we will need a list of predicted items for each user, which can be obtained with help of leave-one-out cross-validation. User coverage, diversity and novelty need also information about number of users, similarity scores between movies and movie popularity ranks, respectively. Let's also serialize the best parameters.

In [21]:
# Save model and params
with open('../models/svd_final_params.json', 'w') as file:
    json.dump(best_params,file)

In [19]:
with open('../models/svd_final_params.json', 'r') as file:
    best_params = json.load(file)
selected_algo = SVDpp(**best_params['rmse'])

In [10]:
algo = Algorithm(selected_algo, ratings, popularity_ranks)
# Compute accuracy metrics using single trainset and validation set
# Compute user-centered metrics using leave-one-out cross-validation
algo.evaluate(compute_accuracy_only=False, verbose=True) 

Computing accuracy metrics...

Computing user-centered metrics...
Fitting the algorithm to the leave-one-out CV trainset...
Making predictions for the leave-one-out CV testset...
Making precictions for the anti testset...
Generating top N recommendations for all users...
Computing hit rate...
Computing cumulative hit rate...
Computing average reciprocal hit rank...
Computing diversity...
Computing user coverage...
Computing novelty...

RMSE = 0.870192
MSE = 0.679322
HR = 0.028933
cHR = 0.044606
ARHR = 0.008582
Diversity = 0.036372
User Coverage = 0.943200
Novelty = 0.000000

Legend:

HR:        Hit Rate; how often we are able to recommend a left-out rating. Higher is better.
cHR:       Cumulative Hit Rate; hit rate, confined to ratings above a certain threshold. Higher is better.
ARHR:      Average Reciprocal Hit Rank - Hit rate that takes the ranking into account. Higher is better.
Diversity: 1-S, where S is the average similarity score between every possible pair of recommendations
 

### Remarks

Until we compare these metrics for a few different models, we cannot really say, whether the values we obtained are high or low. What's clearly noticeable is that novelty is equal to 0. Novelty is computed as a sum of popularity ranks of recommended movies for all users divided by the total number of recommended items. This means, that the model recommends only items with low ranks, i.e. the most popular items. Also, user coverage seems to be quite high (~94%), but it should be noted that it was computed on a small sample.

## Recommend movies

The ultimate goal of a recommendation model/system is to, well, recommend different items, which in our case are movies. Let's also serialize the final model.

In [5]:
# Fit model on full dataset separately
algo.fit_with_full_trainset()

In [8]:
# Save the model
dump('../models/svd_final.pkl', algo=algo, verbose=1)

The dump has been saved as file ../models/svd_final.pkl


In [17]:
# Sample recommendations
recs = algo.generate_recommendations('1732491',topN=10)
print('Recommendations:')
for movie_id, rating in recs:
    print(f'{dl.get_movie_name(int(movie_id))}: {np.round(rating,3)}')

Recommendations:
The West Wing: Season 2: 4.577
House M.D.: Season 1: 4.555
Coupling: Season 2: 4.554
Lord of the Rings: The Two Towers: Extended Edition: 4.545
Finding Nemo (Widescreen): 4.523
24: Season 1: 4.51
The Lord of the Rings: The Fellowship of the Ring: Extended Edition: 4.503
Sex and the City: Season 3: 4.496
CSI: Season 2: 4.473
Six Feet Under: Season 2: 4.457


## Model's performance on the test set

In [18]:
ratings_test_path = '../../data/ratings_test.csv'
testset = pd.read_csv(ratings_test_path)
algo.validate_test(testset)

RMSE: 1.080824918502178
MAE: 0.9066404037474705
