# Strong Baseline: SVD

## Why SVD?

1. SVD has competitive performance on the Movielens-100K in some [benchmarks](http://surpriselib.com/) - which is similar in some ways to the dataset for the ML Challenge
2. It became more well known after being used by the Netflix RecSys Prize winners
3. It's a pure collaborative filtering approach, not requiring any other information about the movie or the user.

## Implementation
I've chosen to use the SVD implementation from [Surprise](http://surpriselib.com/), a light-weight, scikit-inspired interface for matrix factorisation and clustering based [collaborative filtering approaches](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html)

# TODO
- [ ] Standardise train and test splits
- [ ] Switch to Random rating split instead of user split in the Popular Movies baseline
- [ ] Implement hit rate
- [ ] Measure hit rate on test set

In [1]:
# !pip install scikit-surprise
# !pip install matplotlib

In [2]:
import pickle
from pathlib import Path

import numpy as np
import pandas as pd
from surprise import SVD, Dataset, Reader
from surprise.model_selection import (GridSearchCV, cross_validate,
                                      train_test_split)

# import matplotlib.pyplot as plt
# %matplotlib inline

In [3]:
def read(ds: str, data_dir=Path("../data/ext/od-challenge")):
    with (data_dir / f"{ds}.pickle").open("rb") as f:
        df = pickle.load(f)
    return df

aggs = read(ds="aggs")
teams = read(ds="teams")
movies = read(ds="movies")
labels = read(ds="labels")

In [4]:
labels.head()

Unnamed: 0,movie_id,user_id,rating
1,116367,1,3.0
3,114287,1,5.0
4,109370,1,5.0
5,112851,1,5.0
6,112508,1,5.0


In [5]:
df = labels[["user_id", "movie_id", "rating"]]
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[["user_id", "movie_id", "rating"]], reader)
trainset, testset = train_test_split(data, test_size=0.25)

In [None]:
%%time
param_grid = {
    "n_epochs": [15],
    "n_factors": [100, 200, 300],
    "lr_all": [0.01, 0.02, 0.03],
    "reg_all": [0.4, 0.6],
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=3)
gs.fit(data)
print(gs.best_score["rmse"])
# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

In [None]:
model = gs.best_estimator['rmse']
model.fit(trainset)

In [None]:
predictions = model.test(testset)

In [None]:
predictions[0]