# Strong Baseline: SVD

## Why SVD?

1. SVD has competitive performance on the Movielens-100K in some [benchmarks](http://surpriselib.com/) - which is similar in some ways to the dataset for the ML Challenge
2. It became more well known after being used by the Netflix RecSys Prize winners
3. It's a pure collaborative filtering approach, not requiring any other information about the movie or the user.

## Implementation
I've chosen to use the SVD implementation from [Surprise](http://surpriselib.com/), a light-weight, scikit-inspired interface for matrix factorisation and clustering based [collaborative filtering approaches](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html)

In [1]:
# !pip install scikit-surprise
# !pip install matplotlib
!pip install tqdm



In [2]:
import pickle
import random
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from surprise import SVD, Dataset, Reader
from surprise.model_selection import (GridSearchCV, cross_validate,
                                      train_test_split)
from tqdm.notebook import tqdm

# import matplotlib.pyplot as plt
# %matplotlib inline

In [3]:
def read(ds: str, data_dir=Path("../data/ext/od-challenge")):
    with (data_dir / f"{ds}.pickle").open("rb") as f:
        df = pickle.load(f)
    return df


aggs = read(ds="aggs")
teams = read(ds="teams")
movies = read(ds="movies")
labels = read(ds="labels")

data_dir = Path("../data/intermediate/")
train, test = pd.read_csv(data_dir / "train.csv"), pd.read_csv(data_dir / "test.csv")

In [4]:
labels.head()

Unnamed: 0,movie_id,user_id,rating
1,116367,1,3.0
3,114287,1,5.0
4,109370,1,5.0
5,112851,1,5.0
6,112508,1,5.0


In [5]:
headers = ["user_id", "movie_id", "rating"]
df = train[headers]
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[headers], reader)

In [6]:
run_grid_search = False

In [7]:
%%time
if run_grid_search:
    param_grid = {
        "n_epochs": [10, 15],
        "n_factors": [100, 200, 300],
        "lr_all": [0.01, 0.02, 0.03],
    }
    gs = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=3)
    gs.fit(data)
    print(gs.best_score["rmse"])
    # combination of parameters that gave the best RMSE score
    print(gs.best_params["rmse"])
    model = gs.best_estimator['rmse']
else:
    model = SVD(n_epochs=15, n_factors=100, lr_all=0.01)

CPU times: user 27 µs, sys: 22 µs, total: 49 µs
Wall time: 52.9 µs


In [8]:
model.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7feaf7fc27f0>

# Measure RMSE on Test Split

In [9]:
test[headers]

Unnamed: 0,user_id,movie_id,rating
0,383,52357,4.0
1,413,117998,5.0
2,441,101452,3.5
3,346,119488,3.5
4,24,80455,3.5
...,...,...,...
8443,437,115956,5.0
8444,462,78902,3.0
8445,352,1232829,2.0
8446,409,42451,2.0


In [10]:
predicted_rating = []
for element in test.iterrows():
    iid = element[1]['movie_id']
    uid = element[1]['user_id']
    predicted_rating.append(model.predict(uid=float(uid), iid=iid).est)

In [11]:
real_rating = test.rating.tolist()

In [12]:
mse = mean_squared_error(y_true=real_rating, y_pred=predicted_rating)
rmse = np.sqrt(mse)
rmse

0.9176016156056267

# Measure Hit Rate on Test Split

When is it counted as a hit? 
1. Candidates = user has not interacted with the movie in training split but has in the test split
2. We recommend atleast one movie with rating higher that threshold from these candidates

Assumption: Movie id is present in both train and test splits. If it is not, we don't have an elegant solution for that here. 

In [13]:
def user_hits(predicted_movies: List[int], seen_movies: List[int]):
    return len(set(predicted_movies) & set(seen_movies)) > 0


def get_candidate_movies(train, test, user_id):
    train_movies = set(train[train.user_id == user_id].movie_id.unique())
    test_movies = set(test[test.user_id == user_id].movie_id.unique())
    candidate_movies = test_movies - train_movies
    return list(candidate_movies)


def get_seen_movies(test, user_id, threshold=4.0):
    df = test[test.user_id == user_id]
    seen_movies = []
    for user_item_rating in df.iterrows():
        if user_item_rating[1]["rating"] >= threshold:
            seen_movies.append(user_item_rating[1]["movie_id"])
    return seen_movies


def recommend_movies(candidate_movies: List, user_id, threshold=4.0, k=10):
    recs = []
    random.shuffle(candidate_movies)
    for c in candidate_movies[:100]:
        r_est = model.predict(uid=user_id, iid=c).est
        if r_est >= threshold:
            recs.append(c)
        if len(recs) >= 10:
            return recs
    return recs


def calc_hit_rate(split):
    hits = []
    for user_id in tqdm(split.user_id):
        candidate_movies = get_candidate_movies(train, split, user_id)
        recommended_movies = recommend_movies(
            candidate_movies=candidate_movies, user_id=user_id
        )
        seen_movies = get_seen_movies(split, user_id)
        hits.append(
            user_hits(predicted_movies=recommended_movies, seen_movies=seen_movies)
        )

    return sum(hits) / len(hits)

In [14]:
calc_hit_rate(test)

  0%|          | 0/8448 [00:00<?, ?it/s]

0.6236979166666666