# SVD

This notebook takes the master data that was prepared in the "data_preprocessing" notebook, creates a test holdout set from the 10% of the data that the same split will be used for future experimentation for consistency.
The training set is fitted into the SVD Model by using the Surprise package, which is a Python Scikit. Aim of this notebook is not to use the SVD Model, but just to obtain the RMSE metric from it for comparison.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
  
from surprise import Reader, Dataset
from surprise import SVD, KNNBasic
from surprise import accuracy

import joblib

In [2]:
#Do not load the "timestamp" column since it is not needed for building the recommender engine
df = pd.read_csv('data/master_data.zip', compression="zip")[["userId", "movieId", "rating"]]
df

Unnamed: 0,userId,movieId,rating
0,0,2,3.0
1,0,6,3.0
2,0,10,4.0
3,0,14,2.0
4,0,15,3.0
...,...,...,...
11155334,25547,4181,4.5
11155335,25547,4188,4.5
11155336,25547,4195,4.5
11155337,25547,4198,3.0


## Train-test Split

The train test split is done by only taking the 20% of the data as the test holdout set. For making sure the train and test data is consistent in all experiments, the following test holdout split will be the same for each experiment.

It is an important detail that the split is done in a stratified way to ensure that the user rankings will be splitted as evenly as possible.

In [7]:
#This split will be standard for all experiments

X = df.copy()
y = df["userId"]

#There is no need for the target values since we are splitting the whole dataset
#y is only given for stratifying

X_train, X_test, _, _ = train_test_split(X, y, test_size = 0.20, stratify=y, random_state=42)

SurpriseLib requires the data to be loaded on its own format. A reader object is created for it by passing the minimum and maximum of the rankings in the data as the parameters of the constructor.

In [8]:
minimum_rating = min(df['rating'].values)
 
maximum_rating = max(df['rating'].values)
 
print(f"Minimum rating: {minimum_rating}")
print(f"Maximum rating: {maximum_rating}")

Minimum rating: 0.5
Maximum rating: 5.0


In [9]:
#Convert splitted data into SurpriseLib format

reader = Reader(rating_scale=(minimum_rating,maximum_rating))

train_data = Dataset.load_from_df(X_train, reader).build_full_trainset()
test_data  = [tuple(x) for x in X_test.to_records(index=False)]

## Model Fit

The algorithm to create the recommender engine in this notebook is SVD ( Singular Value Decomposition.). However, it should be noted that this SVD algorithm used in Surprise package is not exactly same as the standard SVD technique because it cannot directly work on matrix with empty data. It is to say that this an algorithm inspired by SVD.

In [13]:
svd_model = SVD(n_factors=50, lr_all=0.01, reg_all = 0.1)

svd_model.fit(train_data)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x25f926a4a20>

In [10]:
#joblib.dump(svd_model, "svd_model.pkl")

['svd_model.pkl']

## Model Evaluate

Here only the RMSE metric will be calculated to have an overview of the accuracy of the model to be compared with other recommendation systems.

In [17]:
predictions = svd_model.test(test_data)
accuracy.rmse(predictions, verbose=True)

RMSE: 0.8003


0.8003138226574599