# SVD recommendation model

SVD decomposition takes user-item matrix of `m x n` size and splits into three parts:
$$U_{m x k} \times \sum_{k x k} \times V_{k x n}^T$$
- $U_{m x k} =$ user (row) and latent factors k (col)
- $\sum_{k x k} =$ latent factors k. I.e., the weights
    - Latent factor k is the number of factors or features to consider for the model
- $V_{k x n}^T =$ transpose matrix of item (row) x latent factors k (col)
- Too low of k leads to a simple model (underfitting), too high of k leads to a complex model (overfitting)


Error metrics:
- MAE: More easier to interpret than RMSE, but less sensitive to outliers
- RMSE: computes the square root of the average of the squared differences between the predicted rating and the actual ratings 
$$ RMSE = \sqrt{\frac{1}{N}\sum_{(u,i) \in R} (r_{ui} - \hat{r_{ui}})^2}$$


The 3 main hyper parameters to tune for SVD are:
1. Latent factor $k$
2. Regularization parameter $\lambda$
    - $J = \text{new cost} = \sum_{(u, i) \in R}[(r_{ui} - \hat{r_{ui}})^2 + \lambda (\sum_u || P_u ||^2 + \sum_i || Q_i ||^2)]$
    - $P =$ vector of latent factors for user u
    - $Q =$ vector of latent factors for user k
Larger value -> more penalization
If value too large -> underfit, If value too small -> overfit
3. Learning rate $\alpha$
    - If too high, then model may overshoot minimum thus failing to converge
    - If too low, then model may take too long to converge / stuck at local minimum
    - The gradients are derived from the cost functions
        - $P_u \leftarrow P_u + \alpha * [(r_{ui} - \hat{r_{ui}})^2 Q_i - \lambda P_u]$
        - $Q_i \leftarrow Q_i + \alpha * [(r_{ui} - \hat{r_{ui}})^2 P_u - \lambda Q_i]$

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



## Import datasets

In [2]:
data = pd.read_csv('./datasets/merged.csv')
data.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip_code,title,genres
0,4958,1407,5,2003-02-28 12:47:23,M,18,7,55403,Scream (1996),Horror|Thriller
1,5950,1407,2,2000-05-01 07:50:46,M,25,4,19713,Scream (1996),Horror|Thriller
2,4607,1407,3,2000-07-21 12:25:04,M,25,0,27403,Scream (1996),Horror|Thriller
3,5312,1407,5,2000-06-29 14:31:32,M,25,1,10463,Scream (1996),Horror|Thriller
4,3391,1407,4,2000-08-30 11:45:55,M,18,4,48135,Scream (1996),Horror|Thriller


In [3]:
trainset = pd.read_csv('./datasets/train.csv')
trainset.head()

Unnamed: 0,user_id,movie_id,rating
0,4168,3082,3
1,4284,2763,4
2,798,2559,5
3,4345,2529,3
4,984,3099,4


In [4]:
testset = pd.read_csv('./datasets/test.csv')
testset.head()

Unnamed: 0,user_id,movie_id,rating
0,5412,2431,5
1,5440,111,5
2,368,2976,3
3,425,2139,4
4,4942,2532,3


## Build SVD model

In [5]:
from surprise import SVD, Dataset, Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from sklearn.model_selection import train_test_split 

# trainset, testset = train_test_split(data[['user_id', 'movie_id', 'rating']], test_size = 0.01, random_state = 42)

reader = Reader(rating_scale = (1, 5))
train_data = Dataset.load_from_df(trainset, reader)
train_data = train_data.build_full_trainset()

test_data = list(testset.itertuples(index = False, name = None))

# As explained above, we will only focus on three hyper parameters
    # number of factors (n_factors), learning rate (lr_all), and regularization parameter (reg_all)
    # default for number of epochs is 20
hyperparameters = [
    {'n_factors': 50, 'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0.01}, 
    {'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}, # default
    {'n_factors': 150, 'n_epochs': 20, 'lr_all': 0.008, 'reg_all': 0.03}, 
    {'n_factors': 200, 'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.05}, 
    {'n_factors': 250, 'n_epochs': 20, 'lr_all': 0.012, 'reg_all': 0.08}, 
    {'n_factors': 300, 'n_epochs': 20, 'lr_all': 0.015, 'reg_all': 0.1}, 
    {'n_factors': 350, 'n_epochs': 20, 'lr_all': 0.02, 'reg_all': 0.12}, 
]

res = []
for params in hyperparameters:
    # Create an SVD model
    svd_model = SVD(
        n_factors=params['n_factors'],
        n_epochs=params['n_epochs'],
        lr_all=params['lr_all'],
        reg_all=params['reg_all']
    )
    
    # Train the model on the training data
    svd_model.fit(train_data)
    
    # Predict on the training set (to evaluate training RMSE)
    train_predictions = svd_model.test(train_data.build_testset())
    train_rmse = accuracy.rmse(train_predictions, verbose=False)
    
    # Predict on the test set (to evaluate validation RMSE)
    test_predictions = svd_model.test(test_data)
    test_rmse = accuracy.rmse(test_predictions, verbose=False)
    
    # Display results
    print(f"For parameters {params}")
    print(f"\tTraining RMSE: {train_rmse:.4f}")
    print(f"\tValidation RMSE: {test_rmse:.4f}")
    res.append({'parameters': params, 'training_rmse': train_rmse, 'test_rmse': test_rmse})

For parameters {'n_factors': 50, 'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0.01}
	Training RMSE: 0.8432
	Validation RMSE: 0.9578
For parameters {'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}
	Training RMSE: 0.6715
	Validation RMSE: 0.9639
For parameters {'n_factors': 150, 'n_epochs': 20, 'lr_all': 0.008, 'reg_all': 0.03}
	Training RMSE: 0.5905
	Validation RMSE: 0.9359
For parameters {'n_factors': 200, 'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.05}
	Training RMSE: 0.6751
	Validation RMSE: 0.9032
For parameters {'n_factors': 250, 'n_epochs': 20, 'lr_all': 0.012, 'reg_all': 0.08}
	Training RMSE: 0.7954
	Validation RMSE: 0.9166
For parameters {'n_factors': 300, 'n_epochs': 20, 'lr_all': 0.015, 'reg_all': 0.1}
	Training RMSE: 0.8350
	Validation RMSE: 0.9354
For parameters {'n_factors': 350, 'n_epochs': 20, 'lr_all': 0.02, 'reg_all': 0.12}
	Training RMSE: 0.8610
	Validation RMSE: 0.9471


In [8]:
best_svd_model = SVD(
    n_factors=200,
    n_epochs=50,
    lr_all=0.005,
    reg_all=0.1
)
best_svd_model.fit(train_data)

best_svd_model.predict(196, 302).est

3.830450002844806

In [None]:
best_svd_model = SVD(
    n_factors=200,
    n_epochs=50,
    lr_all=0.005,
    reg_all=0.1
)

# Train the model on the training data
best_svd_model.fit(train_data)

# Predict on the test set (to evaluate validation RMSE)
test_predictions = best_svd_model.test(test_data)
test_rmse = accuracy.rmse(test_predictions, verbose=False)

# Display results
print(f"Training RMSE: {train_rmse:.4f}")
print(f"Validation RMSE: {test_rmse:.4f}")