# Singular Value Thresholding

We implement the singular value thresholding algorithm as presented in Cai, Candés and Shen (2008) and perform a grid search to optimize its parameters. Note that it is far from exhaustive as the algorithm is slow.

### Importing the libraries

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#%pip install scikit-surprise
from surprise import AlgoBase, PredictionImpossible, Reader, Dataset, accuracy
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
import thresholding

### Prepare the dataset

First we read the dataset as a pandas DataFrame.

In [2]:
data_train_raw = pd.read_csv('../data/data_train.csv')

# parse rows and columns
row_str = data_train_raw['Id'].apply(lambda x: x.split('_')[0])
row_id = row_str.apply(lambda x: int(x.split('r')[1]) - 1)
col_str = data_train_raw['Id'].apply(lambda x: x.split('_')[1])
col_id = col_str.apply(lambda x: int(x.split('c')[1]) - 1)

# apply changes
data_train_raw['row'] = row_id
data_train_raw['col'] = col_id

# dataset as data frame
data_train_df = data_train_raw.loc[:,['row', 'col', 'Prediction']]

In [3]:
data_train_df.head()

Unnamed: 0,row,col,Prediction
0,43,0,4
1,60,0,3
2,66,0,4
3,71,0,3
4,85,0,5


### Prepare dataset

In [4]:
# set up surprise dataset
reader = Reader()
dataset = Dataset.load_from_df(data_train_df[['row', 'col', 'Prediction']], reader)

### Perform a grid search over parameters

Note that the parameter n_epochs concerns the maximal number of epochs, as the algorithm otherwise stops when a criteria is reached

In [14]:
# define the model
model = GridSearchCV(thresholding.SVDthr, param_grid = {'tao': [100,500,1000], 
                                                        'n_epochs': [1000], 
                                                        'step_size': [1,1.99]})

# fit the model to the dataset
model.fit(dataset)

# output best scores and best parameters
model.best_score, model.best_params

### Train model with optimal parameters 

In [9]:
# define the model
model = thresholding.SVDthr(n_epochs=1, tao=10000, step_size=1)

# fit the model to the training set
model.fit(trainset)

# compute the predictions
predictions = model.test(testset)

# compute the RMSE on the testset
rmse = accuracy.rmse(predictions)

RMSE: 3.0653
