## Introduction of SurPRISE Package (Simple Python RecommendatIon System Engine)

For Question 2 Netflix Dataset, we utilized the recommendation system package called "SurPRISE", which has a set of built-in recommendation algorithms based on matrix factorization and collaborative filtering. In particular, the recommendation system using matrix factorization throughout the truncated SVD decomposition outputs a much better performance in the prediction. On top of that, the large size matrix in a format of users-items(m x n) scales down to user factor(U: m x k) x latent factor(sigma: k x k) x item factor (V: k x n), where the value of k is much smaller than m and n. Moreover, the small k distinctively enhances the computational efficiency. In other words, the downsized matrix factorization is the approximation of the original matrix. To minimize the error between the approximation and the original matrix in excluding the null values in the original matrix, the stochastic gradient descent (SGD) method is applied to this package. The SGD in the package is mainly implemented as following.

Step1. Determine the latent factor to generate U and V <br>
Step2. Randomly select one element from the users-items matrix <br>
Step3. Measure the error term (adding the regularization term to avoid overfitting) <br>
Step4. Update U and V for the minimization of the error term <br>
Step5. Repeat Steps 2 to 4 until the error term is satisfied <br>

Therefore, we need to tune parameters such as learning rate, regularization coefficient, and number of latent factors to minimize the cost function.

In [1]:
from surprise import Reader, Dataset, SVD, evaluate
from surprise.model_selection import GridSearchCV
import pandas as pd
import numpy as np
from surprise import SVD
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

## Load and Divide the train dataset into trainset and testset
Before predicting the "Netflix_HW3_test" without the dependent label, we initially divided the trainset into a trainset and testset. By using the testset with the rating values, we can evaluate the regression model that was built in the trainset

In [2]:
# Load Dataset
df_train_raw = pd.read_csv('./netflix_HW3/Netflix_HW3_training.txt', header = None, names = ['itemID', 'userID', 'rating'])
df_test_raw = pd.read_csv('./netflix_HW3/Netflix_HW3_test.txt', header = None, names = ['itemID', 'userID'])

In [3]:
#Split the df_train_raw into train(80%) and test(20%)
trainset, testset = train_test_split(df_train_raw, test_size=0.2, random_state= 135)

In [4]:
#fit the dataset to the surprise package format
reader = Reader(rating_scale=(1, 5)) # Rating range is 1 to 5
train_sdf = Dataset.load_from_df(trainset[['itemID', 'userID', 'rating']], reader=reader)
test_sdf = Dataset.load_from_df(testset[['itemID', 'userID', 'rating']], reader=reader)
ttl_train_sdf = Dataset.load_from_df(df_train_raw[['itemID', 'userID', 'rating']], reader=reader)

## Grid Search
As mentioned above, the tuning hyper-parameters have been set as below.    
n_factors: The number of factors. Default is 100. <br>
n_epochs: The number of iterations of the SGD procedure. Default is 20. <br>
lr_all: The learning rate for all parameters. Default is 0.005. <br>
reg_all: The regularization term for all parameters. Default is 0.02. <br>

In [10]:
#parameter setting
param_grid = {'n_factors':[60, 90, 120], 'n_epochs': [20, 30, 40], 'lr_all': [0.003, 0.005, 0.007],
              'reg_all': [0.01, 0.02, 0.03]}

SVD_Grid = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=10, n_jobs=-1)

In [11]:
#Fit the data
SVD_Grid.fit(train_sdf)

In [12]:
# best RMSE score
print("==== Best RSME Score====\n", SVD_Grid.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print("\n==== Hyper-parameter Values for the Best RMSE====\n",SVD_Grid.best_params['rmse'])

==== Best RSME Score====
 0.8473964427371665

==== Hyper-parameter Values for the Best RMSE====
 {'n_factors': 60, 'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.03}


### Trainset RMSE :  0.8470215204103282

In [5]:
#parameter setting
param_grid = {'n_factors': [60], 'n_epochs': [30], 'lr_all': [0.005], 'reg_all': [0.03]}
SVD_Grid = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=10, n_jobs=-1)
#Fit the data
SVD_Grid.fit(train_sdf)
print("==== Best RSME Score====\n", SVD_Grid.best_score['rmse'])

==== Best RSME Score====
 0.8470215204103282


## Model Evaluation (RMSE) by the best hyper-paramter

In [6]:
# Build the model with the selected parameters1
SVD_best = SVD_Grid.best_estimator['rmse']
SVD_best.fit(train_sdf.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1c27f23f828>

In [7]:
def rmse(predictions, targets):
    return print("RMSE:", np.sqrt(((predictions - targets) ** 2).mean()))

### Testset RMSE : 0.8412717700600827

In [8]:
#prediction for the testset and Check the RMSE
pred = [SVD_best.predict(testset.iloc[idx, 0], testset.iloc[idx, 1]).est for idx in range(len(testset))]
testset['pred'] = pred
rmse(testset['rating'], testset['pred'])

RMSE: 0.8412717700600827


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


## Rating Prediction

In [9]:
# Build the model with the selected parameters
SVD_final = SVD_Grid.best_estimator['rmse']
SVD_final.fit(ttl_train_sdf.build_full_trainset())

#prediction for the testset
pred = [SVD_final.predict(df_test_raw.iloc[idx, 0], df_test_raw.iloc[idx, 1]).est for idx in range(len(df_test_raw))]
df_test_raw['pred'] = pred

In [10]:
df_test_raw.head()

Unnamed: 0,itemID,userID,pred
0,443,1549632,4.192552
1,2104,428488,4.276588
2,9410,2069695,3.026097
3,4005,128311,2.202649
4,16770,2037731,4.166902


In [11]:
df_test_raw.to_csv('Netflix_final_result.csv', index = None)

## Conclusion

The best model that we found throught the grid search has an RMSE of 0.841 for the validation dataset that was randomly sampled from the netflix_train.txt dataset(20% of the original train dataset). Compared to the Netflix winner who built a regression model with the RMSE of 0.842, the result is a bit lower. By doing so, we understood how powerful the SVD method could be when used for the recommendation system. In addition, the matrix decomposition can be applied with other machine learing algorithms in ensemble ways. 