## Training [SVD++ algorithm](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVDpp) on [MovieLens 100k](https://grouplens.org/datasets/movielens/100k/) Dataset

downloading the required libraries

In [1]:
!pip install wheel surprise pandas scikit-learn fastparquet





In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import fastparquet
from surprise.reader import Reader
from surprise.dataset import DatasetAutoFolds
from surprise.accuracy import mae, mse, rmse
from surprise.model_selection import GridSearchCV
from surprise.prediction_algorithms import SVDpp

In [3]:
movie_cols = [
    "movieid",
    "movie_title",
    "release_date",
    "video_release_date",
    "IMDb_URL",
    "unknown", #genre of the movie (one-hot encoded)
    "action",
    "adventure",
    "animation",
    "children",
    "comedy",
    "crime",
    "documentary",
    "drama",
    "fantasy",
    "film-noir",
    "horror",
    "musical",
    "mystery",
    "romance",
    "sci-fi",
    "thriller",
    "War",
    "western",
]

loading the data

In [4]:
ratings = pd.read_csv("../data/u.data", delimiter="\t", names=["userid","movieid","rating","timestamp"])
items = pd.read_csv("../data/u.item", delimiter="|", names=movie_cols, encoding_errors="replace")

In [5]:
ratings = pd.merge(ratings, items, on="movieid")

In [6]:
ratings

Unnamed: 0,userid,movieid,rating,timestamp,movie_title,release_date,video_release_date,IMDb_URL,unknown,action,...,fantasy,film-noir,horror,musical,mystery,romance,sci-fi,thriller,War,western
0,196,242,3,881250949,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,0,0,0,0
1,186,302,3,891717742,L.A. Confidential (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?L%2EA%2E+Conf...,0,0,...,0,1,0,0,1,0,0,1,0,0
2,22,377,1,878887116,Heavyweights (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Heavyweights%...,0,0,...,0,0,0,0,0,0,0,0,0,0
3,244,51,2,880606923,Legends of the Fall (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Legends%20of%...,0,0,...,0,0,0,0,0,1,0,0,1,1
4,166,346,1,886397596,Jackie Brown (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?imdb-title-11...,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,880,476,3,880175444,"First Wives Club, The (1996)",14-Sep-1996,,http://us.imdb.com/M/title-exact?First%20Wives...,0,0,...,0,0,0,0,0,0,0,0,0,0
99996,716,204,5,879795543,Back to the Future (1985),01-Jan-1985,,http://us.imdb.com/M/title-exact?Back%20to%20t...,0,0,...,0,0,0,0,0,0,1,0,0,0
99997,276,1090,1,874795795,Sliver (1993),01-Jan-1993,,http://us.imdb.com/M/title-exact?Sliver%20(1993),0,0,...,0,0,0,0,0,0,0,1,0,0
99998,13,225,2,882399156,101 Dalmatians (1996),27-Nov-1996,,http://us.imdb.com/M/title-exact?101%20Dalmati...,0,0,...,0,0,0,0,0,0,0,0,0,0


Descriptive stats of number of ratings for a movie

In [7]:
ratings["movieid"].value_counts().describe()

count    1682.000000
mean       59.453032
std        80.383846
min         1.000000
25%         6.000000
50%        27.000000
75%        80.000000
max       583.000000
Name: count, dtype: float64

Filtering such that each movie has alteast 10 ratings

In [8]:
ratings = ratings.groupby("movieid").filter(lambda x: 10 <= len(x))

Descriptive stats of number of ratings individual users gave

In [9]:
ratings["userid"].value_counts().describe()

count    943.000000
mean     103.873807
std       96.854269
min       18.000000
25%       33.000000
50%       64.000000
75%      147.000000
max      589.000000
Name: count, dtype: float64

Descriptive stats of number of ratings for a movie after filtering

In [10]:
ratings["movieid"].value_counts().describe()

count    1152.000000
mean       85.028646
std        85.768231
min        10.000000
25%        25.000000
50%        53.500000
75%       117.000000
max       583.000000
Name: count, dtype: float64

Descriptive stats of ratings

In [11]:
ratings["rating"].describe()

count    97953.000000
mean         3.545997
std          1.116467
min          1.000000
25%          3.000000
50%          4.000000
75%          4.000000
max          5.000000
Name: rating, dtype: float64

Splitting the data in 80-20 train-test split such that each user is present in both sets

In [12]:
train, test = train_test_split(ratings, test_size=0.2, stratify=ratings["userid"], random_state=42)

Saving the interaction matrix, and the test-train split

In [13]:
pd.pivot_table(ratings, values="rating", index="userid", columns="movieid").to_parquet("../parquet/interaction_matrix.parquet")
train.to_parquet("../parquet/train.parquet")
train.to_parquet("../parquet/test.parquet")

verifying that the train-test split are in desired proportion

In [14]:
len(train)/len(ratings), len(test)/len(ratings) 

(0.79999591640889, 0.20000408359111002)

Preparing the train and test data for model training and testing

In [15]:

#! Range of our raitngs is 1 to 5
reader = Reader(rating_scale=(1, 5))
data_test = (
    DatasetAutoFolds(
        reader=reader,
        df=test[["userid", "movieid", "rating"]],
    )
    .build_full_trainset()
    .build_testset()
)
data_train = DatasetAutoFolds(
    reader=reader, df=train[["userid", "movieid", "rating"]]
)

We're running a Gridsearch with 5 fold cross validation, therefore, a separate validation set is not required

In [16]:

#! setting up parameteres for gridsearch
param_grid = {
    "n_factors": [25, 20, 15],
    "lr_all": [0.002, 0.001],
    "reg_all": [0.002, 0.001],
    "n_epochs": [100],
}


gs = GridSearchCV(
    SVDpp,
    param_grid,
    measures=["mse", "mae"], #! gridsearch will be evaluated on mean-square-error and mean-absolute-error
    cv=5, #!5 fold cross validation
    refit=True, #! the set of parameters performing best on the first evaluation metric (mse) will be refitted on the whole data set (all 5 folds)
    n_jobs=-1,
    joblib_verbose=3,
)

fitting the data

In [17]:
gs.fit(data_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done  58 out of  60 | elapsed:  9.1min remaining:   18.7s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  9.1min finished


* Training time: 5m 34s 
* Specs: 16 core Intel i9-9900K with 64 GB RAM

In [18]:
df = pd.DataFrame(gs.cv_results)
df_final = df[
    [
        "mean_test_mse",
        "std_test_mse",
        "mean_test_mae",
        "std_test_mae",
        "params",
    ]
]
df_final.sort_values("mean_test_mse", ascending=True).head(10)

Unnamed: 0,mean_test_mse,std_test_mse,mean_test_mae,std_test_mae,params
10,0.852797,0.004257,0.724805,0.00137,"{'n_factors': 15, 'lr_all': 0.001, 'reg_all': ..."
6,0.856057,0.008569,0.725754,0.002052,"{'n_factors': 20, 'lr_all': 0.001, 'reg_all': ..."
11,0.858529,0.008924,0.726866,0.002893,"{'n_factors': 15, 'lr_all': 0.001, 'reg_all': ..."
2,0.859703,0.008624,0.727487,0.003044,"{'n_factors': 25, 'lr_all': 0.001, 'reg_all': ..."
7,0.864232,0.007465,0.729011,0.002873,"{'n_factors': 20, 'lr_all': 0.001, 'reg_all': ..."
3,0.866677,0.00582,0.729999,0.001671,"{'n_factors': 25, 'lr_all': 0.001, 'reg_all': ..."
8,0.92933,0.01005,0.749213,0.003714,"{'n_factors': 15, 'lr_all': 0.002, 'reg_all': ..."
9,0.94487,0.007026,0.754519,0.002287,"{'n_factors': 15, 'lr_all': 0.002, 'reg_all': ..."
4,0.952931,0.011957,0.757742,0.003303,"{'n_factors': 20, 'lr_all': 0.002, 'reg_all': ..."
5,0.966377,0.00561,0.762524,0.002548,"{'n_factors': 20, 'lr_all': 0.002, 'reg_all': ..."


Getting the best score for each metric achieved and their respective hyper parameter combination

In [19]:
gs.best_params, gs.best_score

({'mse': {'n_factors': 15, 'lr_all': 0.001, 'reg_all': 0.002, 'n_epochs': 100},
  'mae': {'n_factors': 15,
   'lr_all': 0.001,
   'reg_all': 0.002,
   'n_epochs': 100}},
 {'mse': 0.8527970618826071, 'mae': 0.7248052533817531})

finally testing the best combination of hyper parameters on our test data

In [20]:
gs_r = gs.test(data_test)
print(f" MAE: {mae(gs_r,verbose=False)}")
print(f" MSE: {mse(gs_r,verbose=False)}")
print(f"RMSE: {rmse(gs_r,verbose=False)}")

 MAE: 0.7106358349280463
 MSE: 0.8272398493595747
RMSE: 0.909527266968712


saving the learnt model

In [21]:
from surprise.dump import dump
dump("../model/model.pkl", algo=gs)