Creates a GridSearchCV object:

SVD: the algorithm to tune.

param_grid: the hyperparameter combinations.

measures: evaluation metrics (Root Mean Squared Error and Mean Absolute Error).

cv=3: uses 3-fold cross-validation to ensure robustness.

movie 1 million: 18.01

In [16]:
from surprise import Dataset, Reader

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train_df[["user_idx", "item_idx", "rating"]], reader)


In [10]:
from surprise import SVD
from surprise.model_selection import cross_validate

algo = SVD(n_factors=50, n_epochs=10, lr_all=0.005, reg_all=0.02)

cross_validate(algo, data, measures=["RMSE", "MAE"], cv=3, verbose=True)


Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9037  0.9043  0.9014  0.9031  0.0013  
MAE (testset)     0.7150  0.7157  0.7135  0.7147  0.0009  
Fit time          21.90   21.81   19.64   21.12   1.04    
Test time         7.12    5.62    6.69    6.48    0.63    


{'test_rmse': array([0.90367496, 0.90426384, 0.90135821]),
 'test_mae': array([0.71499923, 0.71565863, 0.71350197]),
 'fit_time': (21.903677225112915, 21.80621862411499, 19.64152455329895),
 'test_time': (7.121840000152588, 5.617508411407471, 6.692774534225464)}

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9037  0.9043  0.9014  0.9031  0.0013  
MAE (testset)     0.7150  0.7157  0.7135  0.7147  0.0009  
Fit time          21.90   21.81   19.64   21.12   1.04    
Test time         7.12    5.62    6.69    6.48    0.63    
{'test_rmse': array([0.90367496, 0.90426384, 0.90135821]),
 'test_mae': array([0.71499923, 0.71565863, 0.71350197]),
 'fit_time': (21.903677225112915, 21.80621862411499, 19.64152455329895),
 'test_time': (7.121840000152588, 5.617508411407471, 6.692774534225464)}

In [11]:
print("data exists?", "data" in globals())
print("train_df exists?", "train_df" in globals())


data exists? True
train_df exists? True


In [13]:
from pathlib import Path
import pandas as pd

project_root = Path.cwd().parents[1]  # if notebook is in notebooks/learning
data_dir = project_root / "data" / "processed" / "movielens" / "ml-1m"

train_df = pd.read_parquet(data_dir / "train.parquet")
test_df = pd.read_parquet(data_dir / "test.parquet")

print("Project root:", project_root)
print("Train rows:", len(train_df), "Test rows:", len(test_df))
train_df.head()


Project root: /home/helin/projects/BachelorThesis/code/srcCode/recsys-negative-feedback
Train rows: 993571 Test rows: 6040


Unnamed: 0,user,item,rating,timestamp,user_idx,item_idx
0,1,3186,4,978300019,0,31
1,1,1270,5,978300055,0,22
2,1,1721,4,978300055,0,27
3,1,1022,5,978300055,0,37
4,1,2340,3,978300103,0,24


Project root: /home/helin/projects/BachelorThesis/code/srcCode/recsys-negative-feedback
Train rows: 993571 Test rows: 6040
user	item	rating	timestamp	user_idx	item_idx
0	1	3186	4	978300019	0	31
1	1	1270	5	978300055	0	22
2	1	1721	4	978300055	0	27
3	1	1022	5	978300055	0	37
4	1	2340	3	978300103	0	24


In [14]:
from surprise import Dataset, Reader

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train_df[["user_idx", "item_idx", "rating"]], reader)

print("Surprise dataset created.")


Surprise dataset created.


In [15]:
from surprise import SVD
from surprise.model_selection import cross_validate

algo = SVD(n_factors=50, n_epochs=10, lr_all=0.005, reg_all=0.02)
cross_validate(algo, data, measures=["RMSE", "MAE"], cv=3, verbose=True)


Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9019  0.9023  0.9038  0.9027  0.0008  
MAE (testset)     0.7146  0.7139  0.7148  0.7145  0.0004  
Fit time          21.29   23.73   22.38   22.47   0.99    
Test time         6.63    5.84    6.61    6.36    0.36    


{'test_rmse': array([0.901949  , 0.90228511, 0.90381731]),
 'test_mae': array([0.71463392, 0.71391773, 0.71480001]),
 'fit_time': (21.294512033462524, 23.7257022857666, 22.3794686794281),
 'test_time': (6.625228643417358, 5.842278480529785, 6.606756925582886)}

valuating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9019  0.9023  0.9038  0.9027  0.0008  
MAE (testset)     0.7146  0.7139  0.7148  0.7145  0.0004  
Fit time          21.29   23.73   22.38   22.47   0.99    
Test time         6.63    5.84    6.61    6.36    0.36    
{'test_rmse': array([0.901949  , 0.90228511, 0.90381731]),
 'test_mae': array([0.71463392, 0.71391773, 0.71480001]),
 'fit_time': (21.294512033462524, 23.7257022857666, 22.3794686794281),
 'test_time': (6.625228643417358, 5.842278480529785, 6.606756925582886)}

In [7]:
from surprise.model_selection import GridSearchCV

param_grid = {
    "n_factors": [50, 100],
    "n_epochs": [10, 20],
    "lr_all": [0.005, 0.01],
    "reg_all": [0.02, 0.05],
}

gs = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=3, n_jobs=-1)
gs.fit(data)

best_params = gs.best_params["rmse"]
best_rmse = gs.best_score["rmse"]

best_params, best_rmse


({'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.05},
 0.8645678378379147)

({'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.05},
 0.8645678378379147)

In [17]:
import json

out_dir = project_root / "outputs" / "movielens" / "ml-1m"
out_dir.mkdir(parents=True, exist_ok=True)

with open(out_dir / "best_svd_params.json", "w") as f:
    json.dump(
        {"best_params": best_params, "best_rmse": best_rmse},
        f,
        indent=2
    )

print("Saved:", out_dir / "best_svd_params.json")


Saved: /home/helin/projects/BachelorThesis/code/srcCode/recsys-negative-feedback/outputs/movielens/ml-1m/best_svd_params.json
