In [1]:
pip install scikit-surprise 

Note: you may need to restart the kernel to use updated packages.


In [2]:
from surprise import SVD, SVDpp, NMF
from surprise import Dataset, accuracy, Reader
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV

In [3]:
# Loading dataset
data = Dataset.load_builtin('ml-100k')


In [4]:
# Use the famous SVD algorithm
model = SVD()

# Run 5-fold cross-validation and print results
results_SVD = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9382  0.9282  0.9302  0.9326  0.9418  0.9342  0.0051  
MAE (testset)     0.7377  0.7321  0.7361  0.7363  0.7427  0.7370  0.0034  
Fit time          1.25    1.22    1.31    1.27    1.29    1.27    0.03    
Test time         0.13    0.12    0.25    0.19    0.11    0.16    0.05    


In [5]:
# We divide the dataset into trainset and testset, test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=0.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.9398


0.9398456991782965

In [6]:
# We determine the best parameters n_epochs, lr_all and reg_all using GridSearchCV for the model SVD
params = {'n_epochs': [5, 10, 15], 'lr_all': [0.002, 0.005, 0.01], 'reg_all': [0.02, 0.1, 0.2]}

svd = SVD()

grid_search = GridSearchCV(SVD, params, measures=['rmse', 'mae'], cv=3)

grid_search.fit(data)

print("Best parameters found:", grid_search.best_params)
print("Best RMSE:", grid_search.best_score['rmse'])
print("Best MAE:", grid_search.best_score['mae'])

Best parameters found: {'rmse': {'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.1}, 'mae': {'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.1}}
Best RMSE: 0.935255545858
Best MAE: 0.7412732460046921


In [7]:
# Let's find the same parameters for the algorithm SVD++ (it takes more than 20 minutes)
svdpp = SVDpp()

grid_search_svdpp = GridSearchCV(SVDpp, params, measures=['rmse', 'mae'], cv=3)
grid_search_svdpp.fit(data)

print("Best parameters found for SVD++:", grid_search_svdpp.best_params)
print("Best RMSE for SVD++:", grid_search_svdpp.best_score['rmse'])
print("Best MAE for SVD++:", grid_search_svdpp.best_score['mae'])

Best parameters found for SVD++: {'rmse': {'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.02}, 'mae': {'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.02}}
Best RMSE for SVD++: 0.9294371515556997
Best MAE for SVD++: 0.7288864429677684


In [8]:
# Let's find the same parameters for the Non-negative Matrix Factorization (NMF) algorithm
nmf = NMF()

param_nmf = {'n_factors': [5, 10, 15], 'n_epochs': [10, 15], 'reg_pu': [0.02, 0.1, 0.2], 'reg_qi': [0.02, 0.1, 0.2]}
grid_search_nmf = GridSearchCV(NMF, param_nmf, measures=['rmse', 'mae'], cv=3)
grid_search_nmf.fit(data)

print("Best parameters found for NMF:", grid_search_nmf.best_params)
print("Best RMSE for NMF:", grid_search_nmf.best_score['rmse'])
print("Best MAE for NMF:", grid_search_nmf.best_score['mae'])

Best parameters found for NMF: {'rmse': {'n_factors': 10, 'n_epochs': 15, 'reg_pu': 0.1, 'reg_qi': 0.2}, 'mae': {'n_factors': 5, 'n_epochs': 15, 'reg_pu': 0.2, 'reg_qi': 0.2}}
Best RMSE for NMF: 0.965012061745123
Best MAE for NMF: 0.7529903892877684


Let's analyze the metrics that we can observe as a result of code execution:

**RMSE Root Mean Squared Error** (testset): Root Mean Squared Error between actual and predicted ratings on the test data set. The smaller the value, the better.

**MAE Mean Absolute Error** (testset): Mean absolute error between actual and predicted ratings on the test data set. The smaller the value, the better.

**Fit time**: The time required to train the model on each section of data. This is usually measured in seconds.

**Test time**: The time required to predict the ratings for the test data set. This is also usually measured in seconds.

---

*Having read the results, we can say that the SVD and SVD++ methods showed better results than the NMF method.
SVD++ showed the best result, but it needs much more time to train the model and get the necessary data.*