Google translated task:

SURPRISE PACKAGE

use MovieLens 1M data

you can use any models from the package

get RMSE on a test set of 0.87 or lower

Teacher's comment :

The DZ may not have enough RAM for the 1M dataset. Can be done at 100K. I suggest that the RMSE quality is calculated based on CrossValidation (5 folds), and not on a deferred dataset.

In [20]:
from surprise import KNNWithMeans, KNNBasic, SVD
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV

import pandas as pd

In [2]:
movies = pd.read_csv('../movies.csv')
ratings = pd.read_csv('../ratings.csv')

In [3]:
movies_with_ratings = movies.join(ratings.set_index('movieId'), on='movieId').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)

In [4]:
dataset = pd.DataFrame({
    'uid': movies_with_ratings.userId,
    'iid': movies_with_ratings.title,
    'rating': movies_with_ratings.rating
})

In [5]:
min_r = ratings.rating.min()
max_r = ratings.rating.max()

In [6]:
reader = Reader(rating_scale=(min_r, max_r))
data = Dataset.load_from_df(dataset, reader)

In [7]:
algo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8985  0.8901  0.8924  0.8981  0.8984  0.8955  0.0036  
Fit time          0.41    0.44    0.45    0.44    0.45    0.44    0.01    
Test time         1.15    1.19    1.17    1.12    1.13    1.15    0.03    


{'test_rmse': array([0.89848716, 0.89012551, 0.89236153, 0.89812934, 0.89843228]),
 'fit_time': (0.41491174697875977,
  0.43680548667907715,
  0.4457817077636719,
  0.4437870979309082,
  0.4468064308166504),
 'test_time': (1.1508936882019043,
  1.1878247261047363,
  1.1739506721496582,
  1.1190080642700195,
  1.1290020942687988)}

**The goal is to achieve RMSE 0.87 and lower on 5 folds cross-validation. So, the first attempt failed.**

I don't want to brute-force this until I get the needed quality. So I will use GridSearchCV. 

In [8]:
param_grid = {'k': list(range(1,100))}
gs = GridSearchCV(KNNWithMeans, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [9]:
# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8960363785366996
{'k': 41}


Still not enough quality. Ok, experiment with k_min.

In [None]:
param_grid = {'k': list(range(40,42)), 'min_k': list(range(1,40))}
gs = GridSearchCV(KNNWithMeans, param_grid, measures=['rmse'], cv=5)

gs.fit(data)

In [15]:
# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8901349498323121
{'k': 41, 'min_k': 3}


Try another algorithm. Still, I must use only collaborative algorithms.

In [None]:
param_grid = {'k': list(range(1,100)), 'min_k': list(range(1,40))}
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=5)

gs.fit(data)

<img src='https://i.ytimg.com/vi/U7CZcd-UYmU/maxresdefault.jpg'>

In [18]:
# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.9336358624349461
{'k': 15, 'min_k': 3}


This is we were waiting for?! Netflix will save us!

In [21]:
algo = SVD()
trainset, testset = train_test_split(data, test_size=.15)
algo.fit(trainset)
test_pred = algo.test(testset)
accuracy.rmse(test_pred, verbose=True)

RMSE: 0.8690


0.8690262650522234

Told ya! Ok, I must cross validate.

In [23]:
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8690  0.8875  0.8727  0.8697  0.8676  0.8733  0.0073  
Fit time          3.76    3.82    3.80    3.80    3.79    3.80    0.02    
Test time         0.09    0.09    0.17    0.09    0.17    0.12    0.04    


{'test_rmse': array([0.86904669, 0.88749203, 0.87266096, 0.86971309, 0.8675558 ]),
 'fit_time': (3.76493239402771,
  3.8177952766418457,
  3.7998435497283936,
  3.801806688308716,
  3.7928922176361084),
 'test_time': (0.08677077293395996,
  0.09075736999511719,
  0.16755175590515137,
  0.08876228332519531,
  0.16555476188659668)}

Not enough still. So I will optimize the algorithm.

In [25]:
param_grid = {'n_factors': list(range(10,200))}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)

gs.fit(data)

In [26]:
# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8686319838681957
{'n_factors': 17}


In [29]:
algo = SVD(n_factors = 17)
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8727  0.8648  0.8740  0.8708  0.8630  0.8691  0.0044  
Fit time          1.68    1.52    1.55    1.53    1.59    1.57    0.06    
Test time         0.09    0.08    0.19    0.08    0.17    0.12    0.05    


{'test_rmse': array([0.87268002, 0.86484867, 0.87404056, 0.87082158, 0.86296716]),
 'fit_time': (1.675549030303955,
  1.5169432163238525,
  1.5518405437469482,
  1.5268909931182861,
  1.5887558460235596),
 'test_time': (0.08975982666015625,
  0.08477234840393066,
  0.19348859786987305,
  0.08377265930175781,
  0.16652584075927734)}

In [38]:
param_grid = {'n_factors': [17], 'n_epochs': list(range(19,200))}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)

gs.fit(data)

In [39]:
# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8667456310889294
{'n_factors': 17, 'n_epochs': 28}


In [40]:
algo = SVD(n_factors = 17, n_epochs = 28)
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8753  0.8643  0.8646  0.8641  0.8688  0.8674  0.0043  
Fit time          2.13    2.14    2.12    2.14    2.14    2.13    0.00    
Test time         0.09    0.09    0.08    0.17    0.09    0.10    0.03    


{'test_rmse': array([0.87525042, 0.86425813, 0.86458723, 0.86408517, 0.86879585]),
 'fit_time': (2.1303319931030273,
  2.1362898349761963,
  2.1243226528167725,
  2.1352920532226562,
  2.1372835636138916),
 'test_time': (0.0877382755279541,
  0.08776569366455078,
  0.08477306365966797,
  0.17154145240783691,
  0.08576321601867676)}

In [52]:
lr = list(range(1,10))
lr = [x * 0.001 for x in lr]

In [53]:
reg = list(range(1,10))
reg = [x * 0.01 for x in reg]

In [54]:
param_grid = {'n_factors': [17], 'n_epochs': [28], 'lr_all': lr, 'reg_all': reg}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)

gs.fit(data)

In [55]:
# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8583441908328538
{'n_factors': 17, 'n_epochs': 28, 'lr_all': 0.009000000000000001, 'reg_all': 0.07}


In [56]:
algo = SVD(n_factors = 17, n_epochs = 28, lr_all = 0.009, reg_all = 0.07)
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8588  0.8581  0.8615  0.8592  0.8559  0.8587  0.0018  
Fit time          2.12    2.10    2.17    2.11    2.13    2.13    0.02    
Test time         0.17    0.09    0.17    0.09    0.17    0.14    0.04    


{'test_rmse': array([0.85880572, 0.85814844, 0.8615228 , 0.8591883 , 0.85589416]),
 'fit_time': (2.123286008834839,
  2.10141921043396,
  2.1701717376708984,
  2.1083905696868896,
  2.126312255859375),
 'test_time': (0.16755199432373047,
  0.08675885200500488,
  0.16954684257507324,
  0.0857701301574707,
  0.16656279563903809)}

**Ok, I did it.**

Compare prediction to one in the lecture.

In [58]:
algo.predict(uid=2, iid='Fight Club (1999)')

Prediction(uid=2, iid='Fight Club (1999)', r_ui=None, est=4.426336076969732, details={'was_impossible': False})