# __рекомендаційні системи__
### _!_ (_in `GoogleColab`_)
бібліотека `surprise`, котра якраз є по суті додатком до бібліотеки `scikit-learn` для тренування моделей рекомендаційних систем.
<br><br>
Example: датасет `movielens`, побудова моделі матричної факторизації. У даній бібліотеці він має назву `SVD`.
<br>
Підбір найкращих параметрів за допомогою крос-валідації, також й інші алгоритмами розрахунків (`SVD++`, `NMF`).
<br><br>
як саме побудувати дану модель - в документації до даної бібліотеки.

!pip install matplotlib

In [115]:
!pip install surprise



In [116]:
import os
from pathlib import Path
import random
import shutil

from google.colab import files
import numpy as np
import pandas as pd
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate

In [117]:
def set_seed(seed_value: int) -> None:
    """Set a random state for repeatability of results."""
    random.seed(seed_value)
    np.random.seed(seed_value)
    # tf.random.set_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    os.environ['TF_DETERMINISTIC_OPS'] = 'true'


set_seed(1)

### __`OBTAIN` & `SCRUB`__ + __`EXPLORE`__ (DATASET)

- links.csv - movieId, imdbId, tmdbId
- movies.csv - movieId, title, genres
- ratings.csv - userId, movieId, rating, timestamp
- tags.csv - userId, movieId, tag, timestamp

In [118]:
# data = Dataset.load_builtin('ml-100k')

Column names are irrelevant, but specify only one rating per line, and each line needs to respect the following structure:<br> `user ; item ; rating ; [timestamp]`

In [119]:
# # data.__dict__
# data.raw_ratings[:3]

In [120]:
# r_mean = np.mean([r[2] for r in data.raw_ratings])
# r_std = np.std([r[2] for r in data.raw_ratings])

In [121]:
# max([r[2] for r in data.raw_ratings]), min([r[2] for r in data.raw_ratings]), r_mean, r_std

1.0 мінімальний наявний рейтинг, тому логічно що межа мінімума 0.0:

In [122]:
def read_from_csvfile(file: Path) -> pd.DataFrame:
    """Read content from csv-file and return dataframe from content."""
    df = pd.read_csv(file)

    return df

In [123]:
# Create new folder
new_folder = 'ml'

if os.path.isdir(new_folder):
  shutil.rmtree(new_folder)

os.mkdir(new_folder)

# Upload Files to GoogleColab
uploaded = files.upload()
for filename in uploaded.keys():
  dst_path = os.path.join(new_folder, filename)
  print(f'move {filename} to {dst_path}')
  shutil.move(filename, dst_path)

Saving links.csv to links.csv
Saving movie_ids.txt to movie_ids.txt
Saving movies.csv to movies.csv
Saving movies.mat to movies.mat
Saving ratings.csv to ratings.csv
Saving README.txt to README.txt
Saving tags.csv to tags.csv
move links.csv to ml/links.csv
move movie_ids.txt to ml/movie_ids.txt
move movies.csv to ml/movies.csv
move movies.mat to ml/movies.mat
move ratings.csv to ml/ratings.csv
move README.txt to ml/README.txt
move tags.csv to ml/tags.csv


In [124]:
links = read_from_csvfile('/content/ml/links.csv')
movies = read_from_csvfile('/content/ml/movies.csv')
ratings = read_from_csvfile('/content/ml/ratings.csv')
tags = read_from_csvfile('/content/ml/tags.csv')

In [125]:
links.tail(3)

Unnamed: 0,movieId,imdbId,tmdbId
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0
9741,193609,101726,37891.0


In [126]:
movies.tail(3)

Unnamed: 0,movieId,title,genres
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [127]:
ratings.tail(3)

Unnamed: 0,userId,movieId,rating,timestamp
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352
100835,610,170875,3.0,1493846415


In [128]:
tags.tail(3)

Unnamed: 0,userId,movieId,tag,timestamp
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978
3682,610,168248,Heroic Bloodshed,1493844270


Column names are irrelevant, but specify only one rating per line, and each line needs to respect the following structure:<br> `user ; item ; rating ; [timestamp]`

In [129]:
df = ratings[['userId', 'movieId', 'rating']]
df.tail(3)

Unnamed: 0,userId,movieId,rating
100833,610,168250,5.0
100834,610,168252,5.0
100835,610,170875,3.0


In [130]:
r_mean = df['rating'].mean()
r_std = df['rating'].std()

In [131]:
df['rating'].max(), df['rating'].min(), r_mean, r_std

(5.0, 0.5, 3.501556983616962, 1.042529239060635)

0.5 мінімальний рейтинг, тому логічно що межа мінімума 0:

- https://surprise.readthedocs.io/en/stable/reader.html?highlight=Reader#surprise.reader.Reader

In [132]:
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0, 5))
reader

<surprise.reader.Reader at 0x7dcd6b53ece0>

In [133]:
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)
data

<surprise.dataset.DatasetAutoFolds at 0x7dcd6b53d7b0>

### __`MODEL`__ &  __`Training`__

#### `SVD`

- модель матричної факторизації `SVD`

- https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD
- https://www.piwheels.org/project/scikit-surprise/
- https://pypi.org/project/scikit-surprise/
- https://github.com/NicolasHug/Surprise

In [134]:
# from surprise.prediction_algorithms.matrix_factorization import SVD  # poetry add surprise  # poetry add scikit-surprise==1.1.3 --use-pep517

- https://surprise.readthedocs.io/en/stable/getting_started.html#cross-validate-example
- https://surprise.readthedocs.io/en/stable/model_selection.html
- https://surprise.readthedocs.io/en/stable/getting_started.html?highlight=Dataset#use-a-custom-dataset


In [135]:
# from surprise import SVD
# from surprise import Dataset, Reader
# from surprise.model_selection import cross_validate

In [136]:
# from surprise.model_selection import train_test_split
# trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
# model = SVD()
# model.fit(trainset)
# # Прогнозування рейтингів
# test_predictions = model.test(testset)
# predictions = model.predict(uid=1, iid=1)

In [137]:
algo = SVD()

In [138]:
algo

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53fc40>

In [139]:
algo2 = SVD(
            n_factors=1,
            n_epochs=10,
            biased=True,
            init_mean=2.5,
            init_std_dev=0.5,
            lr_all=0.005,
            reg_all=0.02,
            lr_bu=None,
            lr_bi=None,
            lr_pu=None,
            lr_qi=None,
            reg_bu=None,
            reg_bi=None,
            reg_pu=None,
            reg_qi=None,
            random_state=None,
            verbose=False
            )

In [140]:
# cross_validate(
#                algo,
#                data,
#                measures=['rmse', 'mae'],
#                cv=None,
#                return_train_measures=False,
#                n_jobs=1,
#                pre_dispatch='2*n_jobs',
#                verbose=False
#                )

- Підбір найкращих параметрів за допомогою крос-валідації

In [141]:
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=6, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.8750  0.8633  0.8713  0.8784  0.8712  0.8753  0.8724  0.0048  
MAE (testset)     0.6706  0.6639  0.6692  0.6777  0.6682  0.6716  0.6702  0.0042  
Fit time          1.82    1.49    2.07    2.94    3.69    2.01    2.34    0.75    
Test time         0.16    0.10    0.84    0.22    0.25    0.12    0.28    0.25    


{'test_rmse': array([0.87495113, 0.86327261, 0.87125727, 0.87841512, 0.87122477,
        0.87529899]),
 'test_mae': array([0.67061167, 0.66393946, 0.66918979, 0.67774807, 0.66815439,
        0.67161136]),
 'fit_time': (1.8200418949127197,
  1.4873108863830566,
  2.074875831604004,
  2.9426469802856445,
  3.685574769973755,
  2.0123400688171387),
 'test_time': (0.15635013580322266,
  0.09995627403259277,
  0.8353290557861328,
  0.22399449348449707,
  0.25064945220947266,
  0.11501479148864746)}

In [142]:
# Run 6-fold cross-validation and print results.
cross_validate(algo2, data, measures=['RMSE', 'MAE'], cv=6, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.9020  0.8982  0.9062  0.8888  0.8889  0.9032  0.8979  0.0068  
MAE (testset)     0.6943  0.6900  0.6980  0.6887  0.6845  0.6934  0.6915  0.0043  
Fit time          0.34    0.37    0.37    0.37    0.35    0.37    0.36    0.01    
Test time         0.10    0.09    0.10    0.09    0.12    0.10    0.10    0.01    


{'test_rmse': array([0.90204838, 0.89815799, 0.90622792, 0.88876659, 0.88886203,
        0.90323361]),
 'test_mae': array([0.69427831, 0.68998837, 0.69803931, 0.68865522, 0.68452513,
        0.6934053 ]),
 'fit_time': (0.33513712882995605,
  0.3654625415802002,
  0.3718702793121338,
  0.3654508590698242,
  0.3549935817718506,
  0.36893796920776367),
 'test_time': (0.1030879020690918,
  0.09238934516906738,
  0.09917163848876953,
  0.09246134757995605,
  0.1163184642791748,
  0.1032407283782959)}

model set:

In [143]:
algo_SVD = {
            f'SVD(nfc{nfs}, neph{eph})':
            SVD(
                n_factors=nfs,
                n_epochs=eph,
                init_mean=r_mean, # df['rating'].mean(), # (0+5)/2 ...
                init_std_dev=r_std  # df['rating'].std()
                )
            for nfs in (1, 5, 10, 20, 40)
            for eph in (5, 10, 20)
           }

In [144]:
algo_SVD

{'SVD(nfc1, neph5)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53f160>,
 'SVD(nfc1, neph10)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53c8b0>,
 'SVD(nfc1, neph20)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53f250>,
 'SVD(nfc5, neph5)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53dcf0>,
 'SVD(nfc5, neph10)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53f0d0>,
 'SVD(nfc5, neph20)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53e590>,
 'SVD(nfc10, neph5)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53d8a0>,
 'SVD(nfc10, neph10)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53da50>,
 'SVD(nfc10, neph20)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53f520>,
 'SVD(nfc20, neph5)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7dcd6b53e0e0>,
 'SVD(nfc2

model results by cross validation:

In [145]:
assessment_SVD = {
                  f'{name}_cv{cv}':
                  cross_validate(alg, data, measures=['RMSE', 'MAE'], cv=cv, verbose=True)
                  for name, alg in algo_SVD.items() for cv in (3, 5, 6)
                  }

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9521  0.9475  0.9515  0.9504  0.0021  
MAE (testset)     0.7389  0.7395  0.7379  0.7388  0.0007  
Fit time          0.16    0.24    0.24    0.21    0.04    
Test time         0.32    0.88    0.31    0.50    0.26    
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9396  0.9313  0.9329  0.9386  0.9479  0.9380  0.0059  
MAE (testset)     0.7262  0.7220  0.7225  0.7278  0.7325  0.7262  0.0038  
Fit time          0.20    0.21    0.21    0.21    0.22    0.21    0.00    
Test time         0.13    0.54    0.12    0.18    0.59    0.31    0.21    
Evaluating RMSE, MAE of algorithm SVD on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.9287  0.9301  0.9285  0.9328  0.9451  0.9357  0.9335  0.0058  
MA

In [146]:
# assessment_SVD

In [147]:
# best_SVD_by_rmse = min(assessment_SVD, key=lambda x: x['test_rmse'].mean())
all_SVD_by_rmse = {
                   name: value['test_rmse'].mean()
                   for name, value in assessment_SVD.items()
                   }
best_SVD_by_rmse = min(all_SVD_by_rmse, key=all_SVD_by_rmse.get)

In [148]:
best_SVD_by_rmse

'SVD(nfc1, neph20)_cv6'

In [149]:
# best_SVD_by_mae = min(assessment_SVD, key=lambda x:x['test_mae'].mean())
all_SVD_by_mae = {
                  name: value['test_mae'].mean()
                  for name, value in assessment_SVD.items()
                  }
best_SVD_by_mae = min(all_SVD_by_mae, key=all_SVD_by_mae.get)

In [150]:
best_SVD_by_mae

'SVD(nfc1, neph20)_cv6'

Для моделі матричної факторизації (SVD) серед деякого набору параметрів виявився найкращим варіант з 1 фактором, 20 епохами при 6 сегментах кросвалідації.

In [151]:
predictions = algo_SVD.get(best_SVD_by_rmse[:-4]).predict(uid=1, iid=1)
predictions

Prediction(uid=1, iid=1, r_ui=None, est=4.586234359102049, details={'was_impossible': False})

врахуємо й інші деякі гіперпараметри для SVD

In [152]:
from surprise.model_selection import GridSearchCV

In [153]:
another_r_mean = (max([r[2] for r in data.raw_ratings]) + min([r[2] for r in data.raw_ratings])) / 2

In [154]:
param_grid = {
              'n_factors': (1, 5, 10, 20, 40),
              'n_epochs': (5, 10, 20),
              'init_mean': (r_mean, another_r_mean), # (df['rating'].mean(), ((df['rating'].max() + df['rating'].min()) / 2)),
              'init_std_dev': (r_std, 1), # (df['rating'].std(), 1),
              'lr_all': (0.002, 0.005),
              'reg_all': (0.4, 0.6)
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=6)

gs.fit(data)


In [155]:
# best MAE score
print(gs.best_score['mae'].round(4))

# combination of parameters that gave the best MAE score
print(gs.best_params['mae'])

0.6807
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [156]:
# best RMSE score
print(gs.best_score['rmse'].round(4))

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8806
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [157]:
# the_best_svd = gs.best_estimator['rmse']
# the_best_svd.__dict__

In [158]:
# the_best_svd = gs.best_estimator['mae']
# the_best_svd.__dict__

0.8806<br>
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 1, 'lr_all': 0.005, 'reg_all': 0.4}<br>
(22+ хв.оброблення ~240 models)

In [159]:
param_grid = {
              'n_factors': (1, ),
              'n_epochs': (20, ),
              'init_mean': (2.75,),
              'init_std_dev': (0.2,),
              'lr_all': (0.005,),
              'reg_all': (0.4, )
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=6)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'].round(4))

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

# best MAE score
print(gs.best_score['mae'].round(4))

# combination of parameters that gave the best MAE score
print(gs.best_params['mae'])

0.8801
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 0.2, 'lr_all': 0.005, 'reg_all': 0.4}
0.6799
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 0.2, 'lr_all': 0.005, 'reg_all': 0.4}


In [160]:
parameters = gs.best_params['rmse']
compare_results_RMSE = {f'''SVD_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gs.best_score['rmse']}
parameters = gs.best_params['mae']
compare_results_MAE = {f'''SVD_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gs.best_score['mae']}
compare_results_fit_time = {f'''SVD_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': 7}

In [161]:
compare_results_RMSE, compare_results_MAE

({'SVD_nf1_nep20': 0.880074625316146}, {'SVD_nf1_nep20': 0.6798870906595543})

Для моделі матричної факторизації (SVD) серед деякого набору параметрів виявився найкращим варіант з 1 фактором, 20 епохами за 6 сегментів кросвалідації.

SVDpp, NMF

- https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF

- https://surprise.readthedocs.io/en/stable/building_custom_algo.html

In [162]:
from surprise import SVDpp, NMF

#### `SVD++`

In [163]:
param_grid = {
              'n_factors': (1,),
              'n_epochs': (10, 20),
              'init_mean': (r_mean,),  # another_r_mean # (df['rating'].mean(), ((df['rating'].max() + df['rating'].min()) / 2)),
              'init_std_dev': (r_std, 1), # (df['rating'].std(), 1),
              'lr_all': (0.005,),
              'reg_all': (0.4,)
              }
gspp = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=6)

gspp.fit(data)

# best RMSE score
print(gspp.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gspp.best_params['rmse'])

1.8254159284465967
{'n_factors': 1, 'n_epochs': 10, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [164]:
# best MAE score
print(gspp.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gspp.best_params['mae'])

1.498443016383038
{'n_factors': 1, 'n_epochs': 10, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [165]:
param_grid = {
              'n_factors': (1,),
              'n_epochs': (10,),
              'init_mean': (r_mean,), # (df['rating'].mean(), 2.5),
              'init_std_dev': (r_std,), # (df['rating'].std(),),
              'lr_all': (0.005,),
              'reg_all': (0.4,)
              }
gspp = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=6)

gspp.fit(data)

# best RMSE score
print(gspp.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gspp.best_params['rmse'])

# best MAE score
print(gspp.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gspp.best_params['mae'])

1.825412753039325
{'n_factors': 1, 'n_epochs': 10, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}
1.4984430163830378
{'n_factors': 1, 'n_epochs': 10, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [166]:
param_grid = {
              'n_factors': (10,),
              'n_epochs': (30,),
              'init_mean': (r_mean,), # (df['rating'].mean(), 2.5),
              'init_std_dev': (r_std,), # (df['rating'].std(),),
              'lr_all': (0.005,),
              'reg_all': (0.4,)
              }
gspp = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gspp.fit(data)

# best RMSE score
print(gspp.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gspp.best_params['rmse'])

# best MAE score
print(gspp.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gspp.best_params['mae'])

1.8254183704982334
{'n_factors': 10, 'n_epochs': 30, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}
1.4984429289038275
{'n_factors': 10, 'n_epochs': 30, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [167]:
param_grid = {
              'n_factors': (10,),
              'n_epochs': (30,),
              # 'init_mean': (r_mean,), # (df['rating'].mean(), 2.5),
              # 'init_std_dev': (r_std,), # (df['rating'].std(),),
              'lr_all': (0.005,),
              'reg_all': (0.4,)
              }
gspp = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gspp.fit(data)

# best RMSE score
print(gspp.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gspp.best_params['rmse'])

# best MAE score
print(gspp.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gspp.best_params['mae'])

0.8790485337594897
{'n_factors': 10, 'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.4}
0.6789601853742753
{'n_factors': 10, 'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.4}


In [168]:
gspp.best_estimator['rmse'].init_mean, gspp.best_estimator['rmse'].init_std_dev

(0, 0.1)

In [169]:
gspp.best_estimator['rmse'].n_factors, gspp.best_estimator['rmse'].n_epochs

(10, 30)

In [170]:
parameters = gspp.best_params['rmse']
compare_results_RMSE.update({f'''SVDpp_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gspp.best_score['rmse']})
parameters = gspp.best_params['mae']
compare_results_MAE.update({f'''SVDpp_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gspp.best_score['mae']})
compare_results_fit_time.update({f'''SVDpp_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': 278 / 2})

In [171]:
compare_results_RMSE, compare_results_MAE

({'SVD_nf1_nep20': 0.880074625316146, 'SVDpp_nf10_nep30': 0.8790485337594897},
 {'SVD_nf1_nep20': 0.6798870906595543, 'SVDpp_nf10_nep30': 0.6789601853742753})

#### `NMF`

- https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF

In [172]:
init_low = min([r[2] for r in data.raw_ratings])
init_high = max([r[2] for r in data.raw_ratings])

In [173]:
param_grid = {
              'n_factors': (1, 5, 15, 20),
              'n_epochs': (10, 20, 30, 50, 75),
              # 'reg_pu': (0.01, 0.06, 0.1),
              # 'reg_qi': (0.01, 0.06, 0.1),
              # 'reg_bu': (0.01, 0.02, 0.03),
              # 'reg_bi': (0.01, 0.02, 0.03),
              # 'lr_bu': (0.005,),
              # 'lr_bi': (0.005,),
              'init_low': (0, init_low) if init_low > 0 else (0,),
              'init_high': (init_high,),
              }
gsNMF= GridSearchCV(NMF, param_grid, measures=['rmse', 'mae'], cv=6)

gsNMF.fit(data)

# best RMSE score
print(gsNMF.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gsNMF.best_params['rmse'])

0.9370779006110879
{'n_factors': 1, 'n_epochs': 75, 'init_low': 0, 'init_high': 5.0}


In [174]:
# model_nmf.pu  # Матриця P (фактори користувача)
# model_nmf.qi  # Матриця Q (фактори товару)
# avg_ratings_item = model_nmf.qi.mean(axis=1)

In [175]:
# best MAE score
print(gsNMF.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gsNMF.best_params['mae'])

0.7447791040327644
{'n_factors': 1, 'n_epochs': 75, 'init_low': 0, 'init_high': 5.0}


In [176]:
param_grid = {
              'n_factors': (15,),
              'n_epochs': (50,),
              # 'reg_pu': (0.1,),
              # 'reg_qi': (0.1,),
              # 'reg_bu': (0.03,),
              # 'reg_bi': (0.03,),
              # 'lr_bu': (0.002,),
              # 'lr_bi': (0.002,)
              }
gsNMF= GridSearchCV(NMF, param_grid, measures=['rmse', 'mae'], cv=6)

gsNMF.fit(data)

# best RMSE score
print(gsNMF.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gsNMF.best_params['rmse'])

# best MAE score
print(gsNMF.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gsNMF.best_params['mae'])

0.919641758678024
{'n_factors': 15, 'n_epochs': 50}
0.7038049160122294
{'n_factors': 15, 'n_epochs': 50}


In [177]:
param_grid = {
              'n_factors': (15,),
              'n_epochs': (50,),
              'reg_pu': (0.1,),
              'reg_qi': (0.1,),
              'reg_bu': (0.03,),
              'reg_bi': (0.03,),
              'lr_bu': (0.002,),
              'lr_bi': (0.002,)
              }
gsNMF= GridSearchCV(NMF, param_grid, measures=['rmse', 'mae'], cv=6)

gsNMF.fit(data)

# best RMSE score
print(gsNMF.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gsNMF.best_params['rmse'])

# best MAE score
print(gsNMF.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gsNMF.best_params['mae'])

0.9046710820375528
{'n_factors': 15, 'n_epochs': 50, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.03, 'lr_bu': 0.002, 'lr_bi': 0.002}
0.6978892341115174
{'n_factors': 15, 'n_epochs': 50, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.03, 'lr_bu': 0.002, 'lr_bi': 0.002}


In [178]:
param_grid = {
              'n_factors': (20,),
              'n_epochs': (75,),
              'init_low': (0,),
              'init_high': (5,),
              'reg_pu': (0.1,),
              'reg_qi': (0.1,),
              'reg_bu': (0.03,),
              'reg_bi': (0.03,),
              'lr_bu': (0.002,),
              'lr_bi': (0.002,)
              }
gsNMF= GridSearchCV(NMF, param_grid, measures=['rmse', 'mae'], cv=6)

gsNMF.fit(data)

# best RMSE score
print(gsNMF.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gsNMF.best_params['rmse'])

# best MAE score
print(gsNMF.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gsNMF.best_params['mae'])

0.8996856094961729
{'n_factors': 20, 'n_epochs': 75, 'init_low': 0, 'init_high': 5, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.03, 'lr_bu': 0.002, 'lr_bi': 0.002}
0.6985667000484241
{'n_factors': 20, 'n_epochs': 75, 'init_low': 0, 'init_high': 5, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.03, 'lr_bu': 0.002, 'lr_bi': 0.002}


In [179]:
parameters = gsNMF.best_params['rmse']
compare_results_RMSE.update({f'''NMF_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gsNMF.best_score['rmse']})
parameters = gsNMF.best_params['mae']
compare_results_MAE.update({f'''NMF_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gsNMF.best_score['mae']})
compare_results_fit_time.update({f'''NMF_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': 9})

In [180]:
compare_results_RMSE, compare_results_MAE, compare_results_fit_time

({'SVD_nf1_nep20': 0.880074625316146,
  'SVDpp_nf10_nep30': 0.8790485337594897,
  'NMF_nf20_nep75': 0.8996856094961729},
 {'SVD_nf1_nep20': 0.6798870906595543,
  'SVDpp_nf10_nep30': 0.6789601853742753,
  'NMF_nf20_nep75': 0.6985667000484241},
 {'SVD_nf1_nep20': 7, 'SVDpp_nf10_nep30': 139.0, 'NMF_nf20_nep75': 9})

### __PREDICT__

- https://stackoverflow.com/questions/35388647/how-to-use-gridsearchcv-output-for-a-scikit-prediction

In [191]:
algo_SVD.get(best_SVD_by_rmse[:-4]).__dict__

{'n_factors': 1,
 'n_epochs': 20,
 'biased': True,
 'init_mean': 3.501556983616962,
 'init_std_dev': 1.042529239060635,
 'lr_bu': 0.005,
 'lr_bi': 0.005,
 'lr_pu': 0.005,
 'lr_qi': 0.005,
 'reg_bu': 0.02,
 'reg_bi': 0.02,
 'reg_pu': 0.02,
 'reg_qi': 0.02,
 'random_state': None,
 'verbose': False,
 'bsl_options': {},
 'sim_options': {'user_based': True},
 'trainset': <surprise.trainset.Trainset at 0x7dcd6b53fcd0>,
 'bu': array([-1.14033291e-01, -7.09933022e-01,  2.84385920e-02, -1.52528897e-01,
        -5.41109222e-01,  4.54405429e-01, -7.55730131e-01, -4.13610544e-01,
        -6.31708243e-01, -2.81646924e-01, -9.03084693e-02, -4.31061417e-01,
        -1.76238505e-01, -3.80442224e-01, -4.53431424e-01,  3.66356824e-01,
        -4.27117645e-01, -1.02450975e-01, -5.16858870e-01, -2.38364767e-01,
        -2.37816229e-01, -2.99428462e-01,  6.92637758e-02, -1.81386310e-01,
        -7.19374161e-01, -2.24012330e-01, -4.27725877e-01, -4.70041650e-01,
        -1.48772361e-01, -2.96496016e-01,  1.

In [181]:
predictions = algo_SVD.get(best_SVD_by_rmse[:-4]).predict(uid=1, iid=1)
predictions

Prediction(uid=1, iid=1, r_ui=None, est=4.586234359102049, details={'was_impossible': False})

In [182]:
predictions.est

4.586234359102049

In [197]:
algo_SVD_1 = SVD(
                 n_factors=1,
                 n_epochs=20,
                 biased=True,
                 init_mean=3.501556983616962,
                 init_std_dev=1.042529239060635,
                 lr_bu=0.005,
                 lr_bi=0.005,
                 lr_pu=0.005,
                 lr_qi=0.005,
                 reg_bu=0.02,
                 reg_bi=0.02,
                 reg_pu=0.02,
                 reg_qi=0.02,
                 random_state=None,
                 verbose=False
                 )

In [198]:
cross_validate(algo_SVD_1, data, measures=['RMSE', 'MAE'], cv=6, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.8859  0.8872  0.8887  0.8853  0.8975  0.8790  0.8873  0.0055  
MAE (testset)     0.6817  0.6800  0.6810  0.6811  0.6919  0.6747  0.6817  0.0051  
Fit time          0.62    0.90    0.83    0.68    0.67    0.67    0.73    0.10    
Test time         0.20    0.20    0.10    0.11    0.11    0.09    0.13    0.04    


{'test_rmse': array([0.88589328, 0.88715723, 0.88867932, 0.88528644, 0.89749134,
        0.87899553]),
 'test_mae': array([0.68173055, 0.67995597, 0.68096466, 0.68113494, 0.69185808,
        0.67474374]),
 'fit_time': (0.6248815059661865,
  0.8994846343994141,
  0.8292737007141113,
  0.6796760559082031,
  0.6687231063842773,
  0.6720678806304932),
 'test_time': (0.19831132888793945,
  0.19502973556518555,
  0.10313272476196289,
  0.10672783851623535,
  0.11003613471984863,
  0.09494662284851074)}

In [199]:
predictions1 = algo_SVD_1.predict(uid=1, iid=1)
predictions1

Prediction(uid=1, iid=1, r_ui=None, est=4.478592493373982, details={'was_impossible': False})

In [192]:
gs.best_estimator['rmse'].__dict__

{'n_factors': 1,
 'n_epochs': 20,
 'biased': True,
 'init_mean': 2.75,
 'init_std_dev': 0.2,
 'lr_bu': 0.005,
 'lr_bi': 0.005,
 'lr_pu': 0.005,
 'lr_qi': 0.005,
 'reg_bu': 0.4,
 'reg_bi': 0.4,
 'reg_pu': 0.4,
 'reg_qi': 0.4,
 'random_state': None,
 'verbose': False,
 'bsl_options': {},
 'sim_options': {'user_based': True}}

In [205]:
algo_SVD_2 = SVD(
                 n_factors=1,
                 n_epochs=20,
                 biased=True,
                 init_mean=2.75,
                 init_std_dev=0.2,
                 lr_bu=0.005,
                 lr_bi=0.005,
                 lr_pu=0.005,
                 lr_qi=0.005,
                 reg_bu=0.4,
                 reg_bi=0.4,
                 reg_pu=0.4,
                 reg_qi=0.4,
                 random_state=None,
                 verbose=False
                 )
cross_validate(algo_SVD_2, data, measures=['RMSE', 'MAE'], cv=6, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.8802  0.8701  0.8850  0.8805  0.8817  0.8803  0.8796  0.0046  
MAE (testset)     0.6799  0.6751  0.6832  0.6782  0.6782  0.6821  0.6794  0.0027  
Fit time          0.67    0.69    0.69    0.99    0.96    0.76    0.79    0.13    
Test time         0.14    0.10    0.14    0.21    0.29    0.90    0.30    0.28    


{'test_rmse': array([0.88019364, 0.87011156, 0.8850459 , 0.88045319, 0.88170053,
        0.88028071]),
 'test_mae': array([0.6798545 , 0.67510349, 0.68315716, 0.67821597, 0.67815387,
        0.68209943]),
 'fit_time': (0.6721510887145996,
  0.6876275539398193,
  0.6892197132110596,
  0.9852011203765869,
  0.9573357105255127,
  0.7594242095947266),
 'test_time': (0.14191532135009766,
  0.10269308090209961,
  0.13662028312683105,
  0.21170258522033691,
  0.2929048538208008,
  0.9005699157714844)}

In [203]:
predictions2 = algo_SVD_2.predict(uid=1, iid=1)
predictions2

Prediction(uid=1, iid=1, r_ui=None, est=4.254468434837564, details={'was_impossible': False})

In [204]:
gspp.best_estimator['rmse'].__dict__

{'n_factors': 10,
 'n_epochs': 30,
 'init_mean': 0,
 'init_std_dev': 0.1,
 'lr_bu': 0.005,
 'lr_bi': 0.005,
 'lr_pu': 0.005,
 'lr_qi': 0.005,
 'lr_yj': 0.005,
 'reg_bu': 0.4,
 'reg_bi': 0.4,
 'reg_pu': 0.4,
 'reg_qi': 0.4,
 'reg_yj': 0.4,
 'random_state': None,
 'verbose': False,
 'cache_ratings': False,
 'bsl_options': {},
 'sim_options': {'user_based': True}}

In [206]:
algo_SVDpp = SVD(
                 n_factors=10,
                 n_epochs=30,
                 init_mean=0,
                 init_std_dev=0.1,
                 lr_bu=0.005,
                 lr_bi=0.005,
                 lr_pu=0.005,
                 lr_qi=0.005,
                 reg_bu=0.4,
                 reg_bi=0.4,
                 reg_pu=0.4,
                 reg_qi=0.4,
                 random_state=None,
                 verbose=False
                 )
cross_validate(algo_SVDpp, data, measures=['RMSE', 'MAE'], cv=6, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.8782  0.8766  0.8885  0.8713  0.8731  0.8815  0.8782  0.0057  
MAE (testset)     0.6787  0.6765  0.6887  0.6724  0.6718  0.6801  0.6780  0.0056  
Fit time          1.29    1.08    1.12    1.10    1.13    1.08    1.13    0.07    
Test time         0.10    0.10    0.12    0.10    0.10    0.10    0.10    0.01    


{'test_rmse': array([0.87819979, 0.8765685 , 0.88848631, 0.87128222, 0.87311453,
        0.88153739]),
 'test_mae': array([0.67871874, 0.67648718, 0.68866662, 0.67239871, 0.67176171,
        0.68006918]),
 'fit_time': (1.2944934368133545,
  1.0768423080444336,
  1.1183114051818848,
  1.0992112159729004,
  1.1301133632659912,
  1.0774996280670166),
 'test_time': (0.09768319129943848,
  0.10087227821350098,
  0.12119078636169434,
  0.0973820686340332,
  0.0962216854095459,
  0.09785985946655273)}

In [207]:
predictions3 = algo_SVDpp.predict(uid=1, iid=1)
predictions3

Prediction(uid=1, iid=1, r_ui=None, est=4.308400650270771, details={'was_impossible': False})

In [208]:
gsNMF.best_estimator['rmse'].__dict__

{'n_factors': 20,
 'n_epochs': 75,
 'biased': False,
 'reg_pu': 0.1,
 'reg_qi': 0.1,
 'lr_bu': 0.002,
 'lr_bi': 0.002,
 'reg_bu': 0.03,
 'reg_bi': 0.03,
 'init_low': 0,
 'init_high': 5,
 'random_state': None,
 'verbose': False,
 'bsl_options': {},
 'sim_options': {'user_based': True}}

In [210]:
algo_NMF = NMF(
               n_factors=20,
               n_epochs=75,
               biased=False,
               reg_pu=0.1,
               reg_qi=0.1,
               lr_bu=0.002,
               lr_bi=0.002,
               reg_bu=0.03,
               reg_bi=0.03,
               init_low=0,
               init_high=5,
               random_state=None,
               verbose=False
               )
cross_validate(algo_NMF, data, measures=['RMSE', 'MAE'], cv=6, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.8935  0.9047  0.9042  0.9007  0.9051  0.9084  0.9028  0.0047  
MAE (testset)     0.6912  0.7017  0.7019  0.6998  0.7059  0.7079  0.7014  0.0053  
Fit time          5.10    5.36    4.67    5.07    4.89    4.58    4.94    0.27    
Test time         0.09    0.10    0.73    0.18    0.09    0.09    0.21    0.23    


{'test_rmse': array([0.89352855, 0.90474124, 0.90417976, 0.90065999, 0.90513334,
        0.90837804]),
 'test_mae': array([0.69118407, 0.70166846, 0.70187377, 0.69981521, 0.70591347,
        0.70790093]),
 'fit_time': (5.09546160697937,
  5.364324569702148,
  4.671360969543457,
  5.074461460113525,
  4.8859803676605225,
  4.576553583145142),
 'test_time': (0.08793473243713379,
  0.09647607803344727,
  0.7258057594299316,
  0.17620611190795898,
  0.09053301811218262,
  0.08816933631896973)}

In [211]:
predictions4 = algo_NMF.predict(uid=1, iid=1)
predictions4

Prediction(uid=1, iid=1, r_ui=None, est=4.544657600970317, details={'was_impossible': False})