# __рекомендаційні системи__
### _!_ (_in `GoogleColab`_)
бібліотека `surprise`, котра якраз є по суті додатком до бібліотеки `scikit-learn` для тренування моделей рекомендаційних систем.
<br><br>
Example: датасет `movielens`, побудова моделі матричної факторизації. У даній бібліотеці він має назву `SVD`.
<br>
Підбір найкращих параметрів за допомогою крос-валідації, також й інші алгоритмами розрахунків (`SVD++`, `NMF`).
<br><br>
як саме побудувати дану модель - в документації до даної бібліотеки.

!pip install matplotlib

In [28]:
!pip install surprise



In [29]:
import os
from pathlib import Path
import random
import shutil

from google.colab import files
import numpy as np
import pandas as pd

In [30]:
def set_seed(seed_value: int) -> None:
    """Set a random state for repeatability of results."""
    random.seed(seed_value)
    np.random.seed(seed_value)
    # tf.random.set_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    os.environ['TF_DETERMINISTIC_OPS'] = 'true'


set_seed(1)

### __`OBTAIN` & `SCRUB`__ + __`EXPLORE`__ (DATASET)

- links.csv - movieId, imdbId, tmdbId
- movies.csv - movieId, title, genres
- ratings.csv - userId, movieId, rating, timestamp
- tags.csv - userId, movieId, tag, timestamp

In [59]:
# data = Dataset.load_builtin("ml-100k")

In [31]:
def read_from_csvfile(file: Path) -> pd.DataFrame:
    """Read content from csv-file and return dataframe from content."""
    df = pd.read_csv(file)

    return df

In [32]:
# Create new folder
new_folder = 'ml'

if os.path.isdir(new_folder):
  shutil.rmtree(new_folder)

os.mkdir(new_folder)

# Upload Files to GoogleColab
uploaded = files.upload()
for filename in uploaded.keys():
  dst_path = os.path.join(new_folder, filename)
  print(f'move {filename} to {dst_path}')
  shutil.move(filename, dst_path)

Saving links.csv to links.csv
Saving movie_ids.txt to movie_ids.txt
Saving movies.csv to movies.csv
Saving movies.mat to movies.mat
Saving ratings.csv to ratings.csv
Saving tags.csv to tags.csv
move links.csv to ml/links.csv
move movie_ids.txt to ml/movie_ids.txt
move movies.csv to ml/movies.csv
move movies.mat to ml/movies.mat
move ratings.csv to ml/ratings.csv
move tags.csv to ml/tags.csv


In [33]:
links = read_from_csvfile('/content/ml/links.csv')
movies = read_from_csvfile('/content/ml/movies.csv')
ratings = read_from_csvfile('/content/ml/ratings.csv')
tags = read_from_csvfile('/content/ml/tags.csv')

In [34]:
links.tail(3)

Unnamed: 0,movieId,imdbId,tmdbId
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0
9741,193609,101726,37891.0


In [35]:
movies.tail(3)

Unnamed: 0,movieId,title,genres
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [36]:
ratings.tail(3)

Unnamed: 0,userId,movieId,rating,timestamp
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352
100835,610,170875,3.0,1493846415


In [37]:
tags.tail(3)

Unnamed: 0,userId,movieId,tag,timestamp
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978
3682,610,168248,Heroic Bloodshed,1493844270


- `Використовуємо спочатку лише датасет рейтингу, для найпростішої системи.`

### __`MODEL`__ &  __`Training`__

- модель матричної факторизації `SVD`

- https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD
- https://www.piwheels.org/project/scikit-surprise/
- https://pypi.org/project/scikit-surprise/
- https://github.com/NicolasHug/Surprise

In [38]:
# from surprise.prediction_algorithms.matrix_factorization import SVD  # poetry add surprise  # poetry add scikit-surprise==1.1.3 --use-pep517

- https://surprise.readthedocs.io/en/stable/getting_started.html#cross-validate-example
- https://surprise.readthedocs.io/en/stable/model_selection.html
- https://surprise.readthedocs.io/en/stable/getting_started.html?highlight=Dataset#use-a-custom-dataset


In [39]:
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate

Column names are irrelevant, but specify only one rating per line, and each line needs to respect the following structure:<br> `user ; item ; rating ; [timestamp]`

In [40]:
df = ratings[['userId', 'movieId', 'rating']]

In [41]:
df.tail(3)

Unnamed: 0,userId,movieId,rating
100833,610,168250,5.0
100834,610,168252,5.0
100835,610,170875,3.0


In [42]:
df['rating'].max(), df['rating'].min(), df['rating'].mean(), df['rating'].std()

(5.0, 0.5, 3.501556983616962, 1.042529239060635)

0.5 мінімальний рейтинг, тому логічно що межа мінімума 0:

- https://surprise.readthedocs.io/en/stable/reader.html?highlight=Reader#surprise.reader.Reader

In [43]:
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0, 5))
reader

<surprise.reader.Reader at 0x790c506f8f70>

In [44]:
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)
data

<surprise.dataset.DatasetAutoFolds at 0x790c506f9db0>

In [None]:
# from surprise.model_selection import train_test_split
# trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
# model = SVD()
# model.fit(trainset)
# # Прогнозування рейтингів
# test_predictions = model.test(testset)
# predictions = model.predict(uid=1, iid=1)

In [45]:
algo = SVD()

In [50]:
algo

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c506f8c10>

In [46]:
algo2 = SVD(
            n_factors=1,
            n_epochs=10,
            biased=True,
            init_mean=2.5,
            init_std_dev=0.5,
            lr_all=0.005,
            reg_all=0.02,
            lr_bu=None,
            lr_bi=None,
            lr_pu=None,
            lr_qi=None,
            reg_bu=None,
            reg_bi=None,
            reg_pu=None,
            reg_qi=None,
            random_state=None,
            verbose=False
            )

In [47]:
# cross_validate(
#                algo,
#                data,
#                measures=['rmse', 'mae'],
#                cv=None,
#                return_train_measures=False,
#                n_jobs=1,
#                pre_dispatch='2*n_jobs',
#                verbose=False
#                )

In [48]:
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=6, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.8750  0.8633  0.8713  0.8784  0.8712  0.8753  0.8724  0.0048  
MAE (testset)     0.6706  0.6639  0.6692  0.6777  0.6682  0.6716  0.6702  0.0042  
Fit time          1.71    1.70    1.97    4.61    3.49    2.49    2.66    1.06    
Test time         0.24    0.12    0.20    0.23    0.21    0.10    0.18    0.05    


{'test_rmse': array([0.87495113, 0.86327261, 0.87125727, 0.87841512, 0.87122477,
        0.87529899]),
 'test_mae': array([0.67061167, 0.66393946, 0.66918979, 0.67774807, 0.66815439,
        0.67161136]),
 'fit_time': (1.7121727466583252,
  1.702911615371704,
  1.974195957183838,
  4.609205007553101,
  3.485034942626953,
  2.4914538860321045),
 'test_time': (0.23586225509643555,
  0.11668515205383301,
  0.19858407974243164,
  0.2310924530029297,
  0.20995044708251953,
  0.10158228874206543)}

In [49]:
# Run 6-fold cross-validation and print results.
cross_validate(algo2, data, measures=['RMSE', 'MAE'], cv=6, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.9020  0.8982  0.9062  0.8888  0.8889  0.9032  0.8979  0.0068  
MAE (testset)     0.6943  0.6900  0.6980  0.6887  0.6845  0.6934  0.6915  0.0043  
Fit time          0.37    0.54    0.54    0.40    0.40    0.38    0.44    0.07    
Test time         0.16    0.33    0.18    0.22    0.11    0.23    0.21    0.07    


{'test_rmse': array([0.90204838, 0.89815799, 0.90622792, 0.88876659, 0.88886203,
        0.90323361]),
 'test_mae': array([0.69427831, 0.68998837, 0.69803931, 0.68865522, 0.68452513,
        0.6934053 ]),
 'fit_time': (0.3678905963897705,
  0.5401310920715332,
  0.5371837615966797,
  0.4002864360809326,
  0.3972444534301758,
  0.37680506706237793),
 'test_time': (0.15778684616088867,
  0.3342294692993164,
  0.17870593070983887,
  0.21947240829467773,
  0.11394381523132324,
  0.23300886154174805)}

- Підбір найкращих параметрів за допомогою крос-валідації


model set:

In [25]:
algo_SVD = {
            f'SVD(nfc{nfs}, neph{eph})':
            SVD(
                n_factors=nfs,
                n_epochs=eph,
                init_mean=df['rating'].mean(), # (0+5)/2 ...
                init_std_dev=df['rating'].std()
                )
            for nfs in (1, 5, 10, 20, 40)
            for eph in (5, 10, 20)
           }

In [72]:
algo_SVD

{'SVD(nfc1, neph5)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508cfeb0>,
 'SVD(nfc1, neph10)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508cf310>,
 'SVD(nfc1, neph20)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508cf520>,
 'SVD(nfc5, neph5)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508ce470>,
 'SVD(nfc5, neph10)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508ce710>,
 'SVD(nfc5, neph20)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508cf850>,
 'SVD(nfc10, neph5)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508cfdc0>,
 'SVD(nfc10, neph10)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508cf910>,
 'SVD(nfc10, neph20)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508cf820>,
 'SVD(nfc20, neph5)': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x790c508cf700>,
 'SVD(nfc2

model results by cross validation:

In [52]:
assessment_SVD = {
                  f'{name}_cv{cv}':
                  cross_validate(alg, data, measures=['RMSE', 'MAE'], cv=cv, verbose=True)
                  for name, alg in algo_SVD.items() for cv in (3, 5, 6)
                  }

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9521  0.9475  0.9515  0.9504  0.0021  
MAE (testset)     0.7389  0.7395  0.7379  0.7388  0.0007  
Fit time          0.16    0.27    0.28    0.24    0.05    
Test time         0.28    0.29    0.46    0.35    0.08    
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9396  0.9313  0.9329  0.9386  0.9479  0.9380  0.0059  
MAE (testset)     0.7262  0.7220  0.7225  0.7278  0.7325  0.7262  0.0038  
Fit time          0.32    0.33    0.33    0.35    0.32    0.33    0.01    
Test time         0.35    0.15    0.34    0.14    0.37    0.27    0.10    
Evaluating RMSE, MAE of algorithm SVD on 6 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Mean    Std     
RMSE (testset)    0.9287  0.9301  0.9285  0.9328  0.9451  0.9357  0.9335  0.0058  
MA

In [53]:
# assessment_SVD

In [54]:
# best_SVD_by_rmse = min(assessment_SVD, key=lambda x: x['test_rmse'].mean())
all_SVD_by_rmse = {
                   name: value['test_rmse'].mean()
                   for name, value in assessment_SVD.items()
                   }
best_SVD_by_rmse = min(all_SVD_by_rmse, key=all_SVD_by_rmse.get)

In [55]:
best_SVD_by_rmse

'SVD(nfc1, neph20)_cv6'

In [56]:
# best_SVD_by_mae = min(assessment_SVD, key=lambda x:x['test_mae'].mean())
all_SVD_by_mae = {
                  name: value['test_mae'].mean()
                  for name, value in assessment_SVD.items()
                  }
best_SVD_by_mae = min(all_SVD_by_mae, key=all_SVD_by_mae.get)

In [57]:
best_SVD_by_mae

'SVD(nfc1, neph20)_cv6'

Для моделі матричної факторизації (SVD) серед деякого набору параметрів виявився найкращим варіант з 1 фактором, 20 епохами при 6 сегментах кросвалідації.

In [122]:
predictions = algo_SVD.get(best_SVD_by_rmse[:-4]).predict(uid=1, iid=1)
predictions

Prediction(uid=1, iid=1, r_ui=None, est=4.586234359102049, details={'was_impossible': False})

врахуємо й інші деякі гіперпараметри для SVD

In [58]:
from surprise.model_selection import GridSearchCV

In [60]:
param_grid = {
              'n_factors': (1, 5, 10, 20, 40),
              'n_epochs': (5, 10, 20),
              'init_mean': (df['rating'].mean(), ((df['rating'].max() + df['rating'].min()) / 2)),
              'init_std_dev': (df['rating'].std(), 1),
              'lr_all': (0.002, 0.005),
              'reg_all': (0.4, 0.6)
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=6)

gs.fit(data)


0.8806090211823191
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [99]:
# best MAE score
print(gs.best_score['mae'].round(4))

# combination of parameters that gave the best MAE score
print(gs.best_params['mae'])

0.6807
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [68]:
# best RMSE score
print(gs.best_score['rmse'].round(4))

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8806
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [113]:
# the_best_svd = gs.best_estimator['rmse']
# the_best_svd.__dict__

In [114]:
# the_best_svd = gs.best_estimator['mae']
# the_best_svd.__dict__

0.8806<br>
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 1, 'lr_all': 0.005, 'reg_all': 0.4}<br>
(22+ хв.оброблення ~240 models)

In [130]:
param_grid = {
              'n_factors': (1, ),
              'n_epochs': (20, ),
              'init_mean': (2.75,),
              'init_std_dev': (0.2,),
              'lr_all': (0.005,),
              'reg_all': (0.4, )
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=6)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'].round(4))

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

# best MAE score
print(gs.best_score['mae'].round(4))

# combination of parameters that gave the best MAE score
print(gs.best_params['mae'])

0.8786
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 0.2, 'lr_all': 0.005, 'reg_all': 0.4}
0.6788
{'n_factors': 1, 'n_epochs': 20, 'init_mean': 2.75, 'init_std_dev': 0.2, 'lr_all': 0.005, 'reg_all': 0.4}


In [131]:
parameters = gs.best_params['rmse']
compare_results_RMSE = {f'''SVD_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gs.best_score['rmse']}
parameters = gs.best_params['mae']
compare_results_MAE = {f'''SVD_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gs.best_score['mae']}
compare_results_fit_time = {f'''SVD_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': 7}

In [132]:
compare_results_RMSE, compare_results_MAE

({'SVD_nf1_nep20': 0.8786474219586929}, {'SVD_nf1_nep20': 0.6788400349414734})

Для моделі матричної факторизації (SVD) серед деякого набору параметрів виявився найкращим варіант з 1 фактором, 20 епохами за 6 сегментів кросвалідації.

SVDpp, NMF

In [133]:
from surprise import SVDpp, NMF

In [135]:
param_grid = {
              'n_factors': (1, ),
              'n_epochs': (10, 20),
              'init_mean': (df['rating'].mean(), ((df['rating'].max() + df['rating'].min()) / 2)),
              'init_std_dev': (df['rating'].std(),),
              'lr_all': (0.005,),
              'reg_all': (0.4,)
              }
gspp = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=6)

gspp.fit(data)

# best RMSE score
print(gspp.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gspp.best_params['rmse'])

1.8254204586130063
{'n_factors': 1, 'n_epochs': 10, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [136]:
# best MAE score
print(gspp.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gspp.best_params['mae'])

1.4984430163830378
{'n_factors': 1, 'n_epochs': 10, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


Для моделі SVD++ серед деякого набору параметрів виявився найкращим варіант з 1 фактором, 10 епохами за 6 сегментів кросвалідації.

46+ хв.оброблення для 32 моделей для одного сегменту - недочекався. 28+ хв. - 8 моделей

In [137]:
param_grid = {
              'n_factors': (1,),
              'n_epochs': (10, ),
              'init_mean': (df['rating'].mean(), 2.5,),
              'init_std_dev': (df['rating'].std(),),
              'lr_all': (0.005,),
              'reg_all': (0.4,)
              }
gspp = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=6)

gspp.fit(data)

# best RMSE score
print(gspp.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gspp.best_params['rmse'])

# best MAE score
print(gspp.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gspp.best_params['mae'])

1.8254027708295515
{'n_factors': 1, 'n_epochs': 10, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}
1.4984430163830378
{'n_factors': 1, 'n_epochs': 10, 'init_mean': 3.501556983616962, 'init_std_dev': 1.042529239060635, 'lr_all': 0.005, 'reg_all': 0.4}


In [138]:
parameters = gspp.best_params['rmse']
compare_results_RMSE.update({f'''SVDpp_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gspp.best_score['rmse']})
parameters = gspp.best_params['mae']
compare_results_MAE.update({f'''SVDpp_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gspp.best_score['mae']})
compare_results_fit_time.update({f'''SVDpp_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': 278 / 2})

In [139]:
compare_results_RMSE, compare_results_MAE

({'SVD_nf1_nep20': 0.8786474219586929, 'SVDpp_nf1_nep10': 1.8254027708295515},
 {'SVD_nf1_nep20': 0.6788400349414734, 'SVDpp_nf1_nep10': 1.4984430163830378})

In [140]:
param_grid = {
              'n_factors': (1, 5),
              'n_epochs': (10, 20),
              'reg_pu': (0.01, 0.06, 0.1),
              'reg_qi': (0.01, 0.06, 0.1),
              'reg_bu': (0.01, 0.02, 0.03),
              'reg_bi': (0.01, 0.02, 0.03),
              'lr_bu': (0.005,),
              'lr_bi': (0.005,)
              }
gsNMF= GridSearchCV(NMF, param_grid, measures=['rmse', 'mae'], cv=6)

gsNMF.fit(data)

# best RMSE score
print(gsNMF.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gsNMF.best_params['rmse'])

1.1699977606580931
{'n_factors': 5, 'n_epochs': 20, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.02, 'lr_bu': 0.005, 'lr_bi': 0.005}


In [141]:
# best MAE score
print(gsNMF.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gsNMF.best_params['mae'])

0.9856609347621211
{'n_factors': 5, 'n_epochs': 20, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.02, 'lr_bu': 0.005, 'lr_bi': 0.005}


Для моделі NMF серед деякого набору параметрів виявився найкращим варіант з 5 факторами, 20 епохами за 6 сегментів кросвалідації.

(27+ хв.оброблення - 324 моделей)

In [146]:
param_grid = {
              'n_factors': (15,),
              'n_epochs': (50,),
              # 'reg_pu': (0.1,),
              # 'reg_qi': (0.1,),
              # 'reg_bu': (0.03,),
              # 'reg_bi': (0.03,),
              # 'lr_bu': (0.002,),
              # 'lr_bi': (0.002,)
              }
gsNMF= GridSearchCV(NMF, param_grid, measures=['rmse', 'mae'], cv=6)

gsNMF.fit(data)

# best RMSE score
print(gsNMF.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gsNMF.best_params['rmse'])

# best MAE score
print(gsNMF.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gsNMF.best_params['mae'])

0.9210037531861538
{'n_factors': 15, 'n_epochs': 50}
0.704752999211338
{'n_factors': 15, 'n_epochs': 50}


In [147]:
param_grid = {
              'n_factors': (15,),
              'n_epochs': (50,),
              'reg_pu': (0.1,),
              'reg_qi': (0.1,),
              'reg_bu': (0.03,),
              'reg_bi': (0.03,),
              'lr_bu': (0.002,),
              'lr_bi': (0.002,)
              }
gsNMF= GridSearchCV(NMF, param_grid, measures=['rmse', 'mae'], cv=6)

gsNMF.fit(data)

# best RMSE score
print(gsNMF.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gsNMF.best_params['rmse'])

# best MAE score
print(gsNMF.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gsNMF.best_params['mae'])

0.9047003794765264
{'n_factors': 15, 'n_epochs': 50, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.03, 'lr_bu': 0.002, 'lr_bi': 0.002}
0.6982012171453905
{'n_factors': 15, 'n_epochs': 50, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.03, 'lr_bu': 0.002, 'lr_bi': 0.002}


In [153]:
param_grid = {
              'n_factors': (20,),
              'n_epochs': (75,),
              'init_low': (0,),
              'init_high': (5,),
              'reg_pu': (0.1,),
              'reg_qi': (0.1,),
              'reg_bu': (0.03,),
              'reg_bi': (0.03,),
              'lr_bu': (0.002,),
              'lr_bi': (0.002,)
              }
gsNMF= GridSearchCV(NMF, param_grid, measures=['rmse', 'mae'], cv=6)

gsNMF.fit(data)

# best RMSE score
print(gsNMF.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gsNMF.best_params['rmse'])

# best MAE score
print(gsNMF.best_score['mae'])

# combination of parameters that gave the best MAE score
print(gsNMF.best_params['mae'])

0.9019063605831592
{'n_factors': 20, 'n_epochs': 75, 'init_low': 0, 'init_high': 5, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.03, 'lr_bu': 0.002, 'lr_bi': 0.002}
0.7001125031843363
{'n_factors': 20, 'n_epochs': 75, 'init_low': 0, 'init_high': 5, 'reg_pu': 0.1, 'reg_qi': 0.1, 'reg_bu': 0.03, 'reg_bi': 0.03, 'lr_bu': 0.002, 'lr_bi': 0.002}


In [143]:
parameters = gsNMF.best_params['rmse']
compare_results_RMSE.update({f'''NMF_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gsNMF.best_score['rmse']})
parameters = gsNMF.best_params['mae']
compare_results_MAE.update({f'''NMF_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': gsNMF.best_score['mae']})
compare_results_fit_time.update({f'''NMF_nf{parameters.get('n_factors')}_nep{parameters.get('n_epochs')}''': 9})

In [144]:
compare_results_RMSE, compare_results_MAE, compare_results_fit_time

({'SVD_nf1_nep20': 0.8786474219586929,
  'SVDpp_nf1_nep10': 1.8254027708295515,
  'NMF_nf5_nep20': 1.1718055094848485},
 {'SVD_nf1_nep20': 0.6788400349414734,
  'SVDpp_nf1_nep10': 1.4984430163830378,
  'NMF_nf5_nep20': 0.9867587361073876},
 {'SVD_nf1_nep20': 7, 'SVDpp_nf1_nep10': 139.0, 'NMF_nf5_nep20': 9})

При ручній зміні меж сітки параметрів за результатами видно, що при підборі певних параметрів виявлена перевага алгоритму SVD як по точності, так і по швидкості тренування моделі. SVD++ виявився шось занадто повільним та за 20 епохами гірший за 10.

In [145]:
predictions = algo_SVD.get(best_SVD_by_rmse[:-4]).predict(uid=1, iid=1)
predictions

Prediction(uid=1, iid=1, r_ui=None, est=4.586234359102049, details={'was_impossible': False})