n=5   | total interactions (user, item, label): 16021685  | unique users: 93971 | unique items: 55011 <br>
n=15 | total interactions (user, item, label): 15797582  | unique users: 86338 | unique items: 26937


[MODELS](https://recbole.io/docs/user_guide/model_intro.html) |
[CONFIG](https://recbole.io/docs/user_guide/config_settings.html#config-settings) | [EVAL METRIC](https://recbole.io/docs/recbole/recbole.evaluator.metrics.html)

In [7]:
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.context_aware_recommender import FM, DeepFM
from recbole.model.general_recommender import SimpleX, BPR, NeuMF
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger
import torch
from torch import nn
import logging
from recbole.quick_start import  load_data_and_model
import pandas as pd

#sanity check for mps
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

Using device: mps


In [8]:
inter['label:float'].value_counts()

label:float
0    8148506
1    7649076
Name: count, dtype: int64

https://github.com/RUCAIBox/RecBole/issues/1721 - 
https://github.com/RUCAIBox/RecBole/issues/1150

[custom model doc](https://recbole.io/docs/developer_guide/customize_models.html)

In [2]:
class DeepFMCustom(DeepFM):
    def __init__(self, config, dataset):
        super(DeepFMCustom, self).__init__(config, dataset)
        pretrained_user_emb = dataset.get_preload_weight('uid')
        self.user_embedding = nn.Embedding.from_pretrained(torch.from_numpy(pretrained_user_emb))

In [2]:
#GENERAL RECOMMENDER
config_dict = {
    'epochs': 10,
    'data_path': '/Users/giulia/Desktop/tesi/',
    'dataset': 'test_run', # 'mind_small15' or  'test_run'
    'load_col': {
        'inter': ['user_id', 'item_id', 'label']},     
    'USER_ID_FIELD': 'user_id',
    'ITEM_ID_FIELD': 'item_id',    
    'eval_args': {
        'split': {'RS': [0.8, 0.1, 0.1]},
        'group_by': 'user',
        'order': 'RO', #there is no timestamp column
        'mode': 'labeled'},
    'model': 'NeuMF',
    'learning_rate': 0.001, 
    'device': device, #this doesn't work
    'embedding_size': 32, # 64 -> kernel dies :(
    'train_batch_size': 32, #64-> kernel dies :(
    'eval_batch_size': 32,
    'l2_reg': 0.001,
    'early_stopping_patience': 5,  
    'early_stopping_metric': 'MRR@10',
    'checkpoint_dir': './saved',
    'log_level': 'DEBUG',
    'seed': 42,
    'reproducibility': True,
    'metrics': ["AUC", "MAE", "RMSE", "LogLoss"], #
    
    
}


In [3]:
#CONTEX-AWARE RECOMMENDER
config_dict = {
    'epochs': 10,
    'data_path': '/Users/giulia/Desktop/tesi/',
    'dataset': 'mind_small15', #or 'test_run'
    'additional_feat_suffix': ['useremb'],
    'load_col': {
        'inter': ['user_id', 'item_id', 'label'],
        'useremb': ['uid', 'user_emb']},
    'USER_ID_FIELD': 'user_id',
    'ITEM_ID_FIELD': 'item_id',
    'alias_of_user_id': ['uid'], #List of fields’ names, which will be remapped into the same index system with USER_ID_FIELD
    'preload_weight': {'uid': 'user_emb'},
    'eval_args': {
        'split': {'RS': [0.8, 0.1, 0.1]},
        'group_by': 'user',
        'order': 'RO', #there is no timestamp column
        'mode': 'labeled'},
    'model': DeepFMCustom,
    'mlp_hidden_size': [32] ,
    'dropout_prob': 0.1,
    'learning_rate': 0.001, 
    'device': device, #this doesn't work
    'embedding_size': 32, # 64 -> kernel dies :(
    'train_batch_size': 32, #64-> kernel dies :(
    'eval_batch_size': 32,
    'l2_reg': 0.001,
    'early_stopping_patience': 5,  
    'early_stopping_metric': 'MRR@10',
    'checkpoint_dir': './saved',
    'log_level': 'DEBUG',
    'seed': 42,
    'reproducibility': True,
    'metrics': ["AUC", "MAE", "RMSE", "LogLoss"], #, ["MRR", "NDCG", "Precision", "Recall", "F1"] are not supported by DeepFM
    
    
}


[handler doc](https://docs.python.org/3/library/logging.handlers.html)

In [3]:
config = Config(model=NeuMF, dataset=config_dict['dataset'], config_dict=config_dict)
init_seed(config['seed'], config['reproducibility'])
#------------logger
init_logger(config)
logger = logging.getLogger()
#------------handler
c_handler = logging.StreamHandler()
c_handler.setLevel(logging.DEBUG)
logger.addHandler(c_handler)
#logger.info(config)
#------------data
dataset = create_dataset(config)
#print(dataset.useremb_feat)
logger.info(dataset)

24 Apr 12:17    INFO  test_run
The number of users: 978
Average actions of users: 1.0235414534288638
The number of items: 852
Average actions of items: 1.1750881316098707
The number of inters: 1000
The sparsity of the dataset: 99.87998886296648%
Remain Fields: ['user_id', 'item_id', 'label']
test_run
The number of users: 978
Average actions of users: 1.0235414534288638
The number of items: 852
Average actions of items: 1.1750881316098707
The number of inters: 1000
The sparsity of the dataset: 99.87998886296648%
Remain Fields: ['user_id', 'item_id', 'label']


In [4]:
train_data, valid_data, test_data = data_preparation(config, dataset)

#model = DeepFMCustom(config, train_data.dataset).to(config['device'])
#logger.info(model)


user_id [  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
 163 164 165 166 167 166 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 237 238 239 240 241 242 243 24

24 Apr 12:17    INFO  [Training]: train_batch_size = [32] train_neg_sample_args: [{'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}]
[Training]: train_batch_size = [32] train_neg_sample_args: [{'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}]
24 Apr 12:17    INFO  [Evaluation]: eval_batch_size = [32] eval_args: [{'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'labeled', 'test': 'labeled'}}]
[Evaluation]: eval_batch_size = [32] eval_args: [{'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'labeled', 'test': 'labeled'}}]


In [5]:
model = NeuMF(config, train_data.dataset).to(config['device'])
trainer = Trainer(config, model)
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data, saved=True) 

24 Apr 12:17    INFO  epoch 0 training [time: 0.18s, train loss: 42.9800]
epoch 0 training [time: 0.18s, train loss: 42.9800]
24 Apr 12:17    INFO  Saving current: ./saved/NeuMF-Apr-24-2024_12-17-26.pth
Saving current: ./saved/NeuMF-Apr-24-2024_12-17-26.pth
24 Apr 12:17    INFO  epoch 1 training [time: 0.13s, train loss: 42.9676]
epoch 1 training [time: 0.13s, train loss: 42.9676]
24 Apr 12:17    INFO  Saving current: ./saved/NeuMF-Apr-24-2024_12-17-26.pth
Saving current: ./saved/NeuMF-Apr-24-2024_12-17-26.pth
24 Apr 12:17    INFO  epoch 2 training [time: 0.13s, train loss: 42.9557]
epoch 2 training [time: 0.13s, train loss: 42.9557]
24 Apr 12:17    INFO  Saving current: ./saved/NeuMF-Apr-24-2024_12-17-26.pth
Saving current: ./saved/NeuMF-Apr-24-2024_12-17-26.pth
24 Apr 12:17    INFO  epoch 3 training [time: 0.15s, train loss: 42.8360]
epoch 3 training [time: 0.15s, train loss: 42.8360]
24 Apr 12:17    INFO  Saving current: ./saved/NeuMF-Apr-24-2024_12-17-26.pth
Saving current: ./saved

In [7]:
#LOAD MODEL
#checkpoint_path = './saved/DeepFM-Apr-21-2024_16-02-34.pth' #spec at the end of respective log file

#config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
#    model_file=checkpoint_path,
#)

In [6]:
test_result = trainer.evaluate(test_data)
test_result

24 Apr 12:17    INFO  Loading model structure and parameters from ./saved/NeuMF-Apr-24-2024_12-17-26.pth
Loading model structure and parameters from ./saved/NeuMF-Apr-24-2024_12-17-26.pth


OrderedDict([('auc', 0.725),
             ('mae', 0.5257),
             ('rmse', 0.6677),
             ('logloss', 1.9878)])

In [None]:
test_result = trainer.evaluate(test_data)
test_result

24 Apr 06:59    INFO  Loading model structure and parameters from ./saved/DeepFMCustom-Apr-23-2024_21-10-08.pth
Loading model structure and parameters from ./saved/DeepFMCustom-Apr-23-2024_21-10-08.pth


OrderedDict([('auc', 0.9858),
             ('mae', 0.0429),
             ('rmse', 0.1688),
             ('logloss', 0.1335)])

-----------------------------------------------------------------------------------------


just personal notes & useful links:<br>
[ADD-> TF-IDF EMBEDDING](https://recbole.io/docs/user_guide/usage/load_pretrained_embedding.html)

[kaggle example to get prediction](https://www.kaggle.com/code/astrung/recbole-lstm-sequential-for-recomendation-tutorial#4.-Create-recommendation-result-from-trained-model)



[MODELS link](https://recbole.io/docs/user_guide/model_intro.html)

higher FLOPS->higher complexity & more computation



save_dataset (bool): Determines whether the processed dataset is saved to disk. This can be useful for large datasets that take a long time to preprocess, as it allows for quicker loading in subsequent runs.

[training hyperparam.](https://recbole.io/docs/user_guide/config/training_settings.html)

[clip_grad_norm](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html): This parameter is used to prevent the exploding gradient problem by clipping the gradients of the parameters during backpropagation to have a maximum norm of a specified value. If set to None, no clipping is applied. If you specify a number (e.g., 5.0), it will clip the gradients such that their norm does not exceed this value. Gradient clipping can be crucial for stabilizing the training of **deep learning models**.

[eval hyperparam](https://recbole.io/docs/user_guide/config/evaluation_settings.html)
reproducibility vs repeatible (args):

**Reproducibility**: set to True, the framework will explicitly set random seeds for all underlying libraries (e.g., PyTorch, NumPy) and any internal operations that use random numbers. This ensures that every aspect of the computation, from the way data is split to the initialization of model parameters, is consistent across runs.

**Repeatable**: False might allow for variability in how data is sampled, ordered, or split during the evaluation phase, potentially leading to slight differences in evaluation metrics across runs. Conversely, setting it to True would fix these aspects to ensure consistency in evaluation outcomes.


[data hyp](https://recbole.io/docs/user_guide/config/data_settings.html)

In [9]:
#LRS
"""config_dict = {
    'data_path': './mind',
    'dataset': 'mind',
    'eval_args': {
    'split': {'LRS': None},
    'order': 'RO'  ,# not relevant
    'group_by': '-', # not relevant
    'mode': 'full'  # Train, validation, test split
}}

config = Config(model='DMF', dataset=config_dict['dataset'], config_dict=config_dict)
dataset = create_dataset(config)
"""
#

"config_dict = {\n    'data_path': './mind',\n    'dataset': 'mind',\n    'eval_args': {\n    'split': {'LRS': None},\n    'order': 'RO'  ,# not relevant\n    'group_by': '-', # not relevant\n    'mode': 'full'  # Train, validation, test split\n}}\n\nconfig = Config(model='DMF', dataset=config_dict['dataset'], config_dict=config_dict)\ndataset = create_dataset(config)\n"