## Imports
These experiments were run on Python 3.8. In the requirements.txt are the versions used for these packages.
- tqdm: For showing progress in loops.
- numpy and pandas: For data manipulation.
- cornac: For obtaining the recommendations.
- tensorflow: Required by cornac.
- torch: Required for the VAECF implementation of Cornac.

In [49]:
from datetime import datetime
from pathlib import Path
from logging import Formatter, StreamHandler, getLogger, INFO

from tqdm import tqdm
from cornac import Experiment
from cornac.eval_methods import RatioSplit
from cornac.metrics import NDCG, Recall, Precision
from cornac.hyperopt import Discrete, Continuous
from cornac.hyperopt import GridSearch, RandomSearch
#from cornac.models import MF, WMF, SVD, VAECF
from cornac.models import VAECF, NeuMF, MCF, SVD, WMF, SKMeans, GMF
from cornac.exception import ScoreException
from numpy import array, nan
from pandas import read_csv, DataFrame, Series

## Logger setup
Here we set up the logger for showing some info when executing this script.

In [50]:
logger = getLogger(__name__)
logger.setLevel(INFO)

ch = StreamHandler()
ch.setLevel(INFO)
ch.setFormatter(
    Formatter('%(asctime)s - %(levelname)s - %(message)s')
)

logger.addHandler(ch)

## Configuration variables
We define some variables used on the rest of the experiment.

### General config
Getting the date now and the name of the experiment.

In [51]:
now = f'{datetime.now():%Y%m%d%H%M}'
experiment_name = 'AMBAR'

### File and dir config
Getting the working directory with pathlib, and obtaining the csv to be used in cornac, and defining a results directory.

In [52]:
work_dir = Path('.').resolve()
data_file = work_dir / 'data' / experiment_name / 'ratings_info.csv'
results_dir = work_dir / 'results' / experiment_name / now

Here we make sure the results directory exists by creating it if it doesn't.

In [53]:
if not results_dir.exists():
    results_dir.mkdir(parents=True, exist_ok=True)

Also, we make sure the data file exists and is a file. Here we could also make sure that the file is an actual csv file.

In [54]:
if not data_file.exists() and data_file.is_file():
    print("Bad data file")

### Dataframe config
We define the names of the headers of each column to be identified by pandas. Also, we define the data type of the values in each cell of the user, item and rating. If the data has multiple data types, the val_dtype can be a list of type string compatible with pandas.

In [55]:
col_names = {
    'user': 'user_id',
    'item': 'track_id',
    'rating': 'rating'
}
val_dtype = 'int'

### Cornac config
Here we set up the k value, the test set size and the validation set size. Also we decide if we want to exclude unknown values or not.

In [56]:
k = 1000
test_size = 0.2
val_size = 0.1
exclude_unknown = True

## Function setup
We set up various utility functions to be used later. Mostly for exporting data and getting it in a format compatible with cornac.

set_data_to_tuple_list takes a dict of {user: [item_list, rating_list]}, process it and returns a tuple list of format [(user, item, rating)...].

In [57]:
def set_data_to_tuple_list(d: dict) -> list:
    result = []
    for user in d:
        transpose = array(d[user]).T
        for t in transpose:
            result.append((user,) + tuple(t))
    return result

list_to_dict converts a list into a dict using dict comprehension and enumerate.

In [58]:
def list_to_dict(l: list) -> dict:
    return {i: v for i, v in enumerate(l)}

get_set_dataframe process the raw data ({user: [item_list, rating_list]}), with the item ids and user ids, and converts it into a pandas DataFrame to be exported later.

In [59]:
# Transforma de formato ({user:[item_list, rating_list]})
# DataFrame final:
#    user_id  item_id  rating  item_idx  user_idx
# 0        0        0       5         0         0
# 1        0        1       3         1         0
# 2        1        1       4         1         1
# 3        1        2       2         2         1
def get_set_dataframe(set_data: dict, i_ids: list, u_ids: list) -> DataFrame:
    data_list = set_data_to_tuple_list(set_data)
    i_map = list_to_dict(i_ids)
    u_map = list_to_dict(u_ids)

    set_df = DataFrame(data_list,
                       columns=list(col_names.values()),
                       dtype=val_dtype)
    set_df['item_idx'] = set_df[col_names['item']]
    set_df['item'] = set_df[col_names['item']].replace(to_replace=i_map)
    set_df['user_idx'] = set_df[col_names['user']]
    set_df['user'] = set_df[col_names['user']].replace(to_replace=u_map)
    return set_df

In [60]:
logger.info('Experiment start...')
logger.info(f'{experiment_name}')
logger.info(f'{k=}')
logger.info(f'{work_dir=}')
logger.info(f'{data_file=}')
logger.info(f'{results_dir=}')

2025-01-08 01:37:38,640 - INFO - Experiment start...
2025-01-08 01:37:38,640 - INFO - Experiment start...
2025-01-08 01:37:38,640 - INFO - Experiment start...
2025-01-08 01:37:38,642 - INFO - AMBAR
2025-01-08 01:37:38,642 - INFO - AMBAR
2025-01-08 01:37:38,642 - INFO - AMBAR
2025-01-08 01:37:38,644 - INFO - k=1000
2025-01-08 01:37:38,644 - INFO - k=1000
2025-01-08 01:37:38,644 - INFO - k=1000
2025-01-08 01:37:38,646 - INFO - work_dir=WindowsPath('M:/Framework/AMBAR')
2025-01-08 01:37:38,646 - INFO - work_dir=WindowsPath('M:/Framework/AMBAR')
2025-01-08 01:37:38,646 - INFO - work_dir=WindowsPath('M:/Framework/AMBAR')
2025-01-08 01:37:38,647 - INFO - data_file=WindowsPath('M:/Framework/AMBAR/data/AMBAR/ratings_info.csv')
2025-01-08 01:37:38,647 - INFO - data_file=WindowsPath('M:/Framework/AMBAR/data/AMBAR/ratings_info.csv')
2025-01-08 01:37:38,647 - INFO - data_file=WindowsPath('M:/Framework/AMBAR/data/AMBAR/ratings_info.csv')
2025-01-08 01:37:38,649 - INFO - results_dir=WindowsPath('M:/

Here we create the dataset out of the data file, the expected data is only with user, item and rating in that order. The name of the columns is defined in the set-up part, same with the data types.

For testing purposes before actually executing the full experiment, we left a filter that takes a sample of 50 users, and gets only the data of those 50 users. Please use it only to make sure that the script executes correctly from start to finish.

In [61]:
# user, item, rating
keys = ['0', '1', '2']

# Crea un diccionario que mapea cada clave a su tipo de dato
if isinstance(val_dtype, str):
    d_type = {key: val_dtype for key in keys}
elif isinstance(val_dtype, list):
    d_type = dict(zip(keys, val_dtype))
else:
    logger.error('Wrong type setup. Must be a type string or a list of type string.')
    exit()

logger.info('Loading data into triplets...')
df = read_csv(
    data_file,
    header=0,
    names=['0', '1', '2']
)[['0', '1', '2']].astype(d_type)

# FOR TESTING ONLY
# Selecciona aleatoriamente 50 usuarios unicos del dataframe
user_filter = Series(df['0'].unique()).sample(50).to_list()
# Incluye solo filas donde estan los usuarios filtrados
df = df[df['0'].isin(user_filter)]

data = list(df.to_records(index=False, column_dtypes=d_type))

2025-01-08 01:37:38,673 - INFO - Loading data into triplets...
2025-01-08 01:37:38,673 - INFO - Loading data into triplets...
2025-01-08 01:37:38,673 - INFO - Loading data into triplets...


Here we create the Ratio Split that will be used by cornac. It splits the data into 3 sets randomly. 1 for test, 1 for train and 1 for validation.

In [62]:
logger.info('Creating ratio split...')
ratio_split = RatioSplit(
    data=data,
    test_size=test_size,
    val_size=val_size,
    exclude_unknowns=exclude_unknown,
    verbose=True,
    seed=123 # Añadida para poder reproducir resultados
)

2025-01-08 01:37:38,749 - INFO - Creating ratio split...
2025-01-08 01:37:38,749 - INFO - Creating ratio split...
2025-01-08 01:37:38,749 - INFO - Creating ratio split...


rating_threshold = 1.0
exclude_unknowns = True
---
Training data:
Number of users = 50
Number of items = 2446
Number of ratings = 2959
Max rating = 5.0
Min rating = 1.0
Global mean = 1.6
---
Test data:
Number of users = 50
Number of items = 2446
Number of ratings = 296
Number of unknown users = 0
Number of unknown items = 0
---
Validation data:
Number of users = 50
Number of items = 2446
Number of ratings = 164
---
Total users = 50
Total items = 2446




We define the metris here. In this experiment, we set up NDCG, Recall and Precision, using the k defined in the set-up.

In [63]:
metrics = [
    NDCG(k),
    Recall(k),
    Precision(k)
]

Also, we define the models with some previously obtained parameters. We could also define the hyperparameter calculation in this part, in this case, is important to leave a models variable with said configuration, so cornac can pick up the array and execute the calculation and exporting of the recommendations.

Because this script is assuming an array with models with parameters already predefined, in case of needing the best parameters obtained by cornac, the exporting of this must be done after running the experiment.

## Base models to compute

In [64]:
base_vaecf = VAECF(
    name='vaecf_default',
    k=k,
    autoencoder_structure=[20],
    act_fn="tanh",
    likelihood="mult",
    n_epochs=100,
    batch_size=100,
    learning_rate=0.001,
    beta=1.0,
    use_gpu=True,
        verbose=True)



In [65]:
# Modelos que le pasaremos a CORNAC para ejecutar los experimentos 
models = [
   base_vaecf
]

In [66]:
# Obtener el total de usuarios del split de entrenamiento
total_users = ratio_split.train_set.num_users
# Obtener el total de items del split de entrenamiento
total_items = ratio_split.train_set.num_items
logger.info(f'{total_users=}')
logger.info(f'{total_items=}')

2025-01-08 01:37:38,840 - INFO - total_users=50
2025-01-08 01:37:38,840 - INFO - total_users=50
2025-01-08 01:37:38,840 - INFO - total_users=50
2025-01-08 01:37:38,842 - INFO - total_items=2446
2025-01-08 01:37:38,842 - INFO - total_items=2446
2025-01-08 01:37:38,842 - INFO - total_items=2446


After setting up the metrics and models, we export the test, train and validation data into the results directory.

In [67]:
import pandas as pd
from pathlib import Path

logger.info('Exporting training data for MOReGIn...')
train_df = get_set_dataframe(
    dict(ratio_split.train_set.user_data),
    list(ratio_split.train_set.item_ids),
    list(ratio_split.train_set.user_ids)
)

# Read the original input data to get continent and genre information
input_df = pd.read_csv(work_dir / 'for_testing.csv')
input_lookup = input_df.set_index(['user', 'item'])[['continent', 'genre']]

# Add continent and genre to training data
train_df = train_df.join(input_lookup, on=['user', 'item'])
train_df.to_csv('train.csv', index=False)

2025-01-08 01:37:38,858 - INFO - Exporting training data for MOReGIn...
2025-01-08 01:37:38,858 - INFO - Exporting training data for MOReGIn...
2025-01-08 01:37:38,858 - INFO - Exporting training data for MOReGIn...


In [68]:
logger.info('Exporting train data...')
get_set_dataframe(
    dict(ratio_split.train_set.user_data),
    list(ratio_split.train_set.item_ids),
    list(ratio_split.train_set.user_ids),
).to_csv(results_dir / 'train_set.csv')

2025-01-08 01:37:39,121 - INFO - Exporting train data...
2025-01-08 01:37:39,121 - INFO - Exporting train data...
2025-01-08 01:37:39,121 - INFO - Exporting train data...


In [69]:
logger.info('Exporting validation data...')
get_set_dataframe(
    dict(ratio_split.val_set.user_data),
    list(ratio_split.val_set.item_ids),
    list(ratio_split.val_set.user_ids),
).to_csv(results_dir / 'val_set.csv')

2025-01-08 01:37:39,292 - INFO - Exporting validation data...
2025-01-08 01:37:39,292 - INFO - Exporting validation data...
2025-01-08 01:37:39,292 - INFO - Exporting validation data...


And we run the experiments with the defined variables.

In [70]:
logger.info('Running experiment...')
exp = Experiment(
    eval_method=ratio_split,
    models=models,
    metrics=metrics,
    user_based=True,
)
exp.run()

#print(rs_mcf.best_params)

2025-01-08 01:37:39,400 - INFO - Running experiment...
2025-01-08 01:37:39,400 - INFO - Running experiment...
2025-01-08 01:37:39,400 - INFO - Running experiment...



[vaecf_default] Training started!


100%|██████████| 100/100 [00:01<00:00, 93.99it/s, loss=9.23]



[vaecf_default] Evaluation started!


Ranking: 100%|██████████| 41/41 [00:00<00:00, 911.09it/s]
Ranking: 100%|██████████| 34/34 [00:00<00:00, 944.45it/s]


VALIDATION:
...
              | NDCG@1000 | Precision@1000 | Recall@1000 | Time (s)
------------- + --------- + -------------- + ----------- + --------
vaecf_default |    0.1478 |         0.0033 |      0.6623 |   0.0390

TEST:
...
              | NDCG@1000 | Precision@1000 | Recall@1000 | Train (s) | Test (s)
------------- + --------- + -------------- + ----------- + --------- + --------
vaecf_default |    0.1638 |         0.0046 |      0.6297 |    1.0690 |   0.0480






After running the experiment, we export the metrics obtained from the calculation into a csv using pandas.

In [71]:
logger.info('Exporting metrics...')
metric_results = {
    exp.models[i].name: dict(exp.result[i].metric_avg_results)
    for i in range(len(models))
}
(DataFrame(metric_results)
 .reset_index()
 .rename(columns={'index': 'metric'})
 .to_csv(results_dir / 'metric_results.csv'))

2025-01-08 01:37:40,577 - INFO - Exporting metrics...
2025-01-08 01:37:40,577 - INFO - Exporting metrics...
2025-01-08 01:37:40,577 - INFO - Exporting metrics...


And finally we export the recommendations. We use a custom multi loop to get the results.
- Here we first loop over the models of the experiment.
- We loop over the users map of cornac to get both the original id and the internal index of cornac.
- We get the scores for the users.
- We get the k top items using a combination of argsort and reversing of the list.
- We loop over the items map of cornac to get both the original id and the internal index of cornac.
- We get the score obtained from cornac, or nan in case of IndexError.
- We append the user and items, both the id and indexes, and the score to the result list.
- After all the loops are finished, we export the data into a csv file.

In [72]:
logger.info('Processing models...')
for model in exp.models:
    model_result = []
    logger.info(f'Getting scores for {model.name}...')

    # Read the input data with continent and genre information
    input_df = pd.read_csv(work_dir / 'for_testing.csv')
    # Convert IDs to integers to ensure matching
    input_df['user'] = input_df['user'].astype(int)
    input_df['item'] = input_df['item'].astype(int)
    
    # Create a dictionary for faster lookups - store all genre/continent info for each item
    item_info = {}
    for _, row in input_df.iterrows():
        if row['item'] not in item_info:
            item_info[row['item']] = {
                'continent': row['continent'],
                'genre': row['genre']
            }

    for user_id, user_index in tqdm(exp.eval_method.train_set.uid_map.items()):
        try:
            scores = model.score(user_index)
        except ScoreException:
            logger.error(f"{model.name}: Couldn't predict for user {user_index} ({user_id=})")
            continue

        top_items = list(reversed(scores.argsort()))[:k]

        for item_id, item_index in exp.eval_method.train_set.iid_map.items():
            if item_index not in top_items:
                continue

            try:
                score = scores[item_index]
            except IndexError:
                logger.error(
                    f"{model.name}: No score for item {item_index} ({item_id=}) in user {user_index} ({user_id=})"
                )
                score = nan

            # Get continent and genre using item-based lookup
            item_data = item_info.get(item_id, {})
            continent = item_data.get('continent', 'unknown')
            genre = item_data.get('genre', 'unknown')

            model_result.append({
                'id': len(model_result) + 1,
                'user': user_id,
                'item': item_id,
                'rating': score,
                'position': list(reversed(scores.argsort())).index(item_index) + 1,
                'continent': continent,
                'genre': genre
            })

    logger.info(f'Exporting {model.name}...')
    result_df = pd.DataFrame(model_result)
    # Save both formats - one for general use and one specifically for MOReGIn
    result_df.to_csv(results_dir / f'{model.name}.csv', index=False)
    result_df.to_csv(f'{model.name}.csv', index=False)  # This one for MOReGIn

2025-01-08 01:37:40,597 - INFO - Processing models...
2025-01-08 01:37:40,597 - INFO - Processing models...
2025-01-08 01:37:40,597 - INFO - Processing models...
2025-01-08 01:37:40,611 - INFO - Getting scores for vaecf_default...
2025-01-08 01:37:40,611 - INFO - Getting scores for vaecf_default...
2025-01-08 01:37:40,611 - INFO - Getting scores for vaecf_default...
100%|██████████| 50/50 [00:15<00:00,  3.32it/s]
2025-01-08 01:38:00,848 - INFO - Exporting vaecf_default...
2025-01-08 01:38:00,848 - INFO - Exporting vaecf_default...
2025-01-08 01:38:00,848 - INFO - Exporting vaecf_default...
