## Imports
These experiments were run on Python 3.8. In the requirements.txt are the versions used for these packages.
- tqdm: For showing progress in loops.
- numpy and pandas: For data manipulation.
- cornac: For obtaining the recommendations.
- tensorflow: Required by cornac.
- torch: Required for the VAECF implementation of Cornac.

In [158]:
from datetime import datetime
from pathlib import Path
from logging import Formatter, StreamHandler, getLogger, INFO

from tqdm import tqdm
from cornac import Experiment
from cornac.eval_methods import RatioSplit, BaseMethod
from cornac.metrics import NDCG, Recall, Precision
from cornac.models import MF, WMF, SVD, VAECF, UserKNN
from cornac.exception import ScoreException
from numpy import array, nan
from pandas import read_csv, DataFrame, Series

## Logger setup
Here we set up the logger for showing some info when executing this script.

In [159]:
logger = getLogger(__name__)
logger.setLevel(INFO)

ch = StreamHandler()
ch.setLevel(INFO)
ch.setFormatter(
    Formatter('%(asctime)s - %(levelname)s - %(message)s')
)

logger.addHandler(ch)

## Configuration variables
We define some variables used on the rest of the experiment.

### General config
Getting the date now and the name of the experiment.

In [160]:
now = f'{datetime.now():%Y%m%d%H%M%S}'
experiment_name = 'AMBAR'

### File and dir config
Getting the working directory with pathlib, and obtaining the csv to be used in cornac, and defining a results directory.

In [161]:
import pandas as pd

# Work directory
work_dir = Path('.').resolve()

# Resuls directory, scores will be saved here
results_dir = work_dir / 'results' / now

# preprocessing data directory
prep = work_dir / 'data' / 'PFair' / 'Preprocessing'

# preprocessing train data
train_set = prep / 'train_filtered2.csv'

# preprocessing test data
test_set = prep / 'test_filtered.csv'

Here we make sure the results directory exists by creating it if it doesn't.

In [162]:
# If directory doesnt existe, create one
if not results_dir.exists():
    results_dir.mkdir(parents=True, exist_ok=True)

Also, we make sure the data file exists and is a file. Here we could also make sure that the file is an actual csv file.

In [163]:
if (not train_set.exists() and train_set.is_file()) or (not train_set.exists() and train_set.is_file()) :
    print("Bad data file")

### Dataframe config
We define the names of the headers of each column to be identified by pandas. Also, we define the data type of the values in each cell of the user, item and rating. If the data has multiple data types, the val_dtype can be a list of type string compatible with pandas.

In [164]:
# Establish columns names for using later
col_names = {
    'user': 'user_id',
    'item': 'track_id',
    'rating': 'rating'
}
val_dtype = 'int'

### Cornac config
Here we set up the k value, the test set size and the validation set size. Also we decide if we want to exclude unknown values or not.

In [165]:
k = 1000
#test_size = 0.2
#val_size = 0.1
exclude_unknown = True

## Function setup
We set up various utility functions to be used later. Mostly for exporting data and getting it in a format compatible with cornac.

set_data_to_tuple_list takes a dict of {user: [item_list, rating_list]}, process it and returns a tuple list of format [(user, item, rating)...].

In [166]:
def set_data_to_tuple_list(d: dict) -> list:
    result = []
    for user in d:
        transpose = array(d[user]).T
        for t in transpose:
            result.append((user,) + tuple(t))
    return result

list_to_dict converts a list into a dict using dict comprehension and enumerate.

In [167]:
def list_to_dict(l: list) -> dict:
    return {i: v for i, v in enumerate(l)}

get_set_dataframe process the raw data ({user: [item_list, rating_list]}), with the item ids and user ids, and converts it into a pandas DataFrame to be exported later.

In [168]:
def get_set_dataframe(set_data: dict, i_ids: list, u_ids: list) -> DataFrame:
    data_list = set_data_to_tuple_list(set_data)
    i_map = list_to_dict(i_ids)
    u_map = list_to_dict(u_ids)

    set_df = DataFrame(data_list,
                       columns=list(col_names.values()),
                       dtype=val_dtype)
    set_df['item_idx'] = set_df[col_names['item']]
    set_df['item'] = set_df[col_names['item']].replace(to_replace=i_map)
    set_df['user_idx'] = set_df[col_names['user']]
    set_df['user'] = set_df[col_names['user']].replace(to_replace=u_map)
    return set_df

In [169]:
logger.info('Experiment start...')
logger.info(f'{experiment_name}')
logger.info(f'{k=}')
logger.info(f'{work_dir=}')
logger.info(f'{results_dir=}')

2025-01-12 23:32:10,342 - INFO - Experiment start...
2025-01-12 23:32:10,342 - INFO - Experiment start...
2025-01-12 23:32:10,342 - INFO - Experiment start...
2025-01-12 23:32:10,342 - INFO - Experiment start...
2025-01-12 23:32:10,342 - INFO - Experiment start...
2025-01-12 23:32:10,342 - INFO - Experiment start...
2025-01-12 23:32:10,342 - INFO - Experiment start...
2025-01-12 23:32:10,342 - INFO - Experiment start...
2025-01-12 23:32:10,346 - INFO - AMBAR
2025-01-12 23:32:10,346 - INFO - AMBAR
2025-01-12 23:32:10,346 - INFO - AMBAR
2025-01-12 23:32:10,346 - INFO - AMBAR
2025-01-12 23:32:10,346 - INFO - AMBAR
2025-01-12 23:32:10,346 - INFO - AMBAR
2025-01-12 23:32:10,346 - INFO - AMBAR
2025-01-12 23:32:10,346 - INFO - AMBAR
2025-01-12 23:32:10,350 - INFO - k=1000
2025-01-12 23:32:10,350 - INFO - k=1000
2025-01-12 23:32:10,350 - INFO - k=1000
2025-01-12 23:32:10,350 - INFO - k=1000
2025-01-12 23:32:10,350 - INFO - k=1000
2025-01-12 23:32:10,350 - INFO - k=1000
2025-01-12 23:32:10,350 

Here we create the dataset out of the data file, the expected data is only with user, item and rating in that order. The name of the columns is defined in the set-up part, same with the data types.

For testing purposes before actually executing the full experiment, we left a filter that takes a sample of 50 users, and gets only the data of those 50 users. Please use it only to make sure that the script executes correctly from start to finish.

In [170]:
"""keys = ['0', '1', '2']

if isinstance(val_dtype, str):
    d_type = {key: val_dtype for key in keys}
elif isinstance(val_dtype, list):
    d_type = dict(zip(keys, val_dtype))
else:
    logger.error('Wrong type setup. Must be a type string or a list of type string.')
    exit()

logger.info('Loading data into triplets...')
df = read_csv(
    data_file,
    header=0,
    names=['0', '1', '2']
)[['0', '1', '2']].astype(d_type)

# FOR TESTING ONLY
user_filter = Series(df['0'].unique()).sample(50, random_state=123).to_list()
df = df[df['0'].isin(user_filter)]
print(df.head())
data = list(df.to_records(index=False, column_dtypes=d_type))"""

"keys = ['0', '1', '2']\n\nif isinstance(val_dtype, str):\n    d_type = {key: val_dtype for key in keys}\nelif isinstance(val_dtype, list):\n    d_type = dict(zip(keys, val_dtype))\nelse:\n    logger.error('Wrong type setup. Must be a type string or a list of type string.')\n    exit()\n\nlogger.info('Loading data into triplets...')\ndf = read_csv(\n    data_file,\n    header=0,\n    names=['0', '1', '2']\n)[['0', '1', '2']].astype(d_type)\n\n# FOR TESTING ONLY\nuser_filter = Series(df['0'].unique()).sample(50, random_state=123).to_list()\ndf = df[df['0'].isin(user_filter)]\nprint(df.head())\ndata = list(df.to_records(index=False, column_dtypes=d_type))"

In [171]:
# PARA TESTEAR
# Here we change the input data format to be compatible with cornac evaluation method
from sklearn.model_selection import train_test_split

keys = ['0', '1', '2']

if isinstance(val_dtype, str):
    d_type = {key: val_dtype for key in keys}
elif isinstance(val_dtype, list):
    d_type = dict(zip(keys, val_dtype))
else:
    logger.error('Wrong type setup. Must be a type string or a list of type string.')
    exit()
# We change the columns names for the train set
logger.info('Loading data into triplets...')
df = read_csv(
    train_set,
    header=0,
    names=['0', '1', '2']
)[['0', '1', '2']].astype(d_type)
# We change the columns names for the test set
df1 = read_csv(
    test_set,
    header=0,
    names=['0', '1', '2']
)[['0', '1', '2']].astype(d_type)

# If validation data was needed, repeat the code here


# FOR TESTING ONLY 
    # THIS WILL GENERATE 500 RANDOM USERS FOR SAMPLING 

# Increase the number of users sampled
#user_filter = Series(df['0'].unique()).sample(500, random_state=333).to_list()  # Increase from 200 to 500
#df = df[df['0'].isin(user_filter)]

# Split the data into training and test sets
#train_df, test_df = train_test_split(df, test_size=0.2, random_state=333, stratify=df['2'])  # Stratify by ratings

#train_data = list(train_df.to_records(index=False, column_dtypes=d_type))
#test_data = list(test_df.to_records(index=False, column_dtypes=d_type))

# FOR TESTING ONLY

"""print("Training data:")
print(df.head)
print("Test data:")
print(df1.head)"""

train_data = list(df.to_records(index=False, column_dtypes=d_type))
test_data = list(df1.to_records(index=False, column_dtypes=d_type))



2025-01-12 23:32:10,392 - INFO - Loading data into triplets...
2025-01-12 23:32:10,392 - INFO - Loading data into triplets...
2025-01-12 23:32:10,392 - INFO - Loading data into triplets...
2025-01-12 23:32:10,392 - INFO - Loading data into triplets...
2025-01-12 23:32:10,392 - INFO - Loading data into triplets...
2025-01-12 23:32:10,392 - INFO - Loading data into triplets...
2025-01-12 23:32:10,392 - INFO - Loading data into triplets...
2025-01-12 23:32:10,392 - INFO - Loading data into triplets...


Here we create the Ratio Split that will be used by cornac. It splits the data into 3 sets randomly. 1 for test, 1 for train and 1 for validation.

In [172]:
# Here we stablish the evaluation method, it can be ratio split
# cross/validation, stratified, etc. the base method of this class
# allows to give or owns set, but must be change the format like we did in above cell
# ------------------- //Check Cornac documentation// -------------------

eval_method = BaseMethod.from_splits(
    train_data=train_data,
    test_data=test_data,
    val_data=None,  # Optional, can be omitted if not using a validation set
    #fmt='UIR',  # Format of the data, typically 'UIR' for user-item-rating
    rating_threshold=1.0,  # Threshold for considering a rating as positive
    exclude_unknowns=True,  # Whether to exclude unknown users/items
    verbose=True  # Whether to print detailed logs
)

# Probando (ignorar)
#user_data = eval_method.test_set.num_users
                          
#print("probando")
#print(user_data)

rating_threshold = 1.0
exclude_unknowns = True
---
Training data:
Number of users = 500
Number of items = 3723
Number of ratings = 8411
Max rating = 5.0
Min rating = 1.0
Global mean = 1.7
---
Test data:
Number of users = 500
Number of items = 3723
Number of ratings = 1678
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 500
Total items = 3723
probando
500




We define the metris here. In this experiment, we set up NDCG, Recall and Precision, using the k defined in the set-up.

In [173]:
# Here we create a list to give to the experiment further in the code.
# More metrics can be exported in the first cell.
# ------------------- //Check Cornac documentation for more metrics// -------------------

metrics = [
    NDCG(k),
    Recall(k),
    Precision(k)
]

Also, we define the models with some previously obtained parameters. We could also define the hyperparameter calculation in this part, in this case, is important to leave a models variable with said configuration, so cornac can pick up the array and execute the calculation and exporting of the recommendations.

Because this script is assuming an array with models with parameters already predefined, in case of needing the best parameters obtained by cornac, the exporting of this must be done after running the experiment.

In [174]:
# Here we save the models parameters for the training
# models, as with metrics, must be imported from cornac
# we save the models in variables to give as a list to
# the experiment class further in the code
# you may experiment with gridsearch and random search for tunning
# ------------------- //Check Cornac documentation// -------------------

from cornac.models import BiVAECF
from cornac.hyperopt import Discrete, Continuous
from cornac.hyperopt import GridSearch, RandomSearch

biVAECF = BiVAECF(
    name='vibae_exp1',
    k=k,
    encoder_structure=[20],
    act_fn="tanh",
    likelihood="pois",
    n_epochs=100,
    batch_size=100,
    learning_rate=0.001,
    beta_kl=1.0,
    use_gpu=True,
    verbose=True
)

# First alternative parameter set
biVAECF_alt1 = BiVAECF(
    name='bivae_exp2',
    k=500,  # Reduced number of latent factors
    encoder_structure=[50, 30],  # Different autoencoder structure
    act_fn="relu",  # Changed activation function
    likelihood="gaus",  # Changed likelihood
    n_epochs=150,  # Increased number of epochs
    batch_size=50,  # Smaller batch size
    learning_rate=0.005,  # Increased learning rate
    beta_kl=0.5,  # Reduced KL divergence weight
    use_gpu=False,  # Disabled GPU usage
    verbose=False  # Disabled verbosity
)

# Second alternative parameter set
biVAECF_alt2 = BiVAECF(
    name='vibae_exp3',
    k=2000,  # Increased number of latent factors
    encoder_structure=[100, 50, 25],  # More complex autoencoder structure
    act_fn="sigmoid",  # Different activation function
    likelihood="bern",  # Different likelihood
    n_epochs=200,  # Further increased number of epochs
    batch_size=200,  # Larger batch size
    learning_rate=0.0001,  # Decreased learning rate
    beta_kl=2.0,  # Increased KL divergence weight
    use_gpu=True,  # Enabled GPU usage
    verbose=True  # Enabled verbosity
)

gs_vae = GridSearch(
    model=biVAECF,
    space=[
        Discrete(name="k", values=[10,15,20,25]),
        Discrete(name="encoder_structure", values=[[20],]),
        Discrete(name="n_epochs", values=[100,150,200]),
        Discrete(name="batch_size", values=[100,150,200])
    ],
    metric=metrics,
    eval_method=eval_method
)


In [175]:
# Here we must pass the models variables were 
# we defined the parameters

models = [ 
    #gs_vae
    biVAECF,
    #biVAECF_alt1,
    #biVAECF_alt2
]

In [176]:
"""total_users = ratio_split.train_set.num_users
total_items = ratio_split.train_set.num_items
logger.info(f'{total_users=}')
logger.info(f'{total_items=}')"""

"total_users = ratio_split.train_set.num_users\ntotal_items = ratio_split.train_set.num_items\nlogger.info(f'{total_users=}')\nlogger.info(f'{total_items=}')"

After setting up the metrics and models, we export the test, train and validation data into the results directory.

In [177]:
"""logger.info('Exporting test data...')
get_set_dataframe(
    dict(ratio_split.test_set.user_data),
    list(ratio_split.test_set.item_ids),
    list(ratio_split.test_set.user_ids),
).to_csv(results_dir / 'test_set.csv')"""

"logger.info('Exporting test data...')\nget_set_dataframe(\n    dict(ratio_split.test_set.user_data),\n    list(ratio_split.test_set.item_ids),\n    list(ratio_split.test_set.user_ids),\n).to_csv(results_dir / 'test_set.csv')"

In [178]:
"""logger.info('Exporting train data...')
get_set_dataframe(
    dict(ratio_split.train_set.user_data),
    list(ratio_split.train_set.item_ids),
    list(ratio_split.train_set.user_ids),
).to_csv(results_dir / 'train_set.csv')"""

"logger.info('Exporting train data...')\nget_set_dataframe(\n    dict(ratio_split.train_set.user_data),\n    list(ratio_split.train_set.item_ids),\n    list(ratio_split.train_set.user_ids),\n).to_csv(results_dir / 'train_set.csv')"

In [179]:
"""logger.info('Exporting validation data...')
get_set_dataframe(
    dict(ratio_split.val_set.user_data),
    list(ratio_split.val_set.item_ids),
    list(ratio_split.val_set.user_ids),
).to_csv(results_dir / 'val_set.csv')"""

"logger.info('Exporting validation data...')\nget_set_dataframe(\n    dict(ratio_split.val_set.user_data),\n    list(ratio_split.val_set.item_ids),\n    list(ratio_split.val_set.user_ids),\n).to_csv(results_dir / 'val_set.csv')"

And we run the experiments with the defined variables.

In [180]:
# Here we run the experiment
exp = Experiment(
    eval_method= eval_method,
    models=models,
    metrics=metrics,
    user_based=True,
    verbose= True
    )
exp.run()

train_data = eval_method.train_set
train_results = models.evaluate(train_data, metrics)
print("Training Results:", train_results)


[vibae_exp1] Training started!


  9%|▉         | 9/100 [00:05<00:57,  1.58it/s, loss_i=0.895, loss_u=7.2] 


KeyboardInterrupt: 

After running the experiment, we export the metrics obtained from the calculation into a csv using pandas.

In [None]:
# Here we export the metrics from the experiment in the test
# and validation data (if exist)

logger.info('Exporting metrics...')
metric_results = {
    exp.models[i].name: dict(exp.result[i].metric_avg_results)
    for i in range(len(models))
}
(DataFrame(metric_results)
 .reset_index()
 .rename(columns={'index': 'metric'})
 .to_csv(results_dir / 'metric_results.csv'))

And finally we export the recommendations. We use a custom multi loop to get the results.
- Here we first loop over the models of the experiment.
- We loop over the users map of cornac to get both the original id and the internal index of cornac.
- We get the scores for the users.
- We get the k top items using a combination of argsort and reversing of the list.
- We loop over the items map of cornac to get both the original id and the internal index of cornac.
- We get the score obtained from cornac, or nan in case of IndexError.
- We append the user and items, both the id and indexes, and the score to the result list.
- After all the loops are finished, we export the data into a csv file.

In [None]:
# Export recommendations

logger.info('Processing models...')
for model in exp.models:
    model_result = []
    logger.info(f'Getting scores for {model.name}...')

    for user_id, user_index in tqdm(exp.eval_method.train_set.uid_map.items()):
        try:
            scores = model.score(user_index)
        except ScoreException:
            logger.error(f"{model.name}: Couldn't predict for user {user_index} ({user_id=})")
            continue

        top_items = list(reversed(scores.argsort()))[:k]

        for item_id, item_index in exp.eval_method.train_set.iid_map.items():
            if item_index not in top_items:
                continue

            try:
                score = scores[item_index]
            except IndexError:
                logger.error(
                    f"{model.name}: No score for item {item_index} ({item_id=}) in user {user_index} ({user_id=})"
                )
                score = nan

            model_result.append({
                'user_id': user_id,
                'user_index': user_index,
                'item_id': item_id,
                'item_index': item_index,
                'score': score
            })

    logger.info(f'Exporting {model.name}...')
    (DataFrame(model_result)
     .sort_values(by=['user_id', 'score'], ascending=[True, False])
     .to_csv(results_dir / f'{model.name}.csv'))