Basic recpack implementation with pipeline

In [1]:
import tqdm as notebook_tqdm
import sys
import functions as f
import pandas as pd
import numpy as np

from recpack.pipelines import PipelineBuilder
from recpack.scenarios import WeakGeneralization
from recpack.preprocessing.preprocessors import DataFramePreprocessor
from recpack.preprocessing.filters import MinItemsPerUser, MinUsersPerItem

2023-11-30 15:19:40.104020: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-30 15:19:40.104090: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-30 15:19:40.111588: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-30 15:19:41.486943: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


ImportError: cannot import name 'Self' from 'typing_extensions' (/usr/lib/python3/dist-packages/typing_extensions.py)

In [None]:
transactions_path2 = '../../00 - Data/transactions/transactions_train.csv'
transactions3 = pd.read_csv(transactions_path2)
len(transactions3)

In [None]:
sample = 0.005
transactions_sample = transactions3.sample(frac=sample, random_state=42)
len(transactions_sample)

In [None]:
#turns pandas dataframe into interaction-matrix object
#       item1   item2   item3
#usr1      x                x
#usr2       x       x
proc = DataFramePreprocessor(item_ix='article_id', user_ix='customer_id', timestamp_ix='t_dat')

# #every user has at least 2 items bought
proc.add_filter(MinUsersPerItem(2, item_ix='article_id', user_ix='customer_id'))
# #every item is bought at least twice
proc.add_filter(MinItemsPerUser(2, item_ix='article_id', user_ix='customer_id'))

interaction_matrix = proc.process(transactions_sample)

In [None]:
#divide matrix into test-train (75-25)
scenario = WeakGeneralization(0.75, validation=True)
scenario.split(interaction_matrix)

builder = PipelineBuilder.PipelineBuilder()
builder.set_data_from_scenario(scenario)

First, you calculate the Discounted Cumulative Gain (DCG) at K, which is the sum of the relevance scores of the top-K recommended items, each discounted by its position in the list. Relevance scores are often binary (relevant or not relevant) or graded (e.g., on a scale from 1 to 5).
Then, you calculate the Ideal DCG (IDCG) at K, which represents the best possible DCG score if the recommendations were perfectly relevant.
Finally, you compute NDCG@K as the ratio of DCG@K to IDCG@K, normalizing the score to be between 0 and 1. A higher NDCG@K indicates better recommendations.
Coverage@K:

Coverage is a metric that measures how diverse or comprehensive a recommendation system is in terms of the items it suggests.
The "@K" in this metric signifies that it is calculated for the top K recommendations.
The idea is to assess the ability of the system to cover a wide range of items in its recommendations, not just focusing on a few popular items.
The Coverage@K metric can be calculated in various ways, but a common approach is to count the unique items that appear in the top-K recommendations. A higher Coverage@K indicates that the recommendations cover a larger variety of items.

In [None]:
#adds algorithms to use later on. Baseline algorithim, just recommends popular stuff
builder.add_algorithm('Popularity') 

# #we will evaluate similarity using K nearest neighbors and computing distance with cosine
# builder.add_algorithm('ItemKNN', grid={
#     'K': [100, 200, 500],
#     'similarity': ['cosine', 'conditional_probability'],
# })

# builder.add_algorithm('KUNN')

#Set the metric for optimisation of parameters in algorithms. What is NDCGK ??
builder.set_optimisation_metric('NDCGK', K=10)

#adds metric for evaluation
#NDCGK = Normalized Discounted Cumulative Gain at K
builder.add_metric('NDCGK', K=[10, 20, 50])
builder.add_metric('CoverageK', K=[10, 20])

In [None]:
pipeline = builder.build()
pipeline.run()

For the itemKNN recommendation, we can see how when suggesting in a range of k=10 it achieves a pretty high recommendation of varied items (0.77)
but looking at the metric NDCGK it is not very good at recommending stuff. This leads me to believe that i am recommending too much novel stuff


METRICS FOR transactions_train_short.parquet

In [None]:
pipeline.get_metrics()

In [None]:
pipeline.optimisation_results

METRICS FOR transactions_train_short.parquet run2

In [None]:
pipeline.get_metrics()

In [None]:
pipeline.optimisation_results