# Item Similarity: Comparison

In this notebook, the goal is to compare different item similarity algorithms from the recpack library in terms of recall, so we can choose the best performing algorithm for candidate generation. 

We expand the notebook from last week by incorporating feedback on last week's notebook. Bjorn suggested to try out the `TimedLastItemPrediction` scenario instead of `Timed`, and see whether the score improves.

In [2]:
import numpy as np
import pandas as pd
from recpack.preprocessing.preprocessors import DataFramePreprocessor
from recpack.preprocessing.filters import MinItemsPerUser, MinUsersPerItem
from recpack.scenarios import Timed, TimedLastItemPrediction
from recpack.pipelines import PipelineBuilder

# import utils file from previous lecture
import sys
sys.path.append('../lecture4')
from utils import DATA_PATH, customer_hex_id_to_int

In [3]:
transactions = pd.read_parquet(f'{DATA_PATH}/transactions_train.parquet')
# customers = pd.read_parquet(f'{DATA_PATH}/customers.parquet')
# articles = pd.read_parquet(f'{DATA_PATH}/articles.parquet')

In [4]:
test_week = transactions.week.max()
transactions = transactions[transactions.week > test_week - 10]

In [13]:
# print the amount of unique customers and articles
print(f'Unique customers: {transactions["customer_id"].nunique()}')
print(f'Unique articles: {transactions["article_id"].nunique()}')

Unique customers: 437365
Unique articles: 38331


# Results from last week: `ItemKNN` and `TARSItemKNN`

To make it easier to compare to last week's results, I simply run the best `ItemKNN` and `TARSItemKNN` algorithm from last week in the `Timed` scenario, so we can see the results here.

## Preprocessing

In [5]:
proc = DataFramePreprocessor(item_ix='article_id', user_ix='customer_id', timestamp_ix='week')
proc.add_filter(MinUsersPerItem(10, item_ix='article_id', user_ix='customer_id'))
proc.add_filter(MinItemsPerUser(10, item_ix='article_id', user_ix='customer_id'))

interaction_matrix = proc.process(transactions)

  0%|          | 0/1228106 [00:00<?, ?it/s]

  0%|          | 0/1228106 [00:00<?, ?it/s]

In [15]:
# train on everything < test_week, test on test_week
scenario = Timed(t=test_week, delta_out=None, delta_in=None, validation=False)
scenario.split(interaction_matrix)

In [16]:
builder = PipelineBuilder()
builder.set_data_from_scenario(scenario)

builder.add_algorithm('ItemKNN', params={
    'K': 90,  
    'similarity': 'cosine',
    'normalize_X': False,
    'normalize_sim': True
})

builder.add_algorithm('TARSItemKNN', params={
    'K': 580, 
    'similarity': 'cosine',
    'fit_decay': 1/10,
    'predict_decay': 1/3,
})

builder.add_metric('PrecisionK', K=[12, 20, 30, 40])
builder.add_metric('RecallK', K=[12, 20, 30, 40])

builder.set_optimisation_metric('RecallK', K=12)

In [17]:
import warnings
from scipy.sparse import SparseEfficiencyWarning
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", SparseEfficiencyWarning)

pipeline = builder.build()
pipeline.run()

  0%|          | 0/2 [00:00<?, ?it/s]

2022-11-26 15:50:16,723 - base - recpack - INFO - Fitting ItemKNN complete - Took 1.46s
2022-11-26 15:50:30,642 - base - recpack - INFO - Fitting TARSItemKNN complete - Took 8.96s


In [18]:
pipeline.get_metrics()

Unnamed: 0,precisionk_12,precisionk_20,precisionk_30,precisionk_40,recallk_12,recallk_20,recallk_30,recallk_40
"ItemKNN(K=90,normalize_X=False,normalize_sim=True,pop_discount=None,similarity=cosine)",0.007474,0.006124,0.005108,0.004458,0.024356,0.033519,0.040821,0.047116
"TARSItemKNN(K=580,fit_decay=0.1,predict_decay=0.3333333333333333,similarity=cosine)",0.00838,0.00671,0.005677,0.005007,0.028117,0.036982,0.045927,0.053186


## Try `TimedLastItemPrediction` scenario

In [23]:
# train on everything < test_week - 1, validate on test_week - 1, test on test_week
scenario = TimedLastItemPrediction(t=test_week, t_validation=test_week - 1, validation=True)
scenario.split(interaction_matrix)

  max_ts_per_user = data.timestamps.max(level=0)


0it [00:00, ?it/s]

0it [00:00, ?it/s]

  max_ts_per_user = data.timestamps.max(level=0)


0it [00:00, ?it/s]

0it [00:00, ?it/s]

In [24]:
builder = PipelineBuilder()
builder.set_data_from_scenario(scenario)

# [50, 600] => best: ItemKNN(K=90,normalize_X=False,normalize_sim=True,pop_discount=None,similarity=cosine), Recall12=0.024356
builder.add_algorithm('ItemKNN', grid={
    'K': [k for k in range(50, 150, 10)],  
    'similarity': ['cosine'],
    'normalize_X': [True, False],
    'normalize_sim': [True]
})

# [50, 600] => best: TARSItemKNN(K=580,fit_decay=0.1,predict_decay=0.3333333333333333,similarity=cosine), Recall12=0.028117
builder.add_algorithm('TARSItemKNN', grid={
    'K': [k for k in range(570, 680, 10)], 
    'similarity': ['cosine'],
    'fit_decay': [1/2, 1/5, 1/10],
    'predict_decay': [1/3, 1/5, 1/10],
})


builder.add_metric('PrecisionK', K=[12, 20, 30, 40])
builder.add_metric('RecallK', K=[12, 20, 30, 40])

builder.set_optimisation_metric('RecallK', K=12)

In [25]:
import warnings
from scipy.sparse import SparseEfficiencyWarning
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", SparseEfficiencyWarning)

pipeline = builder.build()
pipeline.run()

  0%|          | 0/2 [00:00<?, ?it/s]

2022-11-26 16:01:05,776 - base - recpack - INFO - Fitting ItemKNN complete - Took 0.88s
2022-11-26 16:01:07,874 - base - recpack - INFO - Fitting ItemKNN complete - Took 0.9s
2022-11-26 16:01:10,167 - base - recpack - INFO - Fitting ItemKNN complete - Took 0.95s
2022-11-26 16:01:12,521 - base - recpack - INFO - Fitting ItemKNN complete - Took 0.981s
2022-11-26 16:01:14,937 - base - recpack - INFO - Fitting ItemKNN complete - Took 1.07s
2022-11-26 16:01:17,664 - base - recpack - INFO - Fitting ItemKNN complete - Took 1.09s
2022-11-26 16:01:20,344 - base - recpack - INFO - Fitting ItemKNN complete - Took 1.15s
2022-11-26 16:01:23,267 - base - recpack - INFO - Fitting ItemKNN complete - Took 1.17s
2022-11-26 16:01:26,271 - base - recpack - INFO - Fitting ItemKNN complete - Took 1.25s
2022-11-26 16:01:29,442 - base - recpack - INFO - Fitting ItemKNN complete - Took 1.26s
2022-11-26 16:01:32,692 - base - recpack - INFO - Fitting ItemKNN complete - Took 1.39s
2022-11-26 16:01:36,262 - base -

In [27]:
pipeline.optimisation_results

Unnamed: 0,identifier,params,recallk_12
0,"ItemKNN(K=50,normalize_X=True,normalize_sim=Tr...","{'K': 50, 'normalize_X': True, 'normalize_sim'...",0.056492
1,"ItemKNN(K=50,normalize_X=False,normalize_sim=T...","{'K': 50, 'normalize_X': False, 'normalize_sim...",0.076311
2,"ItemKNN(K=60,normalize_X=True,normalize_sim=Tr...","{'K': 60, 'normalize_X': True, 'normalize_sim'...",0.056344
3,"ItemKNN(K=60,normalize_X=False,normalize_sim=T...","{'K': 60, 'normalize_X': False, 'normalize_sim...",0.076014
4,"ItemKNN(K=70,normalize_X=True,normalize_sim=Tr...","{'K': 70, 'normalize_X': True, 'normalize_sim'...",0.056294
...,...,...,...
114,"TARSItemKNN(K=670,fit_decay=0.2,predict_decay=...","{'K': 670, 'fit_decay': 0.2, 'predict_decay': ...",0.091731
115,"TARSItemKNN(K=670,fit_decay=0.2,predict_decay=...","{'K': 670, 'fit_decay': 0.2, 'predict_decay': ...",0.083576
116,"TARSItemKNN(K=670,fit_decay=0.1,predict_decay=...","{'K': 670, 'fit_decay': 0.1, 'predict_decay': ...",0.099244
117,"TARSItemKNN(K=670,fit_decay=0.1,predict_decay=...","{'K': 670, 'fit_decay': 0.1, 'predict_decay': ...",0.094894


In [28]:
pipeline.get_metrics()

Unnamed: 0,precisionk_12,precisionk_20,precisionk_30,precisionk_40,recallk_12,recallk_20,recallk_30,recallk_40
"ItemKNN(K=50,normalize_X=False,normalize_sim=True,pop_discount=None,similarity=cosine)",0.006263,0.004779,0.003736,0.003153,0.075152,0.095571,0.112066,0.12612
"TARSItemKNN(K=610,fit_decay=0.1,predict_decay=0.3333333333333333,similarity=cosine)",0.008543,0.006171,0.004791,0.003918,0.102519,0.123416,0.143728,0.156722


### Conclusion

The recall@12 is significantly better in the `TimedLastItemPrediction` scenario compared to the `Timed` scenario. However, I was not sure whether this scenario was applicable to our use case, so I went to the recpack documentation on `TimedLastItemPrediction`, which says the following:

*"Predict users’ last interaction, given information about historical interactions. ... The scenario splits the data such that the last interaction of a user is the target for prediction, while the earlier ones are used for training and as history."*

From this description, it seems to me that this scenario is not well-suited for what we are trying to achieve. The goal in the H&M recommendation problem requires us to predict all interactions a user makes in week $t + 1$, given their interaction history up until week $t$. However, the `TimedLastItemPrediction` scenario targets only a single user interaction for prediction, namely the last interaction that a user performs.