# Next basket recommendation - Top frequency

The task of next basket recommendation is to predict a content of customers basket at their future purchase.

Task has been assigned as a competition among the students who took the Algorithms of data mining course at Faculty of Information Technology @ Czech Technical University in Prague.

### Scoring function

The competition has been divided into 2 rounds. The difference between rounds is the used scoring function. 1st round scoring function is Jaccard similarity coefficient

$J(A,B) = \frac{ |A \cap B| }{ |A \cup B| }$,

where A is a real basket and B is a predicted one. Final score is a mean of $J(A,B)$ over all predictions. 2nd round scoring function is Generalized Jaccard similarity coefficient over multisets

$J_g(A,B) = \frac{ \sum_i min(a_i, b_i) }{ \sum_i max(a_i, b_i) }$,

where $A$ is a real basket and $B$ is a predicted one. $a_i$ (resp. $b_i$) is number of occurrences of $i$-th in $A$ (resp. $B$). Final score is a mean of $J_g(A,B)$ over all predictions. Generalized jaccard index takes into account cases where there is multiple occurrences of same products in one basket. Therefore the scoring function is stricter.

Score in 1st rnd: 0.188  Pos: 7/46

Score in 2nd rnd: TBA


### Disclaimer

Unfortunately I cannot provide the data for the problem as to avoid any legal issues. The data has been provided by a external company in collaboration with the university.

Even though I cannot provide the dataset I will still try to capture my thought process and all the ideas and examples will be shown.

## Dataset

We have been provided with 2 csv files.

First file holds basket history. It contains 3 columns. The columns are `userid`, `date` and `itemids`. Columns are self-explanatory. `userid` is an ID number of an user, `date` is a date of a purchase and `itemids` is a space separated list of product IDs. Basket history files come in 2 sizes. First, the smaller one, contains 1 700 000+ rows. Second file contains total of 8 800 000+ of data.

| userid 	 | date       	 | itemids           	 |
|----------|--------------|---------------------|
| 12345  	 | 1995-30-7  	 | 11111 22222       	 |
| 777777 	 | 2022-1-1   	 | 12314             	 |
| 425645 	 | 2020-12-31 	 | 45646 46511 11111 	 |

Second file contains information about products. Each row represents one product. There is 25+ features about each product. Total number of products is around 1500.

| productid 	 | features.... 	 |
|-------------|----------------|
| 11111     	 | features     	 |
| 22222     	 | features     	 |


# TOP frequency

This notebook will focus purely on TOP recommendation methods, which are broadly used as a baseline methods for next basket recommendation.

First we will preprocess and prepare data for prediction. After that series of methods and improvements will be shown and evaluated on sample data. Lastly we will scale to full dataset and predict for test data.

### Imports

`common.py` stores functions that would otherwise clutter the notebook.

In [17]:
import pandas as pd
import numpy as np

from common import expl_ratio
from common import dataframe_prediction, dataframe_score

First import smaller dataset.

In [18]:
df = pd.read_csv('data/train100k.csv')
df.head()

Unnamed: 0,userid,date,itemids
0,7226385,2019-01-22,42203 41183 15823 39620
1,7226385,2019-02-12,54231 14939 39462
2,7226385,2019-03-11,15823 21028 39620 52846
3,7226385,2019-04-03,14939 39620 27542 21028 19353
4,7226385,2019-05-23,21028 21028 14939 15823


In [19]:
df.shape

(1711877, 3)

## Data preprocessing

In first step we will convert date to datetime and split ids of bought items for entire dataset.

In [20]:
df = (
    df
    .assign(
        date=lambda x: pd.to_datetime(x['date'], infer_datetime_format=True),
        product_id=lambda x: x['itemids'].str.split(),
    )
    .drop(columns=['itemids'])
    .astype({'userid':'uint32'})
)

Smaller dataset contains 1.7 mil. baskets. We will take a sample of data.

In [21]:
sample = df[:50000]

Baskets for each user were already sorted by date for us. As a validation dataset we will use last purchased basket of every user. The rest of the baskets will be used for training.

In [22]:
valid_df = (
    sample
        .groupby('userid')
        .last()
        .reset_index()
)
train_df = (
    sample
        .groupby('userid')
        .apply(lambda x: x.iloc[:-1])
        .reset_index(drop=True)
)
print(valid_df.shape)
print(train_df.shape)

(2942, 3)
(47058, 3)


In the next step we will remove duplicate items in each basket. Then we will take each basket of each user and calculate month difference from the validation basket. As this will be later used. Lastly we will explode baskets so that each bought product is in own row.

In [23]:
exploded_train_df = (
        train_df
        .merge(valid_df[['userid','date']], how='inner', on='userid', suffixes=('','_last'))
        .assign(
            product_id=lambda x: x['product_id'].apply(lambda l:list(set(l))),
            monthdiff=lambda x: (x['date_last'] - x['date'])/np.timedelta64(1, 'M')
        )
        .explode('product_id')
        .astype({'product_id':'uint16'})
        .reset_index(drop=True)
)
exploded_train_df.head()

Unnamed: 0,userid,date,product_id,date_last,monthdiff
0,1002915,2020-04-02,12043,2020-12-11,8.312286
1,1002915,2020-04-02,44895,2020-12-11,8.312286
2,1002915,2020-04-02,14291,2020-12-11,8.312286
3,1002915,2020-04-02,39917,2020-12-11,8.312286
4,1002915,2020-04-02,18339,2020-12-11,8.312286


### Basket size

For basket size mean and median of baskets is calculated for each user.

In [24]:
basket_size = (
    exploded_train_df[['userid', 'product_id', 'date']]
            .groupby(['userid','date'])
            .count()
            .reset_index()
            .groupby('userid')
            ['product_id']
            .agg(mean='mean', median='median')
)
basket_size.head()

Unnamed: 0_level_0,mean,median
userid,Unnamed: 1_level_1,Unnamed: 2_level_1
1002915,3.272727,3.0
1007942,10.0,11.0
1008516,2.125,2.0
1009684,5.615385,5.0
1013330,1.5,1.0


## Predicting next basket

Now we will move on predicting next baskets.

Number of methods and improvements will be shown and discussed.

### TOP Generalized frequencies

For each product count of purchases is calculated. Prediction afterwards is simple. Take top-$K$ bought products, where $K$ is the basket size, and return as prediction.

In [25]:
gfreqs = (
    exploded_train_df[['product_id', 'date']]
        .groupby('product_id')
        .agg(freq=('date','count'))
        .astype({'freq':'uint32'})
        .sort_values(by='freq', ascending=False)
)
gfreqs.head()

Unnamed: 0_level_0,freq
product_id,Unnamed: 1_level_1
10985,1559
54684,1475
31758,1444
43437,1306
47139,1289


In [26]:
gfreq_test = dataframe_prediction(valid_df, 'gfreq', gfreqs, basket_size.loc[:,'median'])
print('G-topfreq: ', dataframe_score(gfreq_test))

G-topfreq:  0.018057837267940368


As expected this model is very weak and doesn't capture different trends among customers.

### TOP Personalized frequencies

Huge improvement is gained when counts of purchases are calculated based on each users basket history. Therefore predictions are solely based on user's past preferences. Disadvantage of this approach is exploration of new products is not possible.

In [27]:
pfreqs = (
    exploded_train_df[['userid','product_id','date']]
        .groupby(['userid','product_id'])
        .agg(freq=('date','count'))
        .astype({'freq':'uint32'})
        .sort_values(by=['userid','freq'], ascending=[True,False])
)
pfreqs.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,freq
userid,product_id,Unnamed: 2_level_1
1002915,18339,4
1002915,20262,4
1002915,55167,3
1002915,14291,2
1002915,34720,2


In [28]:
pfreq_test = dataframe_prediction(valid_df, 'pfreq', pfreqs, basket_size.loc[:,'median'])
print('P-topfreq: ', dataframe_score(pfreq_test))

P-topfreq:  0.17351118426603324


In comparison to Generalized TOP method the score of personalized TOP frequencies is about 10x higher for the sample used.

### Adding weighted frequencies

Back in preprocessing we calculated month difference between users past baskets and validation basket. The expectation is that users taste is changing throughout the time and that products not recently bought should be predicted less. To achieve this, frequencies are not just simply incremented for each occurence. Instead occurance is weighted by $e^{-\frac{x}{a}$, where $x$ is the timedelta (i.e. month difference ) and $a \in  [0, +\infty )$ is a hyperparameter. This is so called exponential decay.

In [29]:
def wpfrequencies(data_df, a):
    return (
        data_df[['userid','product_id','monthdiff']]
            .assign(weight=lambda x:np.exp(-x['monthdiff']/a))
            .groupby(['userid','product_id'])
            .agg(freq=('weight','sum'))
            .astype({'freq':'float32'})
            .sort_values(by=['userid','freq'], ascending=[True,False])
    )


Hyperparameter tuning $a$.

In [30]:
a = np.linspace(5, 20, num=250)
param_a_scores = {}

for param in a:
    wpfreqs = wpfrequencies(exploded_train_df, param)
    wpfreq_test = dataframe_prediction(valid_df, 'wpfreq', wpfreqs, basket_size.loc[:,'median'])
    param_a_scores[param] = dataframe_score(wpfreq_test)

best_a = max(param_a_scores, key=param_a_scores.get)
print(f'with a={best_a} WP-topfreq score = {param_a_scores[best_a]}')

with a=9.156626506024097 WP-topfreq score = 0.18658268917932255


For this sample the score improved by around 10%.

### Generalized + weighted personalized frequencies

Another idea to improve predictions was to measure exploration of users in their last $n$ baskets. Exploration meaning percentage of items that they bought for the first time in last $n$ baskets. And then from the percentage determine expected number of never before bought items in next basket. The calculated number of items is then selected from generalized TOP frequencies.

In [31]:
def build_rpt_expl_df(ddf, n):
    return (
        ddf
            .groupby('userid')
            .agg(exploration=('product_id',lambda x:expl_ratio(x,n)))
            .assign(
            repetition=lambda x:1-x['exploration']
        )
    )

Hyperparameter tuning $n$.

In [34]:
n = range(0,10)
wpfreqs = wpfrequencies(exploded_train_df, best_a)
param_n_scores = {}

for param in n:
    rpt_expl_df = build_rpt_expl_df(train_df, param)
    gwpfreq_test = dataframe_prediction(valid_df, 'gpfreq', (wpfreqs,gfreqs), basket_size.loc[:,'median'], rpt_expl_df)
    param_n_scores[param] = dataframe_score(gwpfreq_test)

best_n = max(param_n_scores, key=param_n_scores.get)
print(f'with n={best_n} GWP-topfreq score = {param_n_scores[best_n]}')

with n=0 GWP-topfreq score = 0.18658268917932255


Best result with n=0 means that this idea didn't improve the predicting power of the model.

### Removing often unsuccessfully predicted items

Idea behind is to measure accuracy of predictions for each item and simply not predict items that are often predicted incorrectly.

In [49]:
def count_successful_predictions(prediction, product_prediction_accuracy):
    for prod in prediction['pred']:
        product_prediction_accuracy.loc[prod,'pred_cnt'] += 1
        if prod in map(int,prediction['product_id']):
            product_prediction_accuracy.loc[prod,'pred_success'] += 1

def remove_banned(predicted_basket, banned):
    return [prod for prod in predicted_basket if prod not in banned]

In [50]:
wpfreqs = wpfrequencies(exploded_train_df, best_a)
wpfreq_test = dataframe_prediction(valid_df, 'wpfreq', wpfreqs, basket_size.loc[:,'median'])
print('Score before removing banned items',dataframe_score(wpfreq_test))

Score before removing banned items 0.18658268917932255


Now we will calculate item predicition accuracy.

In [59]:
pred_acc = pd.DataFrame(index=gfreqs.index)
pred_acc[['pred_cnt', 'pred_success']] = 0
wpfreq_test.apply(lambda row: count_successful_predictions(row[['product_id','pred']], pred_acc), axis=1)
pred_acc['acc'] = pred_acc['pred_success']/pred_acc['pred_cnt']

Hyperparameter tuning $p$, where $p$ is a threshold for accuracy under which items are considered as banned.

In [58]:
p = np.linspace(0,0.2,100)
param_p_scores = {}

for param_p in p:
    banned_prods = set(pred_acc.query('acc < @param_p').index.tolist())
    tmpdf = wpfreq_test.copy()
    tmpdf['pred'] = tmpdf['pred'].map(lambda row: remove_banned(row, banned_prods))
    param_p_scores[param_p] = dataframe_score(tmpdf)

best_p = max(param_p_scores, key=param_p_scores.get)
print(f'with p={best_p} score = {param_p_scores[best_p]}')

with p=0.07878787878787878 score = 0.19376484943082484


This approach again improved the score by around 10%.

## Summary

Median turned out to be better than mean for the size of basket.

Generalized TOP not really useful model but valuable stepping stone.

validation score ~`0.018`

Personalized TOP gave huge improvement and is basis for my best solution.

validation score ~`0.175`

Weighted personalized TOP with the hypertuned parameter `a` it turns out baskets 12 months old are around 3.5x less important and 24 months old around 10x less important than 1 month old baskets.

validation score ~`0.185`

Exploration attempt was not successful.

Removing often unsuccessfully predicted items again slightly improved prediction score. To my understanding these banned items were once heavily bought but not anymore.

validation score ~`0.193`

Score on test data was `0.188`.

Position `7/45`.

## Test data predictions and scaling to full dataset

Lastly we will scale to full dataset, predict on test data and format the data so that the evaluation server can process it.

In [132]:
def test_pred_format(test):
    test = test.rename(columns={'pred':'itemids'})
    test['itemids'] = test['itemids'].astype(str).str.replace(r'\[|\]|\,','', regex=True)
    return test

In [133]:
test_df = (
        pd.read_csv('data/test24k.csv')
        .drop(columns=['itemids'])
        .assign(date=lambda x: pd.to_datetime(x['date'], infer_datetime_format=True))
)

exploded_train_df = (
    df
        .merge(test_df[['userid','date']], how='inner', on='userid', suffixes=('','_last'))
        .assign(
        product_id=lambda x: x['product_id'].apply(lambda l:list(set(l))),
        monthdiff=lambda x: (x['date_last'] - x['date'])/np.timedelta64(1, 'M')
    )
        .explode('product_id')
        .astype({'product_id':'uint16'})
        .reset_index(drop=True)
)
basket_size = (
    exploded_train_df[['userid', 'product_id', 'date']]
        .groupby(['userid','date'])
        .count()
        .reset_index()
        .groupby('userid')
    ['product_id']
        .agg(mean='mean', std='std', median='median')
)


In [134]:
banned_prods = set(pred_acc.query('acc < @best_p').index.tolist())
tmpdf = dataframe_prediction(test_df, 'wpfreq', wpfrequencies(exploded_train_df, best_a), basket_size.loc[:,'median'])
tmpdf['pred'] = tmpdf['pred'].map(lambda row: remove_banned(row, banned_prods))
test_pred_format(tmpdf).to_csv('data/pred24k.csv', index=False)