Based on Time Decaying Popularity Benchmark [0.0216] : https://www.kaggle.com/code/mayukh18/time-decaying-popularity-benchmark-0-0216

# This notebook combines time decay + repurchase information + popular items
1. Recommend items that the customer bought in the last 4 weeks.
2. Recommend popular items from last 2 weeks weighted down by time.
3. Recommend items that are bought by the most customers from the last week.


In [133]:
import numpy as np
import pandas as pd
import os
import glob
from tqdm import tqdm
import datetime

# Forming Train Set

Repurchase info

In [134]:
latest_bought_articles = pd.read_csv('/kaggle/input/4weekrepurchase/repurchase4Weeks(1).csv')
latest_bought_articles = latest_bought_articles.values.tolist()

Popularity

In [135]:
%%time
import pandas as pd
pad = "/kaggle/input/makeparquet"
transactions = pd.read_parquet(pad+'/transactions_train.parquet')
customers = pd.read_parquet(pad+'/customers.parquet')
articles = pd.read_parquet(pad+'/articles.parquet')

CPU times: user 1.92 s, sys: 1.95 s, total: 3.87 s
Wall time: 1.72 s


In [136]:
# Step 1: Filter transactions for the last week
last_week_transactions = transactions[transactions['week'] == transactions['week'].max()-1]

# Step 2: Group transactions by 'article_id' and count unique 'customer_id'
article_customer_count = last_week_transactions.groupby('article_id')['customer_id'].nunique().reset_index(name='customer_count')

# Step 3: Sort articles based on customer count in descending order
sorted_articles = article_customer_count.sort_values(by='customer_count', ascending=False)

# Step 4: Take the top 12 articles
top_12_articles = sorted_articles.head(12)
pop_items = (top_12_articles.article_id.to_list())

time decay

In [137]:
data = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv", dtype={'article_id':str})
data.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


We'll drop everything except the last few(up for experimentation) days. The info from previous months are not coming of much use. 
We'll keep 4 weeks as train and the last week as validation.

In [138]:
print("All Transactions Date Range: {} to {}".format(data['t_dat'].min(), data['t_dat'].max()))

data["t_dat"] = pd.to_datetime(data["t_dat"])
train1 = data.loc[(data["t_dat"] >= datetime.datetime(2020,9,8)) & (data['t_dat'] < datetime.datetime(2020,9,16))]
train2 = data.loc[(data["t_dat"] >= datetime.datetime(2020,9,1)) & (data['t_dat'] < datetime.datetime(2020,9,8))]
train3 = data.loc[(data["t_dat"] >= datetime.datetime(2020,8,23)) & (data['t_dat'] < datetime.datetime(2020,9,1))]
train4 = data.loc[(data["t_dat"] >= datetime.datetime(2020,8,15)) & (data['t_dat'] < datetime.datetime(2020,8,23))]

val = data.loc[data["t_dat"] >= datetime.datetime(2020,9,16)]

All Transactions Date Range: 2018-09-20 to 2020-09-22


Items which an user has bought in our train set time.

In [139]:
# List of all purchases per user (has repetitions)
positive_items_per_user1 = train1.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user2 = train2.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user3 = train3.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user4 = train4.groupby(['customer_id'])['article_id'].apply(list)

Next we do time decay based popularity for items. This leads to items bought more recently having more weight in the popularity list. In simple words, item A bought 5 times on the first day of the train period is inferior than item B bought 4 times on the last day of the train period.

In [140]:
train = pd.concat([train1, train2], axis=0)
train['pop_factor'] = train['t_dat'].apply(lambda x: 1/(datetime.datetime(2020,9,16) - x).days)
popular_items_group = train.groupby(['article_id'])['pop_factor'].sum()

_, popular_items = zip(*sorted(zip(popular_items_group, popular_items_group.keys()))[::-1])

train['pop_factor'].describe()

count    557958.000000
mean          0.200478
std           0.207752
min           0.066667
25%           0.083333
50%           0.125000
75%           0.200000
max           1.000000
Name: pop_factor, dtype: float64

# Validation: Evaluating the Idea

In [141]:
def apk(actual, predicted, k=12):
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=12):
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

Items bought by users in the validation period. Similar as the one for train set.

In [142]:
positive_items_val = val.groupby(['customer_id'])['article_id'].apply(list)

In [143]:
# creating validation set for metrics use case
val_users = positive_items_val.keys()
val_items = []

for i,user in tqdm(enumerate(val_users)):
    val_items.append(positive_items_val[user])
    
print("Total users in validation:", len(val_users))

68984it [00:00, 195738.37it/s]

Total users in validation: 68984





We'll now validate our algo on the validation set.

In [144]:
from collections import Counter
outputs = []
cnt = 0
user_cnt=0

popular_items = list(popular_items)

for user in tqdm(val_users):
    user_output = []
    if user in positive_items_per_user1.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user1[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user2.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user2[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user3.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user3[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user4.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user4[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    
    
    user_output = [int(j) for j in user_output] 
    repurchase = []
    for articleRepurchase in latest_bought_articles[user_cnt]:
        if articleRepurchase !=0:
            repurchase.append(articleRepurchase)
    
#     order: repurchase decay + pop_items
    user_output = repurchase + user_output + pop_items
#     remove duplicates
    user_output = [int(j) for j in user_output] 
    user_output = pd.Series(user_output).drop_duplicates().tolist()
    
    user_output = user_output[:12]
    user_output = ['0'+str(j) for j in user_output] 

    outputs.append(user_output)
    user_cnt+=1
    
print("mAP Score on Validation set:", mapk(val_items, outputs))

100%|██████████| 68984/68984 [00:19<00:00, 3520.32it/s]


mAP Score on Validation set: 0.021555666982134056


# Prediction on Test Set: Submission

In [145]:
# Step 1: Filter transactions for the last week
last_week_transactions = transactions[transactions['week'] >= transactions['week'].max()]

# Step 2: Group transactions by 'article_id' and count unique 'customer_id'
article_customer_count = last_week_transactions.groupby('article_id')['customer_id'].nunique().reset_index(name='customer_count')

# Step 3: Sort articles based on customer count in descending order
sorted_articles = article_customer_count.sort_values(by='customer_count', ascending=False)

# Step 4: Take the top 12 articles
top_12_articles = sorted_articles.head(12)
pop_items = (top_12_articles.article_id.to_list())

In [146]:
train1 = data.loc[(data["t_dat"] >= datetime.datetime(2020,9,16)) & (data['t_dat'] < datetime.datetime(2020,9,23))]
train2 = data.loc[(data["t_dat"] >= datetime.datetime(2020,9,8)) & (data['t_dat'] < datetime.datetime(2020,9,16))]
train3 = data.loc[(data["t_dat"] >= datetime.datetime(2020,8,31)) & (data['t_dat'] < datetime.datetime(2020,9,8))]
train4 = data.loc[(data["t_dat"] >= datetime.datetime(2020,8,23)) & (data['t_dat'] < datetime.datetime(2020,8,31))]

positive_items_per_user1 = train1.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user2 = train2.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user3 = train3.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user4 = train4.groupby(['customer_id'])['article_id'].apply(list)

train = pd.concat([train1, train2], axis=0)
train['pop_factor'] = train['t_dat'].apply(lambda x: 1/(datetime.datetime(2020,9,23) - x).days)
popular_items_group = train.groupby(['article_id'])['pop_factor'].sum()

_, popular_items = zip(*sorted(zip(popular_items_group, popular_items_group.keys()))[::-1])

user_group = pd.concat([train1, train2, train3, train4], axis=0).groupby(['customer_id'])['article_id'].apply(list)

In [147]:
submission = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/sample_submission.csv")
submission.head()

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0706016001 0706016002 0372860001 0610776002 07...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0706016001 0706016002 0372860001 0610776002 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0706016001 0706016002 0372860001 0610776002 07...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0706016001 0706016002 0372860001 0610776002 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0706016001 0706016002 0372860001 0610776002 07...


In [148]:
from collections import Counter
outputs = []
cnt = 0
user_cnt=0
for user in tqdm(submission['customer_id']):
    user_output = []
    if user in positive_items_per_user1.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user1[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user2.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user2[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12 - len(user_output)]
    if user in positive_items_per_user3.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user3[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12 - len(user_output)]
    if user in positive_items_per_user4.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user4[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12 - len(user_output)]
        
#     Add repurchase and own popularity
    user_output = [int(j) for j in user_output] 
    repurchase = []
    for articleRepurchase in latest_bought_articles[user_cnt]:
        if articleRepurchase !=0:
            repurchase.append(articleRepurchase)
    
#     order: repurchase decay + pop_items
    user_output = repurchase + user_output + pop_items
#     remove duplicates
    user_output = [int(j) for j in user_output] 
    user_output = pd.Series(user_output).drop_duplicates().tolist()
    
    user_output = user_output[:12]
    outputs.append(user_output)
    user_cnt+=1
    
str_outputs = []
for output in outputs:
    str_outputs.append(" ".join(['0' + str(x) for x in output]))

100%|██████████| 1371980/1371980 [05:59<00:00, 3820.78it/s]


In [149]:
submission['prediction'] = str_outputs


In [150]:
submission.to_csv(f'Repurchase4weekDecayPopular.csv.gz', index=False)

In [151]:
submission.head(1).prediction.values

array(['0568601043 0924243001 0918522001 0924243002 0923758001 0866731001 0915529003 0909370001 0915529005 0751471001 0918292001 0762846027'],
      dtype=object)