[NB 03](https://github.com/radekosmulski/personalized_fashion_recs/blob/messing_around/03_Basic_lgbm_with_idxs_restart.ipynb) didn't quite work.

The two likely reasons for that are:
 - issues with how I am generating train data
 - false assumption that you can toss whatever at a ranking model and it will do the rest

In this notebook, I want to put the groundwork needed for growing a good solution. First of all, that will require a robust and fast local validation scheme. We know that using the last week of train data for validation work and tracks the leaderboard nicely.

Secondly, we want to start from a kernel of a solution that we can extend. This [notebook](https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2) on kaggle seems to me like a great starting point.

The plan is to develop the functionality needed for a nice setup of a solution that we will reuse in NB 05. Along the way I hope to learn a bit more about the data, about some of the trends that I might want to model through the features I will engineer.

The plan is:
* implement a quick training pipeline leading to good validation
* train a ranking model on candidates we know to be good
* only generate new training data / candidates while validating whether we are moving in the right direction using local CV
* start with building sensible features and see whether they move the needle on the score

The truth is I do not know what will work. These RecSys models are a completely new breed of models to me. But I can set the problem up in a way as to help me learn. And that is what I am going to do :).

Once I get this working I will breathe a sigh of relief and will jump into reading papers and drawing inspiration from there.

Let's get started.

NOTE: You are welcome to check out the earlier code that I wrote which can be found on [this branch](https://github.com/radekosmulski/personalized_fashion_recs/tree/messing_around). I learned a lot about RecSys models and this particular problem through it. But it has quite a few bugs and a couple of issues I now know off with regards to the approach. **You should not need any of the earlier code to run the notebooks in the main branch of this repo.**

In [1]:
# helper functions
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from average_precision import apk
import pandas as pd

# https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635
def customer_hex_id_to_int(series):
    return series.str[-16:].apply(hex_id_to_int)

def hex_id_to_int(str):
    return int(str[-16:], 16)

def article_id_str_to_int(series):
    return series.astype('int32')

def article_id_int_to_str(series):
    return '0' + series.astype('str')

class Categorize(BaseEstimator, TransformerMixin):
    def __init__(self, min_examples=0):
        self.min_examples = min_examples
        self.categories = []
        
    def fit(self, X):
        for i in range(X.shape[1]):
            vc = X.iloc[:, i].value_counts()
            self.categories.append(vc[vc > self.min_examples].index.tolist())
        return self

    def transform(self, X):
        data = {X.columns[i]: pd.Categorical(X.iloc[:, i], categories=self.categories[i]).codes for i in range(X.shape[1])}
        return pd.DataFrame(data=data)

In [2]:
# !wget https://raw.githubusercontent.com/benhamner/Metrics/master/Python/ml_metrics/average_precision.py

We want this to be fast. I can get as much RAM as I will ever need through VMs on GCP, but that is not the point. I want to see how far I can push my local hardware, but this goes even beyond that.

I need the speed to make a good use of my time as I continue to build my understanding of what RecSys models are about. And the path to this leads through making the data I will work on smaller.

In [3]:
#%%time

transactions = pd.read_csv('../00 - Data/transactions_train/transactions_train.csv', dtype={"article_id": "str"})
customers = pd.read_csv('../00 - Data/customers/customers_features.csv')
articles = pd.read_csv('../00 - Data/articles/articles.csv', dtype={"article_id": "str"})

In [4]:
transactions.memory_usage(deep=True)

Index                      132
t_dat               2129817708
customer_id         3846387204
article_id          2129817708
price                254306592
sales_channel_id     254306592
dtype: int64

In [5]:
transactions.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype  
---  ------            -----  
 0   t_dat             object 
 1   customer_id       object 
 2   article_id        object 
 3   price             float64
 4   sales_channel_id  int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 8.0 GB


In [6]:
%%time
transactions['customer_id'].nunique()

CPU times: total: 7.28 s
Wall time: 7.27 s


1362281

In [7]:
%%time
transactions['customer_id'] = customer_hex_id_to_int(transactions['customer_id'])
transactions['customer_id'].nunique()

CPU times: total: 27.7 s
Wall time: 27.9 s


1362281

In [8]:
transactions.memory_usage(deep=True)

Index                      132
t_dat               2129817708
customer_id          254306592
article_id          2129817708
price                254306592
sales_channel_id     254306592
dtype: int64

In [9]:
transactions.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype  
---  ------            -----  
 0   t_dat             object 
 1   customer_id       uint64 
 2   article_id        object 
 3   price             float64
 4   sales_channel_id  int64  
dtypes: float64(1), int64(1), object(2), uint64(1)
memory usage: 4.7 GB


Nice!

Initially, I wanted to get rid of the `t_dat` column but on second thought I am not a fan.

I am all for speed and reducing weight, but the main purpose of this activity is to increase developer productivity.

If I fall back down to ints representing year, week, day I will be certainly trading developer productivity for fewer CPU cycles that are needed (and I want to go in the exact opposite direction! developer productivity > (nearly) anything else)

In [10]:
%%time

transactions.t_dat = pd.to_datetime(transactions.t_dat, format='%Y-%m-%d')

CPU times: total: 3.36 s
Wall time: 3.47 s


In [11]:
transactions['week'] = 104 - (transactions.t_dat.max() - transactions.t_dat).dt.days // 7

In [12]:
transactions.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 6 columns):
 #   Column            Dtype         
---  ------            -----         
 0   t_dat             datetime64[ns]
 1   customer_id       uint64        
 2   article_id        object        
 3   price             float64       
 4   sales_channel_id  int64         
 5   week              int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(1), uint64(1)
memory usage: 3.2 GB


Let's do something about the `article_id` (both here and on `articles`) and let's take a closer look at `price`, `sales_channel_id` and `week`.

In [13]:
transactions.article_id = article_id_str_to_int(transactions.article_id)
articles.article_id = article_id_str_to_int(articles.article_id)

transactions.week = transactions.week.astype('int8')
transactions.sales_channel_id = transactions.sales_channel_id.astype('int8')
transactions.price = transactions.price.astype('float32')

In [14]:
transactions.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 6 columns):
 #   Column            Dtype         
---  ------            -----         
 0   t_dat             datetime64[ns]
 1   customer_id       uint64        
 2   article_id        int32         
 3   price             float32       
 4   sales_channel_id  int8          
 5   week              int8          
dtypes: datetime64[ns](1), float32(1), int32(1), int8(2), uint64(1)
memory usage: 788.2 MB


In [15]:
transactions.drop(columns='t_dat').info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype  
---  ------            -----  
 0   customer_id       uint64 
 1   article_id        int32  
 2   price             float32
 3   sales_channel_id  int8   
 4   week              int8   
dtypes: float32(1), int32(1), int8(2), uint64(1)
memory usage: 545.7 MB


Well, this is interesting. There are very few unique `t_dat` values hence despite it being a scary `datetime64` it takes up very little memory!

Keeping it for convenience is definitely the way to go.

Let's take a brief look at the `customers` and `articles` dfs.

In [16]:
customers.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1371980 entries, 0 to 1371979
Data columns (total 8 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   customer_id             1371980 non-null  object 
 1   FN                      1371980 non-null  int64  
 2   Active                  1371980 non-null  int64  
 3   club_member_status      1371980 non-null  int64  
 4   fashion_news_frequency  1371980 non-null  int64  
 5   age                     1371980 non-null  float64
 6   postal_code             1371980 non-null  object 
 7   INFO-CODE               1371980 non-null  object 
dtypes: float64(1), int64(4), object(3)
memory usage: 460.6 MB


In [17]:
articles.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   article_id                    105542 non-null  int32 
 1   product_code                  105542 non-null  int64 
 2   prod_name                     105542 non-null  object
 3   product_type_no               105542 non-null  int64 
 4   product_type_name             105542 non-null  object
 5   product_group_name            105542 non-null  object
 6   graphical_appearance_no       105542 non-null  int64 
 7   graphical_appearance_name     105542 non-null  object
 8   colour_group_code             105542 non-null  int64 
 9   colour_group_name             105542 non-null  object
 10  perceived_colour_value_id     105542 non-null  int64 
 11  perceived_colour_value_name   105542 non-null  object
 12  perceived_colour_master_id    105542 non-null  int64 
 13 

Well, this stuff will be getting merged with our transactions df at some point, so I guess we can also make this smaller and easier to work with down the road.

In [18]:
customers['club_member_status'].unique()

array([1, 0, 2, 3], dtype=int64)

In [19]:
customers.customer_id = customer_hex_id_to_int(customers.customer_id)
for col in ['FN', 'Active', 'age']:
    customers[col].fillna(-1, inplace=True)
    customers[col] = customers[col].astype('int8')

In [20]:
customers.club_member_status = Categorize().fit_transform(customers[['club_member_status']]).club_member_status
customers.postal_code = Categorize().fit_transform(customers[['postal_code']]).postal_code
customers.fashion_news_frequency = Categorize().fit_transform(customers[['fashion_news_frequency']]).fashion_news_frequency

In [21]:
customers.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1371980 entries, 0 to 1371979
Data columns (total 8 columns):
 #   Column                  Non-Null Count    Dtype 
---  ------                  --------------    ----- 
 0   customer_id             1371980 non-null  uint64
 1   FN                      1371980 non-null  int8  
 2   Active                  1371980 non-null  int8  
 3   club_member_status      1371980 non-null  int8  
 4   fashion_news_frequency  1371980 non-null  int8  
 5   age                     1371980 non-null  int8  
 6   postal_code             1371980 non-null  int32 
 7   INFO-CODE               1371980 non-null  object
dtypes: int32(1), int8(5), object(1), uint64(1)
memory usage: 113.8 MB


In [22]:
for col in articles.columns:
    if articles[col].dtype == 'object':
        articles[col] = Categorize().fit_transform(articles[[col]])[col]

In [23]:
articles.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype
---  ------                        --------------   -----
 0   article_id                    105542 non-null  int32
 1   product_code                  105542 non-null  int64
 2   prod_name                     105542 non-null  int32
 3   product_type_no               105542 non-null  int64
 4   product_type_name             105542 non-null  int16
 5   product_group_name            105542 non-null  int8 
 6   graphical_appearance_no       105542 non-null  int64
 7   graphical_appearance_name     105542 non-null  int8 
 8   colour_group_code             105542 non-null  int64
 9   colour_group_name             105542 non-null  int8 
 10  perceived_colour_value_id     105542 non-null  int64
 11  perceived_colour_value_name   105542 non-null  int8 
 12  perceived_colour_master_id    105542 non-null  int64
 13  perceived_colo

In [24]:
for col in articles.columns:
    if articles[col].dtype == 'int64':
        articles[col] = articles[col].astype('int32')

And this concludes our raw data preparation step! Let's now write everything back to disk.

In [25]:
transactions.sort_values(['t_dat', 'customer_id'], inplace=True)

In [26]:
%%time

transactions.to_parquet('../00 - Data/transactions_train/transactions_train.parquet')
customers.to_parquet('../00 - Data/customers/customers.parquet')
articles.to_parquet('../00 - Data/articles/articles.parquet')

CPU times: total: 7.67 s
Wall time: 8.19 s


Let's also generate a sample we will be able to use to speed up development.

In [27]:
%%time
# let's create a 5% sample of the entiriety of the data to speed up dev

sample = 0.05
customers_sample = customers.sample(frac=sample, replace=False)
customers_sample_ids = set(customers_sample['customer_id'])
transactions_sample = transactions[transactions["customer_id"].isin(customers_sample_ids)]
articles_sample_ids = set(transactions_sample["article_id"])
articles_sample = articles[articles["article_id"].isin(articles_sample_ids)]

customers_sample.to_parquet(f'../00 - Data/customers/customers_sample_{sample}.parquet', index=False)
transactions_sample.to_parquet(f'../00 - Data/transactions_train/transactions_train_sample_{sample}.parquet', index=False)
articles_sample.to_parquet(f'../00 - Data/articles/articles_train_sample_{sample}.parquet', index=False)

CPU times: total: 3.73 s
Wall time: 3.75 s


## Evaluation

In [28]:
from collections import defaultdict

val_week_purchases_by_cust = defaultdict(list)

val_week_purchases_by_cust.update(
    transactions[transactions.week == transactions.week.max()] \
        .groupby('customer_id')['article_id'] \
        .apply(list) \
        .to_dict()
)

pd.to_pickle(dict(val_week_purchases_by_cust), '../00 - Data/val_week_purchases_by_cust.pkl')

sample_sub = pd.read_csv('../00 - Data/sample_submission/sample_submission.csv')
valid_gt = customer_hex_id_to_int(sample_sub.customer_id) \
    .map(val_week_purchases_by_cust) \
    .apply(lambda xx: ' '.join('0' + str(x) for x in xx))

sample_sub.prediction = valid_gt
sample_sub.to_parquet('../00 - Data/validation_ground_truth.parquet', index=False)

In [29]:
from average_precision import apk

def calculate_apk(list_of_preds, list_of_gts):
    # for fast validation this can be changed to operate on dicts of {'cust_id_int': [art_id_int, ...]}
    # using 'data/val_week_purchases_by_cust.pkl'
    apks = []
    for preds, gt in zip(list_of_preds, list_of_gts):
        apks.append(apk(gt, preds, k=12))
    return np.mean(apks)

def eval_sub(sub_csv, skip_cust_with_no_purchases=True):
    sub=pd.read_csv(sub_csv)
    validation_set=pd.read_parquet('../data/validation_ground_truth.parquet')

    apks = []

    no_purchases_pattern = []
    for pred, gt in zip(sub.prediction.str.split(), validation_set.prediction.str.split()):
        if skip_cust_with_no_purchases and (gt == no_purchases_pattern): continue
        apks.append(apk(gt, pred, k=12))
    return np.mean(apks)

## Strong starting point

This, without a doubt, has to be this kaggle [kernel](https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2).

These candidates are so good they get a decent score even with the very simple ranking applied!

There is only one problem with this solution -- it doesn't structure the 'candidates' in a way where they could be fed into a ranking model. Plus the candidates have information they can't possibly have. If candidates are candidates for the future, they can't have access to the information on how well something will sell in the future week.

We need to:
- find a way to structure candidates so that they could be fed into a ranking model
- create features that would capture the information the kernel is relying on
- first output the predictions manually (hardcode a ranking model) and subsequently feed the data to a ranker

In my previuos attempt I generated candidates for all users based on bestsellers for all weeks.

But that created a lot of junk.

This bestseller logic should be applied as postprocessing to a solution! The flow should be as follows:
* predict on candidates
* if for some customer the predictions are in some sense not reliable enough, or the likelihood of a sale is to low, use the bestseller logic from the kaggle kernel.

Let us understand what the customers are doing a little bit better.

In [None]:
final_week=[]
final_week['customer_id'].value_counts().describe()

In [None]:
bestsellers_last_week = set(transactions[transactions.week == transactions.week.max()].article_id.value_counts().index[:12])
bestsellers_week_ago = set(transactions[transactions.week == transactions.week.max()-1].article_id.value_counts().index[:12])

In [None]:
final_week.article_id.isin(bestsellers_last_week).mean(), final_week.article_id.isin(bestsellers_week_ago).mean()

People **are** buying the bestsellers but not as much as one might think.

In [None]:
without_final_week = transactions[transactions.week != transactions.week.max()]

In [None]:
unique_bought_items = 0
last_purchase_repeated = 0
purchases_cust_with_no_history = 0
items_purchased_by_custs_with_no_history = []
week_of_earlier_purchase = []
i = 0

for c_id, df in final_week.groupby('customer_id'):
    purchases_final_week = set(df.article_id)
    unique_bought_items += len(purchases_final_week)
    
    #all transactions made by a single customer
    purchase_history = without_final_week[without_final_week.customer_id == c_id]
    #last item bought by a customer
    purchases_before = set(purchase_history[purchase_history.week == purchase_history.week.max()].article_id)
    week_of_earlier_purchase.append(purchase_history.week.max())
    
    if len(purchases_before) == 0:
        purchases_cust_with_no_history += len(purchases_final_week)
        items_purchased_by_custs_with_no_history += list(purchases_final_week)
    else:
        last_purchase_repeated += len(purchases_final_week.intersection(purchases_before))
    i += 1
    if i == 1000: break

In [None]:
wks_since_purchase = []
for week in week_of_earlier_purchase:
    if type(week) == np.int8:
        wks_since_purchase.append(104 - week)

In [None]:
# weeks elapsed between the purchase in the final week and earlier purchase
(pd.value_counts(wks_since_purchase)/len(wks_since_purchase)).head(20).cumsum()

In [None]:
unique_bought_items, last_purchase_repeated, purchases_cust_with_no_history

In [None]:
np.mean([itm in bestsellers_last_week for itm in items_purchased_by_custs_with_no_history])

There are not that many repeat purchases either. Though a vast majority of customers are repeat customers.

And new customers are not buying bestsellers all that much either.

I bet this could be improved if we did something useful with postal codes -- H&M operates across so many markets. The bestseller in one market doesn't have to be the bestseller in another.

A good model should outpeform this simple last purchase heuristic by a large margin. Still, let's implement it to be able to use down the road to refine our solution for situations where we don't have enough data / results are inconclusive.

In [None]:
last_three_weeks = without_final_week[without_final_week.week > without_final_week.week.max()-3]

In [None]:
# this is a slightly different logic to what's in the reference Kaggle kernel
best_sellers = last_three_weeks.groupby('week').apply(lambda df: df.value_counts('article_id').index[:12].tolist())

In [None]:
def purchase_history_to_preds(df):
    week_of_last_purchase = df.week.max()
    last_purchased_basket = df[df.week == week_of_last_purchase]
    purchased_items = last_purchased_basket.value_counts('article_id').index.tolist()
    purchased_items += best_sellers[last_purchased_basket.week.head(1).item()]
    return purchased_items[:12]

In [None]:
cust2preds2 = last_three_weeks.groupby(['customer_id']).apply(purchase_history_to_preds)

Mhmm that is a bit slow. Let's see if we can run this in parallel.

In [None]:
from dask.distributed import Client

client = Client(n_workers=24)
import dask.dataframe as dd

In [None]:
ltw_dd = dd.from_pandas(last_three_weeks, npartitions=24)

In [None]:
cust2preds = ltw_dd.groupby('customer_id').apply(purchase_history_to_preds, meta=('x', 'object')).compute()

In [None]:
client.close()

Let's generate a submission.

In [None]:
last_week = last_three_weeks.week.max()
def get_preds_for_customer_id(c_id):
    if c_id in c_ids_with_predictions:
        pred_art_ids = cust2preds[c_id]
    else:
        pred_art_ids = best_sellers[last_week]
    return  ['0' + str(art_id) for art_id in pred_art_ids]

In [None]:
c_ids_with_predictions = set(cust2preds.keys())
preds = customer_hex_id_to_int(sample_sub.customer_id).map(get_preds_for_customer_id)

In [None]:
sample_sub.prediction = preds
sample_sub.prediction = sample_sub.prediction.str.join(' ')

In [None]:
sub_name = 'bestsellers_single_week_logic_0'

In [None]:
%%time
sample_sub.to_csv(f'../data/submissions/{sub_name}.csv.gz', index=False)

In [None]:
# !kaggle competitions submit -c h-and-m-personalized-fashion-recommendations -f 'data/subs/{sub_name}.csv.gz' -m {sub_name}

For completness sake I let me implement the logic exactly at it is on Kaggle, though I doubt this would make much of a difference.

In [None]:
client = Client(n_workers=24)

In [None]:
ltw_dd = dd.from_pandas(last_three_weeks, npartitions=24)

In [None]:
weeks = last_three_weeks.week.unique()
best_sellers = {}
for i in range(3):
    best_sellers[weeks[i]] = last_three_weeks[last_three_weeks.week.isin(set(weeks[:i+1]))].article_id.value_counts('article_id').index.tolist()[:12]

In [None]:
def purchase_history_to_preds(df):
    last_purchase_date = df.t_dat.max()
    last_purchased_basket = df[df.t_dat == last_purchase_date]
    purchased_items = last_purchased_basket.value_counts('article_id').index.tolist()
    purchased_items += best_sellers[last_purchased_basket.week.head(1).item()]
    return purchased_items[:12]

In [None]:
%%time
cust2preds = ltw_dd.groupby('customer_id').apply(purchase_history_to_preds, meta=('x', 'object')).compute()

In [None]:
preds = customer_hex_id_to_int(sample_sub.customer_id).map(get_preds_for_customer_id)
sample_sub.prediction = preds
sample_sub.prediction = sample_sub.prediction.str.join(' ')

sub_name = 'bestsellers_kernel_logic'

sample_sub.to_csv(f'00 - Data/subsmissions/{sub_name}.csv.gz', index=False)

In [None]:
eval_sub(f'00 - Data/subs/{sub_name}.csv.gz', skip_cust_with_no_purchases=True)

In [None]:
# !kaggle competitions submit -c h-and-m-personalized-fashion-recommendations -f '00 - Data/subs/{sub_name}.csv.gz' -m {sub_name}

In [None]:
client.close()