NB 03 didn't quite work.

The two likely reasons for that are:
 - issues with how I am generating train data
 - false assumption that you can toss whatever at a ranking model and it will do the rest

In this notebook, I want to put the groundwork needed for growing a good solution. First of all, that will require a robust and fast local validation scheme. We know that using the last week of train data for validation work and tracks the leaderboard nicely.

Secondly, we want to start from a kernel of a solution that we can extend. This [notebook](https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2) on kaggle seems to me like a great starting point.

The plan is to develop the functionality needed for a nice setup of a solution that we will reuse in NB 05. Along the way I hope to learn a bit more about the data, about some of the trends that I might want to model through the features I will engineer.

The plan is:
* implement a quick training pipeline leading to good validation
* train a ranking model on candidates we know to be good
* only generate new training data / candidates while validating whether we are moving in the right direction using local CV
* start with building sensible features and see whether they move the needle on the score

The truth is I do not know what will work. These RecSys models are a completely new breed of models to me. But I can set the problem up in a way as to help me learn. And that is what I am going to do :).

Once I get this working I will breathe a sigh of relief and will jump into reading papers and drawing inspiration from there.

Let's get started.

In [49]:
# helper functions
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from average_precision import apk

# https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635
def customer_hex_id_to_int(series):
    return series.str[-16:].apply(lambda x: int(x, 16))

def article_id_str_to_int(series):
    return series.astype('int32')

def article_id_int_to_str(series):
    return '0' + series.astype('str')

class Categorize(BaseEstimator, TransformerMixin):
    def __init__(self, min_examples=0):
        self.min_examples = min_examples
        self.categories = []
        
    def fit(self, X):
        for i in range(X.shape[1]):
            vc = X.iloc[:, i].value_counts()
            self.categories.append(vc[vc > self.min_examples].index.tolist())
        return self

    def transform(self, X):
        data = {X.columns[i]: pd.Categorical(X.iloc[:, i], categories=self.categories[i]).codes for i in range(X.shape[1])}
        return pd.DataFrame(data=data)
    
def eval_sub(sub_csv, skip_cust_with_no_purchases=True):
    sub=pd.read_csv(sub_csv)
    validation_set=pd.read_csv('data/subs/validation_set.csv')

    apks = []

    no_purchases_pattern = ['0']*12
    for pred, gt in zip(sub.prediction.str.split(), validation_set.prediction.str.split()):
        if skip_cust_with_no_purchases & (gt == no_purchases_pattern): continue
        apks.append(apk(pred, gt))
    return np.mean(apks)

We want this to be fast. I can get as much RAM as I will ever need through VMs on GCP, but that is not the point. I want to see how far I can push my local hardware, but this goes even beyond that.

I need the speed to make best use of my time as I build a feel for what RecSys models are about. And the path to this leads through making the data I will work on smaller.

In [2]:
%%time
import pandas as pd

transactions = pd.read_csv('data/transactions_train.csv', dtype={"article_id": "str"})
customers = pd.read_csv('data/customers.csv')
articles = pd.read_csv('data/articles.csv', dtype={"article_id": "str"})

CPU times: user 19.5 s, sys: 4.12 s, total: 23.6 s
Wall time: 23.6 s


In [3]:
transactions.memory_usage(deep=True)

Index                      128
t_dat               2129817708
customer_id         3846387204
article_id          2129817708
price                254306592
sales_channel_id     254306592
dtype: int64

In [4]:
transactions.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype  
---  ------            -----  
 0   t_dat             object 
 1   customer_id       object 
 2   article_id        object 
 3   price             float64
 4   sales_channel_id  int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 8.0 GB


In [5]:
%%time
transactions['customer_id'].nunique()

CPU times: user 4.78 s, sys: 39.8 ms, total: 4.82 s
Wall time: 4.81 s


1362281

In [6]:
%%time
transactions['customer_id'] = customer_hex_id_to_int(transactions['customer_id'])
transactions['customer_id'].nunique()

CPU times: user 17.2 s, sys: 1.73 s, total: 18.9 s
Wall time: 18.9 s


1362281

In [7]:
transactions.memory_usage(deep=True)

Index                      128
t_dat               2129817708
customer_id          254306592
article_id          2129817708
price                254306592
sales_channel_id     254306592
dtype: int64

In [8]:
transactions.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype  
---  ------            -----  
 0   t_dat             object 
 1   customer_id       uint64 
 2   article_id        object 
 3   price             float64
 4   sales_channel_id  int64  
dtypes: float64(1), int64(1), object(2), uint64(1)
memory usage: 4.7 GB


Nice!

Initially, I wanted to get rid of the `t_dat` column but on second thought I am not a fan.

I am all for speed and reducing weight, but the main purpose of this activity is to increase developer productivity.

If I fall back down to ints representing year, week, day I will be certainly trading developer productivity for fewer CPU cycles that are needed (and I want to go in the exact opposite direction! developer productivity > (nearly) anything else)

In [9]:
%%time

transactions.t_dat = pd.to_datetime(transactions.t_dat, format='%Y-%m-%d')
transactions['week'] = transactions.t_dat.dt.year * 100 + transactions.t_dat.dt.week
transactions['week'] = transactions.week.rank(method='dense').astype('int')



CPU times: user 10.1 s, sys: 814 ms, total: 10.9 s
Wall time: 10.5 s


In [10]:
transactions.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 6 columns):
 #   Column            Dtype         
---  ------            -----         
 0   t_dat             datetime64[ns]
 1   customer_id       uint64        
 2   article_id        object        
 3   price             float64       
 4   sales_channel_id  int64         
 5   week              int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(1), uint64(1)
memory usage: 3.2 GB


Let's do something about the `article_id` (both here and on `articles`) and let's take a closer look at `price`, `sales_channel_id` and `week`.

In [11]:
transactions.article_id = article_id_str_to_int(transactions.article_id)
articles.article_id = article_id_str_to_int(articles.article_id)

transactions.week = transactions.week.astype('int8')
transactions.sales_channel_id = transactions.sales_channel_id.astype('int8')
transactions.price = transactions.price.astype('float32')

In [12]:
transactions.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 6 columns):
 #   Column            Dtype         
---  ------            -----         
 0   t_dat             datetime64[ns]
 1   customer_id       uint64        
 2   article_id        int32         
 3   price             float32       
 4   sales_channel_id  int8          
 5   week              int8          
dtypes: datetime64[ns](1), float32(1), int32(1), int8(2), uint64(1)
memory usage: 788.2 MB


In [13]:
transactions.drop(columns='t_dat').info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype  
---  ------            -----  
 0   customer_id       uint64 
 1   article_id        int32  
 2   price             float32
 3   sales_channel_id  int8   
 4   week              int8   
dtypes: float32(1), int32(1), int8(2), uint64(1)
memory usage: 545.7 MB


Well, this is interesting. There are very few unique `t_dat` values hence despite it being a scary `datetime64` it takes up very little memory!

Keeping it for convenience is definitely the way to go.

Let's take a brief look at the `customers` and `articles` dfs.

In [14]:
customers.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1371980 entries, 0 to 1371979
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   customer_id             1371980 non-null  object 
 1   FN                      476930 non-null   float64
 2   Active                  464404 non-null   float64
 3   club_member_status      1365918 non-null  object 
 4   fashion_news_frequency  1355971 non-null  object 
 5   age                     1356119 non-null  float64
 6   postal_code             1371980 non-null  object 
dtypes: float64(3), object(4)
memory usage: 512.3 MB


In [15]:
articles.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   article_id                    105542 non-null  int32 
 1   product_code                  105542 non-null  int64 
 2   prod_name                     105542 non-null  object
 3   product_type_no               105542 non-null  int64 
 4   product_type_name             105542 non-null  object
 5   product_group_name            105542 non-null  object
 6   graphical_appearance_no       105542 non-null  int64 
 7   graphical_appearance_name     105542 non-null  object
 8   colour_group_code             105542 non-null  int64 
 9   colour_group_name             105542 non-null  object
 10  perceived_colour_value_id     105542 non-null  int64 
 11  perceived_colour_value_name   105542 non-null  object
 12  perceived_colour_master_id    105542 non-null  int64 
 13 

Well, this stuff will be getting merged with our transactions df at some point, so I guess we can also make this smaller and easier to work with down the road.

In [18]:
customers['club_member_status'].unique()

array(['ACTIVE', nan, 'PRE-CREATE', 'LEFT CLUB'], dtype=object)

In [19]:
customers.customer_id = customer_hex_id_to_int(customers.customer_id)
for col in ['FN', 'Active', 'age']:
    customers[col].fillna(-1, inplace=True)
    customers[col] = customers[col].astype('int8')

In [20]:
customers.club_member_status = Categorize().fit_transform(customers[['club_member_status']]).club_member_status
customers.postal_code = Categorize().fit_transform(customers[['postal_code']]).postal_code
customers.fashion_news_frequency = Categorize().fit_transform(customers[['fashion_news_frequency']]).fashion_news_frequency

In [21]:
customers.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1371980 entries, 0 to 1371979
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype 
---  ------                  --------------    ----- 
 0   customer_id             1371980 non-null  uint64
 1   FN                      1371980 non-null  int8  
 2   Active                  1371980 non-null  int8  
 3   club_member_status      1371980 non-null  int8  
 4   fashion_news_frequency  1371980 non-null  int8  
 5   age                     1371980 non-null  int8  
 6   postal_code             1371980 non-null  int32 
dtypes: int32(1), int8(5), uint64(1)
memory usage: 22.2 MB


In [22]:
for col in articles.columns:
    if articles[col].dtype == 'object':
        articles[col] = Categorize().fit_transform(articles[[col]])[col]

In [23]:
articles.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype
---  ------                        --------------   -----
 0   article_id                    105542 non-null  int32
 1   product_code                  105542 non-null  int64
 2   prod_name                     105542 non-null  int32
 3   product_type_no               105542 non-null  int64
 4   product_type_name             105542 non-null  int16
 5   product_group_name            105542 non-null  int8 
 6   graphical_appearance_no       105542 non-null  int64
 7   graphical_appearance_name     105542 non-null  int8 
 8   colour_group_code             105542 non-null  int64
 9   colour_group_name             105542 non-null  int8 
 10  perceived_colour_value_id     105542 non-null  int64
 11  perceived_colour_value_name   105542 non-null  int8 
 12  perceived_colour_master_id    105542 non-null  int64
 13  perceived_colo

In [24]:
for col in articles.columns:
    if articles[col].dtype == 'int64':
        articles[col] = articles[col].astype('int32')

And this concludes our raw data preparation step! Let's now write everything back to disk.

In [26]:
%%time

transactions.to_parquet('data/transactions_train.parquet')
customers.to_parquet('data/customers.parquet')
articles.to_parquet('data/articles.parquet')

CPU times: user 2.56 s, sys: 922 ms, total: 3.48 s
Wall time: 3.38 s


Let's also generate a sample we will be able to use to speed up development.

In [51]:
%%time
# let's create a 5% sample of the entiriety of the data to speed up dev

sample = 0.05
customers_sample = customers.sample(frac=sample, replace=False)
customers_sample_ids = set(customers_sample['customer_id'])
transactions_sample = transactions[transactions["customer_id"].isin(customers_sample_ids)]
articles_sample_ids = set(transactions_sample["article_id"])
articles_sample = articles[articles["article_id"].isin(articles_sample_ids)]

customers_sample.to_csv(f'data/customers_sample_{sample}.parquet', index=False)
transactions_sample.to_csv(f'data/transactions_train_sample_{sample}.parquet', index=False)
articles_sample.to_csv(f'data/articles_train_sample_{sample}.parquet', index=False)

CPU times: user 12.6 s, sys: 1.84 s, total: 14.5 s
Wall time: 12.4 s


## Evaluation

In [50]:
%%time

eval_sub('data/subs/most_purchased_last_two_weeks.csv.gz', True)

CPU times: user 9.54 s, sys: 698 ms, total: 10.2 s
Wall time: 9.62 s


0.0036434351466822766

There is no point messing around with something that works.

I could try to come up with something that works on predictions as list, but let's keep things simple.
Running a 10 second function on results in a csv file (which also validates our submission generation) is good enough for me.

## Strong starting point

This, without a doubt, has to be this kaggle [kernel](https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2).

These candidates are so good they get a decent score even with the very simple ranking applied!

There is only one problem with this solution -- it doesn't structure the 'candidates' in a way that could be fed into a ranking model. We need to change this.

We need to:
- find a way to structure candidates so that they could be fed into a ranking model
- create features that would capture the information the kernel is relying on
- first output the predictions manually (hardcode a ranking model) and subsequently feed the data to a ranker

In [24]:
import dask

In [29]:
dask.dataframe.to_numeric('1').compute()

1

In [22]:
customers.shape

(Delayed('int-cb53c7cc-beb7-454f-b76f-cc32d58054c8'), 7)

In [31]:
dask.dataframe.to_numeric(customers.shape[0]*sample)

TypeError: arg must be a list, tuple, dask.array.Array, or dask.dataframe.Series

In [38]:
%%time
customers.compute().shape

CPU times: user 550 ms, sys: 283 ms, total: 833 ms
Wall time: 1.71 s


(1371980, 7)

In [None]:
%%time
# let's create a 5% sample of the entiriety of the data to speed up dev

sample = 0.05
customers_sample = customers.sample(frac=sample, replace=False)
customers_sample_ids = set(customers_sample['customer_id'])
transactions_sample = transactions[transactions["customer_id"].isin(customers_sample_ids)]
articles_sample_ids = set(transactions_sample["article_id"])
articles_sample = articles[articles["article_id"].isin(articles_sample_ids)]

customers_sample.to_csv(f'data/customers_sample_{sample}.parquet', index=False)
transactions_sample.to_csv(f'data/transactions_train_sample_{sample}.parquet', index=False)
articles_sample.to_csv(f'data/articles_train_sample_{sample}.parquet', index=False)

In [None]:
#     transactions_sample = transactions[transactions["customer_id"].isin(customers_sample_ids)]
#     articles_sample_ids = set(transactions_sample["article_id"])
#     articles_sample = articles[articles["article_id"].isin(articles_sample_ids)]
#     customers_sample.to_csv(f"data/customers_sample_{sample_repr}.csv.gz", index=False)
#     transactions_sample.to_csv(f"data/transactions_train_sample_{sample_repr}.csv.gz", index=False)
#     articles_sample.to_csv(f"data/articles_train_sample_{sample_repr}.csv.gz", index=False)

In [47]:
set(cust_sample_ids)

{'7a3d363fc0cf476ed3be372f56902b29244f72c4c653ec3b02565972af2dc598',
 'd19af35f4e591b16d1a99cb15a3d5135114aefb7a3af60974af406fd694c5935',
 '3a4db6edad612b58a51f7e5b050634e1d28081b3607dbd6ec66f1f6b7de878e6',
 '2e0e149db35b176e645dacae44b7b0aa57a28b728d471f5399b42ea84f77af46',
 '47fe772dc0c347400c15945ce7333ff57eb82dab2f5c41f722ef5091118d9ba2',
 '9ad04ce037539987fbed43bf89ef6d13ffd096690826bcd99aa47d251a28463e',
 '61f45eb75c1a22477b17a96dbf5d95856b15b8b795cd6a7a45ca4994db628c82',
 'a278d5c4187212a549bfbf847ac5e8c4926e4f66e95799d7516ac48c8933b518',
 'c3657834186cb82f18532a7a6df3d168a27be6919b8aab929ac69f2cfb903da5',
 'a722efb3bbd45f738969595883c8ba60192991784fb7b6b1ab130af4cf207251',
 'a889dfb9ec4f8b87ec11a928a824feb3e5aa19f3c02b12be9c3db22265f6b22a',
 'ebbe00215f98d26e0b3878376fb8c3c7dff10d9b476da37f4ceb9419b02b9eb2',
 'd4f1f0fbd5740f7dd8cba497469c471ef068aab9bee3ce92c4da18d68c84b287',
 '0c94eebeb0a9ac953e931c6a7bbb76f73656bc18dcf0a4bb51cf34578c3dc8c6',
 '0377dbd5bf0cdf8453ec8d06183233ae

In [42]:
cust_sample_ids.compute()

249985    2ea9d88b21f0107db2470d1e06faff68942b6bdd13ec48...
136816    197c451127a00a65babb317311a08e3a16b947422e1c91...
1184      0037c4ae2ea5c547b388276aa21861e061e13c67846694...
41367     07bccee77fc9ae85865a67850aa152a1ee44ad8a218eb8...
58826     0afe86aafb4f606f6ac179640c1d06edd1c004ae41e3aa...
                                ...                        
82328     fca52b5204f0562fedc9bcf15e63e7cb60f579b1acdedf...
64930     f966769327d0ad2d684e866886d2a70d6673535e48e742...
28203     f293d6b7a1a21de226e7ec796748b5607160a2822e67b7...
75881     fb6f6497e09336e7ec974781eb25a017195735135b8c8b...
17471     f0990a8b2bda4bcbf57918a87c9fa3f46b1c85b9f065e9...
Name: customer_id, Length: 68598, dtype: object

In [None]:
# %%time

# for sample_repr, sample in [("01", 0.001), ("1", 0.01), ("5", 0.05)]:
#     print(sample)
#     customers_sample = customers.sample(int(customers.shape[0]*sample), replace=False)
#     customers_sample_ids = set(customers_sample["customer_id"])
#     transactions_sample = transactions[transactions["customer_id"].isin(customers_sample_ids)]
#     articles_sample_ids = set(transactions_sample["article_id"])
#     articles_sample = articles[articles["article_id"].isin(articles_sample_ids)]
#     customers_sample.to_csv(f"data/customers_sample_{sample_repr}.csv.gz", index=False)
#     transactions_sample.to_csv(f"data/transactions_train_sample_{sample_repr}.csv.gz", index=False)
#     articles_sample.to_csv(f"data/articles_train_sample_{sample_repr}.csv.gz", index=False)

In [9]:
transactions = dd.read_parquet('data/transactions_train.parquet', dtype={"article_id": "str"})

In [10]:
%%time
transactions.compute()

CPU times: user 4.07 s, sys: 2.06 s, total: 6.13 s
Wall time: 7.6 s


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0685687004,0.016932,2
...,...,...,...,...,...
291609,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0929511001,0.059305,2
291610,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0891322004,0.042356,2
291611,2020-09-22,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,0918325001,0.043203,1
291612,2020-09-22,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,0833459002,0.006763,1


In [8]:
%%time
transactions.compute()

CPU times: user 4.39 s, sys: 2.19 s, total: 6.58 s
Wall time: 8.99 s


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0685687004,0.016932,2
...,...,...,...,...,...
291609,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0929511001,0.059305,2
291610,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0891322004,0.042356,2
291611,2020-09-22,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,0918325001,0.043203,1
291612,2020-09-22,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,0833459002,0.006763,1


In [None]:
transactions

In [4]:
%%time

transactions.compute()

CPU times: user 4.02 s, sys: 2.8 s, total: 6.82 s
Wall time: 9.65 s


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0685687004,0.016932,2
...,...,...,...,...,...
291609,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0929511001,0.059305,2
291610,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0891322004,0.042356,2
291611,2020-09-22,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,0918325001,0.043203,1
291612,2020-09-22,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,0833459002,0.006763,1


In [None]:
import pandas as pd
import swifter
import numpy as np

In [4]:
client.close()