Fetches those customers who bought atleast 1 product in target week

In [1]:
import pandas as pd, os, numpy as np
import plotly.express as px
pd.options.display.max_columns = 50
import swifter, datetime, pickle as pkl
import tensorflow_hub as hub
from tqdm.notebook import tqdm

### Objective

To train the model, we need the data in the following format:

Columns: [features], relevance

Column Descriptions:
2. features: features of user, product etc
3. relevance of the product (1 = bought, 0 not bought)

Breakdown to 1 user:

We need to create examples of feature and relevance essentially. This basically means each row is a pairwise object.
We need relevance to have positive samples and negative samples.

Positive samples: Items the user bought

Negative samples: Items the user didn't buy

How to create the data then?

1. Only create data for those users who bought something: i.e. get customers who bought something
2. For each of those customers, create a list of the 12 products they bought, these are the positive samples
3. For each of those customers, create a list of products they didn't buy, these are the negative samples

Repeat above for all users.

### Creating the samples

In [2]:
df = pd.read_parquet('../data/train.parquet')

In [3]:
# predicting what they bought next week (so train on data before this)
d_end = datetime.datetime(2019, 9, 29).date()
d_start = datetime.datetime(2019, 9, 23).date()

In [4]:
tdf = df[(df['date'] >= d_start) & (df['date'] <= d_end)].reset_index(drop=True).copy()

In [5]:
tdf['a_count'] = tdf.groupby(['cust_id', 'article_id'])['article_id'].transform('count')

In [6]:
tdf.sort_values(['cust_id', 'date', 'a_count'], ascending=[True, False, False], inplace=True)

In [7]:
bdf = tdf.groupby(["cust_id"])["article_id"].agg(
    lambda x: list(x.values[np.sort(np.unique(x.values, return_index=True)[1])])).reset_index()

In [8]:
bdf

Unnamed: 0,cust_id,article_id
0,0,[797065001]
1,13,"[693242018, 661794006, 763037004, 640176008, 6..."
2,21,"[513512003, 535035001, 677930066]"
3,22,"[805947002, 705966002, 803290002, 797710001, 7..."
4,29,"[730683003, 787558001]"
...,...,...
97987,1371932,[771557002]
97988,1371935,"[693243019, 674606045, 516903005, 718939001, 7..."
97989,1371949,"[542846003, 615959005, 685814034, 803683002, 5..."
97990,1371956,"[796240001, 833981001]"


In [9]:
bdf.to_parquet('../data/bought_articles_in_order.parquet')