## import my utils

In [None]:
import os

try: import fastkaggle
except ModuleNotFoundError:
    os.system("pip install -Uq fastkaggle")

from fastkaggle import *

# use fastdebug.utils 
if iskaggle: os.system("pip install nbdev snoop")

if iskaggle:
    path = "../input/fastdebugutils0"
    import sys
    sys.path
    sys.path.insert(1, path)
    import utils as fu
    from utils import *
else: 
    from fastdebug.utils import *
    import fastdebug.utils as fu

# Candidate ReRank Model using Handcrafted Rules
https://www.kaggle.com/code/cdeotte/candidate-rerank-model-lb-0-575?scriptVersionId=111214204

### rd: recsys - otto - candidate rerank - what is candidate rerank model
In this notebook, we present a "candidate rerank" model using handcrafted rules. 

### rd: recsys - otto - candidate rerank - how to improve this candidate rerank model
We can improve this model by engineering features, merging them unto items and users, and training a reranker model (such as XGB) to choose our final 20. 

### rd: recsys - otto - candidate rerank - how to further tune and improve this notebook

we should build a local CV scheme to experiment new logic and/or models.

UPDATE: I published a notebook to compute validation score [here][https://www.kaggle.com/code/cdeotte/compute-validation-score-cv-565?scriptVersionId=111214251] using Radek's scheme described [here][https://www.kaggle.com/competitions/otto-recommender-system/discussion/364991].

### rd: recsys - otto - candidate rerank - what does a session mean in this competition

Note in this competition, a "session" actually means a unique "user". So our task is to predict what each of the `1,671,803` test "users" (i.e. "sessions") will do in the future. For each test "user" (i.e. "session") we must predict what they will `click`, `cart`, and `order` during the remainder of the week long test period.

## rd: recsys - otto - candidate rerank - The candidate rerank model of this notebook

### rd: recsys - otto - candidate rerank - Step 1 - Generate Candidates from 5 sources
For each test user, we generate possible choices, i.e. candidates. In this notebook, we generate candidates from 5 sources:
* User history of clicks, carts, orders
* Most popular 20 clicks, carts, orders during test week
* Co-visitation matrix of click/cart/order to cart/order with type weighting
* Co-visitation matrix of cart/order to cart/order called buy2buy
* Co-visitation matrix of click/cart/order to clicks with time weighting

### rd: recsys - otto - candidate rerank - Step 2 - ReRank and Choose 20 - what are those rules and what are their relationship between the candidates and XGBoost model
Given the list of candidates, we must select 20 to be our predictions. In this notebook, we do this with a set of handcrafted rules. We can improve our predictions by training an XGBoost model to select for us. Our handcrafted rules give priority to:
* Most recent previously visited items
* Items previously visited multiple times
* Items previously in cart or order
* Co-visitation matrix of cart/order to cart/order
* Current popular items

![candidate rerank model visualization](https://raw.githubusercontent.com/cdeotte/Kaggle_Images/main/Nov-2022/c_r_model.png)
  
## rd: recsys - otto - candidate rerank - what are learnt from other kagglers
We thank many Kagglers who have shared ideas. We use co-visitation matrix idea from Vladimir [here][1]. 

We use groupby sort logic from Sinan in comment section [here][4]. 

We use duplicate prediction removal logic from Radek [here][5]. 

We use multiple visit logic from Pietro [here][2]. 

We use type weighting logic from Ingvaras [here][3]. 

We use leaky test data from my previous notebook [here][4]. 

And some ideas may have originated from Tawara [here][6] and KJ [here][7]. 

We use Colum2131's parquets [here][8]. 

Above image is from Ravi's discussion about candidate rerank models [here][9]

[1]: https://www.kaggle.com/code/vslaykovsky/co-visitation-matrix
[2]: https://www.kaggle.com/code/pietromaldini1/multiple-clicks-vs-latest-items
[3]: https://www.kaggle.com/code/ingvarasgalinskas/item-type-vs-multiple-clicks-vs-latest-items
[4]: https://www.kaggle.com/code/cdeotte/test-data-leak-lb-boost
[5]: https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic
[6]: https://www.kaggle.com/code/ttahara/otto-mors-aid-frequency-baseline
[7]: https://www.kaggle.com/code/whitelily/co-occurrence-baseline
[8]: https://www.kaggle.com/datasets/columbia2131/otto-chunk-data-inparquet-format
[9]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721
[10]: https://www.kaggle.com/cdeotte/compute-validation-score-cv-564
[11]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364991

## rd: recsys - otto - candidate rerank - changes and improvements on difference versions
Below are notes about versions:
* **Version 1 LB 0.573** Uses popular ideas from public notebooks and adds additional co-visitation matrices and additional logic. Has CV `0.563`. See validation notebook version 2 [here][1].
* **Version 2 LB 573** Refactor logic for `suggest_buys(df)` to make it clear how new co-visitation matrices are reranking the candidates by adding to candidate weights. Also new logic boosts CV by `+0.0003`. Also LB is slightly better too. See validation notebook version 3 [here][1]
* **Version 3** is the same as version 2 but 1.5x faster co-visitation matrix computation!
* **Version 4 LB 575** Use top20 for clicks and top15 for carts and buys (instead of top40 and top40). This boosts CV `+0.0015` hooray! New CV is `0.5647`. See validation version 5 [here][1]
* **Version 5** is the same as version 4 but 2x faster co-visitation matrix computation! (and 3x faster than version 1)
* **Version 6** Stay tuned for more versions...

[1]: https://www.kaggle.com/code/cdeotte/compute-validation-score-cv-564

## rd: recsys - otto - candidate rerank - Step 1 - Candidate Generation with 3 co-visitation matrices and RAPIDS cuDF GPU

### rd: recsys - otto - candidate rerank - what are the 3 co-visitation matrices
For candidate generation, we build three co-visitation matrices. 

One computes the popularity of cart/order given a user's previous click/cart/order. We apply type weighting to this matrix. 

One computes the popularity of cart/order given a user's previous cart/order. We call this "buy2buy" matrix. 

One computes the popularity of clicks given a user previously click/cart/order.  We apply time weighting to this matrix. 

### rd: recsys - otto - candidate rerank - what are RAPIDS cuDF GPU for
We will use RAPIDS cuDF GPU to compute these matrices quickly!

## rd: recsys - otto - candidate rerank - imports needed

In [None]:
VER = 5

import pandas as pd, numpy as np
from tqdm.notebook import tqdm
import os, sys, pickle, glob, gc
from collections import Counter
import cudf, itertools
print('We will use RAPIDS version',cudf.__version__)

### rd: recsys - otto - candidate rerank - Tricks to Compute Three Co-visitation Matrices super fast
We will compute 3 co-visitation matrices using RAPIDS cuDF on GPU. This is 30x faster than using Pandas CPU like other public notebooks! 

For maximum speed, set the variable `DISK_PIECES` to the smallest number possible based on the GPU you are using without incurring memory errors. If you run this code offline with 32GB GPU ram, then you can use `DISK_PIECES = 1` and compute each co-visitation matrix in almost 1 minute! Kaggle's GPU only has 16GB ram, so we use `DISK_PIECES = 4` and it takes an amazing 3 minutes each! 

Below are some of the tricks to speed up computation (question: does this notebook do all these below?)
* Use RAPIDS cuDF GPU instead of Pandas CPU
* Read disk once and save in CPU RAM for later GPU multiple use
* Process largest amount of data possible on GPU at one time
* Merge data in two stages. Multiple small to single medium. Multiple medium to single large.
* Write result as parquet instead of dictionary

## rd: recsys - otto - candidate rerank - How to Cache data on CPU before processing on GPU

load all files into dataframes whose sizes are reduced by changing its dtypes and save those dataframes into a dict in CPU RAM for access later on GPU

In [None]:
%%time
# CACHE FUNCTIONS
def read_file(f):
    return cudf.DataFrame( data_cache[f] )
def read_file_to_cache(f):
    df = pd.read_parquet(f)
    df.ts = (df.ts/1000).astype('int32')
    df['type'] = df['type'].map(type_labels).astype('int8')
    return df

# CACHE THE DATA ON CPU BEFORE PROCESSING ON GPU
data_cache = {}
type_labels = {'clicks':0, 'carts':1, 'orders':2}
files = glob.glob('../input/otto-chunk-data-inparquet-format/*_parquet/*') # all parquet filenames into a list
for f in files: data_cache[f] = read_file_to_cache(f)

# CHUNK PARAMETERS
READ_CT = 5 # based on the notebook, it means a number of files (5) to process each time inside a CHUNK of files (20). 
# If so, why READ_CT not READ_NF?
# question: Where does 6 come from? if it relates to READ_CT then why not use READ_CT + 1?
CHUNK = int( np.ceil( len(files)/6 )) # CHUNK == 20 here
print(f'We will process {len(files)} files, in groups of {READ_CT} and chunks of {CHUNK}.')

## 1) "Carts Orders" Co-visitation Matrix - Type Weighted
### rd: recsys - otto - candidate rerank - what does DISK_PIECES = 4 mean? 

yes we iterate through 4 times because the GPU memory in Kaggle's P100 GPU can only handle making 1/4th of the co-visitation matrix at a time. The variable DISK_PIECES refers to how many output files we make when producing one co-visitation matrix.

As you make new different co-visitation matrices and/or use a different GPU offline with more GPU RAM, you can try to use a smaller DISK_PIECES. For me, i use DISK_PIECES=1 offline and everything is much faster.

### rd: recsys - otto - candidate rerank - why use only the last 30 aids not all of them? 

This is mainly done to save memory and make things faster. You can try using all data to see if it improves CV score. Note if you use more data, you may need to increase DISK_PIECES to avoid memory error.

Using more data might not necessarily help because the last 30 rows of each user is the more recent data and the more recent data might help predict the future better.

The main reason is that it uses less memory. However, note that a users' tail is their most recent and meaningful data to help predict the future and capture recent co-visitation trends.

You can experiment with changing tail and seeing how it affects CV and LB. Note if you increase tail, you may need to increase DISK_PIECES to avoid memory error.


### rd: recsys - otto - candidate rerank - why type_weight is set to {0:1, 1:6, 2:3} vs {0:1, 1:3, 2:6} 

I'm not trying to align with weights in competition metric. These weights were chosen by Ingvaras in his notebook [here](https://www.kaggle.com/code/ingvarasgalinskas/item-type-vs-multiple-clicks-vs-latest-items). We are trying to determine how important previous behavior is.

The basic idea is this. We want to predict future behavior, so the question is what is more important "someone previously clicked an item OR someone previously put an item in their cart". I would say the second is more important. That means the user will most likely click this item again or order this item. So we give lots of weight to previous behavior of "cart".

Next, we wonder what is more important "someone previously put an item in their cart OR someone previously ordered an item". In both cases, the user might buy the item. But it is more likely that a user will buy an item if they put it in their cart versus buy an item if they have already bought the item. People do buy items multiple times so previously buying an item is more important than previously clicking an item (when predicting a future purchase).


### rd: recsys - otto - candidate rerank - how to use cudf in kaggle

When you use a Kaggle notebook with GPU, then RAPIDS comes already installed by Kaggle.

Note that Kaggle uses version 21.10.01 which is one year old but still works well. Due to drivers and Python version, i don't think we can install the latest RAPIDS into Kaggle.

Here is one important bug to be aware of. When using df['n'] = df.groupby(COL1)[COL2].cumcount() make sure that you have reset index with df = df.reset_index(drop=True) before the cumcount call otherwise, it does cumcount incorrectly. This is fixed in newer version of RAPIDS cuDF. Also if you use df.groupby(COL1)[COL2].transform(FUNC) then we need to reset index first too.


### rd: recsys - otto - candidate rerank - why and how to computer only a quarter of dataframe for memory - df = df.loc[(df.aid_x >= PART*SIZE)&(df.aid_x < (PART+1)*SIZE)]


This line does split the dataframe into 4 parts. If we don't break the construction of the dataframe into 4 parts, then we will get GPU memory error. How would you split the dataframe into 4 parts differently?

(by doing it the way above, all the dictionary entries for top_20[aid_x][ANYTHING] will be together and saved in the dictionary piece together.)

### rd: recsys - otto - candidate rerank - why READ_CT = 5, why splitting all files into 6 parts (results in CHUNK) 
(based on my current understanding, READ_CT=5 can be experimented first on Kaggle GPU to figure out, and then we can experiment to see how big of merged dataframe can GPU handle to figure out how many more parts to merge in the end. To read and comtemplate everyday)

Here's a diagram of what's happening. We can adjust READ_CT to the biggest number our GPU will allow before memory error because we always want GPUs doing as much work as possible (without memory error). That is step 1.

After processing, we have a small dataframe of results (this is blue square in left column of diagram below). If READ_CT is 5 and there are 146 files on disk, then we will have 30 small dataframe. We might think, "why not just merge all 30 dataframes together?".

When we merge "small dataframe A" into "small dataframe B" and both have 1e6 rows, then it might take 10 seconds. If "dataframe A" has 10e6 rows and "dataframe B" has 1e6 rows, it might take 40 seconds. And if "dataframe A" has 10e6 rows and "dataframe B" has 10e6 rows, it might take 50 seconds. Therefore we do not change CHUNK variable. Given number of rows of train data, I have found that we should merge CHUNK = 1/6 of all data small dataframes into large dataframes. Then merge the large dataframes together. The choice of 6 maximizes speed for our competition data and does not need to be changed.

![](https://raw.githubusercontent.com/cdeotte/Kaggle_Images/main/Nov-2022/merge2.png)

### rd: recsys - otto - candidate rerank - How do we know whether we should care about whether aid_y comes before or after aid_x within a day

This is for you to explore and find out :-) 

Co-visitation matrices are pairs of items: ITEM_A to ITEM_B. Depending on what type of items ITEM_A and ITEM_B are, i have found different things are better. When ITEM_A and ITEM_B are both the same, using both forward and backward is better (because when exploring what sneakers to buy the order you click sneakers doesn't matter). However if ITEM_A is "click" and ITEM_B is "cart" or "order" then forward in time is better. (Because why click for sneakers after you already bought sneakers).


In [None]:
%%time
type_weight = {0:1, 1:6, 2:3}

# USE SMALLEST DISK_PIECES POSSIBLE WITHOUT MEMORY ERROR
DISK_PIECES = 4 # divide the kaggle disk into 4 parts or num of output files used to form one co-visitation matrix
SIZE = 1.86e6/DISK_PIECES # 1.855.603 aids, each part of disk should handle SIZE num of aids # question: SIZE of what? size of a quarter of df

# COMPUTE IN PARTS FOR MEMORY MANGEMENT
for PART in range(DISK_PIECES): 
    print()
    print('### DISK PART',PART+1) # which part of the disk we are using at the moment
    
    # MERGE IS FASTEST PROCESSING CHUNKS WITHIN CHUNKS
    # => OUTER CHUNKS
    for j in range(6): # outer chunks have 6 parts to merge
        a = j*CHUNK # the starting file index for a part of the disk
        b = min( (j+1)*CHUNK, len(files) ) # the ending file index for a part of the disk
        print(f'Processing files {a} thru {b-1} in groups of {READ_CT}...') # split the 20 files into groups and 5 files each
        
        # => INNER CHUNKS
        for k in range(a,b,READ_CT): # inner chunk has 20 files, and we take 5 files together in one step
            # READ FILE: convert a READ_CT number of files into a list of 5 dataframes 
            df = [read_file(files[k])] # first file 
            for i in range(1,READ_CT): 
                if k+i<b: df.append( read_file(files[k+i]) ) # start to add second file
            df = cudf.concat(df,ignore_index=True,axis=0) # stack the dataframes vertically
            df = df.sort_values(['session','ts'],ascending=[True,False]) # sort df from small session id to large, then inside a sesion sort from large ts to small
            # USE TAIL OF SESSION: use the latest 30 aids 
            df = df.reset_index(drop=True) # make sure the index is clean (because of using an old cudf version by kaggle)
            df['n'] = df.groupby('session').cumcount() # give index to the aids of each session
            df = df.loc[df.n<30].drop('n',axis=1) # keep the rows where n is < 30 and remove the column 'n'
            # CREATE PAIRS: aid pair occurred in a day and must not be the same (question: it does not matter whether aid_y can come before or after aid_x?)
            df = df.merge(df,on='session')
            df = df.loc[ ((df.ts_x - df.ts_y).abs()< 24 * 60 * 60) & (df.aid_x != df.aid_y) ] # no more use for ts, so removed later
            # MEMORY MANAGEMENT COMPUTE IN PARTS: use only a quarter of the dataframe to compute
            df = df.loc[(df.aid_x >= PART*SIZE)&(df.aid_x < (PART+1)*SIZE)]
            # ASSIGN WEIGHTS: 
            df = df[['session', 'aid_x', 'aid_y','type_y']].drop_duplicates(['session', 'aid_x', 'aid_y']) # drop rows which are the same on those columns
            df['wgt'] = df.type_y.map(type_weight) # add a column which give specific weights to specific types
            df = df[['aid_x','aid_y','wgt']] # keep only 3 columns
            df.wgt = df.wgt.astype('float32') # make weight a float 32 not int
            df = df.groupby(['aid_x','aid_y']).wgt.sum() # accumulate weights for each pair (use df.sort_values())
            # COMBINE INNER CHUNKS
            if k==a: tmp2 = df # if we just start in the inner chunk
            else: tmp2 = tmp2.add(df, fill_value=0) # add the next 5 files dataframe
            print(k,', ',end='') # track the index of the files be processed
        print()
        # COMBINE OUTER CHUNKS
        if a==0: tmp = tmp2 # if the outer chunk just started
        else: tmp = tmp.add(tmp2, fill_value=0) # add the next 20 files dataframe merged from inner chunks
        del tmp2, df # delete to save RAM
        gc.collect() # Use this method to force the system to try to reclaim the maximum amount of available memory.
    # CONVERT MATRIX TO DICTIONARY
    tmp = tmp.reset_index() # get index clean and restore normal index
    tmp = tmp.sort_values(['aid_x','wgt'],ascending=[True,False]) # under each aid_x, order the rows by wgt from high to low
    # SAVE TOP 40
    tmp = tmp.reset_index(drop=True) # index is reordered due to sorting, drop it and restore the normal index
    tmp['n'] = tmp.groupby('aid_x').aid_y.cumcount() # use column n to assign the index to aid_y under each aid_x group
    tmp = tmp.loc[tmp.n<15].drop('n',axis=1) # keep the first 15 rows of each aid_x group
    # SAVE PART TO DISK (convert to pandas first uses less memory)
    tmp.to_pandas().to_parquet(f'top_15_carts_orders_v{VER}_{PART}.pqt') # this is one part of the dataframe and save to disk

## 2) "Buy2Buy" Co-visitation Matrix

In [None]:
%%time
# USE SMALLEST DISK_PIECES POSSIBLE WITHOUT MEMORY ERROR
DISK_PIECES = 1
SIZE = 1.86e6/DISK_PIECES

# COMPUTE IN PARTS FOR MEMORY MANGEMENT
for PART in range(DISK_PIECES):
    print()
    print('### DISK PART',PART+1)
    
    # MERGE IS FASTEST PROCESSING CHUNKS WITHIN CHUNKS
    # => OUTER CHUNKS
    for j in range(6):
        a = j*CHUNK
        b = min( (j+1)*CHUNK, len(files) )
        print(f'Processing files {a} thru {b-1} in groups of {READ_CT}...')
        
        # => INNER CHUNKS
        for k in range(a,b,READ_CT):
            # READ FILE
            df = [read_file(files[k])]
            for i in range(1,READ_CT): 
                if k+i<b: df.append( read_file(files[k+i]) )
            df = cudf.concat(df,ignore_index=True,axis=0)
            df = df.loc[df['type'].isin([1,2])] # ONLY WANT CARTS AND ORDERS
            df = df.sort_values(['session','ts'],ascending=[True,False])
            # USE TAIL OF SESSION
            df = df.reset_index(drop=True)
            df['n'] = df.groupby('session').cumcount()
            df = df.loc[df.n<30].drop('n',axis=1)
            # CREATE PAIRS
            df = df.merge(df,on='session')
            df = df.loc[ ((df.ts_x - df.ts_y).abs()< 14 * 24 * 60 * 60) & (df.aid_x != df.aid_y) ] # 14 DAYS
            # MEMORY MANAGEMENT COMPUTE IN PARTS
            df = df.loc[(df.aid_x >= PART*SIZE)&(df.aid_x < (PART+1)*SIZE)]
            # ASSIGN WEIGHTS
            df = df[['session', 'aid_x', 'aid_y','type_y']].drop_duplicates(['session', 'aid_x', 'aid_y'])
            df['wgt'] = 1
            df = df[['aid_x','aid_y','wgt']]
            df.wgt = df.wgt.astype('float32')
            df = df.groupby(['aid_x','aid_y']).wgt.sum()
            # COMBINE INNER CHUNKS
            if k==a: tmp2 = df
            else: tmp2 = tmp2.add(df, fill_value=0)
            print(k,', ',end='')
        print()
        # COMBINE OUTER CHUNKS
        if a==0: tmp = tmp2
        else: tmp = tmp.add(tmp2, fill_value=0)
        del tmp2, df
        gc.collect()
    # CONVERT MATRIX TO DICTIONARY
    tmp = tmp.reset_index()
    tmp = tmp.sort_values(['aid_x','wgt'],ascending=[True,False])
    # SAVE TOP 40
    tmp = tmp.reset_index(drop=True)
    tmp['n'] = tmp.groupby('aid_x').aid_y.cumcount()
    tmp = tmp.loc[tmp.n<15].drop('n',axis=1)
    # SAVE PART TO DISK (convert to pandas first uses less memory)
    tmp.to_pandas().to_parquet(f'top_15_buy2buy_v{VER}_{PART}.pqt')

## 3) "Clicks" Co-visitation Matrix - Time Weighted

In [None]:
%%time
# USE SMALLEST DISK_PIECES POSSIBLE WITHOUT MEMORY ERROR
DISK_PIECES = 4
SIZE = 1.86e6/DISK_PIECES

# COMPUTE IN PARTS FOR MEMORY MANGEMENT
for PART in range(DISK_PIECES):
    print()
    print('### DISK PART',PART+1)
    
    # MERGE IS FASTEST PROCESSING CHUNKS WITHIN CHUNKS
    # => OUTER CHUNKS
    for j in range(6):
        a = j*CHUNK
        b = min( (j+1)*CHUNK, len(files) )
        print(f'Processing files {a} thru {b-1} in groups of {READ_CT}...')
        
        # => INNER CHUNKS
        for k in range(a,b,READ_CT):
            # READ FILE
            df = [read_file(files[k])]
            for i in range(1,READ_CT): 
                if k+i<b: df.append( read_file(files[k+i]) )
            df = cudf.concat(df,ignore_index=True,axis=0)
            df = df.sort_values(['session','ts'],ascending=[True,False])
            # USE TAIL OF SESSION
            df = df.reset_index(drop=True)
            df['n'] = df.groupby('session').cumcount()
            df = df.loc[df.n<30].drop('n',axis=1)
            # CREATE PAIRS
            df = df.merge(df,on='session')
            df = df.loc[ ((df.ts_x - df.ts_y).abs()< 24 * 60 * 60) & (df.aid_x != df.aid_y) ]
            # MEMORY MANAGEMENT COMPUTE IN PARTS
            df = df.loc[(df.aid_x >= PART*SIZE)&(df.aid_x < (PART+1)*SIZE)]
            # ASSIGN WEIGHTS
            df = df[['session', 'aid_x', 'aid_y','ts_x']].drop_duplicates(['session', 'aid_x', 'aid_y'])
            df['wgt'] = 1 + 3*(df.ts_x - 1659304800)/(1662328791-1659304800)
            df = df[['aid_x','aid_y','wgt']]
            df.wgt = df.wgt.astype('float32')
            df = df.groupby(['aid_x','aid_y']).wgt.sum()
            # COMBINE INNER CHUNKS
            if k==a: tmp2 = df
            else: tmp2 = tmp2.add(df, fill_value=0)
            print(k,', ',end='')
        print()
        # COMBINE OUTER CHUNKS
        if a==0: tmp = tmp2
        else: tmp = tmp.add(tmp2, fill_value=0)
        del tmp2, df
        gc.collect()
    # CONVERT MATRIX TO DICTIONARY
    tmp = tmp.reset_index()
    tmp = tmp.sort_values(['aid_x','wgt'],ascending=[True,False])
    # SAVE TOP 40
    tmp = tmp.reset_index(drop=True)
    tmp['n'] = tmp.groupby('aid_x').aid_y.cumcount()
    tmp = tmp.loc[tmp.n<20].drop('n',axis=1)
    # SAVE PART TO DISK (convert to pandas first uses less memory)
    tmp.to_pandas().to_parquet(f'top_20_clicks_v{VER}_{PART}.pqt')

In [None]:
# FREE MEMORY
del data_cache, tmp
_ = gc.collect()

# Step 2 - ReRank (choose 20) using handcrafted rules
For description of the handcrafted rules, read this notebook's intro.

In [None]:
def load_test():    
    dfs = []
    for e, chunk_file in enumerate(glob.glob('../input/otto-chunk-data-inparquet-format/test_parquet/*')):
        chunk = pd.read_parquet(chunk_file)
        chunk.ts = (chunk.ts/1000).astype('int32')
        chunk['type'] = chunk['type'].map(type_labels).astype('int8')
        dfs.append(chunk)
    return pd.concat(dfs).reset_index(drop=True) #.astype({"ts": "datetime64[ms]"})

test_df = load_test()
print('Test data has shape',test_df.shape)
test_df.head()

In [None]:
%%time
def pqt_to_dict(df):
    return df.groupby('aid_x').aid_y.apply(list).to_dict()
# LOAD THREE CO-VISITATION MATRICES
top_20_clicks = pqt_to_dict( pd.read_parquet(f'top_20_clicks_v{VER}_0.pqt') )
for k in range(1,DISK_PIECES): 
    top_20_clicks.update( pqt_to_dict( pd.read_parquet(f'top_20_clicks_v{VER}_{k}.pqt') ) )
top_20_buys = pqt_to_dict( pd.read_parquet(f'top_15_carts_orders_v{VER}_0.pqt') )
for k in range(1,DISK_PIECES): 
    top_20_buys.update( pqt_to_dict( pd.read_parquet(f'top_15_carts_orders_v{VER}_{k}.pqt') ) )
top_20_buy2buy = pqt_to_dict( pd.read_parquet(f'top_15_buy2buy_v{VER}_0.pqt') )

# TOP CLICKS AND ORDERS IN TEST
top_clicks = test_df.loc[test_df['type']=='clicks','aid'].value_counts().index.values[:20]
top_orders = test_df.loc[test_df['type']=='orders','aid'].value_counts().index.values[:20]

print('Here are size of our 3 co-visitation matrices:')
print( len( top_20_clicks ), len( top_20_buy2buy ), len( top_20_buys ) )

In [None]:
#type_weight_multipliers = {'clicks': 1, 'carts': 6, 'orders': 3}
type_weight_multipliers = {0: 1, 1: 6, 2: 3}

def suggest_clicks(df):
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=20:
        weights=np.logspace(0.1,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        sorted_aids = [k for k,v in aids_temp.most_common(20)]
        return sorted_aids
    # USE "CLICKS" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_clicks]))
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(20) if aid2 not in unique_aids]    
    result = unique_aids + top_aids2[:20 - len(unique_aids)]
    # USE TOP20 TEST CLICKS
    return result + list(top_clicks)[:20-len(result)]

def suggest_buys(df):
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    # UNIQUE AIDS AND UNIQUE BUYS
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    df = df.loc[(df['type']==1)|(df['type']==2)]
    unique_buys = list(dict.fromkeys( df.aid.tolist()[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=20:
        weights=np.logspace(0.5,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        # RERANK CANDIDATES USING "BUY2BUY" CO-VISITATION MATRIX
        aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
        for aid in aids3: aids_temp[aid] += 0.1
        sorted_aids = [k for k,v in aids_temp.most_common(20)]
        return sorted_aids
    # USE "CART ORDER" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_buys[aid] for aid in unique_aids if aid in top_20_buys]))
    # USE "BUY2BUY" CO-VISITATION MATRIX
    aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2+aids3).most_common(20) if aid2 not in unique_aids] 
    result = unique_aids + top_aids2[:20 - len(unique_aids)]
    # USE TOP20 TEST ORDERS
    return result + list(top_orders)[:20-len(result)]

# Create Submission CSV
Inferring test data with Pandas groupby is slow. We need to accelerate the following code.

In [None]:
%%time
pred_df_clicks = test_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_clicks(x)
)

pred_df_buys = test_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_buys(x)
)

In [None]:
clicks_pred_df = pd.DataFrame(pred_df_clicks.add_suffix("_clicks"), columns=["labels"]).reset_index()
orders_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_orders"), columns=["labels"]).reset_index()
carts_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_carts"), columns=["labels"]).reset_index()

In [None]:
pred_df = pd.concat([clicks_pred_df, orders_pred_df, carts_pred_df])
pred_df.columns = ["session_type", "labels"]
pred_df["labels"] = pred_df.labels.apply(lambda x: " ".join(map(str,x)))
pred_df.to_csv("submission.csv", index=False)
pred_df.head()