# "Puzzle" solution (DataFusion2022 track №2)

For the first let's check available resources

In [1]:
import multiprocessing
from psutil import virtual_memory

print(f"CPU: {multiprocessing.cpu_count()}")
print(f"RAM {round(virtual_memory().total / 1024**3, 1)} Gb")

CPU: 2
RAM 15.5 Gb


There are two main data files:

- transactions.csv, that can be fully put into RAM;

- clickstream.csv, that requires more RAM, than I have.

That's why it was decided to provide an initial preprocessing. An initial preprocessing is splitting data files by 1_000_000 elements volume in each files and converting it into ".parquet" extension.  
So after it there are 20 files about transactions data and 127 click streaming service data.  
In this notebook I'll use these files

## 0 Importing necessary modules

In [2]:
import os
import warnings
import numpy as np
import pandas as pd
from tqdm import trange
from joblib import Parallel, delayed
from catboost import CatBoostClassifier

warnings.filterwarnings('ignore')

# Initialising path for auxiliary files
MAIN_PATH = os.getcwd()

Let's create necessary functions for:

- feature engineering;
- generating negative samples for matching

In [3]:
def make_base_bank_features(data):
    
    """
    This function is intended for feature engineering of transactions data. Sum, mean and count features of
    'currency_rk' are created taking in account all possible values of 'currency_rk'. Also sum, mean and
    count features of combination 'currency_rk' and 'mcc_code' are created taking in account all possible
    values of this pair.
    The function produces data type changing aiming to reduce usable memory usage.
    Input:
    - data [pandas.DataFrame] - an initial data.
    Output:
    - data with new features [pandas.DataFrame].
    """
    
    pivot1 = data.pivot_table(index = 'user_id', values=['transaction_amt'],
                              columns=['currency_rk'], aggfunc=['sum', 'mean', 'count']).fillna(0)
    pivot1.columns = [f'{str(i[0])}-{str(i[2])}' for i in pivot1.columns]

    pivot2 = data.pivot_table(index = 'user_id', values=['transaction_amt'],
                              columns=['mcc_code', 'currency_rk'], aggfunc=['sum', 'mean', 'count']).fillna(0)
    pivot2.columns = [f'{str(i[0])}-{str(i[2])}-{str(i[3])}' for i in pivot2.columns]

    new_data = pivot1.join(pivot2)

    dtypes = list()
    for x in new_data.dtypes.tolist():
        if x == 'int64':
            dtypes.append('int16')
        elif x == 'float64':
            dtypes.append('float32')
        else:
            dtypes.append('object')

    dtypes = dict(zip(new_data.columns.tolist(), dtypes))
    new_data = new_data.astype(dtypes)
    
    return new_data

In [4]:
def make_base_rtk_features(data):
    
    """
    This function is intended for feature engineering of clickstream data. Count features of 'cat_id' are
    created taking in account all possible values of 'cat_id'.
    The function produces data type changing to an int type  aiming to reduce usable memory usage.
    Input:
    - data [pandas.DataFrame] - an initial data.
    Output:
    - data with new features [pandas.DataFrame].
    """
    
    new_data = data.pivot_table(index ='user_id', values=['timestamp'], columns=['cat_id'], aggfunc=['count']).fillna(0)
    new_data.columns = [f'{str(i[0])}-{str(i[2])}' for i in new_data.columns]
    
    new_data = new_data.astype('int')
    
    return new_data

In [5]:
def make_hours_features(data, time_col: str, value_col: str, prefix: str):
    
    """
    This function is intended for creating features that contain information about produced operations
    by hours.
    The function produces data type changing to an int type aiming to reduce usable memory usage.
    Input:
    - data [pandas.DataFrame] - an initial data,
    - time_col [str] - string representation of time column,
    - value_col [str] - string representation of value column for pivot table,
    -prefix [str] - string for naming features.
    Output:
    - data with new features [pandas.DataFrame].
    """
    
    data[time_col] = pd.to_datetime(data[time_col])
    data['hour'] = data[time_col].dt.hour
    
    pivot_hours = pd.pivot_table(data, index='user_id', columns='hour', values=value_col, aggfunc='count').fillna(0)
    pivot_hours['sum'] = pivot_hours.sum(axis=1)
    
    for i in pivot_hours.columns[:-1]:
        
        pivot_hours[i] /= pivot_hours['sum']
        pivot_hours[i] = pivot_hours[i].astype('float32')
        
    pivot_hours.columns = [f'{prefix}_{str(i)}h' for i in pivot_hours.columns]
    pivot_hours[f'{prefix}_sumh'] = pivot_hours[f'{prefix}_sumh'].astype('int')
    
    return pivot_hours

In [6]:
def gen_random_for_negative(x:str, k:int, list_of_uniq):
    
    """
    This function is intended for creating negative samples of macthing. Final collection doesn't
    contain sample for which negatives are created.
    Input:
    - x [str] - string index of client,
    - k [int] - integer number of negative samples per one positive sample,
    - list_of_uniq [list / numpy.ndarray] - unique items of indexes
    Ouput:
     - collection that contains k negative samples per one positive sample [list / numpy.ndarray].
    """
    
    while True:
        
        final_list = np.random.choice(list_of_uniq, size=k, replace=False)
        
        if x not in final_list:
            
            return final_list

## 1 Data loading

Let's load train data about matching of bank clients and clickstreaming service clients

In [7]:
%%time

train_matching = pd.read_csv('train_matching.csv')
print(f'Data shape: {train_matching.shape}')
train_matching.sample(5)

Data shape: (17581, 2)
Wall time: 160 ms


Unnamed: 0,bank,rtk
8253,6783adb24efc43549038a0c037c958e7,036395725c6a4454bbabdcbca9e299f2
171,ed1ab56022944a8b8c5799929ef85e32,543f9e92c965432e9fe04220bfa46e3b
8256,43cfd421f06343a5ae5ab1c07b442031,da8ecc0632664c9f8e7935a7d4fbdebd
4207,0f9b40ccb74a43aba8017ae10f0c54e1,113d266f48df4eb5a6d455cf153661ef
10151,1a5a3b9697bb45d78f9671c064dad78d,20b868b8909142c2a2b6755e48a6e90f


To solve this task let's choose only such pairs between which bijective mapping is exist.  
Limit of chossing is 5000 pairs because of memory lack

In [8]:
%%time

np.random.seed(42)

train_matching = train_matching[~train_matching['rtk'].isin(['0'])].sample(5000).reset_index(drop=True)

users_bank = train_matching['bank'].unique().tolist()
users_rtk = train_matching['rtk'].unique().tolist()

print(f'Amount of unique clients: \n[bank]: {len(users_bank)} \n[clickstream]: {len(users_rtk)}')

Amount of unique clients: 
[bank]: 5000 
[clickstream]: 5000
Wall time: 31 ms


Let's load transactions data and make train dataset at once

In [9]:
os.chdir(os.path.join(MAIN_PATH, 'parquets'))


def select_bank_users_train(data): return data[data['user_id'].isin(users_bank)]


train_transactions = Parallel(n_jobs=-1)(delayed(select_bank_users_train)(
    pd.read_parquet(f'bank{i}.parquet', engine='fastparquet')) for i in trange(20))

df_bank_train = pd.concat(train_transactions)
print(df_bank_train.shape)
print(f"Memory usage: {df_bank_train.memory_usage().sum() // 1024 ** 2} Mb")

del train_transactions

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:32<00:00,  1.62s/it]


(4432260, 5)
Memory usage: 202 Mb


Let's take a look at columns of this data

In [10]:
%%time

list(df_bank_train)

Wall time: 1 ms


['user_id', 'mcc_code', 'currency_rk', 'transaction_amt', 'transaction_dttm']

Let's look at 5 random samples from the data

In [11]:
%%time

df_bank_train.sample(5)

Wall time: 1.18 s


Unnamed: 0,user_id,mcc_code,currency_rk,transaction_amt,transaction_dttm
18477270,eec6fe9b4fef41b998d4cc06be99b69f,4131,48,-27.049622,2020-09-22 01:48:38
14609473,bbe082c9701e45a9a391e940bceee703,5499,48,-1373.4779,2021-05-04 03:43:46
19277167,f995027a786a40cbb5bdae56f6677bcd,4829,48,558.44464,2020-12-13 15:52:23
12408073,9faf5b2c0b5c4cbd925a8d6311f8a63b,5541,48,-110.562584,2021-06-03 00:27:39
7575802,60f2860b80094c7fa11eabb5b1798b9d,5411,48,-3114.1226,2021-03-05 05:43:07


Let's check unique values for 'mcc_code' and 'currency_rk'

In [12]:
%%time

print(f"Unique values: \n[mcc_code]: {df_bank_train['mcc_code'].nunique()}")
if df_bank_train['mcc_code'].nunique() < 10:
    print(df_bank_train['mcc_code'].unique())
else:
    print("Great amount for pretty printing")
    
print(f"[currency_rk]: {df_bank_train['currency_rk'].nunique()}")
if df_bank_train['currency_rk'].nunique() < 10:
    print(df_bank_train['currency_rk'].unique())
else:
    print("Great amount for pretty printing")

Unique values: 
[mcc_code]: 338
Great amount for pretty printing
[currency_rk]: 4
[48 -1 50 60]
Wall time: 432 ms


This data includes 5000 unique clients that we have chosen early.  
This feature "transaction_dttm" can be converted to datetime format

Now let's load clickstreaming service data and make train dataset at once

In [13]:
def select_rtk_users_train(data): return data[data['user_id'].isin(users_rtk)]


clicksreaming_train = Parallel(n_jobs=-1)(delayed(select_rtk_users_train)(
    pd.read_parquet(f'click{i}.parquet', engine='fastparquet')) for i in trange(127))

df_rtk_train = pd.concat(clicksreaming_train)
print(df_rtk_train.shape)
print(f"Memory usage: {df_rtk_train.memory_usage().sum() // 1024 ** 2} Mb")

del clicksreaming_train

100%|████████████████████████████████████████████████████████████████████████████████| 127/127 [02:48<00:00,  1.33s/it]


(30762259, 4)
Memory usage: 1173 Mb


Let's take a look at columns of this data and 5 random samples

In [14]:
%%time

print(list(df_rtk_train))
df_rtk_train.sample(5)

['user_id', 'cat_id', 'timestamp', 'new_uid']
Wall time: 8.55 s


Unnamed: 0,user_id,cat_id,timestamp,new_uid
37738683,519ec98d6b284cf880fac1d862b0b6bc,289,2021-02-25 14:08:00,1384344
57839838,758df89106bc4f5780fbe13543986a37,165,2021-05-06 05:25:00,754072
15043655,220f5acda0274829a5d9a96e6fd08ae5,251,2021-05-18 14:14:00,591797
46393041,628709ff22e142d1a1bd2993dcb00ea6,535,2021-03-21 16:12:00,1061068
96486108,c485a7de877d44daaafe5ff804f6b665,931,2021-02-16 13:56:00,554594


This data also includes 5000 unique clients and contains of one feature ("timestamp") that can be converted to datetime format

Let's check unique values for 'cat_id' and 'new_uid'

In [15]:
%%time

print(f"Unique values: \n[cat_id]: {df_rtk_train['cat_id'].nunique()}")
if df_rtk_train['cat_id'].nunique() < 10:
    print(df_rtk_train['cat_id'].unique())
else:
    print("Great amount for pretty printing")
    
print(f"[new_uid]: {df_rtk_train['new_uid'].nunique()}")
if df_rtk_train['new_uid'].nunique() < 10:
    print(df_rtk_train['new_uid'].unique())
else:
    print("Great amount for pretty printing")

Unique values: 
[cat_id]: 321
Great amount for pretty printing
[new_uid]: 20014
Great amount for pretty printing
Wall time: 1.91 s


## 2 Feature engineering

Using initialised functions for generating features let's define working features space

Arguments for generating features function that contain information about making activities by hours:

- for transactions:

    - time_col: transaction_dttm;
    - value_col: transaction_amt;
    - prefix: bank;
    
- for clickstreaming service:

    - time_col: timestamp;
    - value_col: timestamp;
    - prefix: click

In [16]:
%%time

bank_hours_train = make_hours_features(df_bank_train, time_col='transaction_dttm',
                                       value_col='transaction_amt', prefix='bank')
bank_embed_train = make_base_bank_features(df_bank_train)
bank_embed_train = bank_embed_train.join(bank_hours_train)

print(f"Shape: {bank_embed_train.shape}")
print(f"Memory usage: {bank_embed_train.memory_usage().sum() // 1024 ** 2} Mb")

print(20*'-')

rtk_hours_train = make_hours_features(df_rtk_train, time_col='timestamp', value_col='timestamp', prefix='click')
rtk_embed_train = make_base_rtk_features(df_rtk_train)
rtk_embed_train = rtk_embed_train.join(rtk_hours_train)

print(f"Shape: {rtk_embed_train.shape}")
print(f"Memory usage: {rtk_embed_train.memory_usage().sum() // 1024 ** 2} Mb")

del df_bank_train, df_rtk_train
del bank_hours_train, rtk_hours_train

Shape: (5000, 1840)
Memory usage: 35 Mb
--------------------
Shape: (5000, 346)
Memory usage: 6 Mb
Wall time: 1min 19s


After feature engineering there are 1840 features for transactions data and 346 features for clickstreaming service data

Let's check "nan" value in data

In [17]:
%%time

print(f"[transactions]: {(bank_embed_train.isna().sum()).sum()} \n[clickstream] {(rtk_embed_train.isna().sum()).sum()}")

[transactions]: 0 
[clickstream] 0
Wall time: 588 ms


Let's check for duplicates

In [18]:
%%time

print(f"[transactions]: {bank_embed_train.duplicated().sum()} \n[clickstream] {rtk_embed_train.duplicated().sum()}")

[transactions]: 0 
[clickstream] 0
Wall time: 1.09 s


So, there aren't "nan" values and duplicates in data.  
Now let's split the data into train and validation parts

In [19]:
%%time

np.random.seed(42)

val_indexes = np.random.choice(users_bank, 1000, replace=False)
valid_matching = train_matching[train_matching['bank'].isin(val_indexes)]

bank_embed_valid = bank_embed_train.loc[valid_matching['bank'].unique()]
rtk_embed_valid = rtk_embed_train.loc[valid_matching['rtk'].unique()]
print(f'Shape: \n[transactions]: {bank_embed_valid.shape} \n[clickstream]: {rtk_embed_valid.shape}')

print(40*'-')

train_matching = train_matching[~train_matching['bank'].isin(val_indexes)]

bank_embed_train = bank_embed_train.loc[train_matching['bank'].unique()]
rtk_embed_train = rtk_embed_train.loc[train_matching['rtk'].unique()]
print(f'Shape: \n[transactions]: {bank_embed_train.shape} \n[clickstream]: {rtk_embed_train.shape}')

del val_indexes

Shape: 
[transactions]: (1000, 1840) 
[clickstream]: (1000, 346)
----------------------------------------
Shape: 
[transactions]: (4000, 1840) 
[clickstream]: (4000, 346)
Wall time: 305 ms


## 3 Train data preparation

Let's set truly labels for matching and generate 10 negative samples.  
For this goal I'll use function "gen_random_for_negative"

In [20]:
%%time

np.random.seed(42)

train_mathing_rtk_unique = train_matching['rtk'].unique()
train_matching['negatives'] = train_matching['rtk'].apply(
    lambda x: gen_random_for_negative(x, 10, train_mathing_rtk_unique)
)

train_matching['target'] = 1

positive_train = train_matching[['bank', 'rtk', 'target']]
negative_train = train_matching[['bank', 'negatives']].explode('negatives')

negative_train['target'] = 0
negative_train.columns = positive_train.columns

full_train = pd.concat([positive_train, negative_train]).sort_values(by='bank')

del train_matching, train_mathing_rtk_unique
del positive_train, negative_train

Wall time: 964 ms


Let's take a look at 5 random samples

In [21]:
%%time

full_train.sample(5)

Wall time: 9 ms


Unnamed: 0,bank,rtk,target
3677,fc042614b80046ff8b1724f8ab48747c,5ae64af1013f48fab63028f75c5db1ae,0
4414,366bd7c4be8d4a6494451c76212ad6ef,66df362224d946a68f59302dcd244718,0
593,4631e0fdc8714601a63ffdea84627ce5,2a0dadb3b0b1409f889eb3a4af57cf3e,0
3992,f36152eefb1549ac83482b432cd606c0,feba37f3c01045a7a7ffcec2af9483d5,0
1863,a86e8e61e76f4390a072e191395963ba,a25d6d83564044fea7e3d1b022fe4b62,0


It's necessary to convert target variable to "int8" type aiming to reduce memory usage.  
Current type of target in int64. It's overkill for binary variable

In [22]:
full_train['target'] = full_train['target'].astype('int8')

Finally to obtain full train dataset it's necessary to merge all separate parts into one

In [23]:
%%time

full_train = full_train.merge(bank_embed_train, how='left', left_on='bank', right_index=True)\
                       .merge(rtk_embed_train, how='left', left_on='rtk', right_index=True).fillna(0)

print(f"Shape: {full_train.shape}")
print(f"Memory usage: {full_train.memory_usage().sum() // 1024 ** 2} Mb")

del bank_embed_train, rtk_embed_train

Shape: (44000, 2189)
Memory usage: 367 Mb
Wall time: 2.2 s


Let's take a look at 5 random sample from full train data

In [24]:
%%time

full_train.sample(5)

Wall time: 16 ms


Unnamed: 0,bank,rtk,target,sum--1,sum-48,sum-50,sum-60,mean--1,mean-48,mean-50,...,click_15h,click_16h,click_17h,click_18h,click_19h,click_20h,click_21h,click_22h,click_23h,click_sumh
4702,96cefcd25e9e42cbb665cd4be60cad5e,00c611f241cd4ae4992477f99c3be590,0,0.0,-97752.15,0.0,0.0,0.0,-191.295792,0.0,...,0.052027,0.050676,0.059459,0.098649,0.062162,0.02973,0.002027,0.003378,0.004054,1480
3915,4babb230498741d991756ffbca9f9e2e,4b02b923770f4b078b695ab931a15673,0,0.44111,-336549.3,0.0,0.0,0.44111,-492.031097,0.0,...,0.058651,0.129032,0.152493,0.155425,0.085044,0.02346,0.017595,0.014663,0.020528,341
1956,03ca4ea2cf474fa185f9ce23bae2debe,4dab2d99cbc149bbb11c3c9c576a2d7f,0,0.0,-508166.1,0.0,0.0,0.0,-535.475342,0.0,...,0.089219,0.089219,0.089219,0.081784,0.066914,0.022305,0.011152,0.0,0.0,269
1102,da3b5a2a2aa44847b5cb8bfb74577d1e,552e3ffecaab4ceebfa8a3bd7a6e9ad6,0,0.060248,-1250813.0,0.0,0.0,0.001545,-953.363403,0.0,...,0.07154,0.059556,0.047317,0.032383,0.017484,0.011729,0.00805,0.00805,0.010418,27453
2304,22e6f99cc51f476baa06ce29c0acc386,42efd3a305414928b4f87a547ff62f28,0,0.0,-355583.6,0.0,0.0,0.0,-537.947998,0.0,...,0.009259,0.007545,0.006859,0.001372,0.000343,0.0,0.0,0.0,0.0,2916


## 4 Model fitting

Here I considered it's better to use cross validation strategy for estimating model precision but it'll take more than 2 days to fit model. That's why I just take hold out part form the data to evaluate model during fiiting.

Size of hold out part is 10% from full train data

In [25]:
%%time

hold_out_size = int(0.1*len(full_train['bank'].unique()))
hold_out = full_train[full_train['bank'].isin(full_train['bank'].unique()[:hold_out_size])]
train = full_train[full_train['bank'].isin(full_train['bank'].unique()[hold_out_size:])]
print(f"[hold_out]: {hold_out.shape} \n[train]: {train.shape}")

del hold_out_size, full_train

[hold_out]: (4400, 2189) 
[train]: (39600, 2189)
Wall time: 686 ms


Let's set full features because it'll be used in test data

In [26]:
%%time

features = train.columns[3:].to_list()
print(f"Features total amount: {len(features)}")

Features total amount: 2186
Wall time: 14 ms


I use CatBoostClassifier as the main model.

Some settings:

- iterations: 2000;
- depth: 12;
- learning_rate: 0.01;
- loss_function: CrossEntropy;
- eval_set: (hold_out[features], hold_out['target']);
- verbose: 50;
- early_stopping_rounds: 100

In [27]:
np.random.seed(42)

model = CatBoostClassifier(
    iterations=2000,
    depth=12,
    learning_rate=0.01,
    loss_function='CrossEntropy'
)

model.fit(
    train[features],
    train['target'],
    eval_set=(hold_out[features], hold_out['target']),
    verbose=50,
    early_stopping_rounds=100
)

0:	learn: 0.6837565	test: 0.6838021	best: 0.6838021 (0)	total: 20.7s	remaining: 11h 28m 21s
50:	learn: 0.4180662	test: 0.4188100	best: 0.4188100 (50)	total: 16m 5s	remaining: 10h 14m 43s
100:	learn: 0.3392136	test: 0.3402524	best: 0.3402524 (100)	total: 31m 43s	remaining: 9h 56m 22s
150:	learn: 0.3118588	test: 0.3138610	best: 0.3138610 (150)	total: 47m 4s	remaining: 9h 36m 22s
200:	learn: 0.2995091	test: 0.3031644	best: 0.3031644 (200)	total: 1h 2m 58s	remaining: 9h 23m 42s
250:	learn: 0.2899102	test: 0.2974085	best: 0.2974085 (250)	total: 1h 18m 21s	remaining: 9h 6m
300:	learn: 0.2817377	test: 0.2936669	best: 0.2936669 (300)	total: 1h 35m 8s	remaining: 8h 57m 3s
350:	learn: 0.2711053	test: 0.2902317	best: 0.2902317 (350)	total: 1h 51m 43s	remaining: 8h 44m 52s
400:	learn: 0.2627374	test: 0.2881249	best: 0.2881249 (400)	total: 2h 8m 35s	remaining: 8h 32m 44s
450:	learn: 0.2552396	test: 0.2866283	best: 0.2866283 (450)	total: 2h 25m 22s	remaining: 8h 19m 16s
500:	learn: 0.2476738	test: 0

<catboost.core.CatBoostClassifier at 0x231db531bb0>

I don't need anymore train datasets and all information that belongs to it

In [28]:
del train, hold_out
del users_bank, users_rtk

## 5 Model validation

Let's create the base structure for validation results

In [29]:
%%time

validation_results = pd.DataFrame(columns=['bank'], data=bank_embed_valid.index.to_list())
validation_results['rtk'] = validation_results['bank'].apply(lambda x: rtk_embed_valid.index.to_list())

Wall time: 49.1 ms


Predicting matching for validation data will be produced for 200 elements at once.  
Taking it into account it's necessary 5 batches to predict matching for all validation data

In [30]:
%%time

batch_size = 200
num_of_batches = int((len(bank_embed_valid.index.to_list()))/batch_size)
num_of_batches

Wall time: 0 ns


5

So, let's predict matching for validation pairs

In [31]:
validation_done = []
for batch in trange(num_of_batches):
    
    bank_indexes = bank_embed_valid.index.to_list()[(batch*batch_size):((batch+1)*batch_size)]
    
    if len(bank_indexes) != 0:
        
        part_of_validation = validation_results[validation_results['bank'].isin(bank_indexes)].explode('rtk')
        part_of_validation = part_of_validation.merge(bank_embed_valid, how='left', left_on='bank', right_index=True)\
                                               .merge(rtk_embed_valid, how='left', left_on='rtk', right_index=True)\
                                               .fillna(0)

        part_of_validation['predicts'] = model.predict_proba(part_of_validation[features])[:,1]
        part_of_validation = part_of_validation[['bank', 'rtk', 'predicts']]        
        part_of_validation = part_of_validation.sort_values(by=['bank', 'predicts'], ascending=False)\
                             .reset_index(drop=True)
        part_of_validation = part_of_validation.pivot_table(index='bank', values='rtk', aggfunc=list)
        part_of_validation['rtk'] = part_of_validation['rtk'].apply(lambda x: x[:100])
        part_of_validation['bank'] = part_of_validation.index
        part_of_validation = part_of_validation[['bank', 'rtk']]
        
        validation_done.append(part_of_validation)

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:12<00:00, 14.50s/it]


Finally, it's necessary to bring to the desired structure

In [32]:
%%time

validation_final = pd.concat(validation_done)
validation_final.reset_index(inplace=True, drop=True)
validation_final['true_rtk'] = validation_final['bank'].apply(
    lambda x: valid_matching[valid_matching['bank'].isin([x])]['rtk'].values[0])
print(f"Shape: {validation_final.shape}")

del bank_embed_valid, rtk_embed_valid, bank_indexes
del validation_results, part_of_validation, validation_done
del batch_size, num_of_batches, batch
del valid_matching

Shape: (1000, 3)
Wall time: 1.19 s


R1 metric calculation

In [33]:
%%time

precision, mrr = 0, 0
for idx in range(len(validation_final)):
    
    true = validation_final['true_rtk'].loc[idx]
    preds = validation_final['rtk'].loc[idx]

    if true in preds:
        precision += 1
        mrr += 1 / (preds.index(true) + 1)
precision = precision / len(validation_final)
mrr = mrr / len(validation_final)
r1 = (2 * precision * mrr) / (precision + mrr)
print(f'Precision: {round(precision, ndigits=10)} \nMRR: {round(mrr, ndigits=10)} \nR1: {round(r1, ndigits=10)}')

del validation_final
del idx, precision, mrr
del true, preds, r1

Precision: 0.367 
MRR: 0.0405715495 
R1: 0.0730657412
Wall time: 69.2 ms


I've got some optimistic results

Main influence in this metric plays MRR@k. If you'll have high Precision@k, but MRR@k is too low you'll get low R1 too.  
So it's necessary to see right proportional changing between MRR@k and R1 only in direction but not in values changing 

## 6 Test data processing

Let's load data that contain information about candidates to pairs for bank clients. This is the main file for test data

In [34]:
%%time

os.chdir(MAIN_PATH)

test = pd.read_csv('puzzle.csv')
print(f'Shape: {test.shape}')

test_bank_users = test['bank'].unique()
test_rtk_users = test['rtk'].unique()
print(f'Amount of unique clients: \n[bank]: {len(test_bank_users)} \n[clickstream]: {len(test_rtk_users)}')

test.sample(5)

Shape: (4952, 2)
Amount of unique clients: 
[bank]: 4952 
[clickstream]: 4952
Wall time: 62.5 ms


Unnamed: 0,bank,rtk
230,45284b9689f94a6b9519837ff81efaf6,445f444647ed42bcada41a1e82d31abe
1371,6b9c28fda1534d958098af89f9126dee,944ffb6af1834d6da3f11ffebeb9c97b
1957,91f9fe21c534470f91a68c9b0e28129c,4dbfbc35971d467383b845b3aaa3b8ed
3464,eaf003b643aa498bb23346c14248fc7c,b164633f115c46e9a54a4c5676f048ea
4473,bbf0757110f345a9baeb88374e580297,cfc1b16c6c0f4da6a7cdc0ff0c5b26fa


Test mapping data doesn't contain zero mapping, i.e. every bank client has definitely clickstreaming service client

Let's load transactions data and make test dataset at once

In [35]:
os.chdir(os.path.join(MAIN_PATH, 'parquets'))


def select_bank_users_test(data): return data[data['user_id'].isin(test_bank_users)]


test_transactions = Parallel(n_jobs=-1)(delayed(select_bank_users_test)(
    pd.read_parquet(f'bank{i}.parquet', engine='fastparquet')) for i in trange(20))

test_bank = pd.concat(test_transactions)
print(test_bank.shape)
print(f"Memory usage: {test_bank.memory_usage().sum() // 1024 ** 2} Mb")

del test_transactions

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:28<00:00,  1.45s/it]


(4381342, 5)
Memory usage: 200 Mb


Now let's load clickstreaming service data and make train dataset at once

In [36]:
def select_rtk_users_test(data): return data[data['user_id'].isin(test_rtk_users)]


clicksreaming_test = Parallel(n_jobs=-1)(delayed(select_rtk_users_test)(
    pd.read_parquet(f'click{i}.parquet', engine='fastparquet')) for i in trange(127))

test_rtk = pd.concat(clicksreaming_test)
print(test_rtk.shape)
print(f"Memory usage: {test_rtk.memory_usage().sum() // 1024 ** 2} Mb")

del clicksreaming_test

100%|████████████████████████████████████████████████████████████████████████████████| 127/127 [02:41<00:00,  1.27s/it]


(31955968, 4)
Memory usage: 1219 Mb


Here it doesn't necessary to take a look at unique values, some random samples from the data because of structures of these data are similar to train ones.

I'll start feature engineering at once

In [37]:
%%time

bank_hours_test = make_hours_features(test_bank, time_col='transaction_dttm',
                                       value_col='transaction_amt', prefix='bank')
bank_embed_test = make_base_bank_features(test_bank)
bank_embed_test = bank_embed_test.join(bank_hours_test)
print(f'Shape: {bank_embed_test.shape}')
print(f"Memory usage: {bank_embed_test.memory_usage().sum() // 1024 ** 2} Mb")

print(30*'-')

rtk_hours_test = make_hours_features(test_rtk, time_col='timestamp', value_col='timestamp', prefix='click')
rtk_embed_test = make_base_rtk_features(test_rtk)
rtk_embed_test = rtk_embed_test.join(rtk_hours_test)
print(f'Shape: {rtk_embed_test.shape}')
print(f"Memory usage: {rtk_embed_test.memory_usage().sum() // 1024 ** 2} Mb")

del test_bank, test_rtk
del bank_hours_test, rtk_hours_test

Shape: (4952, 1855)
Memory usage: 35 Mb
------------------------------
Shape: (4952, 342)
Memory usage: 6 Mb
Wall time: 1min 12s


After feature engineering there are 1855 features for transactions data and 342 features for clickstreaming service data.  
During predicting matching it's necessary to put in order test data features because of inconsistent shapes of train and test datasets

Let's create the base structure for test results

In [38]:
%%time

test_results = pd.DataFrame(columns=['bank'], data=test_bank_users)
test_results['rtk'] = test_results['bank'].apply(lambda x: test_rtk_users)

Wall time: 15.6 ms


For test data batch size is 5. It'll help me to overcome memory issues

In [39]:
%%time

size = 5
batches = int((len(test_bank_users))/size)+1
batches

Wall time: 0 ns


991

So, let's predict matching for test data

In [40]:
test_done = []
for i in trange(batches):
    
    bank_test_indexes = test_bank_users[(i*size):((i+1)*size)]
    
    if len(bank_test_indexes) != 0:
        
        sub = test_results[test_results['bank'].isin(bank_test_indexes)].explode('rtk')
        sub = sub.merge(bank_embed_test, how='left', left_on='bank', right_index=True)\
                 .merge(rtk_embed_test, how='left', left_on='rtk', right_index=True).fillna(0)
           
        for col in features:
            if col not in sub.columns:
                sub[col] = 0
        
        sub['predicts'] = model.predict_proba(sub[features])[:,1]
        sub['predicts'] = sub['predicts'].astype('float32')
        sub = sub[['bank', 'rtk', 'predicts']]
        
        sub = sub.sort_values(by=['bank', 'predicts'], ascending=False).reset_index(drop=True)
        sub = sub.pivot_table(index='bank', values='rtk', aggfunc=list)
        sub['rtk'] = sub['rtk'].apply(lambda x: x[:100])
        sub['bank'] = sub.index
        sub = sub[['bank', 'rtk']]
        
        test_done.append(sub)
        
del i, batches, size, col
del bank_test_indexes, test_results, test
del bank_embed_test, rtk_embed_test, sub
del test_bank_users, test_rtk_users, features

100%|██████████████████████████████████████████████████████████████████████████████| 991/991 [1:49:23<00:00,  6.62s/it]


Let's concatenate all predictions and take a look at the first 5 samples

In [41]:
%%time

submission = pd.concat(test_done)
submission.reset_index(inplace=True, drop=True)
submission.head(5)

Wall time: 78.1 ms


Unnamed: 0,bank,rtk
0,10f09bfecc7f4cf894897206d4020307,"[2921476d7cf74afb9e05b45e476c591a, 4fc252276e6..."
1,224a2325b44a4326bc539e3f1a6e713b,"[8039a4b6372b40a490ff59bb1a4527c5, a60d6e61060..."
2,6dd66e8624da427da6b558903a5772b8,"[5858d3d01fe44ab1a9ecdd094d1093f4, f795f418fc1..."
3,8a1dd91a260143f4b6044da26844dde2,"[48393d2085f345f4886eadd1243ad1b8, ec5e4863170..."
4,ed547c635e594b88a9eb5b5f7ae75304,"[4b6605d77c2b4fd28e3d8c3995a4e6fa, 38eb39dbd47..."


Finally let's make submit

In [42]:
%%time

os.chdir(MAIN_PATH)

submission.columns = ['bank', 'rtk_list']
submission.to_csv(f'puzzle_sub.csv', index=False)

del test_done

Wall time: 860 ms


I made 5 attempts to receive high score.

My the best attempt has following scores:

- [public]: 0.0213660899 (R1), 0.0116099809 (MRR@100), 0.1338063862 (Precision@100) (41 at leaderboard);
- [private]: 0.0173880855 (R1), 0.0093427651 (MRR@100), 0.1252098019 (Precision@100) (41 at leaderboard)

Drift relative to place at the leaderboard isn't but drawdown by scores is.

----------