<a class="anchor" id="0.1"></a>

## Table of Contents
1. [Import labriraries and variables](#1)
2. [О датасете/EDA](#2)
3. [Feature Engineering](#3)
4. [Train models](#4)

## Import labriraries and variables
<a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [3]:
!pip install kaggle_metrics

Collecting kaggle_metrics
  Downloading kaggle_metrics-0.3.1-py3-none-any.whl.metadata (1.7 kB)
Downloading kaggle_metrics-0.3.1-py3-none-any.whl (7.5 kB)
Installing collected packages: kaggle_metrics
Successfully installed kaggle_metrics-0.3.1


In [5]:
import pandas as pd
import numpy as np
import pyarrow as pa
import matplotlib as plt

from kaggle_metrics.utils import check_shapes, align_shape
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


In [35]:
def weighted_mean_absolute_error(y_true, y_pred, weights):

    '''

    Weighted mean absolute error.

    Parameters
    ----------
    y_true: ndarray
        Ground truth
    y_pred: ndarray
        Array of predictions

    Returns
    -------
    rmsle: float
        Weighted mean absolute error

    References
    ----------
    .. [1] https://www.kaggle.com/wiki/WeightedMeanAbsoluteError

    '''

    # Check shapes
    y_true, y_pred = align_shape(y_true, y_pred)
    check_shapes(y_true, y_pred)

    return (weights * np.abs(y_true - y_pred)).mean()

In [6]:
# variables 
main_dir = "/kaggle/input/alfa-challenge/"

In [7]:
transaction = pd.read_parquet(main_dir + "df_transaction.pa", engine="pyarrow")
train = pd.read_parquet(main_dir + "train.pa", engine="pyarrow")

## О датасете/EDA
<a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

### DataSet "Transaction"
client_num - номер клиента

date_time - время транзакции

mcc_code - МСС код

merchant_name - hash имени мерчанта

amount - сумма транзакции

In [5]:
transaction

Unnamed: 0,client_num,date_time,mcc_code,merchant_name,amount
0,0,2024-07-18 16:04:00,8099,a011100358d0f73ea8f3e860ef5564e3ba9cb217b7b90c...,2900
1,0,2024-07-22 16:31:00,5411,f3855606fc7244ec2f37ea01a4b2b66933d0e965bf4aec...,455
2,0,2024-07-24 16:23:00,5541,786270fa33ad4ac2a3c0e52e888005aa7f98beadbf8986...,1003
3,0,2024-07-28 15:51:00,5691,54887ad4a8df7e260a3ac85e59128a947c50d4423f6330...,1480
4,0,2024-07-28 18:00:00,5331,21617559a372c7cca155208c87be6c84ce97b5f8775589...,88
...,...,...,...,...,...
13508150,109142,2024-08-19 21:32:00,6011,01784811094a8bd592cb35ee21e98c934839341e2b9d14...,14000
13508151,109142,2024-08-19 21:40:00,6011,01784811094a8bd592cb35ee21e98c934839341e2b9d14...,24000
13508152,109142,2024-08-19 21:46:00,6011,01784811094a8bd592cb35ee21e98c934839341e2b9d14...,23000
13508153,109142,2024-08-19 22:04:00,6011,01784811094a8bd592cb35ee21e98c934839341e2b9d14...,32000


In [24]:
transaction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13508155 entries, 0 to 13508154
Data columns (total 5 columns):
 #   Column         Dtype         
---  ------         -----         
 0   client_num     int64         
 1   date_time      datetime64[us]
 2   mcc_code       object        
 3   merchant_name  object        
 4   amount         int64         
dtypes: datetime64[us](1), int64(2), object(2)
memory usage: 515.3+ MB


In [10]:
transaction.isnull().sum()

client_num       0
date_time        0
mcc_code         0
merchant_name    0
amount           0
dtype: int64

In [8]:
transaction.client_num.unique().sum()

5956042653

In [13]:
transaction_copy = transaction.copy()

In [14]:
transaction_copy['month'] = transaction['date_time'].dt.to_period('M')

In [15]:
transaction_copy.groupby('month')['client_num'].count()

month
2024-07    4441222
2024-08    4577178
2024-09    4489751
2024-10          4
Freq: M, Name: client_num, dtype: int64

In [23]:
transaction_copy.groupby(['month', 'client_num'])['amount'].sum()

month    client_num
2024-07  0                7261
         1              422749
         2              114647
         3             1483913
         4               91422
                        ...   
2024-09  109141            378
2024-10  13168            3280
         26937             132
         78137              64
         93720              31
Name: amount, Length: 298590, dtype: int64

In [36]:
transaction_copy['sum_amount'] = transaction_copy.groupby(['month', 'client_num'])['amount'].transform('sum')

In [56]:
transaction_copy

Unnamed: 0,client_num,date_time,mcc_code,merchant_name,amount,month,sum_amount
0,0,2024-07-18 16:04:00,8099,a011100358d0f73ea8f3e860ef5564e3ba9cb217b7b90c...,2900,2024-07,7261
1,0,2024-07-22 16:31:00,5411,f3855606fc7244ec2f37ea01a4b2b66933d0e965bf4aec...,455,2024-07,7261
2,0,2024-07-24 16:23:00,5541,786270fa33ad4ac2a3c0e52e888005aa7f98beadbf8986...,1003,2024-07,7261
3,0,2024-07-28 15:51:00,5691,54887ad4a8df7e260a3ac85e59128a947c50d4423f6330...,1480,2024-07,7261
4,0,2024-07-28 18:00:00,5331,21617559a372c7cca155208c87be6c84ce97b5f8775589...,88,2024-07,7261
...,...,...,...,...,...,...,...
13508150,109142,2024-08-19 21:32:00,6011,01784811094a8bd592cb35ee21e98c934839341e2b9d14...,14000,2024-08,341100
13508151,109142,2024-08-19 21:40:00,6011,01784811094a8bd592cb35ee21e98c934839341e2b9d14...,24000,2024-08,341100
13508152,109142,2024-08-19 21:46:00,6011,01784811094a8bd592cb35ee21e98c934839341e2b9d14...,23000,2024-08,341100
13508153,109142,2024-08-19 22:04:00,6011,01784811094a8bd592cb35ee21e98c934839341e2b9d14...,32000,2024-08,341100


In [60]:
transaction_copy.drop(columns=['date_time', 'mcc_code', 'merchant_name'], inplace=True)

KeyError: "['date_time', 'mcc_code', 'merchant_name'] not found in axis"

In [62]:
transaction_copy.drop(['amount'], axis=1, inplace=True)

In [68]:
transaction_copy.drop_duplicates(subset=['client_num'], inplace=True)

In [27]:
filtered_data = transaction_copy.groupby(['month', 'client_num']).agg({
    'amount': ['sum', 'mean', 'min', 'max'],
    'mcc_code': 'count'
}).reset_index()
filtered_data

Unnamed: 0_level_0,month,client_num,amount,amount,amount,amount,mcc_code
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum,mean,min,max,count
0,2024-07,0,7261,726.100000,39,2900,10
1,2024-07,1,422749,4449.989474,6,100000,95
2,2024-07,2,114647,895.679688,23,14697,128
3,2024-07,3,1483913,35331.261905,2,1000000,42
4,2024-07,4,91422,1523.700000,24,50000,60
...,...,...,...,...,...,...,...
298585,2024-09,109141,378,378.000000,378,378,1
298586,2024-10,13168,3280,3280.000000,3280,3280,1
298587,2024-10,26937,132,132.000000,132,132,1
298588,2024-10,78137,64,64.000000,64,64,1


In [8]:
transaction_copy.mcc_code.unique().shape[0], transaction_copy.merchant_name.unique().shape[0]

(320, 666279)

In [11]:
transaction_copy.groupby(['client_num'])['mcc_code'].count()

client_num
0         132
1         240
2         300
3         147
4         122
         ... 
109138     16
109139     15
109140     18
109141     16
109142     15
Name: mcc_code, Length: 109143, dtype: int64

Тут, скорее всего можно понять, как много человек тратит деньги в различных категориях

In [13]:
transaction_copy.groupby(['mcc_code'])['merchant_name'].count()

mcc_code
0742     6688
0763      239
0780     1209
1520      123
1711      214
        ...  
9311     7480
9390    17579
9399    10437
9402    15032
9406     7465
Name: merchant_name, Length: 320, dtype: int64

А тут можно можно понять, где больше всего продавцов в тех или иных категориях

In [None]:
transaction_copy.groupby('month')['client_num'].count()

### Dataset "Train"
client_num - номер клиента

target - группа риска

In [28]:
train

Unnamed: 0,client_num,target
0,94779,3
1,17279,0
2,5717,2
3,27471,1
4,72725,0
...,...,...
69995,107219,1
69996,108682,1
69997,93497,3
69998,14344,6


In [42]:
train.client_num.unique().shape[0]

70000

In [43]:
transaction_copy.client_num.unique().shape[0]

109143

In [70]:
train_merge = train.merge(transaction_copy, on='client_num', how='left')
# train_merge.drop(['date_time'], axis=1, inplace=True)

In [72]:
train_merge.sort_values(by='client_num')

Unnamed: 0,client_num,target,month,sum_amount
54429,1,4,2024-07,422749
69304,2,5,2024-07,114647
54543,3,3,2024-07,1483913
39100,4,5,2024-07,91422
19645,5,2,2024-07,29600
...,...,...,...,...
27657,109136,3,2024-07,9238
53872,109138,2,2024-08,131817
29655,109139,0,2024-07,2996
67622,109141,0,2024-07,12050


In [76]:
train_merge[train_merge['client_num'] == 2]

Unnamed: 0,client_num,target,month,sum_amount
69304,2,5,2024-07,114647


## Feature Engineering
<a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

## Train models
<a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)