# **Grouped `pos_cash_balance`**

# Data Loading and Preprocessing

9,975,174 clean entries.

## New

In [7]:
from home_credit.tables import POSCashBalance
from pepper.univar import print_value_counts_dict
from home_credit.utils import display_frame_basic_infos

data = POSCashBalance.clean()
display_frame_basic_infos(data)
print_value_counts_dict(data, "NAME_CONTRACT_STATUS")
display(data)
# ok data.info()

[1mn_samples[0m: 9 975 174
[1mn_columns[0m: 6, [('SK', 2), ('NAME', 1), ('CNT', 2), ('TARGET', 1)]
NAME_CONTRACT_STATUS (8): {'Active': 9151093, 'Completed': 744883, 'Signed': 66890, 'Demand': 7065, 'Returned to the store': 2494, 'Approved': 2110, 'Amortized debt': 636, 'Canceled': 3}


Unnamed: 0_level_0,Unnamed: 1_level_0,CLEAN_POS_CASH_BALANCE,TARGET,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
SK_ID_CURR,SK_ID_PREV,MONTHS_BALANCE,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,1369693,53,-1,4,0,Completed,0,0
100001,1369693,54,-1,4,1,Active,0,0
100001,1369693,55,-1,4,2,Active,0,0
100001,1369693,56,-1,4,3,Active,0,0
100001,1369693,57,-1,4,4,Active,0,0
...,...,...,...,...,...,...,...,...
456255,2631384,26,0,36,36,Active,0,0
456255,2729207,13,0,3,0,Completed,0,0
456255,2729207,14,0,3,0,Active,0,0
456255,2729207,15,0,6,5,Active,0,0


## Old

In [2]:
from home_credit.load import get_table
from pepper.utils import display_key_val

# Load the 'pos_cash_balance' table
data = get_table("pos_cash_balance").copy()

# Adjust the 'MONTHS_BALANCE' column to ensure consistency
data.MONTHS_BALANCE = -data.MONTHS_BALANCE

# Insert the aggregation counter
data.insert(0, "n_PREV", 1)

# Display the number of samples in the dataset
display_key_val("number of samples", data.shape[0])

# Display the dataset
display(data)

[1mnumber of samples[0m: 10 001 358


RAW_POS_CASH_BALANCE,n_PREV,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,1,1803195,182943,31,48.0,45.0,Active,0,0
1,1,1715348,367990,33,36.0,35.0,Active,0,0
2,1,1784872,397406,32,12.0,9.0,Active,0,0
3,1,1903291,269225,35,48.0,42.0,Active,0,0
4,1,2341044,334279,35,36.0,35.0,Active,0,0
...,...,...,...,...,...,...,...,...,...
10001353,1,2448283,226558,20,6.0,0.0,Active,843,0
10001354,1,1717234,141565,19,12.0,0.0,Active,602,0
10001355,1,1283126,315695,21,10.0,0.0,Active,609,0
10001356,1,1082516,450255,22,12.0,0.0,Active,614,0


# Key Uniqueness

We verify that there cannot be multiple `SK_ID_CURR` for one `SK_ID_PREV`.

The issue is therefore multi-indexed only in appearance: the `SK_ID_CURR` key is sufficient to separate the groups.

Number of `SK_ID_PREV` for one `SK_ID_CURR` and vice versa :

In [4]:
from home_credit.merge import _get_unique_and_multi_index, curr_prev_uniqueness_report

# Get unique and multi-indexes for the specified table and columns
indexes = _get_unique_and_multi_index(data.reset_index(), "SK_ID_PREV", "SK_ID_CURR")

# Generate a report on the uniqueness of SK_ID_CURR and SK_ID_PREV
curr_prev_uniqueness_report(*indexes)

[1mnumber of unique (curr, prev)              [0m: 935 435
[1mnumber of curr with more than 1 prev       [0m: 831 454
[1mnumber of curr with one prev               [0m: 103 981
[1mnumber of curr with more than 1 prev (in %)[0m: 88.9
[1mnumber of prev with more than 1 curr       [0m: 0
[1mnumber of prev with one curr               [0m: 935 435
[1mnumber of prev with more than 1 curr (in %)[0m: 0.0


# Agrégation cf. **`old_kernel_v2`**

Le premier jet était inspiré du **`lightgbm_kernel`**, un kernel de référence disponible sur Kaggle.

Il s'agit d'une agrégation par prêt qui produit 337 224 échantillons de synthèse.

L'information est appauvrie, on obtient :
- les premier, dernier et nombre de mois de suivi (96 maximum).
- les fréquences sur la période de suivi des occurrences de chaque modalité de chaque variable catégorielle.
- le maximum et la moyenne de chaque variable numérique.

In [5]:
from home_credit.kernel import hot_encode_cats

encoded_data, cat_vars = hot_encode_cats(data.reset_index())
months_agg_rules = {"MONTHS_BALANCE": ["min", "max", "size"]}
sk_dpd_rules = {"SK_DPD": ["max", "mean"], "SK_DPD_DEF": ["max", "mean"]}
cat_vars_agg_rules = {col: ["mean"] for col in cat_vars}
agg_rules = months_agg_rules | sk_dpd_rules | cat_vars_agg_rules
grouped = encoded_data.groupby("SK_ID_CURR")
aggregated = grouped.agg(agg_rules)
aggregated["POS_COUNT"] = grouped.size()
display(aggregated)

Unnamed: 0_level_0,MONTHS_BALANCE,MONTHS_BALANCE,MONTHS_BALANCE,SK_DPD,SK_DPD,SK_DPD_DEF,SK_DPD_DEF,NAME_CONTRACT_STATUS_Active,NAME_CONTRACT_STATUS_Amortized debt,NAME_CONTRACT_STATUS_Approved,NAME_CONTRACT_STATUS_Canceled,NAME_CONTRACT_STATUS_Completed,NAME_CONTRACT_STATUS_Demand,NAME_CONTRACT_STATUS_Returned to the store,NAME_CONTRACT_STATUS_Signed,POS_COUNT
Unnamed: 0_level_1,min,max,size,max,mean,max,mean,mean,mean,mean,mean,mean,mean,mean,mean,Unnamed: 16_level_1
SK_ID_CURR,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
100001,53,96,9,7,0.777778,7,0.777778,0.777778,0.0,0.0,0.0,0.222222,0.0,0.0,0.0,9
100002,1,19,19,0,0.000000,0,0.000000,1.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,19
100003,18,77,28,0,0.000000,0,0.000000,0.928571,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,28
100004,24,27,4,0,0.000000,0,0.000000,0.750000,0.0,0.0,0.0,0.250000,0.0,0.0,0.0,4
100005,15,24,10,0,0.000000,0,0.000000,0.900000,0.0,0.0,0.0,0.100000,0.0,0.0,0.0,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456251,1,8,8,0,0.000000,0,0.000000,0.875000,0.0,0.0,0.0,0.125000,0.0,0.0,0.0,8
456252,76,82,7,0,0.000000,0,0.000000,0.857143,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,7
456253,57,96,17,5,0.294118,5,0.294118,0.882353,0.0,0.0,0.0,0.117647,0.0,0.0,0.0,17
456254,1,11,20,0,0.000000,0,0.000000,1.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,20


# Profils denses de variation à l'aide de l'encodage RLE

Nous inspirant du cas **`bureau_balance`**, nous commençons par cette opération qui a l'avantage de rendre apparents les _patterns_ dynamiques de chaque variable, qu'elle soit numérique ou catégorielle.

Par commodité, nous allons ré-encoder avec une seule lettre chacune  modalités de `NAME_CONTRACT_STATUS`.

## Période de suivi

La période de suivi d'un prêt peut être la totalité des 96 mois, soit une sous-période, voire des sous-périodes fragmentées.

Notre première fonction, basée sur la fonction `jumps_rle` permet de codifier en RLE les sauts entre mois consécutifs de suivi.

### Période de suivi par prêt

On produit la table indexée par `SK_ID_PREV` des périodes de suivi des prêts, avec le premier, le dernier et le nombre de mois de suivi, et la représentation RLE des sous-périodes (dans la plupart des cas, une seule).

**Time :** 1 m.

In [1]:
from home_credit.tables import POSCashBalance

tracking = POSCashBalance.rle_loan_tracking_period()
display(tracking)
# ok tracking.info()

Save to C:/Users/franc/Projects/pepper_credit_scoring_tool\tmp\persist\pos_cash_balance\rle_loan_tracking_period\44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a.pqt


CLEAN_POS_CASH_BALANCE,MONTHS_BALANCE,MONTHS_BALANCE,MONTHS_BALANCE,MONTHS_BALANCE
Unnamed: 0_level_1,min,max,count,jumps_rle
SK_ID_PREV,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1000001,8,10,3,"((9, 1), (1, 2))"
1000002,50,54,5,"((51, 1), (1, 4))"
1000003,1,4,4,"((2, 1), (1, 3))"
1000004,22,29,8,"((23, 1), (1, 7))"
1000005,46,56,11,"((47, 1), (1, 10))"
...,...,...,...,...
2843494,24,26,3,"((25, 1), (1, 2))"
2843495,9,16,8,"((10, 1), (1, 7))"
2843497,1,21,21,"((2, 1), (1, 20))"
2843498,42,48,7,"((43, 1), (1, 6))"
