# **Grouped `installments_payments`**

La table **`installments_payments`** ressemble aux autres tables de niveau 3 avec détails longitudinaux, mais elle diffère en nature. Les 3 autres tables suivent, mois par mois, des mouvements financiers et d'autres informations d'état sur des prêts (externes pour **`bureau_balance`**, internes pour **`pos_cash_balance`** et **`credit_card_balance`**).

La table **`installments_payments`** comporte des informations complémentaires conjointes à **`pos_cash_balance`** et **`credit_card_balance`** sur les échéances et les règlements. Ces informations supplémentaires sont notamment susceptibles d'aider à compléter l'affectation de valeurs manquantes et la correction de valeurs aberrantes de ces deux tables.

La table brute ne contient pas de variable pivot `MONTHS_BALANCE`, mais la version nettoyée la comporte, celle-ci ayant été dérivée de la colonne `DAYS_INSTALMENT`. Cela nous permet donc de former des synthèses mensuelles qui peuvent être alignées sur celles des autres tables.

# Data Loading and Preprocessing

13,605,401 entries.

Les données contiennent des montants d'échéances manquants encodées 0 : il s'agit de 221 montants d'échéance non renseignés mais que nous corrigeons en recopiant le montant de règlement correspondant.

In [1]:
from home_credit.tables import InstallmentsPayments
from home_credit.utils import display_frame_basic_infos

data = InstallmentsPayments.clean()
display_frame_basic_infos(data)
display(data)
# ok data.info()

[1mn_samples[0m: 13 602 496
[1mn_columns[0m: 7, [('NUM', 1), ('DAYS', 2), ('AMT', 2), ('MONTHS', 1), ('TARGET', 1)]


Unnamed: 0_level_0,Unnamed: 1_level_0,CLEAN_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,TARGET,NUM_INSTALMENT_VERSION,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,SK_ID_PREV,NUM_INSTALMENT_NUMBER,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
100001,1369693,1,56,-1,1,1709,1715,3951.000,3951.000
100001,1369693,2,55,-1,1,1679,1715,3951.000,3951.000
100001,1369693,3,54,-1,1,1649,1660,3951.000,3951.000
100001,1369693,4,53,-1,2,1619,1628,17397.900,17397.900
100001,1851984,2,95,-1,1,2916,2916,3982.050,3982.050
...,...,...,...,...,...,...,...,...,...
456255,2631384,23,3,0,3,96,98,27489.690,27489.690
456255,2631384,24,2,0,4,66,76,308277.315,308277.315
456255,2729207,1,15,0,1,469,482,11514.555,11514.555
456255,2729207,2,14,0,1,439,455,11514.555,11514.555


In [2]:
from home_credit.tables import InstallmentsPayments
from home_credit.utils import display_frame_basic_infos

data = InstallmentsPayments.clean(no_na_payment=False)
display_frame_basic_infos(data)
display(data)
# ok data.info()

[1mn_samples[0m: 13 605 401
[1mn_columns[0m: 7, [('NUM', 1), ('DAYS', 2), ('AMT', 2), ('MONTHS', 1), ('TARGET', 1)]


Unnamed: 0_level_0,Unnamed: 1_level_0,CLEAN_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,TARGET,NUM_INSTALMENT_VERSION,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,SK_ID_PREV,NUM_INSTALMENT_NUMBER,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
100001,1369693,1,57,-1,1,1709,1715.0,3951.000000,3951.000000
100001,1369693,2,56,-1,1,1679,1715.0,3951.000000,3951.000000
100001,1369693,3,55,-1,1,1649,1660.0,3951.000000,3951.000000
100001,1369693,4,54,-1,2,1619,1628.0,17397.900391,17397.900391
100001,1851984,2,96,-1,1,2916,2916.0,3982.050049,3982.050049
...,...,...,...,...,...,...,...,...,...
456255,2631384,23,4,0,3,96,98.0,27489.689453,27489.689453
456255,2631384,24,3,0,4,66,76.0,308277.312500,308277.312500
456255,2729207,1,16,0,1,469,482.0,11514.554688,11514.554688
456255,2729207,2,15,0,1,439,455.0,11514.554688,11514.554688


# Key Uniqueness

We verify that there cannot be multiple `SK_ID_CURR` for one `SK_ID_PREV`.

The issue is therefore multi-indexed only in appearance: the `SK_ID_CURR` key is sufficient to separate the groups.

Number of `SK_ID_PREV` for one `SK_ID_CURR` and vice versa :

In [2]:
from home_credit.merge import _get_unique_and_multi_index, curr_prev_uniqueness_report

# Get unique and multi-indexes for the specified table and columns
indexes = _get_unique_and_multi_index(data.reset_index(), "SK_ID_PREV", "SK_ID_CURR")

# Generate a report on the uniqueness of SK_ID_CURR and SK_ID_PREV
curr_prev_uniqueness_report(*indexes)

[1mnumber of unique (curr, prev)              [0m: 997 674
[1mnumber of curr with more than 1 prev       [0m: 903 027
[1mnumber of curr with one prev               [0m: 94 647
[1mnumber of curr with more than 1 prev (in %)[0m: 90.5
[1mnumber of prev with more than 1 curr       [0m: 0
[1mnumber of prev with one curr               [0m: 997 674
[1mnumber of prev with more than 1 curr (in %)[0m: 0.0


# Agrégation cf. **`old_kernel_v2`**

Le premier jet était inspiré du **`lightgbm_kernel`**, un kernel de référence disponible sur Kaggle.

Il s'agit d'une agrégation par prêt qui produit 339 587 échantillons de synthèse.

Une étape d'ingénierie des caractéristique préalable ajout de nouvelles variables pertinentes en amont de l'agrégation : la part d'une échéance qui n'a pas été payée, le DPD (et le DBD).

Le problème de cette version est qu'elle fait complètement l'impasse sur l'agrégation et le nettoyage de base qui permet d'assurer des données cohérentes dans les agrégations de plus haut niveau.

In [12]:
from home_credit.kernel import hot_encode_cats

encoded_data, cat_vars = hot_encode_cats(data.reset_index())
encoded_data = encoded_data.drop(columns=["TARGET", "MONTHS_BALANCE"])

# Percentage and difference paid in each installment (amount paid and installment value)
encoded_data['PAYMENT_PERC'] = encoded_data['AMT_PAYMENT'] / encoded_data['AMT_INSTALMENT']
encoded_data['PAYMENT_DIFF'] = encoded_data['AMT_INSTALMENT'] - encoded_data['AMT_PAYMENT']
# Days past due and days before due (no negative values)
encoded_data['DPD'] = encoded_data['DAYS_ENTRY_PAYMENT'] - encoded_data['DAYS_INSTALMENT']
encoded_data['DBD'] = encoded_data['DAYS_INSTALMENT'] - encoded_data['DAYS_ENTRY_PAYMENT']
encoded_data['DPD'] = encoded_data['DPD'].apply(lambda x: max(x, 0))
encoded_data['DBD'] = encoded_data['DBD'].apply(lambda x: max(x, 0))

nu = ["nunique"]
mms = ["max", "mean", "sum"]
mmsv = mms + ["var"]
mmms = ["min"] + mms
agg_rules = {
    'NUM_INSTALMENT_VERSION': nu,
    'DPD': mms,
    'DBD': mms,
    'PAYMENT_PERC': mmsv,
    'PAYMENT_DIFF': mmsv,
    'AMT_INSTALMENT': mms,
    'AMT_PAYMENT': mmms,
    'DAYS_ENTRY_PAYMENT': mms,
} | {cat: ["mean"] for cat in cat_vars}

grouped = encoded_data.groupby("SK_ID_CURR")
aggregated = grouped.agg(agg_rules)
aggregated["INSTALL_COUNT"] = grouped.size()
display(aggregated)

CLEAN_INSTALLMENTS_PAYMENTS,NUM_INSTALMENT_VERSION,DPD,DPD,DPD,DBD,DBD,DBD,PAYMENT_PERC,PAYMENT_PERC,PAYMENT_PERC,...,AMT_INSTALMENT,AMT_INSTALMENT,AMT_PAYMENT,AMT_PAYMENT,AMT_PAYMENT,AMT_PAYMENT,DAYS_ENTRY_PAYMENT,DAYS_ENTRY_PAYMENT,DAYS_ENTRY_PAYMENT,INSTALL_COUNT
Unnamed: 0_level_1,nunique,max,mean,sum,max,mean,sum,max,mean,sum,...,mean,sum,min,max,mean,sum,max,mean,sum,Unnamed: 21_level_1
SK_ID_CURR,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
100001,2,36.0,8.857143,62.0,11.0,1.571429,11.0,1.000000,1.000000,7.00000,...,5885.132324,4.119593e+04,3951.000000,17397.900391,5885.132324,4.119593e+04,2916.0,2195.000000,15365.0,7
100002,2,31.0,20.421053,388.0,0.0,0.000000,0.0,1.000000,1.000000,19.00000,...,11559.247070,2.196257e+05,9251.775391,53093.746094,11559.247070,2.196257e+05,587.0,315.421053,5993.0,19
100003,2,14.0,7.160000,179.0,0.0,0.000000,0.0,1.000000,1.000000,25.00000,...,64754.585938,1.618865e+06,6662.970215,560835.375000,64754.585938,1.618865e+06,2324.0,1385.320000,34633.0,25
100004,2,11.0,7.666667,23.0,0.0,0.000000,0.0,1.000000,1.000000,3.00000,...,7096.154785,2.128846e+04,5357.250000,10573.964844,7096.154785,2.128846e+04,795.0,761.666667,2285.0,3
100005,2,37.0,23.666667,213.0,1.0,0.111111,1.0,1.000000,1.000000,9.00000,...,6240.205078,5.616184e+04,4813.200195,17656.244141,6240.205078,5.616184e+04,736.0,609.555556,5486.0,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456251,2,46.0,36.285714,254.0,0.0,0.000000,0.0,1.000000,1.000000,7.00000,...,7492.924316,5.245047e+04,6605.910156,12815.009766,7492.924316,5.245047e+04,237.0,156.285714,1094.0,7
456252,1,11.0,3.333333,20.0,3.0,0.500000,3.0,1.000000,1.000000,6.00000,...,10069.867188,6.041920e+04,10046.879883,10074.464844,10069.867188,6.041920e+04,2470.0,2393.833333,14363.0,6
456253,1,51.0,15.142857,212.0,9.0,0.642857,9.0,1.000000,0.928571,13.00000,...,4399.708008,6.159591e+04,27.270000,5575.185059,4115.915039,5.762281e+04,2915.0,2387.428571,33424.0,14
456254,1,31.0,19.000000,361.0,0.0,0.000000,0.0,1.000000,1.000000,19.00000,...,10239.832031,1.945568e+05,2296.439941,19065.824219,10239.832031,1.945568e+05,317.0,161.263158,3064.0,19


# Montant des prêts

997 752 prêts suivis.

Les montants des prêts sont obtenus comme somme des montants de toutes leurs échéances, toutes versions confondues.

In [6]:
from home_credit.tables import InstallmentsPayments

loan_amount = InstallmentsPayments.loan_amount()
print(f"Total Loaned Amount: {int(loan_amount.AMT_LOAN.sum()):n}")
print(f"Min Loan Amount: {int(loan_amount.AMT_LOAN.min()):n}")
print(f"Median Loan Amount: {int(loan_amount.AMT_LOAN.median()):n}")
print(f"Mean Loan Amount: {int(loan_amount.AMT_LOAN.mean()):n}")
print(f"Max Loan Amount: {int(loan_amount.AMT_LOAN.max()):n}")

display(loan_amount)

Total Loaned Amount: 231 990 918 177
Min Loan Amount: 0
Median Loan Amount: 111 337
Mean Loan Amount: 232 513
Max Loan Amount: 30 075 394


Unnamed: 0_level_0,LOAN_AMOUNT,AMT_LOAN
SK_ID_CURR,SK_ID_PREV,Unnamed: 2_level_1
100001,1369693,29250.90
100001,1851984,11945.03
100002,1038818,219625.70
100003,1810518,1150977.33
100003,2396755,80773.38
...,...,...
456255,1359084,140497.74
456255,1743609,132182.28
456255,2073384,315433.12
456255,2631384,1692260.91


# Montant emprunté

C'est la synthèse, pour chaque client, du total emprunté.

In [4]:
from home_credit.tables import InstallmentsPayments

loaned_amount = InstallmentsPayments.loaned_amount()
print(f"Total Loaned Amount: {int(loaned_amount.AMT_LOANED.sum()):n}")
print(f"Min Loaned Amount: {int(loaned_amount.AMT_LOANED.min()):n}")
print(f"Median Loaned Amount: {int(loaned_amount.AMT_LOANED.median()):n}")
print(f"Mean Loaned Amount: {int(loaned_amount.AMT_LOANED.mean()):n}")
print(f"Max Loaned Amount: {int(loaned_amount.AMT_LOANED.max()):n}")

display(loaned_amount)

Total Loaned Amount: 231 990 918 177
Min Loaned Amount: 0
Median Loaned Amount: 334 406
Mean Loaned Amount: 683 156
Max Loaned Amount: 32 479 781


LOANED_AMOUNT,AMT_LOANED
SK_ID_CURR,Unnamed: 1_level_1
100001,41195.93
100002,219625.70
100003,1618864.65
100004,21288.46
100005,56161.84
...,...
456251,52450.47
456252,60419.20
456253,61595.91
456254,194556.82


# Dernière version des règlements

Les règlements à considérer sont ceux de la dernière version.

Il ne faut pas prendre les autres en compte, sinon on compte plusieurs les mêmes règlements.

Notons que les colonnes ont été renommées pour l'aspect pratique.

`n` est une variable ajoutée pour indicer les fragments d'un règlement, car un règlement peut être fractionné en plusieurs lignes.

In [4]:
from home_credit.tables import InstallmentsPayments

last_version = InstallmentsPayments.last_version()
display(last_version)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,M,n,T0,T1,INST,REPAID
SK_ID_CURR,SK_ID_PREV,N,V,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
100001,1369693,1,1,56,0,1709,1715.0,3951.000,3951.000
100001,1369693,2,1,55,0,1679,1715.0,3951.000,3951.000
100001,1369693,3,1,54,0,1649,1660.0,3951.000,3951.000
100001,1369693,4,2,53,0,1619,1628.0,17397.900,17397.900
100001,1851984,2,1,95,0,2916,2916.0,3982.050,3982.050
...,...,...,...,...,...,...,...,...,...
456255,2631384,23,3,3,0,96,98.0,27489.690,27489.690
456255,2631384,24,4,2,0,66,76.0,308277.315,308277.315
456255,2729207,1,1,15,0,469,482.0,11514.555,11514.555
456255,2729207,2,1,14,0,439,455.0,11514.555,11514.555


# Montant et DPD des règlements

Cette table donne, pour chaque règlement d'échéance le montant remboursé (celui de la dernière version), et le DPD (nombre de jours de décalage entre la date d'échéance (date où la dette devient exigible) et celle du règlement).

In [1]:
from home_credit.tables import InstallmentsPayments

repaid_and_dpd = InstallmentsPayments.repaid_and_dpd()
display(repaid_and_dpd)

Unnamed: 0_level_0,Unnamed: 1_level_0,N,V,REPAID,DPD
SK_ID_CURR,SK_ID_PREV,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100001,1369693,1,1,3951.000,6.0
100001,1369693,2,1,3951.000,36.0
100001,1369693,3,1,3951.000,11.0
100001,1369693,4,2,17397.900,9.0
100001,1851984,2,1,3982.050,0.0
...,...,...,...,...,...
456255,2631384,23,3,27489.690,2.0
456255,2631384,24,4,308277.315,10.0
456255,2729207,1,1,11514.555,13.0
456255,2729207,2,1,11514.555,16.0


# Espérance de DPD

L'espérance de DPD est la somme, sur l'ensemble des échéances, du pourcentage de dette remboursé par le DPD.

Le pourcentage s'entend soit relativement au prêt, soit relativement au client tous prêts confondus.

## Espérance de DPD par prêt

In [1]:
from home_credit.tables import InstallmentsPayments

exp_dpd_by_loan = InstallmentsPayments.expected_dpd_by_loan()
display(exp_dpd_by_loan)

Unnamed: 0_level_0,EXP_DPD_BY_LOAN,REPAID,AMT_LOAN,PCT_REPAID,EXP_DPD
SK_ID_CURR,SK_ID_PREV,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100001,1369693,29250.900,29250.90,1.000000,12.511892
100001,1851984,11945.025,11945.03,1.000000,-3.667010
100002,1038818,219625.695,219625.70,1.000000,21.135486
100003,1810518,1150977.330,1150977.33,1.000000,5.863620
100003,2396755,80773.380,80773.38,1.000000,6.751611
...,...,...,...,...,...
456255,1359084,129183.570,140497.74,0.919471,9.605511
456255,1743609,132182.280,132182.28,1.000000,5.691847
456255,2073384,282631.905,315433.12,0.896012,11.527262
456255,2631384,1582302.150,1692260.91,0.935023,8.826935


## Espérance de DPD par client

C'est un indice qui nous intéresse au premier plan pour la table de données finale à fournir à la modélisation.

In [2]:
from home_credit.tables import InstallmentsPayments

exp_dpd_by_client = InstallmentsPayments.expected_dpd_by_client()
display(exp_dpd_by_client)

EXP_DPD_BY_CLIENT,REPAID,AMT_LOANED,PCT_REPAID,EXP_DPD
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100001,41195.925,41195.93,1.000000,7.820713
100002,219625.695,219625.70,1.000000,21.135486
100003,1618864.650,1618864.65,1.000000,7.175516
100004,21288.465,21288.46,1.000000,6.523107
100005,56161.845,56161.84,1.000000,19.083609
...,...,...,...,...
456251,52450.470,52450.47,1.000000,32.937244
456252,60419.205,60419.20,1.000000,2.829605
456253,57622.815,61595.91,0.935497,13.457586
456254,194556.825,194556.82,1.000000,12.966509


# == Voir s'il faut conserver ou jeter la suite ==

Laisser reposer et remonter sur CCB et PCB pour voir ce que cela donne.

On verra plus tard si par exemple les synthèses mensuelles sont pertinentes.

# Agrégation par échéance et version

Pour une analyse et justification complète, voir **`assert.ipynb`**.

La première étape indispensable pour pouvoir travailler cette table est de la réduire par une première agrégation qui permet d'assurer l'unicité des lignes pour chaque triplet (`SK_ID_PREV`, `NUM_INSTALMENT_NUMBER`, `NUM_INSTALMENT_VERSION`).

Sans cette étapes, les agrégations qui suivantes produiront des résultats inconsistants.

Dans cette table agrégée, il n'y a plus qu'une ligne par version d'une échéance, avec :
- le nombre de règlements (de fraction de règlement de l'échéance)
- les montant et date constants de l'échéance,
- les première (max) et dernière (min) dates de règlement,
- la somme des règlements.

Dans la suite du cahier, nous référons cette table en tant que `base_data`.

In [15]:
from home_credit.groupby import get_installments_payments_by_installment_and_version

base_data = get_installments_payments_by_installment_and_version(data)
display(base_data)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,DAYS_INSTALMENT,AMT_INSTALMENT,CNT_PAYMENT,DAYS_ENTRY_PAYMENT_START,DAYS_ENTRY_PAYMENT_END,AMT_PAYMENT
SK_ID_CURR,SK_ID_PREV,NUM_INSTALMENT_NUMBER,NUM_INSTALMENT_VERSION,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
100001,1369693,1,1,1709,3951.000000,1,1715,1715,3951.000000
100001,1369693,2,1,1679,3951.000000,1,1715,1715,3951.000000
100001,1369693,3,1,1649,3951.000000,1,1660,1660,3951.000000
100001,1369693,4,2,1619,17397.900391,1,1628,1628,17397.900391
100001,1851984,2,1,2916,3982.050049,1,2916,2916,3982.050049
...,...,...,...,...,...,...,...,...,...
456255,2631384,23,3,96,27489.689453,1,98,98,27489.689453
456255,2631384,24,4,66,308277.312500,1,76,76,308277.312500
456255,2729207,1,1,469,11514.554688,1,482,482,11514.554688
456255,2729207,2,1,439,11514.554688,1,455,455,11514.554688


## Version intégrée

Cette méthode réalise la chaîne de traitement précédente (avec caching et persistance intégrés) :

In [2]:
from home_credit.tables import InstallmentsPayments

base_data = InstallmentsPayments.clean_base()
display(base_data)
display(base_data[base_data.AMT_INSTALMENT == 0])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,DAYS_INSTALMENT,AMT_INSTALMENT,CNT_PAYMENT,DAYS_ENTRY_PAYMENT_START,DAYS_ENTRY_PAYMENT_END,AMT_PAYMENT
SK_ID_CURR,SK_ID_PREV,NUM_INSTALMENT_NUMBER,NUM_INSTALMENT_VERSION,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
100001,1369693,1,1,1709,3951.000000,1,1715,1715,3951.000000
100001,1369693,2,1,1679,3951.000000,1,1715,1715,3951.000000
100001,1369693,3,1,1649,3951.000000,1,1660,1660,3951.000000
100001,1369693,4,2,1619,17397.900391,1,1628,1628,17397.900391
100001,1851984,2,1,2916,3982.050049,1,2916,2916,3982.050049
...,...,...,...,...,...,...,...,...,...
456255,2631384,23,3,96,27489.689453,1,98,98,27489.689453
456255,2631384,24,4,66,308277.312500,1,76,76,308277.312500
456255,2729207,1,1,469,11514.554688,1,482,482,11514.554688
456255,2729207,2,1,439,11514.554688,1,455,455,11514.554688


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,DAYS_INSTALMENT,AMT_INSTALMENT,CNT_PAYMENT,DAYS_ENTRY_PAYMENT_START,DAYS_ENTRY_PAYMENT_END,AMT_PAYMENT
SK_ID_CURR,SK_ID_PREV,NUM_INSTALMENT_NUMBER,NUM_INSTALMENT_VERSION,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
242609,2198450,13,4,291,0.0,1,291,291,0.0
289626,2028864,3,2,29,0.0,1,28,28,0.0


# Agrégation par échéance

C'est une clé de passage vers l'intégration des données à plus haut niveau.

Les données sont agrégées suivant les relations établies dans **`assert.ipynb`** : les montants des versions d'échéance sont sommées et les données (montant et date) de règlement sont ceux de la dernière version, identiques à celles de versions précédentes : 

In [2]:
from home_credit.tables import InstallmentsPayments

data_by_inst = InstallmentsPayments.by_installment()
display(data_by_inst)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,V_MIN,V_MAX,V_COUNT,AMT_INSTALMENT,AMT_PAYMENT,DAYS_INSTALMENT_START,DAYS_INSTALMENT_END,CNT_PAYMENT,DAYS_ENTRY_PAYMENT_START,DAYS_ENTRY_PAYMENT_END
SK_ID_CURR,SK_ID_PREV,NUM_INSTALMENT_NUMBER,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
100001,1369693,1,1,1,1,3951.000000,3951.000000,1709,1709,1,1715,1715
100001,1369693,2,1,1,1,3951.000000,3951.000000,1679,1679,1,1715,1715
100001,1369693,3,1,1,1,3951.000000,3951.000000,1649,1649,1,1660,1660
100001,1369693,4,2,2,1,17397.900391,17397.900391,1619,1619,1,1628,1628
100001,1851984,2,1,1,1,3982.050049,3982.050049,2916,2916,1,2916,2916
...,...,...,...,...,...,...,...,...,...,...,...,...
456255,2631384,23,3,3,1,27489.689453,27489.689453,96,96,1,98,98
456255,2631384,24,4,4,1,308277.312500,308277.312500,66,66,1,76,76
456255,2729207,1,1,1,1,11514.554688,11514.554688,469,469,1,482,482
456255,2729207,2,1,1,1,11514.554688,11514.554688,439,439,1,455,455


# Feature engineering

## Principal Outstanding

`AMT_OUTSTANDING` (_Principal Outstanding_) est la différence entre le montant de l'échéance et le montant effectivement payé par le client.

Il y a 3 250 cas d'outstanding, soit moins de 3 pour 10 000 échéances.

In [3]:
def add_and_check_outstanding(data, tol_pos=0.1, tol_neg=0.5):
    data["AMT_OUTSTANDING"] = round(data.AMT_INSTALMENT - data.AMT_PAYMENT, 2)
    
    pos_outstanding = data[data.AMT_OUTSTANDING >= tol_pos]
    neg_outstanding = data[data.AMT_OUTSTANDING <= -tol_neg]
    
    n_samples = data.shape[0]
    n_pos_outstanding = pos_outstanding.shape[0]
    n_neg_outstanding = neg_outstanding.shape[0]

    print(f"% of oustanding installments > 0: {100*n_pos_outstanding/n_samples:.2f} %")
    print(f"% of oustanding installments < 0: {100*n_neg_outstanding/n_samples:.2f} %")

    display(pos_outstanding)
    display(neg_outstanding)

add_and_check_outstanding(data_by_inst)

% of oustanding installments > 0: 0.03 %
% of oustanding installments < 0: 0.00 %


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,V_MIN,V_MAX,V_COUNT,AMT_INSTALMENT,AMT_PAYMENT,DAYS_INSTALMENT_START,DAYS_INSTALMENT_END,CNT_PAYMENT,DAYS_ENTRY_PAYMENT_START,DAYS_ENTRY_PAYMENT_END,AMT_OUTSTANDING
SK_ID_CURR,SK_ID_PREV,NUM_INSTALMENT_NUMBER,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
100104,2697044,10,1,1,1,2917.530029,2892.330078,2290,2290,1,2299,2299,25.200001
100149,2523334,12,1,1,1,7669.484863,7668.944824,29,29,2,67,19,0.540000
100430,1096316,5,1,1,1,6632.640137,510.075012,4,4,1,15,15,6122.560059
100784,1925191,23,2,2,1,38727.046875,24548.445312,15,15,3,6,3,14178.599609
100807,1548219,22,1,1,1,48913.199219,35221.230469,10,10,1,20,20,13691.969727
...,...,...,...,...,...,...,...,...,...,...,...,...,...
456112,2073486,8,1,1,1,13312.214844,379.079987,4,4,1,29,29,12933.139648
456141,2481967,8,1,1,1,5969.294922,5339.609863,9,9,1,15,15,629.690002
456183,2391408,23,1,1,1,8606.700195,552.599976,4,4,1,36,36,8054.100098
456199,1843841,7,1,1,1,15270.299805,221.625000,14,14,1,42,42,15048.679688


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,V_MIN,V_MAX,V_COUNT,AMT_INSTALMENT,AMT_PAYMENT,DAYS_INSTALMENT_START,DAYS_INSTALMENT_END,CNT_PAYMENT,DAYS_ENTRY_PAYMENT_START,DAYS_ENTRY_PAYMENT_END,AMT_OUTSTANDING
SK_ID_CURR,SK_ID_PREV,NUM_INSTALMENT_NUMBER,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
255002,1783368,137,0,0,1,9856.980469,11647.980469,597,597,5,613,596,-1791.0


# Agrégation par mois

L'agrégation mensuelle permet d'aligner les informations de **`installments_payments`** avec les autres tables de niveau 3, mensualisées à l'aide de la variable `MONTHS_BALANCE`.

Ici, nous avons plusieurs agrégations mensuelles possibles, selon que l'on s'intéresse aux échéances, ou à leur règlement.


Le groupement s'effectue suivant une variable pivot `MONTHS_BALANCE` dérivée de `DAYS_INSTALMENT`.

Ce qui suit est Old

Sur cette synthèse, il n'est guère pertinent de conserver l'identité des échéances et de leur version, ni même les jours exacts. En revanche, il est utile d'avoir une idée du décalage en jours entre l'échéance et son paiement, ainsi ...

...

In [28]:
aggregated = data.reset_index()

grouped = data.groupby(by=["SK_ID_CURR", "SK_ID_PREV", "MONTHS_BALANCE"])

aggregated = grouped.agg({
    "TARGET": ["count", "first"],
    "NUM_INSTALMENT_VERSION": ["min", "max"],
    "DAYS_INSTALMENT": ["min", "max"],
    "DAYS_ENTRY_PAYMENT": ["min", "max"],
    "AMT_INSTALMENT": "sum",
    "AMT_PAYMENT": "sum"
})

display(aggregated)

Unnamed: 0_level_0,Unnamed: 1_level_0,CLEAN_INSTALLMENTS_PAYMENTS,TARGET,TARGET,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_VERSION,DAYS_INSTALMENT,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,first,min,max,min,max,min,max,sum,sum
SK_ID_CURR,SK_ID_PREV,MONTHS_BALANCE,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
100001,1369693,54,1,-1,2,2,1619,1619,1628,1628,17397.900391,1.739790e+04
100001,1369693,55,1,-1,1,1,1649,1649,1660,1660,3951.000000,3.951000e+03
100001,1369693,56,1,-1,1,1,1679,1679,1715,1715,3951.000000,3.951000e+03
100001,1369693,57,1,-1,1,1,1709,1709,1715,1715,3951.000000,3.951000e+03
100001,1851984,94,1,-1,1,1,2856,2856,2856,2856,3980.925049,3.980925e+03
...,...,...,...,...,...,...,...,...,...,...,...,...
456255,2631384,24,1,0,3,3,726,726,734,734,27489.689453,2.748969e+04
456255,2631384,25,2,0,1,2,756,756,768,768,669251.625000,1.338503e+06
456255,2729207,14,1,0,2,2,409,409,435,435,42754.230469,4.275423e+04
456255,2729207,15,1,0,1,1,439,439,455,455,11514.554688,1.151455e+04


In [29]:
display(aggregated[aggregated[("TARGET", "count")] > 14])

Unnamed: 0_level_0,Unnamed: 1_level_0,CLEAN_INSTALLMENTS_PAYMENTS,TARGET,TARGET,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_VERSION,DAYS_INSTALMENT,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,first,min,max,min,max,min,max,sum,sum
SK_ID_CURR,SK_ID_PREV,MONTHS_BALANCE,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
110288,2335913,14,15,0,0,0,396,425,396,425,1345725.0,1345725.0
192083,1746731,6,15,1,0,0,153,182,153,182,68604.26,60750.0
223443,1968852,6,16,1,0,0,153,181,153,183,150181.2,139500.0
308996,1852274,5,16,1,0,0,122,151,122,164,80419.99,47739.2
308996,1852274,9,17,1,0,0,246,273,246,287,112831.9,105685.1
308996,1852274,12,15,1,0,0,337,365,337,379,99530.73,90133.78
388717,2424398,2,16,0,0,0,32,60,32,60,745826.1,745826.1
388717,2424398,21,16,0,0,0,611,639,611,640,143059.4,143059.4
430301,1610767,1,21,-1,0,0,2,30,2,32,239374.0,215633.4
430301,1610767,2,17,-1,0,0,37,60,37,61,220833.9,220833.9


In [30]:
display(aggregated[
    aggregated[("NUM_INSTALMENT_VERSION", "min")]
    != aggregated[("NUM_INSTALMENT_VERSION", "max")]
])

Unnamed: 0_level_0,Unnamed: 1_level_0,CLEAN_INSTALLMENTS_PAYMENTS,TARGET,TARGET,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_VERSION,DAYS_INSTALMENT,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,first,min,max,min,max,min,max,sum,sum
SK_ID_CURR,SK_ID_PREV,MONTHS_BALANCE,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
100012,2480304,16,2,0,1,2,477,477,487,487,58687.246094,1.173745e+05
100019,1620327,25,2,0,1,2,744,744,750,750,44369.281250,8.873856e+04
100025,2835799,30,2,0,1,2,888,911,888,917,259605.859375,2.596059e+05
100033,2037234,17,3,0,1,13,488,502,495,495,6055.649902,6.055650e+03
100033,2037234,19,5,0,1,11,548,578,553,589,16949.789062,1.140687e+04
...,...,...,...,...,...,...,...,...,...,...,...,...
456225,2508396,22,2,1,1,2,642,655,588,625,19250.910156,1.925091e+04
456227,1772808,16,2,0,1,2,462,462,465,465,50618.523438,1.012370e+05
456234,2414642,29,3,0,1,2,862,878,854,929,11258.280273,6.304140e+03
456240,1508947,21,2,0,1,2,612,612,623,623,35264.339844,7.052868e+04


In [32]:
_data = data.reset_index()
display(_data[
    (_data.SK_ID_CURR == 100033) &
    (_data.SK_ID_PREV == 2037234) &
    (_data.MONTHS_BALANCE == 17)])

CLEAN_INSTALLMENTS_PAYMENTS,SK_ID_CURR,SK_ID_PREV,NUM_INSTALMENT_NUMBER,MONTHS_BALANCE,TARGET,NUM_INSTALMENT_VERSION,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
1032,100033,2037234,12,17,0,1,488,495,5540.310059,5540.310059
1043,100033,2037234,111,17,0,12,502,495,257.670013,257.670013
1044,100033,2037234,112,17,0,13,496,495,257.670013,257.670013


# Profils denses de variation à l'aide de l'encodage RLE

Nous inspirant du cas **`bureau_balance`**, nous commençons par cette opération qui a l'avantage de rendre apparents les _patterns_ dynamiques de chaque variable, qu'elle soit numérique ou catégorielle.

Dans le cadre de la table **`installments_payments`**, le travail est effectué sur la base d'une table préalablement dérivée qui réalise la synthèse mensuelle, partant d'une variable `MONTHS_BALANCE` dérivée de `DAYS_INSTALMENT`.

## Période de suivi

La période de suivi d'un prêt peut être la totalité des 96 mois, soit une sous-période, voire des sous-périodes fragmentées.

Notre première fonction, basée sur la fonction `jumps_rle` permet de codifier en RLE les sauts entre mois consécutifs de suivi.

### Période de suivi par prêt

On produit la table indexée par `SK_ID_PREV` des périodes de suivi des prêts, avec le premier, le dernier et le nombre de mois de suivi, et la représentation RLE des sous-périodes (dans la plupart des cas, une seule).

In [None]:
from home_credit.tables import InstallmentsPayments

# TODO : préalable : la synthèse par mois qui n'existe pas dans la table de base

tracking = InstallmentsPayments.rle_loan_tracking_period()
display(tracking)
# ok tracking.info()

CLEAN_POS_CASH_BALANCE,MONTHS_BALANCE,MONTHS_BALANCE,MONTHS_BALANCE,MONTHS_BALANCE
Unnamed: 0_level_1,min,max,count,jumps_rle
SK_ID_PREV,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1000001,8,10,3,"[[9, 1], [1, 2]]"
1000002,50,54,5,"[[51, 1], [1, 4]]"
1000003,1,4,4,"[[2, 1], [1, 3]]"
1000004,22,29,8,"[[23, 1], [1, 7]]"
1000005,46,56,11,"[[47, 1], [1, 10]]"
...,...,...,...,...
2843494,24,26,3,"[[25, 1], [1, 2]]"
2843495,9,16,8,"[[10, 1], [1, 7]]"
2843497,1,21,21,"[[2, 1], [1, 20]]"
2843498,42,48,7,"[[43, 1], [1, 6]]"


# Old - Aggregation by current application and balance tracking month (`SK_ID_CURR`, `MONTHS_BALANCE`)

We break down the steps to explain, justify, and facilitate understanding of the operation performed. However, the last section invokes the built-in function `groupby_curr_months` which performs all the steps.

## Aggregation

We perform a summary with the `(SK_ID_CURR, MONTHS_BALANCE)` pair as the pivot.

Aggregation strategies:
* `NUM_INSTALMENT_VERSION` and `NUM_INSTALMENT_NUMBER`
    - They are purely informative here and are fully reproduced in the other pivoted table.
    - The **maximum** will suffice (the last version and the last installment number).
* `DAYS_INSTALMENT` and `DAYS_ENTRY_PAYMENT`
    - The significant granularity in our pivot is the month.
    - One can consider the last day or the median day.
    - We choose the **median**.
* `AMT_INSTALMENT`, `AMT_PAYMENT`, `n_PREV`: the **sum**.

The $13,605,401$ records are reduced to $9,477,481$, resulting in a compression rate of approximately $30\%$.

**Time:** 17 s.

In [6]:
# Group the data by 'SK_ID_CURR' and 'MONTHS_BALANCE'
# and aggregate the columns
grouped_data = (
    data.drop(columns="SK_ID_PREV")
    .groupby(by=["SK_ID_CURR", "MONTHS_BALANCE"])
    .agg({
        "n_PREV": "sum",
        "NUM_INSTALMENT_VERSION": "max",
        "NUM_INSTALMENT_NUMBER": "max",
        "DAYS_INSTALMENT": "median",
        "DAYS_ENTRY_PAYMENT": "median",
        "AMT_INSTALMENT": "sum",
        "AMT_PAYMENT": "sum"
    })
)

# Reset the 'MONTHS_BALANCE' as a column
grouped_data.reset_index(level=1, inplace=True)

# Display the grouped data
display(grouped_data)

RAW_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,54,1,2.0,4,-1619.0,-1628.0,17397.900,17397.900
100001,55,1,1.0,3,-1649.0,-1660.0,3951.000,3951.000
100001,56,1,1.0,2,-1679.0,-1715.0,3951.000,3951.000
100001,57,1,1.0,1,-1709.0,-1715.0,3951.000,3951.000
100001,94,1,1.0,4,-2856.0,-2856.0,3980.925,3980.925
...,...,...,...,...,...,...,...,...
456255,28,1,1.0,5,-840.0,-847.0,11090.835,11090.835
456255,29,1,1.0,4,-870.0,-879.0,11090.835,11090.835
456255,30,1,1.0,3,-900.0,-910.0,11090.835,11090.835
456255,31,1,1.0,2,-930.0,-938.0,11090.835,11090.835


## Integrated version (`groupby_curr_months`)

This code uses the `groupby_curr_months` function to aggregate the data based on specified aggregation rules and then checks if the results match the reference data for each column.

**Time:** 23 s.

In [9]:
from home_credit.merge import groupby_curr_months, ip_months_balance_builder

# Store a reference to the previous grouped data
ref_grouped_data = grouped_data

# Define aggregation rules
agg_dict = {
    "NUM_INSTALMENT_VERSION": "max",
    "NUM_INSTALMENT_NUMBER": "max",
    "DAYS_INSTALMENT": "median",
    "DAYS_ENTRY_PAYMENT": "median",
    "AMT_INSTALMENT": "sum",
    "AMT_PAYMENT": "sum"
}

# Group data using the integrated function
grouped_data = groupby_curr_months(
    table_name="installments_payments",
    months_balance_builder=ip_months_balance_builder,
    agg_dict=agg_dict,
    include_uniques=True
)

# Display the grouped data
display(grouped_data)

# Check if the results match the reference data
print("Check results identity by column :")
display((
    (grouped_data == ref_grouped_data)
    # Avoid not(NaN == NaN)
    | grouped_data.isnull() & ref_grouped_data.isnull()
).all())

RAW_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,54,1,2.0,4,-1619.0,-1628.0,17397.900,17397.900
100001,55,1,1.0,3,-1649.0,-1660.0,3951.000,3951.000
100001,56,1,1.0,2,-1679.0,-1715.0,3951.000,3951.000
100001,57,1,1.0,1,-1709.0,-1715.0,3951.000,3951.000
100001,94,1,1.0,4,-2856.0,-2856.0,3980.925,3980.925
...,...,...,...,...,...,...,...,...
456255,28,1,1.0,5,-840.0,-847.0,11090.835,11090.835
456255,29,1,1.0,4,-870.0,-879.0,11090.835,11090.835
456255,30,1,1.0,3,-900.0,-910.0,11090.835,11090.835
456255,31,1,1.0,2,-930.0,-938.0,11090.835,11090.835


Results identity by column :


RAW_INSTALLMENTS_PAYMENTS
MONTHS_BALANCE            True
n_PREV                    True
NUM_INSTALMENT_VERSION    True
NUM_INSTALMENT_NUMBER     True
DAYS_INSTALMENT           True
DAYS_ENTRY_PAYMENT        True
AMT_INSTALMENT            True
AMT_PAYMENT               True
dtype: bool

# RLE Aggregation of Monthly Variations

This is the second level of aggregation and the level where the challenge of information loss comes into play.

A naive aggregation would reduce each longitudinal series to one or more statistical features, resulting in a significant loss of information. It is highly likely that there are early signs of failure that manifest as localized variations. These local signals would be clearly obliterated by a global, non-local statistical measure.

We have chosen an approach that ensures lossless compression of information, inspired by the classical Run Length Encoding (RLE) compression technique. This allows us to retain the details of the 'signal' while aggregating the data.

On this transformed basis, there is nothing preventing us from subsequently deriving all statistical summaries without incurring the aggregation cost once again.

## Basic Aggregation

This is what we dealt with in the previous section: this is where we start.

**Time:** 23 s.

In [14]:
from home_credit.merge import groupby_curr_months, ip_months_balance_builder

# Define aggregation rules
agg_dict = {
    "NUM_INSTALMENT_VERSION": "max",
    "NUM_INSTALMENT_NUMBER": "max",
    "DAYS_INSTALMENT": "median",
    "DAYS_ENTRY_PAYMENT": "median",
    "AMT_INSTALMENT": "sum",
    "AMT_PAYMENT": "sum"
}

# Group data by current application and balance tracking month
cm_data = groupby_curr_months(
    table_name="installments_payments",
    months_balance_builder=ip_months_balance_builder,
    agg_dict=agg_dict,
    include_uniques=True
)

# Display the grouped data
display(cm_data)

RAW_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,54,1,2.0,4,-1619.0,-1628.0,17397.900,17397.900
100001,55,1,1.0,3,-1649.0,-1660.0,3951.000,3951.000
100001,56,1,1.0,2,-1679.0,-1715.0,3951.000,3951.000
100001,57,1,1.0,1,-1709.0,-1715.0,3951.000,3951.000
100001,94,1,1.0,4,-2856.0,-2856.0,3980.925,3980.925
...,...,...,...,...,...,...,...,...
456255,28,1,1.0,5,-840.0,-847.0,11090.835,11090.835
456255,29,1,1.0,4,-870.0,-879.0,11090.835,11090.835
456255,30,1,1.0,3,-900.0,-910.0,11090.835,11090.835
456255,31,1,1.0,2,-930.0,-938.0,11090.835,11090.835


## Sorting Data by `SK_ID_CURR`, `MONTHS_BALANCE`

We begin by sorting the data by current loan application and then by month of balance tracking.

**Time:** 3.6 s.

In [19]:
# Create a copy of the data to sort
sorted_data = cm_data.copy()

# Reset the index to preserve the original index
sorted_data.reset_index(inplace=True)

# Sort the data by 'SK_ID_CURR' and 'MONTHS_BALANCE'
sorted_data.sort_values(by=["SK_ID_CURR", "MONTHS_BALANCE"], inplace=True)

# Set 'SK_ID_CURR' as the new index
sorted_data.set_index("SK_ID_CURR", inplace=True)

# Display the sorted data
display(sorted_data)

RAW_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,54,1,2.0,4,-1619.0,-1628.0,17397.900,17397.900
100001,55,1,1.0,3,-1649.0,-1660.0,3951.000,3951.000
100001,56,1,1.0,2,-1679.0,-1715.0,3951.000,3951.000
100001,57,1,1.0,1,-1709.0,-1715.0,3951.000,3951.000
100001,94,1,1.0,4,-2856.0,-2856.0,3980.925,3980.925
...,...,...,...,...,...,...,...,...
456255,28,1,1.0,5,-840.0,-847.0,11090.835,11090.835
456255,29,1,1.0,4,-870.0,-879.0,11090.835,11090.835
456255,30,1,1.0,3,-900.0,-910.0,11090.835,11090.835
456255,31,1,1.0,2,-930.0,-938.0,11090.835,11090.835


## Grouping by Current Loan Application (`SK_ID_CURR`)

In this step, we are forming groups based on the current loan application (`SK_ID_CURR`) as a preliminary step to aggregation.

**Time:** 0.1 s.

In [16]:
# Form groups based on the current loan application ('SK_ID_CURR')
grouped_data = sorted_data.groupby(by="SK_ID_CURR")

## RLE Reduction of Groups

We apply the RLE reduction using the `feat_eng.data_rle_reduction` function to all the groups. This function has been optimized, but the process still takes a few minutes, giving you a chance to grab a coffee.

**Time:** 5 min 4 s.

In [17]:
from pepper.feat_eng import data_rle_reduction

# Apply RLE reduction to the grouped data
rle_data = grouped_data.apply(data_rle_reduction)

# Set column names to match the original sorted data
rle_data.columns = list(sorted_data.columns)

# Display the resulting RLE reduced data
display(rle_data)

Unnamed: 0_level_0,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,"((54, 1), (55, 1), (56, 1), (57, 1), (94, 1), ...","((1, 7),)","((2.0, 1), (1.0, 6))","((4, 1), (3, 1), (2, 1), (1, 1), (4, 1), (3, 1...","((-1619.0, 1), (-1649.0, 1), (-1679.0, 1), (-1...","((-1628.0, 1), (-1660.0, 1), (-1715.0, 2), (-2...","((17397.9, 1), (3951.0, 3), (3980.925, 1), (39...","((17397.9, 1), (3951.0, 3), (3980.925, 1), (39..."
100002,"((1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1...","((1, 19),)","((2.0, 1), (1.0, 18))","((19, 1), (18, 1), (17, 1), (16, 1), (15, 1), ...","((-25.0, 1), (-55.0, 1), (-85.0, 1), (-115.0, ...","((-49.0, 1), (-67.0, 1), (-99.0, 1), (-133.0, ...","((53093.745, 1), (9251.775, 18))","((53093.745, 1), (9251.775, 18))"
100003,"((18, 1), (19, 1), (20, 1), (21, 1), (22, 1), ...","((1, 4), (2, 3), (1, 6), (2, 1), (1, 7))","((2.0, 1), (1.0, 20))","((7, 1), (6, 1), (5, 1), (4, 1), (6, 1), (5, 1...","((-536.0, 1), (-566.0, 1), (-596.0, 1), (-626....","((-544.0, 1), (-570.0, 1), (-600.0, 1), (-629....","((560835.36, 1), (98356.995, 3), (162632.61, 1...","((560835.36, 1), (98356.995, 3), (162632.61, 1..."
100004,"((24, 1), (25, 1), (26, 1))","((1, 3),)","((2.0, 1), (1.0, 2))","((3, 1), (2, 1), (1, 1))","((-724.0, 1), (-754.0, 1), (-784.0, 1))","((-727.0, 1), (-763.0, 1), (-795.0, 1))","((10573.965, 1), (5357.25, 2))","((10573.965, 1), (5357.25, 2))"
100005,"((16, 1), (17, 1), (18, 1), (19, 1), (20, 1), ...","((1, 9),)","((2.0, 1), (1.0, 8))","((9, 1), (8, 1), (7, 1), (6, 1), (5, 1), (4, 1...","((-466.0, 1), (-496.0, 1), (-526.0, 1), (-556....","((-470.0, 1), (-515.0, 1), (-555.0, 1), (-585....","((17656.245, 1), (4813.2, 8))","((17656.245, 1), (4813.2, 8))"
...,...,...,...,...,...,...,...,...
456251,"((1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1...","((1, 7),)","((2.0, 1), (1.0, 6))","((7, 1), (6, 1), (5, 1), (4, 1), (3, 1), (2, 1...","((-30.0, 1), (-60.0, 1), (-90.0, 1), (-120.0, ...","((-38.0, 1), (-101.0, 1), (-136.0, 1), (-166.0...","((12815.01, 1), (6605.91, 6))","((12815.01, 1), (6605.91, 6))"
456252,"((77, 1), (78, 1), (79, 1), (80, 1), (81, 1), ...","((1, 6),)","((1.0, 6),)","((6, 1), (5, 1), (4, 1), (3, 1), (2, 1), (1, 1))","((-2316.0, 1), (-2346.0, 1), (-2376.0, 1), (-2...","((-2327.0, 1), (-2349.0, 1), (-2376.0, 1), (-2...","((10046.88, 1), (10074.465, 5))","((10046.88, 1), (10074.465, 5))"
456253,"((57, 1), (58, 1), (59, 1), (60, 1), (61, 1), ...","((1, 7), (2, 1), (1, 5))","((1.0, 13),)","((6, 1), (5, 1), (4, 1), (3, 1), (2, 1), (1, 1...","((-1716.0, 1), (-1746.0, 1), (-1776.0, 1), (-1...","((-1738.0, 1), (-1771.0, 1), (-1792.0, 1), (-1...","((5575.185, 1), (5567.715, 5), (3971.88, 1), (...","((5575.185, 1), (5567.715, 5), (3971.88, 1), (..."
456254,"((1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1...","((2, 9), (1, 1))","((1.0, 10),)","((10, 1), (9, 1), (8, 1), (7, 1), (6, 1), (5, ...","((-14.0, 1), (-44.0, 1), (-74.0, 1), (-104.0, ...","((-33.0, 1), (-63.0, 1), (-92.5, 1), (-121.0, ...","((21362.265, 9), (2296.44, 1))","((21362.265, 9), (2296.44, 1))"


## Saving the Reduced Table

The RLE reduction operation is resource-intensive. Therefore, this is the moment to save the transformed table for future use.

The backup is performed in the `tmp/agg_merge/` directory using the default combination of `engine=pyarrow` and `compression=gzip`. The file is named `installments_payments_rle.pqt`.

**Time:** 4 min 13 s.

In [21]:
from pepper.persist import all_to_parquet
from pepper.env import get_tmp_dir
import os

# Define the target directory for saving the data
target_dir = os.path.join(get_tmp_dir(), "agg_merge/")
table_name = "installments_payments"

# Use the all_to_parquet function to save the RLE reduced data to Parquet format
# The data is stored with the name 'installments_payments_rle' in the specified target directory
all_to_parquet({f"{table_name}_rle": rle_data}, target_dir)

.