# **Grouped `installments_payments`**

# Data Loading and Preprocessing

**Time:** 1.4 s for 13,605,401 entries.

In [2]:
from home_credit.load import get_table
from pepper.utils import display_key_val

# Load the 'installments_payments' table
data = get_table("installments_payments").copy()

# Insert the aggregation counter
data.insert(0, "n_PREV", 1)

# Calculate MONTHS_BALANCE based on DAYS_INSTALMENT and insert it
gregorian_month = 365.2425 / 12
data.insert(0, "MONTHS_BALANCE", -(data.DAYS_INSTALMENT // gregorian_month).astype(int))

# Display the number of samples in the dataset
display_key_val("number of samples", data.shape[0])

# Display the dataset
display(data)

[1mnumber of samples[0m: 13 605 401


RAW_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,n_PREV,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,39,1,1054186,161674,1.0,6,-1180.0,-1187.0,6948.360,6948.360
1,71,1,1330831,151639,0.0,34,-2156.0,-2156.0,1716.525,1716.525
2,3,1,2085231,193053,2.0,1,-63.0,-63.0,25425.000,25425.000
3,80,1,2452527,199697,1.0,3,-2418.0,-2426.0,24350.130,24350.130
4,46,1,2714724,167756,1.0,2,-1383.0,-1366.0,2165.040,2160.585
...,...,...,...,...,...,...,...,...,...,...
13605396,54,1,2186857,428057,0.0,66,-1624.0,,67.500,
13605397,51,1,1310347,414406,0.0,47,-1539.0,,67.500,
13605398,1,1,1308766,402199,0.0,43,-7.0,,43737.435,
13605399,66,1,1062206,409297,0.0,43,-1986.0,,67.500,


# Aggregation by current application and balance tracking month (`SK_ID_CURR`, `MONTHS_BALANCE`)

We break down the steps to explain, justify, and facilitate understanding of the operation performed. However, the last section invokes the built-in function `groupby_curr_months` which performs all the steps.

## Separation *uniques* vs *multis*

There is no need to perform it here, firstly because the *uniques* (only one `SK_ID_PREV` for a `SK_ID_CURR`) represent only $10\%$ of the cases, but above all because it would be a logical error. Indeed, even with a `SK_ID_CURR` associated with a unique `SK_ID_PREV`, exploratory analysis has shown that there can be multiple due dates in the same month.

**Therefore, for this table, the aggregation must cover all records.**

## Key Uniqueness

We verify that there cannot be multiple `SK_ID_CURR` for one `SK_ID_PREV`.

The issue is therefore multi-indexed only in appearance: the `SK_ID_CURR` key is sufficient to separate the groups.

Number of `SK_ID_PREV` for one `SK_ID_CURR` and vice versa.

**Time:** 2.1 s.

In [11]:
from home_credit.merge import get_unique_and_multi_index, curr_prev_uniqueness_report

# Get unique and multi-indexes for the specified table and columns
indexes = get_unique_and_multi_index("installments_payments", "SK_ID_PREV", "SK_ID_CURR")

# Generate a report on the uniqueness of SK_ID_CURR and SK_ID_PREV
curr_prev_uniqueness_report(*indexes)

[1mnumber of unique (curr, prev)              [0m: 997 752
[1mnumber of curr with more than 1 prev       [0m: 903 110
[1mnumber of curr with one prev               [0m: 94 642
[1mnumber of curr with more than 1 prev (in %)[0m: 90.5
[1mnumber of prev with more than 1 curr       [0m: 0
[1mnumber of prev with one curr               [0m: 997 752
[1mnumber of prev with more than 1 curr (in %)[0m: 0.0


## Aggregation

We perform a summary with the `(SK_ID_CURR, MONTHS_BALANCE)` pair as the pivot.

Aggregation strategies:
* `NUM_INSTALMENT_VERSION` and `NUM_INSTALMENT_NUMBER`
    - They are purely informative here and are fully reproduced in the other pivoted table.
    - The **maximum** will suffice (the last version and the last installment number).
* `DAYS_INSTALMENT` and `DAYS_ENTRY_PAYMENT`
    - The significant granularity in our pivot is the month.
    - One can consider the last day or the median day.
    - We choose the **median**.
* `AMT_INSTALMENT`, `AMT_PAYMENT`, `n_PREV`: the **sum**.

The $13,605,401$ records are reduced to $9,477,481$, resulting in a compression rate of approximately $30\%$.

**Time:** 17 s.

In [6]:
# Group the data by 'SK_ID_CURR' and 'MONTHS_BALANCE'
# and aggregate the columns
grouped_data = (
    data.drop(columns="SK_ID_PREV")
    .groupby(by=["SK_ID_CURR", "MONTHS_BALANCE"])
    .agg({
        "n_PREV": "sum",
        "NUM_INSTALMENT_VERSION": "max",
        "NUM_INSTALMENT_NUMBER": "max",
        "DAYS_INSTALMENT": "median",
        "DAYS_ENTRY_PAYMENT": "median",
        "AMT_INSTALMENT": "sum",
        "AMT_PAYMENT": "sum"
    })
)

# Reset the 'MONTHS_BALANCE' as a column
grouped_data.reset_index(level=1, inplace=True)

# Display the grouped data
display(grouped_data)

RAW_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,54,1,2.0,4,-1619.0,-1628.0,17397.900,17397.900
100001,55,1,1.0,3,-1649.0,-1660.0,3951.000,3951.000
100001,56,1,1.0,2,-1679.0,-1715.0,3951.000,3951.000
100001,57,1,1.0,1,-1709.0,-1715.0,3951.000,3951.000
100001,94,1,1.0,4,-2856.0,-2856.0,3980.925,3980.925
...,...,...,...,...,...,...,...,...
456255,28,1,1.0,5,-840.0,-847.0,11090.835,11090.835
456255,29,1,1.0,4,-870.0,-879.0,11090.835,11090.835
456255,30,1,1.0,3,-900.0,-910.0,11090.835,11090.835
456255,31,1,1.0,2,-930.0,-938.0,11090.835,11090.835


## Integrated version (`groupby_curr_months`)

This code uses the `groupby_curr_months` function to aggregate the data based on specified aggregation rules and then checks if the results match the reference data for each column.

**Time:** 23 s.

In [9]:
from home_credit.merge import groupby_curr_months, ip_months_balance_builder

# Store a reference to the previous grouped data
ref_grouped_data = grouped_data

# Define aggregation rules
agg_dict = {
    "NUM_INSTALMENT_VERSION": "max",
    "NUM_INSTALMENT_NUMBER": "max",
    "DAYS_INSTALMENT": "median",
    "DAYS_ENTRY_PAYMENT": "median",
    "AMT_INSTALMENT": "sum",
    "AMT_PAYMENT": "sum"
}

# Group data using the integrated function
grouped_data = groupby_curr_months(
    table_name="installments_payments",
    months_balance_builder=ip_months_balance_builder,
    agg_dict=agg_dict,
    include_uniques=True
)

# Display the grouped data
display(grouped_data)

# Check if the results match the reference data
print("Check results identity by column :")
display((
    (grouped_data == ref_grouped_data)
    # Avoid not(NaN == NaN)
    | grouped_data.isnull() & ref_grouped_data.isnull()
).all())

RAW_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,54,1,2.0,4,-1619.0,-1628.0,17397.900,17397.900
100001,55,1,1.0,3,-1649.0,-1660.0,3951.000,3951.000
100001,56,1,1.0,2,-1679.0,-1715.0,3951.000,3951.000
100001,57,1,1.0,1,-1709.0,-1715.0,3951.000,3951.000
100001,94,1,1.0,4,-2856.0,-2856.0,3980.925,3980.925
...,...,...,...,...,...,...,...,...
456255,28,1,1.0,5,-840.0,-847.0,11090.835,11090.835
456255,29,1,1.0,4,-870.0,-879.0,11090.835,11090.835
456255,30,1,1.0,3,-900.0,-910.0,11090.835,11090.835
456255,31,1,1.0,2,-930.0,-938.0,11090.835,11090.835


Results identity by column :


RAW_INSTALLMENTS_PAYMENTS
MONTHS_BALANCE            True
n_PREV                    True
NUM_INSTALMENT_VERSION    True
NUM_INSTALMENT_NUMBER     True
DAYS_INSTALMENT           True
DAYS_ENTRY_PAYMENT        True
AMT_INSTALMENT            True
AMT_PAYMENT               True
dtype: bool

# RLE Aggregation of Monthly Variations

This is the second level of aggregation and the level where the challenge of information loss comes into play.

A naive aggregation would reduce each longitudinal series to one or more statistical features, resulting in a significant loss of information. It is highly likely that there are early signs of failure that manifest as localized variations. These local signals would be clearly obliterated by a global, non-local statistical measure.

We have chosen an approach that ensures lossless compression of information, inspired by the classical Run Length Encoding (RLE) compression technique. This allows us to retain the details of the 'signal' while aggregating the data.

On this transformed basis, there is nothing preventing us from subsequently deriving all statistical summaries without incurring the aggregation cost once again.

## Basic Aggregation

This is what we dealt with in the previous section: this is where we start.

**Time:** 23 s.

In [14]:
from home_credit.merge import groupby_curr_months, ip_months_balance_builder

# Define aggregation rules
agg_dict = {
    "NUM_INSTALMENT_VERSION": "max",
    "NUM_INSTALMENT_NUMBER": "max",
    "DAYS_INSTALMENT": "median",
    "DAYS_ENTRY_PAYMENT": "median",
    "AMT_INSTALMENT": "sum",
    "AMT_PAYMENT": "sum"
}

# Group data by current application and balance tracking month
cm_data = groupby_curr_months(
    table_name="installments_payments",
    months_balance_builder=ip_months_balance_builder,
    agg_dict=agg_dict,
    include_uniques=True
)

# Display the grouped data
display(cm_data)

RAW_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,54,1,2.0,4,-1619.0,-1628.0,17397.900,17397.900
100001,55,1,1.0,3,-1649.0,-1660.0,3951.000,3951.000
100001,56,1,1.0,2,-1679.0,-1715.0,3951.000,3951.000
100001,57,1,1.0,1,-1709.0,-1715.0,3951.000,3951.000
100001,94,1,1.0,4,-2856.0,-2856.0,3980.925,3980.925
...,...,...,...,...,...,...,...,...
456255,28,1,1.0,5,-840.0,-847.0,11090.835,11090.835
456255,29,1,1.0,4,-870.0,-879.0,11090.835,11090.835
456255,30,1,1.0,3,-900.0,-910.0,11090.835,11090.835
456255,31,1,1.0,2,-930.0,-938.0,11090.835,11090.835


## Sorting Data by `SK_ID_CURR`, `MONTHS_BALANCE`

We begin by sorting the data by current loan application and then by month of balance tracking.

**Time:** 3.6 s.

In [19]:
# Create a copy of the data to sort
sorted_data = cm_data.copy()

# Reset the index to preserve the original index
sorted_data.reset_index(inplace=True)

# Sort the data by 'SK_ID_CURR' and 'MONTHS_BALANCE'
sorted_data.sort_values(by=["SK_ID_CURR", "MONTHS_BALANCE"], inplace=True)

# Set 'SK_ID_CURR' as the new index
sorted_data.set_index("SK_ID_CURR", inplace=True)

# Display the sorted data
display(sorted_data)

RAW_INSTALLMENTS_PAYMENTS,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,54,1,2.0,4,-1619.0,-1628.0,17397.900,17397.900
100001,55,1,1.0,3,-1649.0,-1660.0,3951.000,3951.000
100001,56,1,1.0,2,-1679.0,-1715.0,3951.000,3951.000
100001,57,1,1.0,1,-1709.0,-1715.0,3951.000,3951.000
100001,94,1,1.0,4,-2856.0,-2856.0,3980.925,3980.925
...,...,...,...,...,...,...,...,...
456255,28,1,1.0,5,-840.0,-847.0,11090.835,11090.835
456255,29,1,1.0,4,-870.0,-879.0,11090.835,11090.835
456255,30,1,1.0,3,-900.0,-910.0,11090.835,11090.835
456255,31,1,1.0,2,-930.0,-938.0,11090.835,11090.835


## Grouping by Current Loan Application (`SK_ID_CURR`)

In this step, we are forming groups based on the current loan application (`SK_ID_CURR`) as a preliminary step to aggregation.

**Time:** 0.1 s.

In [16]:
# Form groups based on the current loan application ('SK_ID_CURR')
grouped_data = sorted_data.groupby(by="SK_ID_CURR")

## RLE Reduction of Groups

We apply the RLE reduction using the `feat_eng.data_rle_reduction` function to all the groups. This function has been optimized, but the process still takes a few minutes, giving you a chance to grab a coffee.

**Time:** 5 min 4 s.

In [17]:
from pepper.feat_eng import data_rle_reduction

# Apply RLE reduction to the grouped data
rle_data = grouped_data.apply(data_rle_reduction)

# Set column names to match the original sorted data
rle_data.columns = list(sorted_data.columns)

# Display the resulting RLE reduced data
display(rle_data)

Unnamed: 0_level_0,MONTHS_BALANCE,n_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100001,"((54, 1), (55, 1), (56, 1), (57, 1), (94, 1), ...","((1, 7),)","((2.0, 1), (1.0, 6))","((4, 1), (3, 1), (2, 1), (1, 1), (4, 1), (3, 1...","((-1619.0, 1), (-1649.0, 1), (-1679.0, 1), (-1...","((-1628.0, 1), (-1660.0, 1), (-1715.0, 2), (-2...","((17397.9, 1), (3951.0, 3), (3980.925, 1), (39...","((17397.9, 1), (3951.0, 3), (3980.925, 1), (39..."
100002,"((1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1...","((1, 19),)","((2.0, 1), (1.0, 18))","((19, 1), (18, 1), (17, 1), (16, 1), (15, 1), ...","((-25.0, 1), (-55.0, 1), (-85.0, 1), (-115.0, ...","((-49.0, 1), (-67.0, 1), (-99.0, 1), (-133.0, ...","((53093.745, 1), (9251.775, 18))","((53093.745, 1), (9251.775, 18))"
100003,"((18, 1), (19, 1), (20, 1), (21, 1), (22, 1), ...","((1, 4), (2, 3), (1, 6), (2, 1), (1, 7))","((2.0, 1), (1.0, 20))","((7, 1), (6, 1), (5, 1), (4, 1), (6, 1), (5, 1...","((-536.0, 1), (-566.0, 1), (-596.0, 1), (-626....","((-544.0, 1), (-570.0, 1), (-600.0, 1), (-629....","((560835.36, 1), (98356.995, 3), (162632.61, 1...","((560835.36, 1), (98356.995, 3), (162632.61, 1..."
100004,"((24, 1), (25, 1), (26, 1))","((1, 3),)","((2.0, 1), (1.0, 2))","((3, 1), (2, 1), (1, 1))","((-724.0, 1), (-754.0, 1), (-784.0, 1))","((-727.0, 1), (-763.0, 1), (-795.0, 1))","((10573.965, 1), (5357.25, 2))","((10573.965, 1), (5357.25, 2))"
100005,"((16, 1), (17, 1), (18, 1), (19, 1), (20, 1), ...","((1, 9),)","((2.0, 1), (1.0, 8))","((9, 1), (8, 1), (7, 1), (6, 1), (5, 1), (4, 1...","((-466.0, 1), (-496.0, 1), (-526.0, 1), (-556....","((-470.0, 1), (-515.0, 1), (-555.0, 1), (-585....","((17656.245, 1), (4813.2, 8))","((17656.245, 1), (4813.2, 8))"
...,...,...,...,...,...,...,...,...
456251,"((1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1...","((1, 7),)","((2.0, 1), (1.0, 6))","((7, 1), (6, 1), (5, 1), (4, 1), (3, 1), (2, 1...","((-30.0, 1), (-60.0, 1), (-90.0, 1), (-120.0, ...","((-38.0, 1), (-101.0, 1), (-136.0, 1), (-166.0...","((12815.01, 1), (6605.91, 6))","((12815.01, 1), (6605.91, 6))"
456252,"((77, 1), (78, 1), (79, 1), (80, 1), (81, 1), ...","((1, 6),)","((1.0, 6),)","((6, 1), (5, 1), (4, 1), (3, 1), (2, 1), (1, 1))","((-2316.0, 1), (-2346.0, 1), (-2376.0, 1), (-2...","((-2327.0, 1), (-2349.0, 1), (-2376.0, 1), (-2...","((10046.88, 1), (10074.465, 5))","((10046.88, 1), (10074.465, 5))"
456253,"((57, 1), (58, 1), (59, 1), (60, 1), (61, 1), ...","((1, 7), (2, 1), (1, 5))","((1.0, 13),)","((6, 1), (5, 1), (4, 1), (3, 1), (2, 1), (1, 1...","((-1716.0, 1), (-1746.0, 1), (-1776.0, 1), (-1...","((-1738.0, 1), (-1771.0, 1), (-1792.0, 1), (-1...","((5575.185, 1), (5567.715, 5), (3971.88, 1), (...","((5575.185, 1), (5567.715, 5), (3971.88, 1), (..."
456254,"((1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1...","((2, 9), (1, 1))","((1.0, 10),)","((10, 1), (9, 1), (8, 1), (7, 1), (6, 1), (5, ...","((-14.0, 1), (-44.0, 1), (-74.0, 1), (-104.0, ...","((-33.0, 1), (-63.0, 1), (-92.5, 1), (-121.0, ...","((21362.265, 9), (2296.44, 1))","((21362.265, 9), (2296.44, 1))"


## Saving the Reduced Table

The RLE reduction operation is resource-intensive. Therefore, this is the moment to save the transformed table for future use.

The backup is performed in the `tmp/agg_merge/` directory using the default combination of `engine=pyarrow` and `compression=gzip`. The file is named `installments_payments_rle.pqt`.

**Time:** 4 min 13 s.

In [21]:
from pepper.persist import all_to_parquet
from pepper.env import get_tmp_dir
import os

# Define the target directory for saving the data
target_dir = os.path.join(get_tmp_dir(), "agg_merge/")
table_name = "installments_payments"

# Use the all_to_parquet function to save the RLE reduced data to Parquet format
# The data is stored with the name 'installments_payments_rle' in the specified target directory
all_to_parquet({f"{table_name}_rle": rle_data}, target_dir)

.