# Home Credit Default Risk - EDA BUREAU & BUREAU_BALANCE

## 1. Introduction

**Context**

This notebook contains basic EDA for BUREAU and BUREAU_BALANCE data sets. 

These are additional sources of data (application_train/application_test are the main training and testing data).

bureau.csv

    All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in this sample).
    For every loan in this sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.


bureau_balance.csv

    Monthly balances of previous credits in Credit Bureau.
    
    This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.


**Goals:**

    To comprehensively understand the dataset's structure, identify key patterns, and discover meaningful insights that will inform a robust feature engineering and modeling strategy.

**Objectives:**

    Conduct a comprehensive Exploratory Data Analysis (EDA): Perform an in-depth exploration of the datasets to understand their statistical properties and distributions.

    Identify and address data quality issues: Investigate missing values, identify and handle data anomalies.

    Analyze feature relationships: Evaluate correlations between features and assess their individual relationships with the target variable to prioritize their importance for the model.

    Leverage automated tools for initial insights: Utilize libraries like Sweetviz to quickly generate an initial feature exploration report.


## 2. Exploratory Data Analysis (EDA)

### A. Data loading

In [1]:
%load_ext jupyter_black

In [5]:
import pandas as pd
import numpy as np
import sys
import os
from typing import Dict, Optional, List, Tuple, Union
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="sweetviz.graph")
import sweetviz as sv
from ydata_profiling import ProfileReport
from IPython.display import IFrame

In [6]:
sys.path.append(os.path.abspath(".."))
from Data.utils_EDA import feature_types, missing_columns, calculate_missing_rows
from Data.utils_modeling import downcast_numeric_col

**Loading datasets**

In [4]:
bureau = pd.read_csv(r"..\Data\bureau.csv")
bureau.shape

(1716428, 17)

In [5]:
bureau_balance = pd.read_csv(r"..\Data\bureau_balance.csv")
bureau_balance.shape

(27299925, 3)

**Downcasting numeric columns**

In [8]:
bureau = bureau.copy()
downcast_numeric_col(bureau)
bureau.dtypes.unique()

array([dtype('int32'), dtype('O'), dtype('int16'), dtype('float32'),
       dtype('float64'), dtype('int8')], dtype=object)

In [7]:
bureau_balance = bureau_balance.copy()
downcast_numeric_col(bureau_balance)
bureau_balance.dtypes.unique()

array([dtype('int32'), dtype('int8'), dtype('O')], dtype=object)

### B. Bureau data set

In [9]:
bureau.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


In [10]:
bureau.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int32  
 1   SK_ID_BUREAU            int32  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int16  
 5   CREDIT_DAY_OVERDUE      int16  
 6   DAYS_CREDIT_ENDDATE     float32
 7   DAYS_ENDDATE_FACT       float32
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int8   
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int32  
 16  AMT_ANNUITY             float64
dtypes: float32(2), float64(6), int16(2), int32(3), int8(1), object(3)
memory usage: 158.8+ MB


**Feature descriptions:**


1. SK_ID_CURR,"ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau ",hashed

2. SK_BUREAU_ID,Recoded ID of previous Credit Bureau credit related to our loan (unique coding for each loan application),hashed

3. CREDIT_ACTIVE,Status of the Credit Bureau (CB) reported credits,

4. CREDIT_CURRENCY,Recoded currency of the Credit Bureau credit,recoded

5. DAYS_CREDIT,How many days before current application did client apply for Credit Bureau credit,time only relative to the application

6. CREDIT_DAY_OVERDUE,Number of days past due on CB credit at the time of application for related loan in our sample,

7. DAYS_CREDIT_ENDDATE,Remaining duration of CB credit (in days) at the time of application in Home Credit,time only relative to the application

8. DAYS_ENDDATE_FACT,Days since CB credit ended at the time of application in Home Credit (only for closed credit),time only relative to the application

9. AMT_CREDIT_MAX_OVERDUE,Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample),

10. CNT_CREDIT_PROLONG,How many times was the Credit Bureau credit prolonged,

11. AMT_CREDIT_SUM,Current credit amount for the Credit Bureau credit,

12. AMT_CREDIT_SUM_DEBT,Current debt on Credit Bureau credit,

13. AMT_CREDIT_SUM_LIMIT,Current credit limit of credit card reported in Credit Bureau,

14. AMT_CREDIT_SUM_OVERDUE,Current amount overdue on Credit Bureau credit,

15. CREDIT_TYPE,"Type of Credit Bureau credit (Car, cash,...)",

16. DAYS_CREDIT_UPDATE,How many days before loan application did last information about the Credit Bureau credit come,time only relative to the application

17. AMT_ANNUITY,Annuity of the Credit Bureau credit,

**Feature types**

In [11]:
feature_types(bureau)

Numerical features: ['SK_ID_CURR', 'SK_ID_BUREAU', 'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE', 'DAYS_CREDIT_UPDATE', 'AMT_ANNUITY']
Categorical features: ['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE']
Binary features: []


In [18]:
bureau.dtypes.value_counts()

float64    6
int32      3
object     3
int16      2
float32    2
int8       1
Name: count, dtype: int64

In [13]:
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

bureau.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_CURR,1716428.0,278214.9,102938.6,100001.0,188866.75,278055.0,367426.0,456255.0
SK_ID_BUREAU,1716428.0,5924434.0,532265.7,5000000.0,5463953.75,5926303.5,6385681.25,6843457.0
DAYS_CREDIT,1716428.0,-1142.108,795.1649,-2922.0,-1666.0,-987.0,-474.0,0.0
CREDIT_DAY_OVERDUE,1716428.0,0.8181666,36.54443,0.0,0.0,0.0,0.0,2792.0
DAYS_CREDIT_ENDDATE,1610875.0,510.5173,4994.22,-42060.0,-1138.0,-330.0,474.0,31199.0
DAYS_ENDDATE_FACT,1082775.0,-1017.437,714.0106,-42023.0,-1489.0,-897.0,-425.0,0.0
AMT_CREDIT_MAX_OVERDUE,591940.0,3825.418,206031.6,0.0,0.0,0.0,0.0,115987200.0
CNT_CREDIT_PROLONG,1716428.0,0.006410406,0.09622391,0.0,0.0,0.0,0.0,9.0
AMT_CREDIT_SUM,1716415.0,354994.6,1149811.0,0.0,51300.0,125518.5,315000.0,585000000.0
AMT_CREDIT_SUM_DEBT,1458759.0,137085.1,677401.1,-4705600.32,0.0,0.0,40153.5,170100000.0


**Key insights:**

Credit timelines: Most credits are old—average DAYS_CREDIT is –1,142 days (~3 years ago), and many have ended (DAYS_ENDDATE_FACT avg –1,017).

Debt vs. Credit: AMT_CREDIT_SUM_DEBT and AMT_CREDIT_SUM_LIMIT include negative values—possible data errors or reversed entries, some systems store debts as negative because they represent liabilities.
    We will create binary flag and replace negatives with 0 (in case there is some signal)

Overdue amounts: Despite a high max (3.76M), CREDIT_DAY_OVERDUE and AMT_CREDIT_SUM_OVERDUE are near-zero for most—suggesting rare delinquency.

Prolonged credits: CNT_CREDIT_PROLONG is almost always zero—credit extensions are uncommon.


**Missing values**

In [14]:
missing_columns(bureau)

Unnamed: 0,Missing Count,Missing Count Ratio,Missing Count %
AMT_ANNUITY,1226791,0.714735,71.5
AMT_CREDIT_MAX_OVERDUE,1124488,0.655133,65.5
DAYS_ENDDATE_FACT,633653,0.36917,36.9
AMT_CREDIT_SUM_LIMIT,591780,0.344774,34.5
AMT_CREDIT_SUM_DEBT,257669,0.150119,15.0
DAYS_CREDIT_ENDDATE,105553,0.061496,6.1
AMT_CREDIT_SUM,13,8e-06,0.0


Some columns has more than 50% missing values.

In [15]:
calculate_missing_rows(bureau)

Missing rows: 1676762 of 1716428 total rows in data set.
Missing rows %: 97.69


There are too many missing values for dropping them, we will use imputation.

**Checking for duplicates.**

In [16]:
print(
    f"Duplicates: {bureau.duplicated().sum()}, {(bureau.duplicated().sum() / len(bureau) * 100):.2f}%"
)

Duplicates: 0, 0.00%


No duplicates in bureau dataset.

**Sweetviz report**

We can find the report in EDA folder.

In [None]:
report = sv.analyze(df)
html_file = f"Bureau_sweetviz_report.html"
report.show_html(html_file)
#display(IFrame(html_file, width=950, height=600))

**Creating Ydata report**

We can find the report in EDA folder.

In [None]:
profile = ProfileReport(df, title="Bureau_ydata EDA", explorative=True)

profile.to_file("Bureau_ydata_EDA.html")

### C. Feature analysis bureau

    CREDIT_ACTIVE - Status of the Credit Bureau (CB) reported credits

Categorical. No missing values.

Distribution is imbalanced: Closed - 62.9%, Active - 36.7%, Sold - 0.4%, Bad debt <0.1%. No anomalies.

Flags for Credit Status (could be important categorical information about bad credits):

    - HAS_ACTIVE_CREDIT
    - HAS_CLOSED_CREDIT
    - HAS_BAD_CREDIT ("Sold" or "Bad debt")


    CREDIT_CURRENCY - Recoded currency of the Credit Bureau credit,recoded

Categorical. No missing values.

Distribution is imbalanced: currency 1 - 99.9%.

No anomalies. Propably not useful.

    DAYS_CREDIT - How many days before current application did client apply for Credit Bureau credit,time only relative to the application

High correlation with DAYS_ENDDATE_FACT, DAYS_CREDIT_UPDATE.

Numerical. No missing values.

Distribution left skewed.

Minimun 2922, maximum 0 days. No anomalies.

    - Will be converted to years.

    CREDIT_DAY_OVERDUE - Number of days past due on CB credit at the time of application for related loan in our sample

Numerical. No missing values.

Maximum 2792 days. 99.8% zeros.

    - To ensure a consistent naming convention, we will change the feature's name to DAYS_CREDIT_OVERDUE for a more robust conversion to YEARS.

    DAYS_CREDIT_ENDDATE - Remaining duration of CB credit (in days) at the time of application in Home Credit,time only relative to the application

Weak correlation with DAYS_CREDIT, DAYS_ENDDATE_FACT, DAYS_CREDIT_UPDATE.

Numerical, 6.1% missing values.

Minimum -42060 days (115,23 years), maximum 31199 days (85,48 years). Anomalies.

    - Will need to fix anomalies
    - Will be converted to years.
    - Total duration of the credit feature: CREDIT_ENDDATE_PROXIMITY = YEARS_CREDIT_ENDDATE - YEARS_CREDIT


    DAYS_ENDDATE_FACT- Days since CB credit ended at the time of application in Home Credit (only for closed credit),time only relative to the application

High correlation with DAYS_CREDIT, DAYS_CREDIT_UPDATE, weak DAYS_CREDIT_ENDDATE,

Numerical, 36.9% missing values.

Minimum -42023 (115,13 years), maximum 0. Anomalies.
    
    - Will need to fix anomalies
    - Will be converted to years.

    AMT_CREDIT_MAX_OVERDUE - Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample)

Numerical, 65.5% missing values.

27.4% zeros, other 7.1%. Right skewed.

Maximum 115,987,185. Very high number, but there is no information about the currency. This big value are not single, top 10 - 10,861,812.

    CNT_CREDIT_PROLONG - How many times was the Credit Bureau credit prolonged

Numerical, no missing values.

99.5% zeros. Maximum 9.

    AMT_CREDIT_SUM - Current credit amount for the Credit Bureau credit

High correlation with AMT_CREDIT_SUM_DEBT.

Numerical, <0.1% missing values.

3.9% zeros. Maximum 585,000,000.

    AMT_CREDIT_SUM_DEBT - Current debt on Credit Bureau credit

High correlation with DAYS_CREDIT_ENDDATE, DAYS_CREDIT_UPDATE.

Numerical, 15.0% missing values. 59.2% zeros. 

Minimum	-4,705,600.3, Maximum 170,100,000.

    - Flag for values < 0 and replace negatives with 0.
    - Credit utilization feature - AMT_CREDIT_SUM_DEBT / AMT_CREDIT_SUM


    AMT_CREDIT_SUM_LIMIT - Current credit limit of credit card reported in Credit Bureau

Numerical, 34.5% missing values, 61.2% zeros. 

Minimum	-586,406.11, Maximum 4,705,600.3

    - Flag for values < 0 and replace negatives with 0.

    AMT_CREDIT_SUM_OVERDUE - Current amount overdue on Credit Bureau credit

Numerical, no missing values, 99,8% zeros.

Minimum	0, Maximum 3,756,681.

     Feature engineering:
     - HAS_ANY_OVERDUE_DEBT, where AMT_CREDIT_SUM_OVERDUE > 0,
     - HAS_SIGNIFICANT_OVERDUE_DEBT, where AMT_CREDIT_SUM_OVERDUE > 1000
     - HAS_ANY_MAJOR_BUREAU_RISK = HAS_ANY_OVERDUE_DEBT > 0 or HAS_SIGNIFICANT_OVERDUE_DEBT > 0

    CREDIT_TYPE - Type of Credit Bureau credit (Car, cash,...)

Categorical, no missing values, 15 distinct	values.

Distribution is imbalanced: Consumer credit - 72.9%, Credit card - 23.4% (cover 96,3%)

    DAYS_CREDIT_UPDATE - How many days before loan application did last information about the Credit Bureau credit come,time only relative to the application

High correlation with DAYS_CREDIT, DAYS_ENDDATE_FACT, weak DAYS_CREDIT_ENDDATE.

Numerical, no missing values.

Minimum	-41,947 days (114.92 years), Maximum 372 days. Anomalies.
    
    - Will need to fix anomalies
    - Will be converted to years.


    AMT_ANNUITY - Annuity of the Credit Bureau credit

Numerical, 71.5% missing values, 15.0% zeros.

Minimum	0, Maximum 118,453,423.5. High number, but not single, top 10 - 33,784,668.


### Correlation

We will analyze the relationships between features using a Ydata-Quality report. This report will provide a comprehensive overview of our data, including an automated correlation matrix for all features.

To determine which features are most impactful for our model, we will use a more robust method: LightGBM's feature importance. After aggregating the columns from specific datasets into our main dataset, the LightGBM model will automatically calculate the importance of each feature in predicting the target variable. This approach is superior as it directly assesses a feature's predictive power within the context of our chosen model, providing a more reliable measure of its relationship with the target.

**Feature Relationships**

4 features has high correlation:
    
    DAYS_CREDIT and DAYS_CREDIT_ENDDATE, DAYS_CREDIT_UPDATE, DAYS_ENDDATE_FACT.
    

### D. Bureau balance dataset

In [23]:
bureau_balance.head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


In [24]:
bureau_balance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int32 
 1   MONTHS_BALANCE  int8  
 2   STATUS          object
dtypes: int32(1), int8(1), object(1)
memory usage: 338.5+ MB


**Feature descriptions:**

    1. SK_BUREAU_ID,Recoded ID of Credit Bureau credit (unique coding for each application) - use this to join to CREDIT_BUREAU table ,hashed

    2. MONTHS_BALANCE,Month of balance relative to application date (-1 means the freshest balance date),time only relative to the application
    
    3. STATUS,"Status of Credit Bureau loan during the month (active, closed, DPD0-30,…

**Feature types**

In [25]:
feature_types(bureau_balance)

Numerical features: ['SK_ID_BUREAU', 'MONTHS_BALANCE']
Categorical features: ['STATUS']
Binary features: []


In [26]:
bureau_balance.describe()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE
count,27299920.0,27299920.0
mean,6036297.0,-30.74169
std,492348.9,23.86451
min,5001709.0,-96.0
25%,5730933.0,-46.0
50%,6070821.0,-25.0
75%,6431951.0,-11.0
max,6842888.0,0.0


MONTHS_BALANCE: a median of –25 months—meaning most records are from about 2 years ago.

**Missing values**

In [27]:
missing_columns(bureau_balance)

Unnamed: 0,Missing Count,Missing Count Ratio,Missing Count %


In [28]:
calculate_missing_rows(bureau_balance)

Missing rows: 0 of 27299925 total rows in data set.
Missing rows %: 0.00


No missing values in bureau_balance.

**Checking for duplicates.**

In [29]:
print(
    f"Duplicates: {bureau_balance.duplicated().sum()}, {(bureau_balance.duplicated().sum() / len(bureau_balance) * 100):.2f}%"
)

Duplicates: 0, 0.00%


No duplicates in this dataset.

**Sweetviz report**

We can find the report in EDA folder.

In [None]:
report = sv.analyze(bureau_balance)
html_file = f"bureau_balance_sweetviz_report.html"
report.show_html(html_file)
#display(IFrame(html_file, width=950, height=600))

**Creating Ydata report**

We can find the report in EDA folder.

In [None]:
profile = ProfileReport(bureau_balance, title="Bureau Balance EDA", explorative=True)

profile.to_file("bureau_balance_EDA.html")

### E. Feature analysis Bureau balance 

    MONTHS_BALANCE - Month of balance relative to application date (-1 means the freshest balance date),time only relative to the application

Numerical, no missing values, 2.2% zeros.

Minimum	-96 month (8 years), Maximum 0.

Distribution left skewed. Median -25, mean, -30.7. No anomalies.

    STATUS - Status of Credit Bureau loan during the month (active, closed, DPD0-30,… [C means closed, X means status unknown, 0 means no DPD, 1 means maximal did during month between 1-30, 2 means DPD 31-60,… 5 means DPD 120+ or sold or written off ] )",

Categorical, 8 distinct values, no missing values. 

Imbalanced distribution: C - 50.0%,  0 - 27.5%, X - 21.3%, 1 - 0.9%. No anomalies.

    Will create status_map for categories:
        "C": 0,  closed (good)
        "X": 0,  unknown/No History (treat as neutral)
        "0": 0,  no DPD (good)
        "1": 1,  DPD 1-30 days (mild bad)
        "2": 2,  DPD 31-60 days (bad)
        "3": 3,  DPD 61-90 days (severe)
        "4": 4,  DPD 91-120 days (very severe)
        "5": 5,  DPD 120+ or written off (worst)

    Feature engineering:
        WAS_SEVERELY_DELINQUENT >= 3 (critical status)
        WAS_WRITTEN_OFF = 5

We will need to join this back to bureau.csv per SK_ID_BUREAU first, then aggregate to SK_ID_CURR.

### Correlation

There is no noticeable correlations between features (Ydata report)

After aggregating the columns from dataset, we will use a more robust method: LightGBM's feature importance, to determine a feature's predictive power.

## 3. Summary

**Key EDA findings for bureau:**

    - Total features: 17 (numeric 14, categorical 3), rows: ~ 1.7M,

    - Missing cells (%)	13.5%, rows with missing values - 97.7%,
    
    - Missing values:
        - DAYS_ENDDATE_FACT - 37%
        - AMT_CREDIT_MAX_OVERDUE - 65.5%
        - AMT_CREDIT_SUM_LIMIT → 34.5% 
        - AMT_CREDIT_SUM_DEBT - 15% 
        - AMT_ANNUITY - 71.5% 
          
    - Negative values:
        - DAYS_CREDIT - 99.9%
        - DAYS_CREDIT_ENDDATE - 59%
        - DAYS_ENDDATE_FACT - 63%
        - DAYS_CREDIT_UPDATE - 99.9%
        
    - Zeros:
        - AMT_CREDIT_SUM_OVERDUE - 99.8%
        - AMT_CREDIT_SUM_LIMIT - 61%
        - AMT_CREDIT_SUM_DEBT - 51%
        - CNT_CREDIT_PROLONG - 99.5%
        - CREDIT_DAY_OVERDUE - 99.8%

    - Strong correlations (>0.7):
        - DAYS_CREDIT and DAYS_CREDIT_ENDDATE, DAYS_CREDIT_UPDATE, DAYS_ENDDATE_FACT.
    
    - Duplicates: None

**Planned Feature Engineering:**

Ideas for feature engineering from the bureau dataset to capture a client’s credit health and risk profile. The main steps:

    1. Convert "DAY" features to "YEAR".
    
    2. Handling Data Quality Issues

        - Replace negative values in AMT_CREDIT_SUM_LIMIT and AMT_CREDIT_SUM_DEBT with 0 and create flags for negatives (FLAG_NEG_*).

        - Convert categorical columns into binary risk indicators for active, closed, and bad credits.

    3. Feature Engineering

        - Credit Status Flags: HAS_ACTIVE_CREDIT, HAS_CLOSED_CREDIT, HAS_BAD_CREDIT. Flags for overdue or risky bureau records.

        - Risk & Overdue Indicators: HAS_ANY_OVERDUE_DEBT, HAS_SIGNIFICANT_OVERDUE_DEBT, HAS_ANY_MAJOR_BUREAU_RISK

        - Credit Utilization: Ratio of AMT_CREDIT_SUM_DEBT to AMT_CREDIT_SUM (capped at 1).

        - Time-Based Features: CREDIT_ENDDATE_PROXIMITY (distance between start and end dates). Updates recency: YEARS_CREDIT_UPDATE_min, YEARS_CREDIT_UPDATE_max.

        - Count Features: Total credit lines and active credit lines per customer.

    4. Aggregations

        - Numeric aggregations (mean, max, sum) for:

            AMT_CREDIT_SUM, AMT_CREDIT_SUM_DEBT, AMT_CREDIT_SUM_OVERDUE

            YEARS_CREDIT, YEARS_CREDIT_ENDDATE, YEARS_CREDIT_UPDATE

        - Custom Aggregations:

            Maximum or sum of flags to indicate presence of risky behavior.

    5. Feature Selection

        Select top features using LightGBM importance + ROC-AUC ranking.

        Must-keep strong predictors:

            BUREAU_HAS_ANY_OVERDUE_DEBT_max - Indicator if the client has ever had overdue debt in bureau records.

            BUREAU_HAS_SIGNIFICANT_OVERDUE_DEBT_max – Flag showing whether the client has ever had overdue debt above a significant threshold.

            BUREAU_HAS_BAD_CREDIT_sum – Count of instances where the client was flagged with bad credit history.

            BUREAU_HAS_ANY_MAJOR_BUREAU_RISK_max – Indicator for whether the client has ever triggered a major bureau risk flag.

            AMT_CREDIT_SUM_OVERDUE_sum - Total amount of overdue credit across all bureau-reported loans.

    6. Planned Result

        Selected features will be merged into the main training dataset for model building.

**Key EDA findings for bureau_balance:**

    - Total features: 3 (numeric 2, categorical 1), rows: ~ 27.3M,

    - Missing cells 0%, rows with missing values - 0%,
    
    - Negative values:
        - MONTHS_BALANCE - 97.8%
    
    - Zeros:
        - MONTHS_BALANCE - 2.2%
        
    - Strong correlations (>0.7): None
    
    - Duplicates: None

**Planned Feature Engineering:**

Ideas for feature engineering from the bureau_balance dataset to capture a client’s historical payment behavior and delinquency risk. The main steps:

    1. Handling Raw Status Codes

        Mapped STATUS values to numerical severity scores:

            C, X, 0 - 0 (good/neutral)
            
            1 - 1 (mild delinquency)
            
            2 - 2 (moderate)
            
            3 - 3 (severe)
            
            4 - 4 (very severe)
            
            5 - 5 (worst / written off)

    2. Feature Engineering

        Severity-Based Metrics:

            STATUS_SEVERITY_max – worst ever delinquency

            STATUS_SEVERITY_mean – average delinquency

            STATUS_SEVERITY_last – most recent delinquency level

        Critical Flags:

            WAS_SEVERELY_DELINQUENT – credit ever DPD ≥ 61 days

            WAS_WRITTEN_OFF – credit was written off (STATUS = 5)

        History Length:

            MONTHS_BALANCE_count – number of months tracked for each credit

    3. Credit-Level Aggregations (SK_ID_BUREAU)

        Aggregated bureau_balance per credit line:

            Worst status, average severity, last status

            Flags for severe delinquency and write-off

            History length count

    4. Client-Level Aggregations (SK_ID_CURR)

        After joining to bureau.csv, we aggregated features per client:

            BB_STATUS_SEVERITY_max_mean – mean of worst delinquency across credits

            BB_STATUS_SEVERITY_last_max – most recent worst status

            BB_WAS_SEVERELY_DELINQUENT_max – flag if any credit ever had severe delinquency

            BB_WAS_WRITTEN_OFF_max – flag if any credit was written off

            BB_MONTHS_BALANCE_count_sum – total history length across all credits

    5. Feature Selection

        Select top features using LightGBM importance + ROC-AUC ranking.

    6. Planned Result

        Selected features will be merged into the main training dataset for model building.

**Next Steps:**

    Merge these features into the main training dataset.

    Combine with Bureau features and others (previous applications, POS_CASH, etc.) for final model training.