# Home Credit Default Risk - EDA CREDIT CARD BALANCE

## 1. Introduction

**Context**

This notebook contains basic EDA for CREDIT CARD BALANCE data set.

This is additional source of data (application_train/application_test are the main training and testing data).

credit_card_balance.csv

    Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
    This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.

**Goals:**

    To comprehensively understand the dataset's structure, identify key patterns, and discover meaningful insights that will inform a robust feature engineering and modeling strategy.

**Objectives:**

    Conduct a comprehensive Exploratory Data Analysis (EDA): Perform an in-depth exploration of the datasets to understand their statistical properties and distributions.

    Identify and address data quality issues: Investigate missing values, identify and handle data anomalies.

    Analyze feature relationships: Evaluate correlations between features and assess their individual relationships with the target variable to prioritize their importance for the model.

    Leverage automated tools for initial insights: Utilize libraries like Sweetviz to quickly generate an initial feature exploration report.


## 2. Exploratory Data Analysis (EDA)

### A. Data loading & Initial checks

In [1]:
%load_ext jupyter_black

In [2]:
import pandas as pd
import numpy as np
import sys
import os
from typing import Dict, Optional, List, Tuple, Union
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="sweetviz.graph")
import sweetviz as sv
from ydata_profiling import ProfileReport
from IPython.display import IFrame

In [3]:
sys.path.append(os.path.abspath(".."))
from Data.utils_EDA import feature_types, missing_columns, calculate_missing_rows
from Data.utils_modeling import downcast_numeric_col

**Loading dataset**

In [4]:
credit_card_balance = pd.read_csv(r"..\Data\credit_card_balance.csv")
credit_card_balance.shape

(3840312, 23)

**Downcasting numeric columns**

In [5]:
credit_card_balance = credit_card_balance.copy()
downcast_numeric_col(credit_card_balance)
credit_card_balance.dtypes.unique()

array([dtype('int32'), dtype('int8'), dtype('float64'), dtype('float32'),
       dtype('int16'), dtype('O')], dtype=object)

In [6]:
credit_card_balance.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,2562384,378907,-6,56.97,135000,0.0,877.5,0.0,877.5,1700.325,...,0.0,0.0,0.0,1,0.0,1.0,35.0,Active,0,0
1,2582071,363914,-1,63975.555,45000,2250.0,2250.0,0.0,0.0,2250.0,...,64875.555,64875.555,1.0,1,0.0,0.0,69.0,Active,0,0
2,1740877,371185,-7,31815.225,450000,0.0,0.0,0.0,0.0,2250.0,...,31460.085,31460.085,0.0,0,0.0,0.0,30.0,Active,0,0
3,1389973,337855,-4,236572.11,225000,2250.0,2250.0,0.0,0.0,11795.76,...,233048.97,233048.97,1.0,1,0.0,0.0,10.0,Active,0,0
4,1891521,126868,-1,453919.455,450000,0.0,11547.0,0.0,11547.0,22924.89,...,453919.455,453919.455,0.0,1,0.0,1.0,101.0,Active,0,0


In [7]:
credit_card_balance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int32  
 1   SK_ID_CURR                  int32  
 2   MONTHS_BALANCE              int8   
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int32  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float32
 16  CNT_DRAWINGS_CURRENT        int16  
 17  CNT_DRAWINGS_OTHER_CURRENT  float32
 18  CNT_DRAWINGS_POS_CURRENT    float32
 19  CNT_INSTALMENT_MATURE

**Feature descriptions:**


1. SK_ID_PREV ,"ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)",hashed

2. ,SK_ID_CURR,ID of loan in our sample,hashed

3. MONTHS_BALANCE,Month of balance relative to application date (-1 means the freshest balance date),time only relative to the application

4. AMT_BALANCE,Balance during the month of previous credit,

5. AMT_CREDIT_LIMIT_ACTUAL,Credit card limit during the month of the previous credit,

6. AMT_DRAWINGS_ATM_CURRENT,Amount drawing at ATM during the month of the previous credit,

7. AMT_DRAWINGS_CURRENT,Amount drawing during the month of the previous credit,

8. AMT_DRAWINGS_OTHER_CURRENT,Amount of other drawings during the month of the previous credit,

9. AMT_DRAWINGS_POS_CURRENT,Amount drawing or buying goods during the month of the previous credit,

10. AMT_INST_MIN_REGULARITY,Minimal installment for this month of the previous credit,

11. AMT_PAYMENT_CURRENT,How much did the client pay during the month on the previous credit,

12. AMT_PAYMENT_TOTAL_CURRENT,How much did the client pay during the month in total on the previous credit,

13. AMT_RECEIVABLE_PRINCIPAL,Amount receivable for principal on the previous credit,

14. AMT_RECIVABLE,Amount receivable on the previous credit,

15. AMT_TOTAL_RECEIVABLE,Total amount receivable on the previous credit,

16. CNT_DRAWINGS_ATM_CURRENT,Number of drawings at ATM during this month on the previous credit,

17. CNT_DRAWINGS_CURRENT,Number of drawings during this month on the previous credit,

18. CNT_DRAWINGS_OTHER_CURRENT,Number of other drawings during this month on the previous credit,

19. CNT_DRAWINGS_POS_CURRENT,Number of drawings for goods during this month on the previous credit,

20. CNT_INSTALMENT_MATURE_CUM,Number of paid installments on the previous credit,

21. NAME_CONTRACT_STATUS,"Contract status (active signed,...) on the previous credit",

22. SK_DPD,DPD (Days past due) during the month on the previous credit,

23. SK_DPD_DEF,DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit,

**Feature types**

In [8]:
feature_types(credit_card_balance)

Numerical features: ['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'AMT_BALANCE', 'AMT_CREDIT_LIMIT_ACTUAL', 'AMT_DRAWINGS_ATM_CURRENT', 'AMT_DRAWINGS_CURRENT', 'AMT_DRAWINGS_OTHER_CURRENT', 'AMT_DRAWINGS_POS_CURRENT', 'AMT_INST_MIN_REGULARITY', 'AMT_PAYMENT_CURRENT', 'AMT_PAYMENT_TOTAL_CURRENT', 'AMT_RECEIVABLE_PRINCIPAL', 'AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE', 'CNT_DRAWINGS_ATM_CURRENT', 'CNT_DRAWINGS_CURRENT', 'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_POS_CURRENT', 'CNT_INSTALMENT_MATURE_CUM', 'SK_DPD', 'SK_DPD_DEF']
Categorical features: ['NAME_CONTRACT_STATUS']
Binary features: []


In [9]:
credit_card_balance.dtypes.value_counts()

float64    11
float32     4
int32       3
int16       3
int8        1
object      1
Name: count, dtype: int64

In [10]:
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

credit_card_balance.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_PREV,3840312.0,1904504.0,536469.470563,1000018.0,1434385.0,1897122.0,2369328.0,2843496.0
SK_ID_CURR,3840312.0,278324.2,102704.475133,100006.0,189517.0,278396.0,367580.0,456250.0
MONTHS_BALANCE,3840312.0,-34.52192,26.667751,-96.0,-55.0,-28.0,-11.0,-1.0
AMT_BALANCE,3840312.0,58300.16,106307.031024,-420250.185,0.0,0.0,89046.69,1505902.185
AMT_CREDIT_LIMIT_ACTUAL,3840312.0,153808.0,165145.699525,0.0,45000.0,112500.0,180000.0,1350000.0
AMT_DRAWINGS_ATM_CURRENT,3090496.0,5961.325,28225.688578,-6827.31,0.0,0.0,0.0,2115000.0
AMT_DRAWINGS_CURRENT,3840312.0,7433.388,33846.077333,-6211.62,0.0,0.0,0.0,2287098.315
AMT_DRAWINGS_OTHER_CURRENT,3090496.0,288.1696,8201.989345,0.0,0.0,0.0,0.0,1529847.0
AMT_DRAWINGS_POS_CURRENT,3090496.0,2968.805,20796.887047,0.0,0.0,0.0,0.0,2239274.16
AMT_INST_MIN_REGULARITY,3535076.0,3540.204,5600.154122,0.0,0.0,0.0,6633.911,202882.005


**Key insights:**

The dataset is highly skewed with many zero or near-zero values, punctuated by extreme outliers.

Several monetary columns have negative values, this might indicate correction entries, data errors, or reversed transactions

**Missing values**

In [11]:
missing_columns(credit_card_balance)

Unnamed: 0,Missing Count,Missing Count Ratio,Missing Count %
AMT_PAYMENT_CURRENT,767988,0.199981,20.0
AMT_DRAWINGS_ATM_CURRENT,749816,0.195249,19.5
CNT_DRAWINGS_POS_CURRENT,749816,0.195249,19.5
AMT_DRAWINGS_OTHER_CURRENT,749816,0.195249,19.5
AMT_DRAWINGS_POS_CURRENT,749816,0.195249,19.5
CNT_DRAWINGS_OTHER_CURRENT,749816,0.195249,19.5
CNT_DRAWINGS_ATM_CURRENT,749816,0.195249,19.5
CNT_INSTALMENT_MATURE_CUM,305236,0.079482,7.9
AMT_INST_MIN_REGULARITY,305236,0.079482,7.9


In [12]:
calculate_missing_rows(credit_card_balance)

Missing rows: 826036 of 3840312 total rows in data set.
Missing rows %: 21.51


We will not remove missing rows, use imputation.

**Checking for duplicates.**

In [13]:
print(
    f"Duplicates: {credit_card_balance.duplicated().sum()}, {(credit_card_balance.duplicated().sum() / len(credit_card_balance) * 100):.2f}%"
)

Duplicates: 0, 0.00%


No duplicates in credit card balance dataset.

**Sweetviz report**

We can find the report in EDA folder.

In [None]:
report = sv.analyze(credit_card_balance)
html_file = f"Credit_card_balance_sweetviz_report.html"
report.show_html(html_file)
#display(IFrame(html_file, width=950, height=600))

**Ydata report**

The report exceeds 25MB, it was not submitted to GitHub.

In [None]:
profile = ProfileReport(
    credit_card_balance, title="Credit_card_balance_EDA", explorative=True
)

profile.to_file("Credit_card_balance_EDA.html")

### B. Feature analysis

    MONTHS_BALANCE - Month of balance relative to application date (-1 means the freshest balance date),time only relative to the application

Numerical, no missing values. no zeros.

Left skewed.

Minimum	-96 month (8 years), Maximum -1, Mean	-34.5

    AMT_BALANCE - Balance during the month of previous credit

Very high correlation (>0.9) with MT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE, AMT_INST_MIN_REGULARITY, and high (0.8) with AMT_PAYMENT_TOTAL_CURRENT

Right skewed, with negative and positive outliers.

Numerical, no missing values, 56.2% zeros.

Minimum	-420,250.18, Maximum 1,505,902.2, Mean	58,300.155.

    Negative values (%) 0.1% - likely refunds or adjustments, need clipping at 0, and flag for < 0.


    AMT_CREDIT_LIMIT_ACTUAL - Credit card limit during the month of the previous credit

Numerical, no missing values, 19.6% zeros, no negative values.

Right skewed, positive outliers.

Minimum	0, Maximum	1,350,000.0, Mean	153,807.96

    Feature engineering:
    - CREDIT UTILIZATION = AMT_BALANCE / AMT_CREDIT_LIMIT_ACTUAL

    AMT_DRAWINGS_ATM_CURRENT - Amount drawing at ATM during the month of the previous credit

High correlation with AMT_DRAWINGS_CURRENT, CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT.

Numerical, 19.5% missing values, 69.4% zeros, Negative < 0.1%.

Minimum	-6,827.31, Maximum 2,115,000.0, Mean 5,961.32, outliers.

    Only one value < 0, need clipping at 0.

    AMT_DRAWINGS_CURRENT - Amount drawing during the month of the previous credit

High correlation with  AMT_DRAWINGS_ATM_CURRENT, CNT_DRAWINGS_ATM_CURRENT, CNT_DRAWINGS_CURRENT

Numerical, no missing values, 83.9% zeros, Negative (%)	< 0.1%

Minimum	-6,211.62, Maximum	2,287,098.3, Mean	7,433.3882. Outliers.

    Only 3 values < 0, need clipping at 0.
    Feature engineering:
    - What percentage of drawings are from ATM, ATM_DRAWING_RATIO = AMT_DRAWINGS_ATM_CURRENT / AMT_DRAWINGS_CURRENT
    - DRAWINGS_TO_PAYMENTS_RATIO = AMT_DRAWINGS_CURRENT / (AMT_PAYMENT_CURRENT + 1)

    AMT_DRAWINGS_OTHER_CURRENT - Amount of other drawings during the month of the previous credit

High correlation with CNT_DRAWINGS_OTHER_CURRENT.

Numerical, 19.5% missing values, 80.2% zeros, no negative.

Minimum	0, Maximum	1,529,847.0, Mean 288.2. Outliers.

    AMT_DRAWINGS_POS_CURRENT - Amount drawing or buying goods during the month of the previous credit

High correlation with CNT_DRAWINGS_POS_CURRENT.

Numerical, 19.5% missing values, 73.6% zeros, no negative.

Minimum	0, Maximum 2,239,274.2, Mean 2,968.8. Outliers.

    AMT_INST_MIN_REGULARITY - Minimal installment for this month of the previous credit

High correlation with AMT_BALANCE, AMT_PAYMENT_CURRENT, AMT_PAYMENT_TOTAL_CURRENT, AMT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE.

Numerical, 7.9% missing values, 50.2% zeros.

Minimum	0, Maximum	202,882.01, Mean 3,540.2041. Right skewed. Outliers.

    AMT_PAYMENT_CURRENT - How much did the client pay during the month on the previous credit

High correlation with AMT_BALANCE, AMT_INST_MIN_REGULARITY, AMT_PAYMENT_TOTAL_CURRENT.

Numerical, 20.0% missing values, 10.2% zeros.

Minimum	0, Maximum	4,289,207.4, Mean 10,280.5. Outliers.

    Feature engineering:
    - MIN_PAYMENT_RATIO = AMT_PAYMENT_CURRENT / AMT_INST_MIN_REGULARITY,
    If minimum payment is 0, we will set ratio to 1 (assume paid in full or no payment due) 
    - MADE_MINIMUM_PAYMENT = MIN_PAYMENT_RATIO >= 0.95, 5% tolerance
    - PAYMENT_TO_BALANCE_RATIO = AMT_PAYMENT_CURRENT / (AMT_BALANCE + 1)

    AMT_PAYMENT_TOTAL_CURRENT - How much did the client pay during the month in total on the previous credit

High correlation with AMT_PAYMENT_CURRENT, AMT_BALANCE, AMT_INST_MIN_REGULARITY, AMT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE.

Numerical, no missing values, 56.6% zeros, no negative.

Minimum	0, Maximum	4,278,315.7, Mean 	7,588.9. Outliers.

    AMT_RECEIVABLE_PRINCIPAL - Amount receivable for principal on the previous credit

High correlation with AMT_BALANCE, AMT_DRAWINGS_CURRENT, AMT_INST_MIN_REGULARITY, AMT_PAYMENT_CURRENT, AMT_PAYMENT_TOTAL_CURRENT, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE, CNT_DRAWINGS_CURRENT.

Numerical, no missing values, 59.8% zeros, negative 0.1%.

Minimum	-423,305.82, Maximum 1,472,316.8, Mean 55,965.9. Right skewed. Outliers.

    Negative values 0.1% - could be data entry errors, refunds, misclassified transactions , need clipping at 0, and flag for < 0.

    AMT_RECIVABLE - Amount receivable on the previous credit

High correlation with AMT_BALANCE, AMT_INST_MIN_REGULARITY, AMT_PAYMENT_CURRENT, AMT_PAYMENT_TOTAL_CURRENT, AMT_RECEIVABLE_PRINCIPAL, AMT_TOTAL_RECEIVABLE.

Numerical, no missing values, 55.0% zeros, negative	2.8%.

Minimum	-420,250.18, Maximum 1,493,338.2, Mean 58,088.8. Right skewed. Outliers.

    Negative values 2.8% - could bedata anomalies or accounting reversals, need clipping at 0, and flag for < 0.

    AMT_TOTAL_RECEIVABLE - Total amount receivable on the previous credit

High correlation with AMT_BALANCE, AMT_INST_MIN_REGULARITY, AMT_PAYMENT_CURRENT, AMT_PAYMENT_TOTAL_CURRENT, AMT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE.

Numerical, no missing values, 55.0% zeros, negative 2.8%.

Minimum	-420,250.18, Maximum 1,493,338.2, Mean	58,098.3. Right skewed. Outliers.

     Negative values 2.8% - could bedata anomalies or accounting reversals, need clipping at 0, and flag for < 0.

    CNT_DRAWINGS_ATM_CURRENT - Number of drawings at ATM during this month on the previous credit

High correlation with AMT_DRAWINGS_ATM_CURRENT, AMT_DRAWINGS_CURRENT, CNT_DRAWINGS_CURRENT.

Numerical, 19.5% missing values, 69.4% zeros, negative	0.

Minimum	0, Maximum	51, Mean 0.3. Right skewed. Outliers.

    CNT_DRAWINGS_CURRENT - Number of drawings during this month on the previous credit

High correlation with AMT_BALANCE, AMT_DRAWINGS_ATM_CURRENT, AMT_DRAWINGS_CURRENT, AMT_DRAWINGS_POS_CURRENT, AMT_RECEIVABLE_PRINCIPAL, CNT_DRAWINGS_ATM_CURRENT, CNT_DRAWINGS_POS_CURRENT.

Numerical, no missing values, 84.1% zeros, negative	0.

Minimum	0, Maximum	165, Mean 0.7. Right skewed. Outliers.

    CNT_DRAWINGS_OTHER_CURRENT - Number of other drawings during this month on the previous credit

High correlation with AMT_DRAWINGS_OTHER_CURRENT.

Numerical, 19.5% missing values, 80.1% zeros, negative 0.

Minimum	0, Maximum	12, Mean 0.005. Right skewed. Outliers.

    CNT_DRAWINGS_POS_CURRENT - Number of drawings for goods during this month on the previous credit

High correlation with AMT_DRAWINGS_CURRENT, AMT_DRAWINGS_POS_CURRENT, CNT_DRAWINGS_CURRENT.

Numerical, 19.5% missing values, 73.6% zeros, negative 0.

Minimum	0, Maximum	165, Mean 0.6. Right skewed. Outliers.

    CNT_INSTALMENT_MATURE_CUM - Number of paid installments on the previous credit

Numerical, 7.9% missing values, 14.4% zeros, negative 0.

Minimum	0, Maximum	120, Mean 20.8. Right skewed. Outliers.

    NAME_CONTRACT_STATUS - Contract status (active signed,...) on the previous credit

Categorical, no missing values, imbalanced: Active 96.3%, Completed 3.4%.

    Flag "is active" - NAME_CONTRACT_STATUS = "Active".

    SK_DPD - DPD (Days past due) during the month on the previous credit

High correlation with SK_DPD_DEF.

Numerical, no missing values, 96.0% zeros, negative	0.

Minimum	0, Maximum 3,260 days(8.93 year), Mean 9.3. Right skewed. Outliers.

    Clip outliers on 365 (beyond 1 year, risk is already extreme).

    SK_DPD_DEF - DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit

High correlation with SK_DPD.

Numerical, no missing values, 97.7% zeros, negative	0.

Minimum	0, Maximum 3,260 days(8.93 year), Mean 0.3. Right skewed. Outliers.

    Clip outliers on 365 (beyond 1 year, risk is already extreme).

### Correlation

We will analyze the relationships between features using a Ydata-Quality report. This report will provide a comprehensive overview of our data, including an automated correlation matrix for all features.

To determine which features are most impactful for our model, we will use a more robust method: LightGBM's feature importance. After aggregating the columns from specific datasets into our main dataset, the LightGBM model will automatically calculate the importance of each feature in predicting the target variable. This approach is superior as it directly assesses a feature's predictive power within the context of our chosen model, providing a more reliable measure of its relationship with the target.

**Feature Relationships**

High correlation (Ydata Report):

    AMT_BALANCE - MT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE, AMT_INST_MIN_REGULARITY, AMT_PAYMENT_TOTAL_CURRENT
    AMT_DRAWINGS_ATM_CURRENT - AMT_DRAWINGS_CURRENT, CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT
    AMT_DRAWINGS_CURRENT - CNT_DRAWINGS_ATM_CURRENT, CNT_DRAWINGS_CURRENT
    AMT_DRAWINGS_OTHER_CURRENT - CNT_DRAWINGS_OTHER_CURRENT
    AMT_DRAWINGS_POS_CURRENT - CNT_DRAWINGS_POS_CURRENT
    AMT_INST_MIN_REGULARITY - AMT_PAYMENT_CURRENT, AMT_PAYMENT_TOTAL_CURRENT, AMT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE
    AMT_PAYMENT_CURRENT - AMT_PAYMENT_TOTAL_CURRENT
    AMT_PAYMENT_TOTAL_CURRENT - AMT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE
    AMT_RECEIVABLE_PRINCIPAL - AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE
    AMT_RECIVABLE - AMT_TOTAL_RECEIVABLE
    CNT_DRAWINGS_ATM_CURRENT - CNT_DRAWINGS_CURRENT
    SK_DPD - SK_DPD_DEF

## 3. Summary

**Key EDA findings for Credit card balance:**

    - Total features: 23 (numeric 22, categorical 1), rows: ~ 3.8M,

    - Missing cells	6.7%, rows with missing values - 21.5%,
    
    - Missing values (>15%):
        - AMT_DRAWINGS_ATM_CURRENT - 19.5%
        - AMT_DRAWINGS_OTHER_CURRENT - 19.5%
        - AMT_DRAWINGS_POS_CURRENT - 19.5%
        - AMT_PAYMENT_CURRENT - 20.0%
        - CNT_DRAWINGS_ATM_CURRENT - 19.5%
        - CNT_DRAWINGS_OTHER_CURRENT - 19.5%
        - CNT_DRAWINGS_POS_CURRENT - 19.5%
        
    - Negative values (>50%):
        - MONTHS_BALANCE - 100.0%

    - Zeros (>50%):
        - AMT_BALANCE - 56.2%
        - AMT_DRAWINGS_ATM_CURRENT - 69.4%
        - AMT_DRAWINGS_CURRENT - 83.9%
        - AMT_DRAWINGS_OTHER_CURRENT - 80.2%
        - AMT_DRAWINGS_POS_CURRENT - 73.6%
        - AMT_INST_MIN_REGULARITY - 50.2%
        - AMT_PAYMENT_TOTAL_CURRENT - 56.6%
        - AMT_RECEIVABLE_PRINCIPAL - 59.8%
        - AMT_RECIVABLE - 55.0%
        - AMT_TOTAL_RECEIVABLE - 55.0%
        - CNT_DRAWINGS_ATM_CURRENT - 69.4%
        - CNT_DRAWINGS_CURRENT - 84.1%
        - CNT_DRAWINGS_OTHER_CURRENT - 80.1%
        - CNT_DRAWINGS_POS_CURRENT - 73.6%
        - SK_DPD - 96.0%
        - SK_DPD_DEF - 97.7%

    - Strong correlations (>0.7):
        - AMT_BALANCE - MT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE, AMT_INST_MIN_REGULARITY, AMT_PAYMENT_TOTAL_CURRENT
        - AMT_DRAWINGS_ATM_CURRENT - AMT_DRAWINGS_CURRENT, CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT
        - AMT_DRAWINGS_CURRENT - CNT_DRAWINGS_ATM_CURRENT, CNT_DRAWINGS_CURRENT
        - AMT_DRAWINGS_OTHER_CURRENT - CNT_DRAWINGS_OTHER_CURRENT
        - AMT_DRAWINGS_POS_CURRENT - CNT_DRAWINGS_POS_CURRENT
        - AMT_INST_MIN_REGULARITY - AMT_PAYMENT_CURRENT, AMT_PAYMENT_TOTAL_CURRENT, AMT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE
        - AMT_PAYMENT_CURRENT - AMT_PAYMENT_TOTAL_CURRENT
        - AMT_PAYMENT_TOTAL_CURRENT - AMT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE
        - AMT_RECEIVABLE_PRINCIPAL - AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE
        - AMT_RECIVABLE - AMT_TOTAL_RECEIVABLE
        - CNT_DRAWINGS_ATM_CURRENT - CNT_DRAWINGS_CURRENT
        - SK_DPD - SK_DPD_DEF
    
    - Duplicates: None

**Planned Feature Engineering:**

Ideas for feature engineering from the credit_card_balance dataset to capture a client’s credit utilization, payment discipline, delinquency risk, and spending patterns. The main steps:

    1. Handling Negative & Extreme Values

        - Convert negative monetary values to flags and clipp them to zero:

            NEGATIVE_AMT_BALANCE, NEGATIVE_AMT_RECEIVABLE_PRINCIPAL, etc.

        - Capp extreme delinquency (SK_DPD, SK_DPD_DEF) at 365 days to reduce outlier impact.

    2. Feature Engineering

        - Credit Utilization:

            CREDIT_UTILIZATION_RATIO = AMT_BALANCE / AMT_CREDIT_LIMIT_ACTUAL (capped at 1.5)

        - Payment Behavior:

            MIN_PAYMENT_RATIO = AMT_PAYMENT_CURRENT / AMT_INST_MIN_REGULARITY

            MADE_MINIMUM_PAYMENT = flag if ratio ≥ 95%

            PAYMENT_TO_BALANCE_RATIO = AMT_PAYMENT_CURRENT / (AMT_BALANCE + 1)

        - Spending Behavior:

            ATM_DRAWING_RATIO = ATM drawings / total drawings

            DRAWINGS_TO_PAYMENTS_RATIO = AMT_DRAWINGS_CURRENT / (AMT_PAYMENT_CURRENT + 1)

        - Delinquency:

            IS_DELINQUENT = flag if SK_DPD > 0

            IS_SERIOUSLY_DELINQUENT = flag if SK_DPD > 30

        - Trends & Activity:

            UTILIZATION_ROLLING_MEAN = 3-month rolling avg of utilization

            IS_ACTIVE = flag if contract status = Active

    3. Statistical Aggregations (per SK_ID_CURR)

        - Credit Utilization Metrics: mean, max, last utilization

        - Balance & Payments: mean, sum, max, std of AMT_BALANCE, AMT_PAYMENT_CURRENT

        - Behavior Ratios: mean & max of MIN_PAYMENT_RATIO, DRAWINGS_TO_PAYMENTS_RATIO, PAYMENT_TO_BALANCE_RATIO

        - Spending: sum & max of drawings, ATM ratio

        - Delinquency: max & mean SK_DPD and SK_DPD_DEF, delinquency flags

        - Trends: first & last utilization rolling mean

        - Activity: avg IS_ACTIVE

    4. Client-Level Summary

        - Aggregate all features at client level (SK_ID_CURR) using:

            mean, max, sum, last, std where relevant

        - Flags to capture ever delinquent, ever serious delinquency, activity ratio

    5. Feature Selection

        - Select top features using LightGBM importance + ROC-AUC ranking.

        - Key features retained (must keep):

            CC_SK_DPD_max, CC_IS_DELINQUENT_max, CC_IS_SERIOUSLY_DELINQUENT_max, CC_SK_DPD_DEF_max

            High-impact ratios & rolling trends.

    6. Must keep list:
    
        "CC_SK_DPD_max" - Maximum number of days past due across all credit card statements.
        
        "CC_IS_DELINQUENT_max" - Indicator if the client was ever delinquent (any overdue payments).
        
        "CC_IS_SERIOUSLY_DELINQUENT_max" – Indicator if the client was ever seriously delinquent (high days past due).
        
        "CC_SK_DPD_DEF_max" – Maximum number of days past due on credit card accounts considered in default.

    7. Planned Result

        Use LightGBM importance + ROC-AUC ranking to select top features.

        Merge selected features to main data frame for model training.