# Home Credit Default Risk - EDA INSTALLMENTS PAYMENTS

## 1. Introduction

**Context:**

This notebook contains basic EDA for INSTALLMENTS PAYMENTS data set.

This is additional source of data (application_train/application_test are the main training and testing data).

installments_payments.csv

    Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
    There is a) one row for every payment that was made plus b) one row each for missed payment.
    One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.

**Goals:**

    To comprehensively understand the dataset's structure, identify key patterns, and discover meaningful insights that will inform a robust feature engineering and modeling strategy.

**Objectives:**

    Conduct a comprehensive Exploratory Data Analysis (EDA): Perform an in-depth exploration of the datasets to understand their statistical properties and distributions.

    Identify and address data quality issues: Investigate missing values, identify and handle data anomalies.

    Analyze feature relationships: Evaluate correlations between features and assess their individual relationships with the target variable to prioritize their importance for the model.

    Leverage automated tools for initial insights: Utilize libraries like Sweetviz to quickly generate an initial feature exploration report.


## 2. Exploratory Data Analysis (EDA)

### A. Data loading & Initial checks

In [1]:
%load_ext jupyter_black

In [2]:
import pandas as pd
import numpy as np
import sys
import os
from typing import Dict, Optional, List, Tuple, Union
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="sweetviz.graph")
import sweetviz as sv
from ydata_profiling import ProfileReport
from IPython.display import IFrame

In [3]:
sys.path.append(os.path.abspath(".."))
from Data.utils_EDA import feature_types, missing_columns, calculate_missing_rows
from Data.utils_modeling import downcast_numeric_col

**Loading dataset**

In [5]:
installments = pd.read_csv(r"..\Data\installments_payments.csv")
installments.shape

(13605401, 8)

**Downcasting numeric columns**

In [6]:
installments = installments.copy()
downcast_numeric_col(installments)
installments.dtypes.unique()

array([dtype('int32'), dtype('float32'), dtype('int16'), dtype('float64')],
      dtype=object)

In [7]:
installments.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,1054186,161674,1.0,6,-1180.0,-1187.0,6948.36,6948.36
1,1330831,151639,0.0,34,-2156.0,-2156.0,1716.525,1716.525
2,2085231,193053,2.0,1,-63.0,-63.0,25425.0,25425.0
3,2452527,199697,1.0,3,-2418.0,-2426.0,24350.13,24350.13
4,2714724,167756,1.0,2,-1383.0,-1366.0,2165.04,2160.585


In [8]:
installments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_PREV              int32  
 1   SK_ID_CURR              int32  
 2   NUM_INSTALMENT_VERSION  float32
 3   NUM_INSTALMENT_NUMBER   int16  
 4   DAYS_INSTALMENT         float32
 5   DAYS_ENTRY_PAYMENT      float32
 6   AMT_INSTALMENT          float64
 7   AMT_PAYMENT             float64
dtypes: float32(3), float64(2), int16(1), int32(2)
memory usage: 493.1 MB


**Feature descriptions:**


1. SK_ID_PREV ,"ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)",hashed

2. SK_ID_CURR,ID of loan in our sample,hashed

3. NUM_INSTALMENT_VERSION,Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed,

4. NUM_INSTALMENT_NUMBER,On which installment we observe payment,

5. DAYS_INSTALMENT,When the installment of previous credit was supposed to be paid (relative to application date of current loan),time only relative to the application

6. DAYS_ENTRY_PAYMENT,When was the installments of previous credit paid actually (relative to application date of current loan),time only relative to the application

7. AMT_INSTALMENT,What was the prescribed installment amount of previous credit on this installment,

8. AMT_PAYMENT,What the client actually paid on previous credit on this installment,

**Feature types**

In [9]:
feature_types(installments)

Numerical features: ['SK_ID_PREV', 'SK_ID_CURR', 'NUM_INSTALMENT_VERSION', 'NUM_INSTALMENT_NUMBER', 'DAYS_INSTALMENT', 'DAYS_ENTRY_PAYMENT', 'AMT_INSTALMENT', 'AMT_PAYMENT']
Categorical features: []
Binary features: []


In [10]:
installments.dtypes.value_counts()

float32    3
int32      2
float64    2
int16      1
Name: count, dtype: int64

In [11]:
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

installments.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_PREV,13605401.0,1903365.0,536202.905546,1000001.0,1434191.0,1896520.0,2369094.0,2843499.0
SK_ID_CURR,13605401.0,278444.9,102718.310411,100001.0,189639.0,278685.0,367530.0,456255.0
NUM_INSTALMENT_VERSION,13605401.0,0.8566373,1.035216,0.0,0.0,1.0,1.0,178.0
NUM_INSTALMENT_NUMBER,13605401.0,18.8709,26.664067,1.0,4.0,8.0,19.0,277.0
DAYS_INSTALMENT,13605401.0,-1042.27,800.946289,-2922.0,-1654.0,-818.0,-361.0,-1.0
DAYS_ENTRY_PAYMENT,13602496.0,-1051.114,800.585876,-4921.0,-1662.0,-827.0,-370.0,-1.0
AMT_INSTALMENT,13605401.0,17050.91,50570.254429,0.0,4226.085,8884.08,16710.21,3771487.845
AMT_PAYMENT,13602496.0,17238.22,54735.783981,0.0,3398.265,8125.515,16108.425,3771487.845


**Key insights:**

Outliers: Some columns show extreme max values (e.g., 277 installments, 3.77M payments).

The minimum for DAYS_ENTRY_PAYMENT is -4921, which is over 13 years.

**Missing values**

In [12]:
missing_columns(installments)

Unnamed: 0,Missing Count,Missing Count Ratio,Missing Count %
DAYS_ENTRY_PAYMENT,2905,0.000214,0.0
AMT_PAYMENT,2905,0.000214,0.0


In [13]:
calculate_missing_rows(installments)

Missing rows: 2905 of 13605401 total rows in data set.
Missing rows %: 0.02


Almost no missing values. Use imputation.

**Checking for duplicates.**

In [14]:
print(
    f"Duplicates: {installments.duplicated().sum()}, {(installments.duplicated().sum() / len(installments) * 100):.2f}%"
)

Duplicates: 0, 0.00%


No duplicates in this dataset.

**Sweetviz report**

We can find the report in EDA folder.

In [None]:
report = sv.analyze(df)
html_file = f"installments_sweetviz_report.html"
report.show_html(html_file)
#display(IFrame(html_file, width=950, height=600))

**Ydata report**

We can find the report in EDA folder.

In [None]:
profile = ProfileReport(df, title="installments EDA", explorative=True)

profile.to_file("installments_payments_EDA.html")

### B. Feature analysis

    NUM_INSTALMENT_VERSION - Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed

Numerical, no missing values, 30.0% zeros. "1" value 62.4%.

Minimum	0, Maximum 178, Mean 0.86. Right skewed. Outliers.

    NUM_INSTALMENT_NUMBER - On which installment we observe payment

Numerical, no missing values, no zeros.

Minimum	1, Maximum 277, Mean 18.9. Right skewed. Outliers.

    DAYS_INSTALMENT - When the installment of previous credit was supposed to be paid (relative to application date of current loan),time only relative to the application

High correlation with DAYS_ENTRY_PAYMENT.

Numerical, no missing values, no zeros.

Minimum	-2,922 (~8 years), Maximum	-1, Mean -1,042.27. Left skewed.

    Convert to years.

    DAYS_ENTRY_PAYMENT - When was the installments of previous credit paid actually (relative to application date of current loan),time only relative to the application

High correlation with DAYS_INSTALMENT.

Numerical, <0.1% missing values, no zeros.

Minimum	-4,921 (~13.5 years), Maximum -1, Mean -1,051.1. Left skewed.

    Convert to years.
    Feature engineering:
    - calculate late payment LATE_YEARS = YEARS_ENTRY_PAYMENT - YEARS_INSTALMENT
    - flag for late years

    AMT_INSTALMENT - What was the prescribed installment amount of previous credit on this installment

High correlation with AMT_PAYMENT, NUM_INSTALMENT_VERSION.

Numerical, no missing values, <0.1% zeros.

Minimum	0, Maximum 3,771,487.8, Mean 17,050.9. Right skewed. Outliers.

    AMT_PAYMENT - What the client actually paid on previous credit on this installment.

High correlation with AMT_INSTALMENT.

Numerical, <0.1% missing values, <0.1% zeros.

Minimum	0, Maximum 3,771,487.8, Mean 17,238.2. Right skewed. Outliers.

    Feature engineering:
    - UNDERPAYMENT_RATIO = AMT_PAYMENT / AMT_INSTALMENT
    - flag for UNDERPAYMENT_RATIO < 0.95 (5% tolerance)
    - flag for overpayments UNDERPAYMENT_RATIO > 1.05 (5% tolerance)
    - payment difference = AMT_PAYMENT - AMT_INSTALMENT
    - absolute payment difference

### Correlation

We will analyze the relationships between features using a Ydata-Quality report. This report will provide a comprehensive overview of our data, including an automated correlation matrix for all features.

To determine which features are most impactful for our model, we will use a more robust method: LightGBM's feature importance. After aggregating the columns from specific datasets into our main dataset, the LightGBM model will automatically calculate the importance of each feature in predicting the target variable. This approach is superior as it directly assesses a feature's predictive power within the context of our chosen model, providing a more reliable measure of its relationship with the target.

**Feature Relationships**

High correlation (Ydata Report):
    AMT_INSTALMENT - AMT_PAYMENT
    DAYS_ENTRY_PAYMENT - DAYS_INSTALMENT

## 3. Summary

**Key EDA findings for Credit card balance:**

    - Total features: 8 (numeric 8, categorical 0), rows: ~ 13.6M,

    - Missing cells	<0.1%, rows with missing values - 0.02%,
    
    - Missing values (>15%): none
        
    - Negative values (>50%):
        - DAYS_INSTALMENT - 100.0%
        - DAYS_ENTRY_PAYMENT > 99.9%

    - Zeros (>30%):
        - NUM_INSTALMENT_VERSION - 30.0%

    - Strong correlations (>0.7):
        - AMT_INSTALMENT - AMT_PAYMENT
        - DAYS_ENTRY_PAYMENT - DAYS_INSTALMENT
    
    - Duplicates: None

**Planned Feature Engineering: Installments Payments**

The goal is to capture payment punctuality, underpayment behavior, and overpayment patterns, which strongly influence default risk. Steps:

    1. Convert Time Columns

        - Convert DAYS_INSTALMENT and DAYS_ENTRY_PAYMENT into YEARS_INSTALMENT and YEARS_ENTRY_PAYMENT for interpretability.

    2. Late Payment Features

        - LATE_YEARS = Difference between actual and scheduled payment (in years).

        - LATE_FLAG = 1 if payment was late, else 0.

        - Aggregate metrics:

            INSTAL_LATE_YEARS_max – maximum delay in years.

            INSTAL_LATE_FLAG_sum – total number of late payments.

            INSTAL_LATE_FLAG_mean – proportion of payments late.

    3. Underpayment and Overpayment Behavior

        - UNDERPAYMENT_RATIO = AMT_PAYMENT / AMT_INSTALMENT

        - Handles division by zero safely.

        - Flags:

            UNDERPAYMENT_FLAG = 1 if paid < 95% of due amount.

            FULL_REPAYMENT_FLAG = 1 if paid ≥ 105% (possible early closure).

        - Aggregate metrics:

            INSTAL_UNDERPAYMENT_RATIO_min – worst underpayment ratio.
            
            INSTAL_UNDERPAYMENT_RATIO_std – variability in payment ratios.
            
            INSTAL_UNDERPAYMENT_FLAG_sum – count of underpayments.
            
            INSTAL_FULL_REPAYMENT_FLAG_sum – count of full overpayments.

    4. Payment Difference Features

        - PAYMENT_DIFF = AMT_PAYMENT - AMT_INSTALMENT
        
        - ABS_PAYMENT_DIFF = absolute difference (captures both under and overpayments)
        
        - Aggregate metrics:
        
            Mean, sum for both difference features.

    5. Standard Aggregations

        In addition to custom metrics, compute numerical aggregates:

            AMT_INSTALMENT: mean, max, sum
    
            AMT_PAYMENT: mean, max, sum
    
            YEARS_INSTALMENT and YEARS_ENTRY_PAYMENT: mean, max

    6. Must-Keep Features

        Critical indicators to ensure inclusion:

            INSTAL_LATE_DAYS_max – worst delay
            
            INSTAL_UNDERPAYMENT_RATIO_min – worst underpayment ratio
            
            INSTAL_UNDERPAYMENT_FLAG_sum – count of underpayments
            
            INSTAL_LATE_FLAG_sum – count of late payments

    7. Feature Selection

        Use LightGBM importance + ROC-AUC ranking to select top features.

        Merge selected features back to main data frame for model training.


This feature set will help capture payment reliability, financial stress signals, and aggressive repayment patterns, which are highly predictive of credit risk.  