# Home Credit Default Risk - EDA PREVIOUS APPLICATION

## 1. Introduction

**Context:**

This notebook contains basic EDA for PREVIOUS APPLICATION data set.

This is additional source of data (application_train/application_test are the main training and testing data).

previous_application.csv

    All previous applications for Home Credit loans of clients who have loans in our sample.
    There is one row for each previous application related to loans in our data sample.

**Goals:**

    To comprehensively understand the dataset's structure, identify key patterns, and discover meaningful insights that will inform a robust feature engineering and modeling strategy.

**Objectives:**

    Conduct a comprehensive Exploratory Data Analysis (EDA): Perform an in-depth exploration of the datasets to understand their statistical properties and distributions.

    Identify and address data quality issues: Investigate missing values, identify and handle data anomalies.

    Analyze feature relationships: Evaluate correlations between features and assess their individual relationships with the target variable to prioritize their importance for the model.

    Leverage automated tools for initial insights: Utilize libraries like Sweetviz to quickly generate an initial feature exploration report.



## 2. Exploratory Data Analysis (EDA)

### A. Data loading & Initial checks

In [1]:
%load_ext jupyter_black

In [2]:
import pandas as pd
import numpy as np
import sys
import os
from typing import Dict, Optional, List, Tuple, Union
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="sweetviz.graph")
import sweetviz as sv
from ydata_profiling import ProfileReport
from IPython.display import IFrame

In [3]:
sys.path.append(os.path.abspath(".."))
from Data.utils_EDA import feature_types, missing_columns, calculate_missing_rows
from Data.utils_modeling import downcast_numeric_col

**Loading dataset**

In [4]:
previous = pd.read_csv(r"..\Data\previous_application.csv")
previous.shape

(1670214, 37)

**Downcasting numeric columns**

In [5]:
previous = previous.copy()
downcast_numeric_col(previous)
previous.dtypes.unique()

array([dtype('int32'), dtype('O'), dtype('float64'), dtype('int8'),
       dtype('float32'), dtype('int16')], dtype=object)

In [6]:
previous.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,...,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,...,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,...,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,...,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,...,XNA,24.0,high,Cash Street: high,,,,,,


In [7]:
previous.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int32  
 1   SK_ID_CURR                   1670214 non-null  int32  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int8   
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int8   
 12  RATE_DOWN_PAYMENT            774370 non-nu

**Feature descriptions:**

1. SK_ID_PREV ,"ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loan applications in Home Credit, previous application could, but not necessarily have to lead to credit) ",hashed

2. SK_ID_CURR,ID of loan in our sample,hashed

3. NAME_CONTRACT_TYPE,"Contract product type (Cash loan, consumer loan [POS] ,...) of the previous application",

4. AMT_ANNUITY,Annuity of previous application,

5. AMT_APPLICATION,For how much credit did client ask on the previous application,

6. AMT_CREDIT,"Final credit amount on the previous application. This differs from AMT_APPLICATION in a way that the AMT_APPLICATION is the amount for which the client initially applied for, but during our approval process he could have received different amount - AMT_CREDIT",

7. AMT_DOWN_PAYMENT,Down payment on the previous application,

8. AMT_GOODS_PRICE,Goods price of good that client asked for (if applicable) on the previous application,

9. WEEKDAY_APPR_PROCESS_START,On which day of the week did the client apply for previous application,

10. HOUR_APPR_PROCESS_START,Approximately at what day hour did the client apply for the previous application,rounded

11. FLAG_LAST_APPL_PER_CONTRACT,Flag if it was last application for the previous contract. Sometimes by mistake of client or our clerk there could be more applications for one single contract,

12. NFLAG_LAST_APPL_IN_DAY,Flag if the application was the last application per day of the client. Sometimes clients apply for more applications a day. Rarely it could also be error in our system that one application is in the database twice,

13. NFLAG_MICRO_CASH,Flag Micro finance loan,

14. RATE_DOWN_PAYMENT,Down payment rate normalized on previous credit,normalized

15. RATE_INTEREST_PRIMARY,Interest rate normalized on previous credit,normalized

16. RATE_INTEREST_PRIVILEGED,Interest rate normalized on previous credit,normalized

17. NAME_CASH_LOAN_PURPOSE,Purpose of the cash loan,

18. NAME_CONTRACT_STATUS,"Contract status (approved, cancelled, ...) of previous application",

19. DAYS_DECISION,Relative to current application when was the decision about previous application made,time only relative to the application

20. NAME_PAYMENT_TYPE,Payment method that client chose to pay for the previous application,

21. CODE_REJECT_REASON,Why was the previous application rejected,

22. NAME_TYPE_SUITE,Who accompanied client when applying for the previous application,

23. NAME_CLIENT_TYPE,Was the client old or new client when applying for the previous application,

24. NAME_GOODS_CATEGORY,What kind of goods did the client apply for in the previous application,

25. NAME_PORTFOLIO,"Was the previous application for CASH, POS, CAR, …",

26. NAME_PRODUCT_TYPE,Was the previous application x-sell o walk-in,

27. CHANNEL_TYPE,Through which channel we acquired the client on the previous application,

28. SELLERPLACE_AREA,Selling area of seller place of the previous application,

29. NAME_SELLER_INDUSTRY,The industry of the seller,

30. CNT_PAYMENT,Term of previous credit at application of the previous application,

31. NAME_YIELD_GROUP,Grouped interest rate into small medium and high of the previous application,grouped

32. PRODUCT_COMBINATION,Detailed product combination of the previous application,

33. DAYS_FIRST_DRAWING,Relative to application date of current application when was the first disbursement of the previous application,time only relative to the application

34. DAYS_FIRST_DUE,Relative to application date of current application when was the first due supposed to be of the previous application,time only relative to the applicationDAYS_LAST_DUE_1ST_VERSION,Relative to application date of current application when was the first due of the previous application,time only relative to the application

35. DAYS_LAST_DUE,Relative to application date of current application when was the last due date of the previous application,time only relative to the application

36. DAYS_TERMINATION,Relative to application date of current application when was the expected termination of the previous application,time only relative to the application

37. NFLAG_INSURED_ON_APPROVAL,Did the client requested insurance during the previous application,

**Feature types**

In [8]:
feature_types(previous)

Numerical features: ['SK_ID_PREV', 'SK_ID_CURR', 'AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE', 'HOUR_APPR_PROCESS_START', 'NFLAG_LAST_APPL_IN_DAY', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA', 'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL']
Categorical features: ['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE', 'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION']
Binary features: ['FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY', 'NFLAG_INSURED_ON_APPROVAL']


In [9]:
previous.dtypes.value_counts()

object     16
float32    10
float64     5
int32       3
int8        2
int16       1
Name: count, dtype: int64

In [10]:
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

previous.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_PREV,1670214.0,1923089.0,532597.958696,1000001.0,1461857.0,1923110.0,2384280.0,2845382.0
SK_ID_CURR,1670214.0,278357.2,102814.823849,100001.0,189329.0,278714.5,367514.0,456255.0
AMT_ANNUITY,1297979.0,15955.12,14782.137335,0.0,6321.78,11250.0,20658.42,418058.145
AMT_APPLICATION,1670214.0,175233.9,292779.762386,0.0,18720.0,71046.0,180360.0,6905160.0
AMT_CREDIT,1670213.0,196114.0,318574.616547,0.0,24160.5,80541.0,216418.5,6905160.0
AMT_DOWN_PAYMENT,774370.0,6697.402,20921.49541,-0.9,0.0,1638.0,7740.0,3060045.0
AMT_GOODS_PRICE,1284699.0,227847.3,315396.557937,0.0,50841.0,112320.0,234000.0,6905160.0
HOUR_APPR_PROCESS_START,1670214.0,12.48418,3.334028,0.0,10.0,12.0,15.0,23.0
NFLAG_LAST_APPL_IN_DAY,1670214.0,0.9964675,0.05933,0.0,1.0,1.0,1.0,1.0
RATE_DOWN_PAYMENT,774370.0,0.0796368,0.107823,-1.497876e-05,0.0,0.05160508,0.1089091,1.0


**Key insights:**

Most financial metrics (AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_DOWN_PAYMENT, AMT_GOODS_PRICE) has outliers or extreme cases.

Some time Features has values 365243, could be anomalies.

Some columns has a lot of missing values.

**Missing values**

In [11]:
missing_columns(previous)

Unnamed: 0,Missing Count,Missing Count Ratio,Missing Count %
RATE_INTEREST_PRIVILEGED,1664263,0.996437,99.6
RATE_INTEREST_PRIMARY,1664263,0.996437,99.6
AMT_DOWN_PAYMENT,895844,0.5363648,53.6
RATE_DOWN_PAYMENT,895844,0.5363648,53.6
NAME_TYPE_SUITE,820405,0.4911975,49.1
NFLAG_INSURED_ON_APPROVAL,673065,0.4029813,40.3
DAYS_TERMINATION,673065,0.4029813,40.3
DAYS_LAST_DUE,673065,0.4029813,40.3
DAYS_LAST_DUE_1ST_VERSION,673065,0.4029813,40.3
DAYS_FIRST_DUE,673065,0.4029813,40.3


In [12]:
calculate_missing_rows(previous)

Missing rows: 1670143 of 1670214 total rows in data set.
Missing rows %: 100.00


2 columns almost 100% missing. We will use imputation.

**Checking for duplicates.**

In [18]:
print(
    f"Duplicates: {previous.duplicated().sum()}, {(previous.duplicated().sum() / len(previous) * 100):.2f}%"
)

Duplicates: 0, 0.00%


No duplicates in this dataset.

**Sweetviz report**

We can find the report in EDA folder.

In [None]:
report = sv.analyze(df)
html_file = f"previous_sweetviz_report.html"
report.show_html(html_file)
#display(IFrame(html_file, width=950, height=600))

**Ydata Report**

The report exceeds 25MB, it was not submitted to GitHub.

In [None]:
profile = ProfileReport(df, title="previous_application_balance EDA", explorative=True)

profile.to_file("previous_application_EDA.html")

### B. Feature analysis

    NAME_CONTRACT_TYPE - Contract product type (Cash loan, consumer loan [POS] ,...) of the previous application

High correlation with DAYS_FIRST_DRAWING, DAYS_LAST_DUE_1ST_VERSION, NAME_PORTFOLIO, PRODUCT_COMBINATION, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED

Categorical, no missing values.

Distribution: Cash loans 44.8%, Consumer loans 43.7%, Revolving loans 11.6%.

    AMT_ANNUITY - Annuity of previous application

High correlation with AMT_APPLICATION, AMT_CREDIT, AMT_GOODS_PRICE.

Numerical, 22.3% missing values, 0.1% zeros, no negative values.

Minimum	0, Maximum	418,058.15, Mean 15,955.121. Right skewed. Outliers.

    AMT_APPLICATION - For how much credit did client ask on the previous application

High correlation with AMT_ANNUITY, AMT_CREDIT, AMT_GOODS_PRICE, CNT_PAYMENT.

Numerical, no missing values, 23.5% zeros, no negative values.

Minimum	0, Maximum 6,905,160, Mean 175,233.86. Right skewed. Outliers.

    AMT_CREDIT - Final credit amount on the previous application. This differs from AMT_APPLICATION in a way that the AMT_APPLICATION is the amount for which the client initially applied for, but during our approval process he could have received different amount - AMT_CREDIT

High correlation with AMT_ANNUITY, AMT_APPLICATION, AMT_GOODS_PRICE.

Numerical, <0.1% missing values, 20.2% zeros, no negative values.

Minimum	0, Maximum 6,905,160, Mean 196,114.02. Right skewed. Outliers.

    Feature engineering:
        - How much did they ask for vs how much they got
            APP_CREDIT_DIFF = AMT_APPLICATION - AMT_CREDIT
            APP_CREDIT_RATIO = AMT_CREDIT / AMT_APPLICATION
            to avoid extreme values APP_CREDIT_RATIO.clip(0.5, 1.5)

    AMT_DOWN_PAYMENT - Down payment on the previous application

High correlation with RATE_DOWN_PAYMENT.

Numerical, 53.6% missing values, 22.1% zeros, negative < 0.1%.

Minimum	-0.9, Maximum 3,060,045, Mean 6,697.4. Outliers.

    AMT_GOODS_PRICE - Goods price of good that client asked for (if applicable) on the previous application

High correlation with AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT.

Numerical, 23.1% missing values, 0.4% zeros, no negative values.

Minimum	0, Maximum 6,905,160, Mean 227,847.28. Right skewed. Outliers.

    WEEKDAY_APPR_PROCESS_START - On which day of the week did the client apply for previous application

Categorical, no missing values.

Distinct 7. Mon - Sat: 14.4% - 15.3%, Sun - 9.9%.

    HOUR_APPR_PROCESS_START - Approximately at what day hour did the client apply for the previous application, rounded

Numerical, no missing values, <0.1% zeros.

Minimum 0, Maximum 23, Mean 12.5. Distribution close to normal, right skewed. Outliers.

    FLAG_LAST_APPL_PER_CONTRACT - Flag if it was last application for the previous contract. Sometimes by mistake of client or our clerk there could be more applications for one single contract

High correlation with DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_TERMINATION, NFLAG_INSURED_ON_APPROVAL, NFLAG_LAST_APPL_IN_DAY, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED.

Boolean, no missing values, 2 distinct, imbalanced distribution: True - 99.5%, False - 0.5%.

    NFLAG_LAST_APPL_IN_DAY - Flag if the application was the last application per day of the client. Sometimes clients apply for more applications a day. Rarely it could also be error in our system that one application is in the database twice

High correlation with FLAG_LAST_APPL_PER_CONTRACT.

Categorical, no missing values, 2 distinct, imbalanced distribution: 1 - 99.6%, 0 - 0.4%.

    RATE_DOWN_PAYMENT - Down payment rate normalized on previous credit, normalized

High correlation with AMT_DOWN_PAYMENT.

Numerical, 53.6% missing values, 22.1% zeros, negative < 0.1%.

Minimum	-0.000015, Maximum 1, Mean 0.08. Right skewed. Outliers.

    Clip on 0.

    RATE_INTEREST_PRIMARY - Interest rate normalized on previous credit,normalized.

High correlation with FLAG_LAST_APPL_PER_CONTRACT, NAME_CASH_LOAN_PURPOSE, NAME_CONTRACT_STATUS, NAME_CONTRACT_TYPE, NAME_PORTFOLIO, NAME_PRODUCT_TYPE, NAME_YIELD_GROUP, RATE_INTEREST_PRIVILEGED.

Numerical, 99.6% missing values, no zeros, no negative values.

Minimum	0.03, Maximum 1, Mean 0.188. Outliers. 

    RATE_INTEREST_PRIVILEGED - Interest rate normalized on previous credit,normalized

High correlation with FLAG_LAST_APPL_PER_CONTRACT, NAME_CASH_LOAN_PURPOSE, NAME_CONTRACT_STATUS, NAME_CONTRACT_TYPE, NAME_PORTFOLIO, NAME_PRODUCT_TYPE, NAME_YIELD_GROUP, RATE_INTEREST_PRIMARY.

Numerical, 99.6% missing values, no zeros.

Minimum	0.37, Maximum 1, Mean 0.77. Left skewed. Outliers.

    NAME_CASH_LOAN_PURPOSE - Purpose of the cash loan

High correlation with NAME_CONTRACT_STATUS, NAME_PRODUCT_TYPE, NFLAG_INSURED_ON_APPROVAL, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED.

Categorical, no missing values, 25 distinct values, imbalanced distribution: XAP 55.2%, XNA 40.6%.

    NAME_CONTRACT_STATUS - Contract status (approved, cancelled, ...) of previous application

High correlation with CODE_REJECT_REASON, DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE	DAYS_LAST_DUE_1ST_VERSION, DAYS_TERMINATION, NFLAG_INSURED_ON_APPROVAL, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED.

Categorical, no missing values, 4 distinct values, imbalanced distribution: Approved 62.1%, Canceled 18.9%, Refused 17.4%, Unused offer 1.6%.

    Flags for application status.

    DAYS_DECISION - Relative to current application when was the decision about previous application made,time only relative to the application

High correlation with DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE, DAYS_TERMINATION, RATE_INTEREST_PRIVILEGED.

Numerical, no missing values, no zeros.

Minimum	-2,922 (~8 years), Maximum -1, Mean	-880.7. Left skewed. Outliers.

    Convert to positive.

    NAME_PAYMENT_TYPE - Payment method that client chose to pay for the previous application

Categorical, no missing values, 4 distinct values, imbalanced distribution: Cash through the bank 61.9%, XNA 37.6%, Non-cash from your account 	0.5%, Cashless from the account of the employer 0.1%.

    CODE_REJECT_REASON - Why was the previous application rejected

High correlation with NAME_CONTRACT_STATUS.

Categorical, no missing values, 9 distinct values, imbalanced distribution: XAP 81.0%, HC 10.5%, LIMIT 3.3%.

    Feature engineering:
    - Create binary flag for top reason.
    - Create ratios rejections / total previous applications for top 3 reason.

    NAME_TYPE_SUITE - Who accompanied client when applying for the previous application

Categorical, 49.1% missing values, 7 distinct values, imbalanced distribution: Unaccompanied 30.5%, Family 12.8%, Spouse, partner 4.0%.

    NAME_CLIENT_TYPE - Was the client old or new client when applying for the previous application

Categorical, no missing values, 4 distinct values, imbalanced distribution: Repeater 73.7%, New 18.0%, Refreshed 8.1%, XNA 0.1%.

    NAME_GOODS_CATEGORY - What kind of goods did the client apply for in the previous application

High correlation with NAME_SELLER_INDUSTRY.

Categorical, no missing values, 28 distinct values, imbalanced distribution: XNA 56.9%, Mobile 13.5%, Consumer Electronics 7.3%.

    NAME_PORTFOLIO - Was the previous application for CASH, POS, CAR, …

High correlation with DAYS_FIRST_DRAWING, DAYS_LAST_DUE_1ST_VERSION, NAME_CONTRACT_TYPE, NAME_PRODUCT_TYPE, PRODUCT_COMBINATION, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED.

Categorical, no missing values, 5 distinct values, imbalanced distribution: POS 41.4%, Cash 27.6%, XNA 22.3%, Cards 8.7%, Cars < 0.1%.

    NAME_PRODUCT_TYPE - Was the previous application x-sell o walk-in

High correlation with NAME_PORTFOLIO, PRODUCT_COMBINATION, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED.

Categorical, no missing values, 3 distinct values, imbalanced distribution: XNA 63.7%, x-sell 27.3%, walk-in 9.0%.

    CHANNEL_TYPE - Through which channel we acquired the client on the previous application

Categorical, no missing values, 8 distinct values, imbalanced distribution: Credit and cash offices 43.1%, Country-wide 29.6%, Stone 12.7%, Regional / Local 6.5%.

    SELLERPLACE_AREA - Selling area of seller place of the previous application

Numerical, no missing values, 3.6% zeros.

Minimum	-1, Maximum	4,000,000.0, Mean 313.95. Right skewed. Outliers.

    NAME_SELLER_INDUSTRY - The industry of the seller

High correlation with NAME_GOODS_CATEGORY.

Categorical, no missing values, 11 distinct values, distribution: XNA 51.2%, Consumer electronics 23.8%, Connectivity 16.5%.

    CNT_PAYMENT - Term of previous credit at application of the previous application

Numerical, 22.3% missing values, 8.7% zeros, no negative values.

Minimum	0, Maximum	84, Mean 16.1. Right skewed. Outliers.

    NAME_YIELD_GROUP - Grouped interest rate into small medium and high of the previous application,grouped

High correlation with DAYS_FIRST_DRAWING, DAYS_LAST_DUE_1ST_VERSION, PRODUCT_COMBINATION, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED.

Categorical, no missing values, 5 distinct values, distribution: XNA 31.0%, middle 23.1%, high 21.2%, low_normal 19.3%, low_action 5.5%.

    PRODUCT_COMBINATION - Detailed product combination of the previous application

High correlation with DAYS_FIRST_DRAWING, DAYS_LAST_DUE_1ST_VERSION, NAME_CONTRACT_TYPE, NAME_PORTFOLIO, NAME_PRODUCT_TYPE, NAME_YIELD_GROUP.

Categorical, <0.1% missing values, 17 distinct values, distribution: Cash 17.1%, POS household with interest 15.8%, POS mobile with interest 13.2%, Cash X-Sell: middle 8.6%, Cash X-Sell: low 7.8%.

    DAYS_FIRST_DRAWING - Relative to application date of current application when was the first disbursement of the previous application,time only relative to the application

High correlation with FLAG_LAST_APPL_PER_CONTRACT, NAME_CONTRACT_STATUS, NAME_CONTRACT_TYPE, NAME_PORTFOLIO, NAME_YIELD_GROUP, PRODUCT_COMBINATION.

Minimum	-2922 (~8 years), Maximum 365,243 ( 1,000 years, count 934,444, 55.9%), Mean 342,209.86. Anomalies.

    Convert to days, flag anomalies.

    DAYS_FIRST_DUE - Relative to application date of current application when was the first due supposed to be of the previous application,time only relative to the application

High correlation with DAYS_DECISION, DAYS_LAST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_TERMINATION, FLAG_LAST_APPL_PER_CONTRACT, NAME_CONTRACT_STATUS. 

Numerical, 40.3% missing values, no zeros.

Minimum	-2922 (~8 years), Maximum 365,243 (1,000 years, count 934,444, 55.9%), Mean 342,209.86. Anomalies.

    Convert to days, flag anomalies.

    DAYS_LAST_DUE_1ST_VERSION - Relative to application date of current application when was the first due of the previous application,time only relative to the application

High correlation with DAYS_DECISION, DAYS_FIRST_DUE, DAYS_LAST_DUE, DAYS_TERMINATION, FLAG_LAST_APPL_PER_CONTRACT, NAME_CONTRACT_STATUS, NAME_CONTRACT_TYPE, NAME_PORTFOLIO, NAME_YIELD_GROUP, PRODUCT_COMBINATION.

Numerical, 40.3% missing values, <0.1% zeros.

Minimum	-2801 (~8 years), Maximum 365,243 (1,000 years, count 93,864, 5.6%), Mean 33,767.8. Anomalies.

    Convert to days, flag anomalies.

    DAYS_LAST_DUE - Relative to application date of current application when was the last due date of the previous application,time only relative to the application

High correlation with  DAYS_DECISION, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_TERMINATION, FLAG_LAST_APPL_PER_CONTRACT, NAME_CONTRACT_STATUS.

Numerical, 40.3% missing values, no zeros.

Minimum	-2889 (~8 years), Maximum 365,243 (1,000 years, count 211,221, 12.6%), Mean 76,582.4. Anomalies.

    Convert to days, flag anomalies.

    DAYS_TERMINATION - Relative to application date of current application when was the expected termination of the previous application,time only relative to the application

High correlation with  DAYS_DECISION, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE, FLAG_LAST_APPL_PER_CONTRACT, NAME_CONTRACT_STATUS.

Numerical, 40.3% missing values, no zeros.

Minimum	-2874 (~8 years), Maximum 365,243 (1,000 years, count 225,913, 13.5%), Mean 81,992.3. Anomalies.

    Convert to days, flag anomalies.

    NFLAG_INSURED_ON_APPROVAL - Did the client requested insurance during the previous application

High correlation with FLAG_LAST_APPL_PER_CONTRACT, NAME_CONTRACT_STATUS.

Categorical, 40.3% missing values, 2 distinct values, distribution: 0.0 39.8%, 1.0 19.9%.

### Correlation

We will analyze the relationships between features using a Ydata-Quality report. This report will provide a comprehensive overview of our data, including an automated correlation matrix for all features.

To determine which features are most impactful for our model, we will use a more robust method: LightGBM's feature importance. After aggregating the columns from specific datasets into our main dataset, the LightGBM model will automatically calculate the importance of each feature in predicting the target variable. This approach is superior as it directly assesses a feature's predictive power within the context of our chosen model, providing a more reliable measure of its relationship with the target.

**Feature Relationships**

High correlation (Ydata Report):
    
    NAME_CONTRACT_TYPE, DAYS_FIRST_DRAWING, DAYS_LAST_DUE_1ST_VERSION, NAME_PORTFOLIO, PRODUCT_COMBINATION, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED
    
    AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_GOODS_PRICE.
    
    AMT_APPLICATION, AMT_CREDIT, AMT_GOODS_PRICE, CNT_PAYMENT.
    
    AMT_CREDIT, AMT_GOODS_PRICE.

    AMT_DOWN_PAYMENT, RATE_DOWN_PAYMENT.

    FLAG_LAST_APPL_PER_CONTRACT,  DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_TERMINATION, NFLAG_INSURED_ON_APPROVAL, NFLAG_LAST_APPL_IN_DAY, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED.
    
    RATE_INTEREST_PRIMARY, NAME_CASH_LOAN_PURPOSE, NAME_CONTRACT_STATUS, NAME_PORTFOLIO, NAME_PRODUCT_TYPE, NAME_YIELD_GROUP, RATE_INTEREST_PRIVILEGED.
    
    RATE_INTEREST_PRIVILEGED, NAME_CASH_LOAN_PURPOSE, NAME_CONTRACT_STATUS, NAME_PORTFOLIO, NAME_PRODUCT_TYPE, NAME_YIELD_GROUP,
    
    NAME_CASH_LOAN_PURPOSE, NAME_CONTRACT_STATUS, NAME_PRODUCT_TYPE, NFLAG_INSURED_ON_APPROVAL.
    
    NAME_CONTRACT_STATUS, CODE_REJECT_REASON, DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE DAYS_LAST_DUE_1ST_VERSION, DAYS_TERMINATION, NFLAG_INSURED_ON_APPROVAL, 
    
    DAYS_DECISION, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE, DAYS_TERMINATION.
    
    NAME_GOODS_CATEGORY, NAME_SELLER_INDUSTRY.
    
    NAME_PORTFOLIO, DAYS_FIRST_DRAWING, DAYS_LAST_DUE_1ST_VERSION, NAME_PRODUCT_TYPE, PRODUCT_COMBINATION.
    
    NAME_PRODUCT_TYPE, PRODUCT_COMBINATION.
    
    NAME_YIELD_GROUP, DAYS_FIRST_DRAWING, DAYS_LAST_DUE_1ST_VERSION, PRODUCT_COMBINATION.
    
    PRODUCT_COMBINATION, DAYS_FIRST_DRAWING, DAYS_LAST_DUE_1ST_VERSION.
    
    DAYS_FIRST_DUE, DAYS_LAST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_TERMINATION.
    
    DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE, DAYS_TERMINATION.
    
    DAYS_LAST_DUE, DAYS_TERMINATION.

## 3. Summary

**Key EDA findings for Previous application:**

    - Total features: 37 (numeric 19, categorical 17, boolean 1), rows: ~ 1.67M,

    - Missing cells	18.0%, rows with missing values - 100%,
    
    - Missing values (>15%):
        RATE_INTEREST_PRIVILEGED  99.6%
        RATE_INTEREST_PRIMARY	  99.6%
        AMT_DOWN_PAYMENT		  53.6%
        RATE_DOWN_PAYMENT		  53.6%
        NAME_TYPE_SUITE           49.1%
        NFLAG_INSURED_ON_APPROVAL 40.3%
        DAYS_TERMINATION		  40.3%
        DAYS_LAST_DUE		      40.3%
        DAYS_LAST_DUE_1ST_VERSION 40.3%
        DAYS_FIRST_DUE		      40.3%
        DAYS_FIRST_DRAWING		  40.3%
        AMT_GOODS_PRICE		      23.1%
        AMT_ANNUITY	              22.3%
        CNT_PAYMENT		          22.3%
        
    - Negative values (>40%):
        DAYS_DECISION - 100.0%
        SELLERPLACE_AREA - 45.7%
        DAYS_FIRST_DUE - 57.3%
        DAYS_LAST_DUE_1ST_VERSION - 40.6%
        DAYS_LAST_DUE - 47.1%
        DAYS_TERMINATION - 46.2%
        

    - Zeros (>20%):
        AMT_APPLICATION - 23.5%
        AMT_CREDIT - 20.2%
        AMT_DOWN_PAYMENT - 22.1%
        RATE_DOWN_PAYMENT - 22.1%
        

    - Strong correlations (>0.7):
        Mentioned in Feature Relationships (too many features to duplicate here)
    
    - Duplicates: none

**Planned Feature Engineering: previous application**

The previous_application table contains the client’s historical applications and outcomes with Home Credit. It is highly predictive of future risk behavior.

    Preprocessing (Before Feature Engineering)

        - Convert all DAYS_* columns into YEARS_* (divide by -365).
        
        - Create anomaly flags for invalid values (e.g., 1000 years).

**Planned Feature Engineering Steps**

    1. Application Status Flags

        WAS_APPROVED – 1 if NAME_CONTRACT_STATUS == "Approved", else 0.
        
        WAS_REFUSED – 1 if NAME_CONTRACT_STATUS == "Refused", else 0.
        
        WAS_CANCELED - 1 if NAME_CONTRACT_STATUS == "Canceled", else 0.

            - Captures approval history, refusal history.

    2. Financial Features & Ratios

        APP_CREDIT_DIFF = AMT_APPLICATION – AMT_CREDIT

        APP_CREDIT_RATIO = AMT_CREDIT / AMT_APPLICATION

            - Reveals whether the client typically applies for more/less than what is granted.
            - Strong predictor of over-optimistic or risky applications.

    3. Aggregated Financials

        AMT_APPLICATION, AMT_CREDIT, AMT_ANNUITY, AMT_DOWN_PAYMENT, AMT_GOODS_PRICE
        - aggregated with mean, max.
            Shows the typical loan size, repayment burden, and upfront down payment.

    4. Loan Structure Features

        CNT_PAYMENT (number of installments requested).
        - aggregated with mean, max.

        YEARS_DECISION (recency of applications).
        - aggregated with min, max, mean.
            Recent and frequent refusals are a strong risk signal.

    5. Diversity Features

        PRODUCT_DIVERSITY = number of unique NAME_CONTRACT_TYPE values per client.
            Indicates whether the client is experimenting with multiple product types (higher risk).

    6. Count Features

        PREV_APP_COUNT – total number of previous applications.
        
        PREV_APP_APPROVED_COUNT – number of approved applications.
        
        PREV_APP_REFUSED_COUNT – number of refused applications.

            - Absolute volumes of approvals/refusals.

    7. Ratio Features

        PREV_APPROVAL_RATE = approved / total.
        
        PREV_REFUSAL_RATE = refused / total.
        
        PREV_REFUSAL_TO_APPROVAL = refused / approved.

            - Proportional history of risk-taking vs acceptance.
    
    8. Must keep list
    
        "PREV_APP_COUNT" - Total number of previous applications submitted by the client.
        
        "PREV_APP_APPROVED_COUNT" – Number of approved previous applications.
        
        "PREV_APP_REFUSED_COUNT" – Number of refused previous applications.
        
        "PREV_APPROVAL_RATE" – Ratio of approved applications to total applications.
        
        "PREV_REFUSAL_RATE" – Ratio of refused applications to total applications.
        
        "PREV_REFUSAL_TO_APPROVAL" – Ratio of refusals to approvals.
        
        "PREV_AMT_APPLICATION_mean" – Average amount requested in previous applications.
        
        "PREV_AMT_CREDIT_mean" – Average credit amount granted in previous applications.
        
        "PREV_AMT_ANNUITY_mean" – Average annuity value across previous applications.
        
        "PREV_AMT_DOWN_PAYMENT_mean" – Average down payment amount in previous applications.
        
        "PREV_AMT_GOODS_PRICE_mean" – Average goods price requested in previous applications.
        
        "PREV_CNT_PAYMENT_mean" – Average number of installments across previous applications.
        
        "PREV_APP_CREDIT_DIFF_mean" – Average difference between requested and granted credit.
        
        "PREV_APP_CREDIT_RATIO_mean" – Average ratio of requested to granted credit.
        
        "PREV_YEARS_DECISION_min" - Earliest (minimum) years since a previous decision.
        
        "PREV_YEARS_DECISION_max" – Most recent (maximum) years since a previous decision.
        
        "PREV_NAME_CONTRACT_TYPE_PRODUCT_DIVERSITY" – Number of unique product types in previous contracts.

    9. Feature Selection

        Use LightGBM importance + ROC-AUC ranking to select top features.

        Merge selected features to main data frame for model training.