# **DATA PREPROCESSING BASED ON EDA INSIGHTS**

This notebook implements preprocessing steps based on the comprehensive EDA findings and recommendations. It follows the evidence-based approach from the EDA report to ensure our preprocessing aligns with the data patterns discovered.
Based on the EDA report, I will:
1. **Handled missing values properly**
2. **Handle Skewed Variables** - Application Income, Coapplication Income and Loan amount.
3. **Outlier Treatment** - IQR-capping for extreme Application Income, Coapplication Income and Loan amount
4. **Feature Engineering** - Create and interaction features
    - Total Income (TotalIncome = ApplicantIncome + CoapplicantIncome)
    - Debt-to-Income Ratio (DTI = LoanAmount / TotalIncome)
    - Equated Monthly Instalment Feature (EMI = LoanAmount / Loan_Amount_Term)
5. **Feature Selection** - Keep high-signal features, evaluate low-signal ones
6. **Scaling** - StandardScaler for distance-based models
7. **Target Handling** - Classification approach with stratified splits

In [1]:
# let call all neccessary libraies
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as stat 

In [3]:
# let load the data from EDA
# load the test data
df = pd.read_csv("EDA_data.csv")

In [4]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Loan_Status_val
0,LP001003,Male,Yes,1,Graduate,No,4583.0,1508.0,128.0,360,1,Rural,N,0
1,LP001005,Male,Yes,0,Graduate,Yes,3000.0,0.0,66.0,360,1,Urban,Y,1
2,LP001006,Male,Yes,0,Not Graduate,No,2583.0,2358.0,120.0,360,1,Urban,Y,1
3,LP001008,Male,No,0,Graduate,No,6000.0,0.0,141.0,360,1,Urban,Y,1
4,LP001011,Male,Yes,2,Graduate,Yes,5417.0,4196.0,267.0,360,1,Urban,Y,1


## 1. **Handled missing values properly**

In [5]:
# let check the if missing value still existing
df.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
Loan_Status_val      0
dtype: int64

## 2. Handle Skewed Variables



## 3. **Outlier Treatment** - IQR-capping for extreme Application Income, Coapplication Income and Loan amount

## 4. **Feature Engineering** - Create and interaction features

-   Total Income (TotalIncome = ApplicantIncome + CoapplicantIncome)
-   Debt-to-Income Ratio (DTI = LoanAmount / TotalIncome)
-   Equated Monthly Instalment Feature (EMI = LoanAmount / Loan_Amount_Term)