# **DATA PREPROCESSING BASED ON EDA INSIGHTS**

This notebook implements preprocessing steps based on the comprehensive EDA findings and recommendations. It follows the evidence-based approach from the EDA report to ensure our preprocessing aligns with the data patterns discovered.
Based on the EDA report, I will:

    ✅ Handled missing values properly.
    ✅ Handled the skewness.
    ✅ Created engineered features (TotalIncome, Log_TotalIncome, DTI, EMI, Flag).
    ✅ Encoded categorical variables.
    ✅ Handled the outliers.
    ✅ Scaled numeric data.
    ✅ Split dataset into training and testing subsets.

In [202]:
# let call all neccessary libraies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as stat 
import warnings
warnings.filterwarnings('ignore')

# Preprocessing libraries
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

In [203]:
# let load the data from EDA
df_process = pd.read_csv("EDA_data.csv")


In [204]:
df_process.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583.0,1508.0,128.0,360,1,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000.0,0.0,66.0,360,1,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583.0,2358.0,120.0,360,1,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000.0,0.0,141.0,360,1,Urban,Y
4,LP001011,Male,Yes,2,Graduate,Yes,5417.0,4196.0,267.0,360,1,Urban,Y


In [205]:
# let make a copy for preprocess
df= df_process.copy()

## **Handled missing values properly**

In [206]:

# let check the if missing value still existing
df.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

## Handle Skewed Variables



In [207]:
# As suggested from EDA, we are using log transform bcus it is right skewed
# Log-transform skewed variables(e.g.,ApplicationIncome CoapplicationIncome and Loanamount).
df["ApplicantIncome"]= np.log1p(df["ApplicantIncome"])
df["CoapplicantIncome"]= np.log1p(df["CoapplicantIncome"])
df["LoanAmount"]= np.log1p(df["LoanAmount"])

## **Feature Engineering** - Create and interaction features

-   Total Income (TotalIncome = ApplicantIncome + CoapplicantIncome)
-   Debt-to-Income Ratio (DTI = LoanAmount / TotalIncome)
-   Equated Monthly Instalment Feature (EMI = LoanAmount / Loan_Amount_Term)

In [208]:
# total income
df["TotalIncome"]= df["ApplicantIncome"] + df["CoapplicantIncome"]

# Debt to income ratio (DTI)
df["DTI"]= df["LoanAmount"] / df["TotalIncome"]

# Equated Monthly Instalment (EMI)
df["EMI"]= df["LoanAmount"]/ df["Loan_Amount_Term"]


## **Encoded categorical variables.**

In [209]:
# let encode our categorical feature into numerical.
categorical_col= df.select_dtypes(include="object").columns
categorical_col

Index(['Loan_ID', 'Gender', 'Married', 'Education', 'Self_Employed',
       'Property_Area', 'Loan_Status'],
      dtype='object')

I will be doing label encoding for binary column and One-hot encoding for multi-category column

In [210]:
# label encoding
le= LabelEncoder()
binary_col=['Gender', 'Married', 'Education', 'Self_Employed', 'Loan_Status']
for col in binary_col:
    df[col]=le.fit_transform(df[col])

# OneHotEncoding for multi-category
df = pd.get_dummies(df,columns=["Property_Area"], drop_first=True)

# let see whats up
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,TotalIncome,DTI,EMI,Property_Area_Semiurban,Property_Area_Urban
0,LP001003,1,1,1,0,0,8.430327,7.319202,4.859812,360,1,0,15.74953,0.308569,0.013499,False,False
1,LP001005,1,1,0,0,1,8.006701,0.0,4.204693,360,1,1,8.006701,0.525147,0.01168,False,True
2,LP001006,1,1,0,1,0,7.857094,7.765993,4.795791,360,1,1,15.623087,0.306968,0.013322,False,True
3,LP001008,1,0,0,0,0,8.699681,0.0,4.955827,360,1,1,8.699681,0.569656,0.013766,False,True
4,LP001011,1,1,2,0,1,8.597482,8.342125,5.590987,360,1,1,16.939607,0.330054,0.015531,False,True


## **Outlier Treatment**
IQR-capping for extreme Application Income, Coapplication Income and Loan amount

In [211]:
# let define fuction to do IQR Clipping
def IQR_Clipping(column):
    Q1= df[column].quantile(0.25)
    Q3= df[column].quantile(0.75)
    IQR= Q3-Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    df[column]= np.where(df[column]<lower,lower, np.where(df[column]>upper, upper, df[column]))


In [213]:
for col in ['ApplicantIncome','CoapplicantIncome','LoanAmount']:
    IQR_Clipping(col)

## **Scaled numeric data.**

In [None]:
# let normalize numeric values to help algorithms like Logistic Regression or SVM perform better.

scaler= StandardScaler()
num_cols=['ApplicantIncome','CoapplicantIncome','LoanAmount','TotalIncome','Loan_Amount_Term', 'EMI','DTI']

df[num_cols]= scaler.fit_transform(df[num_cols])

## **Split dataset into training and testing subsets.**

In [None]:
x= df.drop(colums=['LoanID','Loan_Status'])
y=df["Loan_Status"]

# let split our data
x_train,x_test,y_train,y_test= train_test_split(x,y, stratify=y, test_size=0.2, random_state=234)