This notebook implements preprocessing steps based on the comprehensive EDA findings and recommendations. We'll follow the evidence-based approach from the EDA report to ensure our preprocessing aligns with the data patterns discovered.
Based on the EDA report, we will:

1. **Handle Skewed Variables** - Log-transform  `Applicant Income`, `Coapplicantr Income`, `Loan Amount`
2. **Outlier Treatment** - IQR-capping for  `Applicant Income`, `Coapplicantr Income`, `Loan Amount`
3. **Feature Engineering** - Create total income and interaction features such as Applicant Income + Coapplicant Income `(Total Income)`, Loan Amount  / Loan Amount Term `(Estimated Monthly Installments(EMI))`, and Loan Amount / Total Income `(Debt Repayment Ratio)`
4. **Feature Selection** - Keep high-signal features, evaluate low-signal ones
5. **Scaling** - StandardScaler for distance-based models
6. **Target Handling** - Classification approach with stratified splits

**Key EDA Evidence to Implement**

- **High-signal features**: `Applicant Income`, `Credit History`, `Coapplicant Income`, `Dependents`
- **Low-signal features**: `Applicant Income` (evaluate for removal)

In [1]:
#Import necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time
import zipfile
import warnings
import scipy.stats as stat
warnings.filterwarnings('ignore')

# Preprocessing libraries
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix

# Statistical Libraries
from scipy import stats
from scipy.stats import zscore, skew

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

print("Libraries imported successfully!")



Libraries imported successfully!


In [2]:
file = pd.read_csv('cleaned_eda_data.csv')
file.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y


#### **2. EDA-Based Data Quality Assessment**

In [None]:
# Cleaning the data

In [3]:
file.duplicated().sum()

0

In [4]:
file.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [None]:
skewed_vars = ['ApplicantIncome', 'CoapplicantIncome' 'LoanAmount']
for variable in skewed_vars:
    if variable in file.columns:
        skewness = skew(file[variable])
        print(f"{variable}: skewness = {skewness:.3f} ({'right-skewed' if skewness > 0.5 else 'approximately normal'})")

ApplicantIncome: skewness = 6.494 (right-skewed)


#### **3. Log-Transform Skewed Variables (EDA Recommendation)**

In [None]:
# Log-transform skewed variables as recommended by EDA
print("=== LOG-TRANSFORMING SKEWED VARIABLES ===")
print("EDA identified these variables as right-skewed and recommended log transformation:")

# Variables to log-transform based on EDA findings
skewed_vars = ['ApplicantIncome', 'CoapplicantIncome' 'LoanAmount']
