## **Data Preprocessing Based on EDA Insights**

This notebook implements preprocessing steps based on the comprehensive EDA findings and recommendations. I'll follow the evidence-based approach from the EDA report to ensure our preprocessing aligns with the data patterns discovered.
Based on the EDA report, I will:


In [27]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Preprocessing libraries
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix


In [28]:
# Loading the cleaned data from EDA
data_process = pd.read_csv("EDA_clean_data.csv")

In [29]:
data_process.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0.0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1.0,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0.0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0.0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0.0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [30]:
# Let's make the `Loan_ID` the index
data_process.set_index("Loan_ID", inplace=True)

In [31]:
# Let's view of dataset again
data_process.head(2)

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LP001002,Male,No,0.0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y
LP001003,Male,Yes,1.0,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N


In [35]:
# Let make a copy for preprocess
df = data_process.copy()

# check for missing values, it should be zero according to EDA report
print("")
print("--------------------------------")
print("Checking Missing Values")
print("--------------------------------")
missing_vals= df.isnull().sum()
if missing_vals.sum()>0:
    print(missing_vals[missing_vals>0])
else:
    print("No missing values as expected from EDA")

# check for duplicates
print("")
print("--------------------------------")
print("Checking duplicates rows")
print("--------------------------------")
duplicates=df.duplicated().sum()
if duplicates>0:
    print(f"The number of duplicate rows: {duplicates}")
    print(f"Percentage of duplicates:{duplicates/len(df)*100:.2f}%")

else:
    print(f"No duplicate rows")

# check skewness for the claimed variables in EDA report as right skewed
print("")
print("--------------------------------")
print("Checking skewness of Features")
print("--------------------------------")

claimed_features=['ApplicantIncome','CoapplicantIncome','LoanAmount']
for col in claimed_features:
    skew_val= df[col].skew()
    print(f"{col} Skewness is: {skew_val:.2f}, ({"right-skewed" if skew_val > 0.5 else "approximately normal"})")
    
# check correlation with target
# print("")
# print("--------------------------------")
# print("Checking correlation of Features")
# print("--------------------------------")

# correlations = df.select_dtypes(include=['number']).corr()['Loan_Status'].sort_values(key=abs, ascending=False)
# # Creating the high signal features
# high_signal = correlations[abs(correlations) > 0.2].drop('Loan_Status')
# # Creating the low signal features
# low_signal = correlations[abs(correlations) < 0.1]

# print("High-signal features (|correlation| > 0.2):")
# for feature, corr in high_signal.items():
#     print(f"    {feature}: {corr:.3f}")

# print("\nLow-signal features (|correlation| < 0.1):")
# for feature, corr in low_signal.items():
#     print(f"    {feature}: {corr:.3f}")


--------------------------------
Checking Missing Values
--------------------------------
No missing values as expected from EDA

--------------------------------
Checking duplicates rows
--------------------------------
No duplicate rows

--------------------------------
Checking skewness of Features
--------------------------------
ApplicantIncome Skewness is: 6.54, (right-skewed)
CoapplicantIncome Skewness is: 7.49, (right-skewed)
LoanAmount Skewness is: 2.74, (right-skewed)


### **Outlier Treatment**

IQR-capping for extreme ApplicantIncome, Coapplication Income and Loan Amount 

In [36]:
# Let's define function to do IQR clipping
def IQR_Clipping(column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    
    # get the number of capped outliers
    outliers_count = ((df[column] < lower) | (df[column] > upper)).sum()
    if outliers_count > 0:
        df[column] = np.where(df[column]<lower, lower, np.where(df[column]>upper, upper, df[column]))
    print(f"\nTotal outliers capped for {column}: {outliers_count}")
    

In [37]:
for col in ["ApplicantIncome", "CoapplicantIncome", "LoanAmount"]:
    IQR_Clipping(col)


Total outliers capped for ApplicantIncome: 50

Total outliers capped for CoapplicantIncome: 18

Total outliers capped for LoanAmount: 41


### **Handle Skewed Variables**

In [38]:
# As suggested from EDA, we are using log transform because it is right skewed
# Log-transform skewed variables(e.g., ApplicantIncome CoapplicantIncome and LoanAmount)
df["ApplicantIncome"] = np.log1p(df["ApplicantIncome"])
df["CoapplicantIncome"] = np.log1p(df["CoapplicantIncome"])
df["LoanAmount"] = np.log1p(df["LoanAmount"])

### **Feature Engineering** - Create an interaction features

- Total Income (TotalIncome = ApplicantIncome + CoapplicantIncome)
- Debt-to-Income Ratio (DTI = LoanAmount / TotalIncome)
- Equated Monthly Instalment Feature (EMI = LoanAmount / Loan_Amount_Term)

In [39]:
# Total Income
df["TotalIncome"] = df["ApplicantIncome"] + df["CoapplicantIncome"]

# Debt to income ratio (DTI)
df["DTI"] = df["LoanAmount"] / df["TotalIncome"]

# Equated Monthly Instalment (EMI)
df["EMI"] = df["LoanAmount"] / df["Loan_Amount_Term"]