# **DATA PREPROCESSING BASED ON EDA INSIGHTS**

This notebook implements preprocessing steps based on the comprehensive EDA findings and recommendations. We'll follow the evidence-based approach from the EDA report to ensure our preprocessing aligns with the data patterns discovered.
Based on the EDA report, we will:

1. **Handle Skewed Variables** - Log-transform `ApplicantIncome` and `CoapplicantIncome`
3. **Feature Engineering** - Create Total_income, Loan_to_income, and interaction features.
4. **Encoding categorical columns**
5. **Scaling** - RobustScaler 
6. **Target Handling** - Classification approach with stratified splits
4. **Feature Selection** - Keep high-signal features, evaluate low-signal ones
7. **Splitting into target and Features**


**Key EDA Evidence to Implement**

- **High-signal features**: `Credit_History, ApplicantIncome, CoapplicantIncome, LoanAmount, Property_Area`, and derived variables such as Total_Income and Loan_to_Income show strong or meaningful relationships with loan approval outcomes.
- **Low-signal features**: `Gender, Married, Dependents, Self_Employed, and Loan_Amount_Term `display weak or negligible influence on loan approval and may be deprioritized during preprocessing.
- **Skewed variables**: `ApplicantIncome` and `CoapplicantIncome` (log-transform)
- **Feature engineering**: Total_Income
Loan_to_Income, Loan_Term_Years, Has_Coapplicant, Dependents_Num.



#### **1. Import Libraries and load the data**

In [1]:
# Core libraries
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import zipfile
import warnings
warnings.filterwarnings('ignore')

# Preprocessing libraries
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix


# Statistical libraries
from scipy import stats
from scipy.stats import zscore, skew

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

print('All libraries imported successfully')


All libraries imported successfully


In [2]:
# Load the dataset
loan_train = pd.read_csv('home_loan_train.csv')

loan_train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
# Create a copy of the dataset for preprocessing
df_processed = loan_train.copy()

In [4]:
# Set the Loan_ID as the index 
df_processed.set_index('Loan_ID', inplace=True)

In [5]:
# Map the Loan_Status from object to int to check correlation 
df_processed['Loan_Status'] = df_processed['Loan_Status'].map({'Y': 1, 'N': 0})


In [6]:
# Display output
df_processed['Loan_Status']

Loan_ID
LP001002    1
LP001003    0
LP001005    1
LP001006    1
LP001008    1
           ..
LP002978    1
LP002979    1
LP002983    1
LP002984    1
LP002990    0
Name: Loan_Status, Length: 614, dtype: int64

#### **2. EDA-Based Data Assessment**

In [7]:
# 1. Check for missing values (EDA showed no missing values)
print("\n1. Missing Values:")
missing_values = df_processed.isna().sum()
if missing_values.sum() > 0:
    print(missing_values[missing_values > 0])
else:
    print("No missing values found as expceted from the EDA")


# 2. Check for duplicates
print("\n2. Duplicate Rows:")
duplicates = df_processed.duplicated().sum()
print(f"Number of duplicated rows: {duplicates}")
if duplicates.sum() > 0:
    print(f"Percentage of duplicates: {(duplicates/len(df_processed))* 100:.2f}")


# 3. Check for skewness for variables identified in EDA as right-skewed
print(f"\n3. Skewness Analysis (EDA identified right-skewed variables):")
skewed_var = ['ApplicantIncome','CoapplicantIncome','LoanAmount']
for var in skewed_var:
    if var in df_processed.columns:
        skewness = skew(df_processed[var])
        print(f"{var}: Skewness = {skewness:.3f} ({'right_skewed' if skewness > 0.5 else 'Approx normal'})")
    

# 4. Check the correlation with target (EDA evidence)
print("\n 4. Correlation wit quality (EDA Evidence):")
correlations = df_processed.select_dtypes(include=['number']).corr()['Loan_Status'].sort_values(key=abs, ascending=False)
print("High signal features (|Correlation| > 0.2)")
high_signal = correlations[abs(correlations) > 0.2].drop('Loan_Status')
for feature, corr in high_signal.items():
    print(f"{feature}: {corr:.3f}")

print("\n Low-signal features (|correlated| < 0.1)")
low_signal = correlations[abs(correlations) < 0.1]
for feature, corr in low_signal.items():
    print(f"{feature}: {corr:.3f}")



1. Missing Values:
Gender              13
Married              3
Dependents          15
Self_Employed       32
LoanAmount          22
Loan_Amount_Term    14
Credit_History      50
dtype: int64

2. Duplicate Rows:
Number of duplicated rows: 0

3. Skewness Analysis (EDA identified right-skewed variables):
ApplicantIncome: Skewness = 6.524 (right_skewed)
CoapplicantIncome: Skewness = 7.473 (right_skewed)
LoanAmount: Skewness = nan (Approx normal)

 4. Correlation wit quality (EDA Evidence):
High signal features (|Correlation| > 0.2)
Credit_History: 0.562

 Low-signal features (|correlated| < 0.1)
CoapplicantIncome: -0.059
LoanAmount: -0.037
Loan_Amount_Term: -0.021
ApplicantIncome: -0.005


#### **3. Handle Missing values**

**Handling missing data in the categorical columns**

In [10]:
# Convert the Credit history to a string
df_processed['Credit_History'] = df_processed['Credit_History'].astype(str)
df_processed['Credit_History']

Loan_ID
LP001002    good
LP001003    good
LP001005    good
LP001006    good
LP001008    good
            ... 
LP002978    good
LP002979    good
LP002983    good
LP002984    good
LP002990     bad
Name: Credit_History, Length: 614, dtype: object

In [9]:
# Map the Credit history 
df_processed['Credit_History'] = df_processed['Credit_History'].map({'1.0': 'good', '0.0':'bad'})

In [11]:
# Convert the Loan_Amount_Term to a string
df_processed['Loan_Amount_Term'] = df_processed['Loan_Amount_Term'].astype(str)

In [12]:
# Display the categorical columns
cat_col = df_processed.select_dtypes(include='object').columns
cat_col

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area'],
      dtype='object')

In [13]:
# Get the mode of the categorical columns
modes = df_processed[cat_col].apply(lambda x: x.value_counts().index[0])

# Fill the the missing categorical dataset with the mode from each column
df_processed[cat_col] = df_processed[cat_col].fillna(modes)

# Check the missing values have been handles
df_processed[cat_col].isna().sum()

Gender              0
Married             0
Dependents          0
Education           0
Self_Employed       0
Loan_Amount_Term    0
Credit_History      0
Property_Area       0
dtype: int64

In [14]:
# confirm cat_col missing values have been handled
df_processed.isna().sum()

Gender                0
Married               0
Dependents            0
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term      0
Credit_History        0
Property_Area         0
Loan_Status           0
dtype: int64

**Handling missing data in the numerical columns**

In [15]:
# Extract the columns with numerical values
num_col = df_processed.select_dtypes(include=['number']).columns
num_col

Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Status'], dtype='object')

In [16]:
for col in num_col:
    missing_num_col = df_processed[col].median()
    df_processed[col].fillna(missing_num_col, inplace=True)
    print(f"Fill the missing dataset in the {col} with the value of the {missing_num_col}")

Fill the missing dataset in the ApplicantIncome with the value of the 3812.5
Fill the missing dataset in the CoapplicantIncome with the value of the 1188.5
Fill the missing dataset in the LoanAmount with the value of the 128.0
Fill the missing dataset in the Loan_Status with the value of the 1.0


In [17]:
# confirm num_col missing values have been handled
df_processed.isna().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

#### **4. Handle Duplicated values**

In [18]:
# Remove duplicates if any (EDA didn't report duplivcates, but let's be thorough)
if duplicates > 0:
    print(f"Removing {duplicates} duplicate rows...")
    df_processed = df_processed.drop_duplicates()
    print(f"Dataset shape after removing duplicates: {df_processed.shape}")
else:
    print("✓ No duplicates to remove (as expected from EDA)")

✓ No duplicates to remove (as expected from EDA)


#### **5. Log-Transform Skewed Variables (EDA Recommendation)**

Based on EDA findings, transform the right-skewed variables identified

In [19]:
# Log-transfrorm skewed varibaled as recommended by EDA
print("=== LOG-TRANSFORMING SKEWED VARIABLES ===")
print("EDA identified these variables as right-skewed and recommended log transforamtion:")

# Variables to log-transform based on EDA findings
skewed_var = ['ApplicantIncome','CoapplicantIncome','LoanAmount']
for var in skewed_var:
    if var in df_processed.columns:
        # Checks if variable has zero or negative values
        min_val = df_processed[var].min()
        if min_val <=0:
            # Use log1p for variables with zeros
            df_processed[f'{var}_log'] = np.log1p(df_processed[var])
            print(f'✓ {var}: Applied log1p transformation (had {min_val:.3f} minimum value)')

        else:
            # Use log for positive values only
            df_processed[f'{var}_log'] = np.log(df_processed[var])
            print(f"✓ {var}: Applied log transformation")

        # Check for skewness before and after
        original_skew = skew(df_processed[var])
        transformed_skew = skew(df_processed[f'{var}_log'])
        print(f" Original Skewness: {original_skew:.3f} -> Transformed skewness: {transformed_skew:.3f}")

print(f"\n Dataset shape after log transformation: {df_processed.shape}")
print("New log-tranformed columns:", [col for col in df_processed.columns if '_log' in col])


=== LOG-TRANSFORMING SKEWED VARIABLES ===
EDA identified these variables as right-skewed and recommended log transforamtion:
✓ ApplicantIncome: Applied log transformation
 Original Skewness: 6.524 -> Transformed skewness: 0.478
✓ CoapplicantIncome: Applied log1p transformation (had 0.000 minimum value)
 Original Skewness: 7.473 -> Transformed skewness: -0.173
✓ LoanAmount: Applied log transformation
 Original Skewness: 2.736 -> Transformed skewness: -0.195

 Dataset shape after log transformation: (614, 15)
New log-tranformed columns: ['ApplicantIncome_log', 'CoapplicantIncome_log', 'LoanAmount_log']


#### **5. Outlier Treatment (EDA Recommendation)**

Based on EDA findings, handle outliers using IQR-clipping method


In [None]:
# Outlier treatment based on EDA recommendations
print("----Outlier Treatment (IQR-clipping method)----")
print("EDA recommend IQR-clipping for extreme values to preserve data points")

# Define numerical columns (excluding target)
num_col = df_processed.select_dtypes(include=[np.number]).columns.tolist()
if 'Loan_Status' in num_col:
    num_col.remove('Loan_Status')

print(f"Treating ouliers in {len(num_col)} numerical features...")

# Apply IQR_clipping method
outliers_clipped = 0
for col in num_col:
    Q1 = df_processed[col].quantile(0.25)
    Q3 = df_processed[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Countoutliers before clipping
    outliers_before = ((df_processed[col] < lower_bound) | (df_processed[col] > upper_bound)).sum()

    if outliers_before > 0:
        # Clip outliers
        df_processed[col] = np.where(df_processed[col] < lower_bound, lower_bound, df_processed[col])
        df_processed[col] = np.where(df_processed[col] > upper_bound, upper_bound, df_processed[col])
        
