# **DATA PREPROCESSING BASED ON EDA INSIGHTS**

This notebook implements preprocessing steps based on the comprehensive EDA findings and recommendations. We'll follow the evidence-based approach from the EDA report to ensure our preprocessing aligns with the data patterns discovered.
Based on the EDA report, we will:

1. **Handle Skewed Variables** - Log-transform `Applicants Income`, `Co Applicants Income`, 
2. **Outlier Treatment** - IQR-capping for extreme acidity/sulphates 
3. **Feature Engineering** - Create acidity ratios and interaction features
4. **Feature Selection** - Keep high-signal features, evaluate low-signal ones
5. **Scaling** - StandardScaler for distance-based models
6. **Target Handling** - Classification approach with stratified splits

<!-- **Key EDA Evidence to Implement**

- **High-signal features**: `alcohol`, `volatile acidity`, `sulphates`, `citric acid`, `density`, `chlorides`
- **Low-signal features**: `residual sugar`, `free sulfur dioxide` (evaluate for removal)
- **Skewed variables**: `residual sugar`, `total sulfur dioxide`, `chlorides` (log-transform)
- **Feature engineering**: Acidity ratios, alcohol-acidity interactions, fermentation efficiency -->


#### **1. Import Libraries and Load Data**

In [1]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import warnings
warnings.filterwarnings('ignore')

# Preprocessing libraries
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix

# Statistical libraries
from scipy import stats
from scipy.stats import zscore, skew

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")


Libraries imported successfully!


In [2]:
# load in the dataset

url1 = r"https://raw.githubusercontent.com/ek-chris/Practice_datasets/refs/heads/main/home_loan_train.csv"
url2 = r"https://raw.githubusercontent.com/kenstare/Practice_datasets/master/home_loan_test.csv"
train_data = pd.read_csv(url1)

#### **2. EDA-Based Data Quality Assessment**

Based on EDA findings lets access the issues identified

In [4]:
#lets create a copy of the data for preprocessing

t_processed = train_data.copy()

In [12]:
t_processed.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,1
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,0
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,1
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,1
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,1


In [5]:
t_processed.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [11]:
# convert alpha to numeric values in the target column
t_processed['Loan_Status'] = t_processed['Loan_Status'].map({'Y': 1, 'N': 0})

In [13]:
# LETS check for missing values

print("\n 1. Missing Values:")
missing_values = t_processed.isnull().sum()
if missing_values.sum() > 0:
    print(missing_values[missing_values > 0])
else:
    print("No missing values found ")

# 2. check for duplicated data
print("\n 2. Duplicate Rows:")
duplicates = t_processed.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
if duplicates > 0:
    print(f"Percentage of duplicates: {(duplicates/len(t_processed))*100:.2f}%")

# 3. check skewness for variables identified in EDA as right skewed
print("\n 3. Skewness Analysis( EDA identified right-skewed varaiables):")
skewed_vars = ['ApplicantIncome','CoapplicantIncome','LoanAmount']
for var in skewed_vars:
    if var in t_processed.columns:
        skewness = skew(t_processed[var])
        print(f"{var}: skewness = {skewness:.3f} ({'right-skewed' if skewness > 0.5 else 'approximately normal'})")

#4. Check correlation with target (EDA evidence)
print("\n 4. correlation with Quality (EDA Evidence):")
correlations = t_processed.select_dtypes(include=['number']).corr()['Loan_Status'].sort_values(key=abs, ascending=False)

print("High-signal features (|correlation| > 0.2):")
high_signal = correlations[abs(correlations) > 0.2].drop('Loan_Status')
for feature, corr in high_signal.items():
    print(f"  {feature}: {corr:.3f}")

print("\nLow-signal features (|correlation| < 0.1):")
low_signal = correlations[abs(correlations) < 0.1]
for feature, corr in low_signal.items():
    print(f"  {feature}: {corr:.3f}")



 1. Missing Values:
Gender              13
Married              3
Dependents          15
Self_Employed       32
LoanAmount          22
Loan_Amount_Term    14
Credit_History      50
dtype: int64

 2. Duplicate Rows:
Number of duplicate rows: 0

 3. Skewness Analysis( EDA identified right-skewed varaiables):
ApplicantIncome: skewness = 6.524 (right-skewed)
CoapplicantIncome: skewness = 7.473 (right-skewed)
LoanAmount: skewness = nan (approximately normal)

 4. correlation with Quality (EDA Evidence):
High-signal features (|correlation| > 0.2):
  Credit_History: 0.562

Low-signal features (|correlation| < 0.1):
  CoapplicantIncome: -0.059
  LoanAmount: -0.037
  Loan_Amount_Term: -0.021
  ApplicantIncome: -0.005


#### **3. Handle Duplicates**


In [14]:
# Remove duplicates if any (EDA didn't report duplicates, but let's be thorough)
if duplicates > 0:
    print(f"Removing {duplicates} duplicate rows...")
    df_processed = t_processed.drop_duplicates()
    print(f"Dataset shape after removing duplicates: {t_processed.shape}")
else:
    print("✓ No duplicates to remove (as expected from EDA)")

✓ No duplicates to remove (as expected from EDA)


#### **4. Log-Transform Skewed Variables (EDA Recommendation)**

Based on EDA findings, transform the right-skewed variables identified

In [18]:
# Log-transform skewed varaiables as recommended by EDA
print("=== LOG-TRANSFORMING SKEWED VARIABLES===")
print("EDA identified these variables as right-skewed and recommended log transformation:")

# Variables to log-transform based on EDA findings 
skewed_vars = ['ApplicantIncome','CoapplicantIncome','LoanAmount']

for var in skewed_vars:
    if var in t_processed.columns:
        # check if variable has zero or negative values 

        min_val = t_processed[var].min()
        if min_val <= 0:
            # Use Log1p for variables with zeros
            t_processed[f'{var}_log'] = np.log1p(t_processed[var])
            print(f"{var}: Applied log1p transformstion (had {min_val:.3f} minimum value)")
        else:
            # Use log for variables with zeros
            t_processed[f"{var}_log"] = np.log(t_processed[var])
            print(f"{var}: Applied log transformation")

        #check skewness before and after
        original_skew = skew(t_processed[var])
        transformed_skew= skew(t_processed[f'{var}_log'])
        print(f"Original skewness: {original_skew:.3f}-> Transformed skewness: {transformed_skew:.3f} ")
print(f"\n Dataset shape after log transformation: {t_processed.shape}")
print("New log-transformed columns:", [col for col in t_processed.columns if "_log" in col])

=== LOG-TRANSFORMING SKEWED VARIABLES===
EDA identified these variables as right-skewed and recommended log transformation:
ApplicantIncome: Applied log transformation
Original skewness: 6.524-> Transformed skewness: 0.478 
CoapplicantIncome: Applied log1p transformstion (had 0.000 minimum value)
Original skewness: 7.473-> Transformed skewness: -0.173 
LoanAmount: Applied log transformation
Original skewness: nan-> Transformed skewness: nan 

 Dataset shape after log transformation: (614, 16)
New log-transformed columns: ['ApplicantIncome_log', 'CoapplicantIncome_log', 'LoanAmount_log']


#### **5. Outlier Treatment (EDA Recommendation)**

Based on EDA findings, handle outliers using IQR-capping method

In [19]:
# Outlier treatment based on EDA recommendations
print("=== OUTLIER TREATMENT (IQR-CAPPING METHOD) ===")
print("EDA recommended IQR-capping for extreme acidity/sulphates to preserve data points")

# Define numerical columns (excluding target)
numerical_cols = t_processed.select_dtypes(include=[np.number]).columns.tolist()
if 'Loan_Status' in numerical_cols:
    numerical_cols.remove('Loan_Status')

print(f"Treating outliers in {len(numerical_cols)} numerical features...")

# Apply IQR-capping method
outliers_capped = 0
for col in numerical_cols:
    Q1 = t_processed[col].quantile(0.25)
    Q3 = t_processed[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Count outliers before capping
    outliers_before = ((t_processed[col] < lower_bound) | (t_processed[col] > upper_bound)).sum()
    
    if outliers_before > 0:
        # Cap outliers
        t_processed[col] = np.where(t_processed[col] < lower_bound, lower_bound, t_processed[col])
        t_processed[col] = np.where(t_processed[col] > upper_bound, upper_bound, t_processed[col])
        outliers_capped += outliers_before
        print(f"✓ {col}: Capped {outliers_before} outliers")

print(f"\nTotal outliers capped: {outliers_capped}")
print(f"Dataset shape after outlier treatment: {t_processed.shape}")


=== OUTLIER TREATMENT (IQR-CAPPING METHOD) ===
EDA recommended IQR-capping for extreme acidity/sulphates to preserve data points
Treating outliers in 8 numerical features...
✓ ApplicantIncome: Capped 50 outliers
✓ CoapplicantIncome: Capped 18 outliers
✓ LoanAmount: Capped 39 outliers
✓ Loan_Amount_Term: Capped 88 outliers
✓ Credit_History: Capped 89 outliers
✓ ApplicantIncome_log: Capped 27 outliers
✓ LoanAmount_log: Capped 34 outliers

Total outliers capped: 345
Dataset shape after outlier treatment: (614, 16)


#### **6. Feature Engineering**

Implement the specific feature engineering recommendations from the EDA report

In [23]:
#1. Turn continuous income values into categories (Low, Medium, High)
# EDA rationale: helps models like Decision Trees or Random Forests interpret ranges more clearly and makes EDA results more readable.

# Define bins (adjust ranges to fit your dataset distribution)

bins = [0, 2500, 6000, 10000, float('inf')]
labels = ['Low', 'Medium', 'High', 'Very High']

# Create new column for income category
t_processed['Income_Category'] = pd.cut(t_processed['ApplicantIncome'], bins=bins, labels=labels)

In [27]:
#1. Turn continuous income values into categories (Low, Medium, High)
# EDA rationale: helps models like Decision Trees or Random Forests interpret ranges more clearly and makes EDA results more readable.

# Define bins (adjust ranges to fit your dataset distribution)

bins = [0, 2500, 6000, 10000, float('inf')]
labels = ['Low', 'Medium', 'High', 'Very High']

# Create new column for income category
t_processed['Income_Category'] = pd.cut(t_processed['ApplicantIncome'], bins=bins, labels=labels)
print(f"\n Income category = Applicant Income(categorized)")

# Encoding
t_processed['Income_Category_Code'] = t_processed['Income_Category'].map({'Low': 1, 'Medium': 2, 'High': 3, 'Very High': 4})
print(f"\n Income category code = Income category(encoded)")


# 2. Total Income: Applicant Income + Coapplicant Income
#  EDA rationale: reflects the household’s earning capacity
t_processed['TotalIncome'] = t_processed['ApplicantIncome'] + t_processed['CoapplicantIncome']

# To reduce skewness
t_processed['Log_TotalIncome'] = np.log1p(t_processed['TotalIncome'])  # log1p handles zeros safely

print("\nTotal income: Applicant Income + Coapplicant Income")



# 3. Ctaegorize Loan_Amount_Term (often measured in months) into interpretable categories like Short, Medium, or Long Term.
# EDA Recommendations: Enhances model interpretability

# Check unique terms first
print(t_processed['Loan_Amount_Term'].unique())

# Define categories
def categorize_loan_term(term):
    if term <= 180:
        return 'Short Term'
    elif term <= 300:
        return 'Medium Term'
    else:
        return 'Long Term'

t_processed['Loan_Term_Category'] = t_processed['Loan_Amount_Term'].apply(categorize_loan_term)

print(f"\n loan_term_category = Loan amount Term(categorized)")

# Now Lets check the shape of our dataset after feature engineering
print(f"\nDataset shape after feature engineering: {t_processed.shape}")
# Lets check the new engineered features
print(f"New engineered features: {[col for col in t_processed.columns if col not in train_data.columns]}")



 Income category = Applicant Income(categorized)

 Income category code = Income category(encoded)

Total income: Applicant Income + Coapplicant Income
[360.  nan]

 loan_term_category = Loan amount Term(categorized)

Dataset shape after feature engineering: (614, 21)
New engineered features: ['ApplicantIncome_log', 'CoapplicantIncome_log', 'LoanAmount_log', 'Income_Category', 'Income_Category_Code', 'TotalIncome', 'Log_TotalIncome', 'Loan_Term_Category']
