# **DATA PREPROCESSING BASED ON EDA INSIGHTS**

This notebook implements preprocessing steps based on the comprehensive EDA findings and recommendations. We'll follow the evidence-based approach from the EDA report to ensure our preprocessing aligns with the data patterns discovered.
Based on the EDA report, we will:

1. **Handle Skewed Variables** - Log-transform `CoapplicantIncome`,  `ApplicantIncome `

2. **Feature Engineering**:

Feature Engineering Suggestions:
- Total_Income = ApplicantIncome + CoapplicantIncome
- EMI = LoanAmount / Loan_Amount_Term
- Income_to_Loan_Ratio = Total_Income / LoanAmount

3. **Feature Selection** - Keep high-signal features
4. **Scaling** - StandardScaler for distance-based models


 Final Preprocessing Pipeline (Suggested Order)

1. Apply log transformations on skewed variables.

4. Feature engineering (TotalIncome, EMI, Ratios).

5. Encode categorical variables (Label + One-Hot).

6. Scale numerical features (StandardScaler/RobustScaler).

7. Address class imbalance (SMOTE or class weights).

8. Perform feature selection 

9. Split data into train/test sets (e.g., 80/20).

#### **1. Import Libraries and Load Data**

In [20]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import warnings
warnings.filterwarnings('ignore')

# Preprocessing libraries
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix

# Statistical libraries
from scipy import stats
from scipy.stats import zscore, skew

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")

Libraries imported successfully!


In [31]:
df = pd.read_csv("cleaned_data.csv")
# df.set_index("")

In [32]:
df.columns


Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

#### **2. EDA-Based Data Quality Assessment**

In [33]:
def handle_skewed_data(df):
    """ Handles the skewness of the data"""
    skewed_column = ['ApplicantIncome','CoapplicantIncome','LoanAmount']
    for col in skewed_column:
            # Check if variable has zero or negative values
            min_val = df[col].min()
            if min_val <= 0:
                # Use log1p for variables with zeros
                df[f"{col}_log"] = np.log1p(df[col])
                print(f"{col}: Applied  loglp Transformation(had {min_val:.3f} minimum value)")
            else:
            # Use log for positive values only
                df[f"{col}_log"] = np.log(df[col])
                print(f" {col}: Applied log transformation")

            # Check skewness before and after
            original_skew = skew(df[col])
            transformed_skew = skew(df[f'{col}_log'])
            print(f"Original skewness: {original_skew:.3f} → Transformed skewness: {transformed_skew:.3f}")
    return df

    

#### **1. Log-Transform Skewed Variables (EDA Recommendation)**

In [34]:
handle_skewed_data(df)

 ApplicantIncome: Applied log transformation
Original skewness: 6.524 → Transformed skewness: 0.478
CoapplicantIncome: Applied  loglp Transformation(had 0.000 minimum value)
Original skewness: 7.473 → Transformed skewness: -0.173
 LoanAmount: Applied log transformation
Original skewness: 2.736 → Transformed skewness: -0.195


Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,ApplicantIncome_log,CoapplicantIncome_log,LoanAmount_log
0,Male,No,0.0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y,8.674026,0.000000,4.852030
1,Male,Yes,1.0,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,8.430109,7.319202,4.852030
2,Male,Yes,0.0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,8.006368,0.000000,4.189655
3,Male,Yes,0.0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,7.856707,7.765993,4.787492
4,Male,No,0.0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,8.699515,0.000000,4.948760
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,Female,No,0.0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y,7.972466,0.000000,4.262680
610,Male,Yes,3.0,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y,8.320205,0.000000,3.688879
611,Male,Yes,1.0,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y,8.996157,5.484797,5.533389
612,Male,Yes,2.0,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y,8.933664,0.000000,5.231109


#### **4. Feature Engineering**

Feature Engineering Suggestions:
- Total_Income = ApplicantIncome + CoapplicantIncome
- EMI = LoanAmount / Loan_Amount_Term
- Income_to_Loan_Ratio = Total_Income / LoanAmount
- Loan_Amount_Term_ in_years= Loan_Amount_Term / 12


In [35]:
#creating column for total income  = ApplicantIncome + CoapplicantIncome
df["Total_income"] = df["ApplicantIncome"] + df["CoapplicantIncome"]
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,ApplicantIncome_log,CoapplicantIncome_log,LoanAmount_log,Total_income
0,Male,No,0.0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y,8.674026,0.0,4.85203,5849.0
1,Male,Yes,1.0,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,8.430109,7.319202,4.85203,6091.0
2,Male,Yes,0.0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,8.006368,0.0,4.189655,3000.0
3,Male,Yes,0.0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,7.856707,7.765993,4.787492,4941.0
4,Male,No,0.0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,8.699515,0.0,4.94876,6000.0


In [36]:
#creating column for Equated Monthly Installment(EMI) = LoanAmount / Loan_Amount_Term
df["EMI"] = df["LoanAmount"] / df["Loan_Amount_Term"]
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,ApplicantIncome_log,CoapplicantIncome_log,LoanAmount_log,Total_income,EMI
0,Male,No,0.0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y,8.674026,0.0,4.85203,5849.0,0.355556
1,Male,Yes,1.0,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,8.430109,7.319202,4.85203,6091.0,0.355556
2,Male,Yes,0.0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,8.006368,0.0,4.189655,3000.0,0.183333
3,Male,Yes,0.0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,7.856707,7.765993,4.787492,4941.0,0.333333
4,Male,No,0.0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,8.699515,0.0,4.94876,6000.0,0.391667


In [37]:
# - Income_to_Loan_Ratio = Total_Income / LoanAmount
df["Income_to_Loan_Ratio"] = df["Total_income"] / df["LoanAmount"]
df.head()


Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,ApplicantIncome_log,CoapplicantIncome_log,LoanAmount_log,Total_income,EMI,Income_to_Loan_Ratio
0,Male,No,0.0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y,8.674026,0.0,4.85203,5849.0,0.355556,45.695312
1,Male,Yes,1.0,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,8.430109,7.319202,4.85203,6091.0,0.355556,47.585938
2,Male,Yes,0.0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,8.006368,0.0,4.189655,3000.0,0.183333,45.454545
3,Male,Yes,0.0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,7.856707,7.765993,4.787492,4941.0,0.333333,41.175
4,Male,No,0.0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,8.699515,0.0,4.94876,6000.0,0.391667,42.553191


In [38]:
df["Loan_Amount_Term_ in_year"] = df["Loan_Amount_Term"] / 12

In [48]:
def income_category(income):
    if income <= 1500:
        return "Low"
    elif (income > 1500) and (income <= 5000):
        return "Medium"
    elif income > 5000:
        return "High"
    else:
        return "Very High"


In [49]:
df["Income_Category"] = df["Total_income"].apply(income_category)
df["Income_Category"]

0        High
1        High
2      Medium
3      Medium
4        High
        ...  
609    Medium
610    Medium
611      High
612      High
613    Medium
Name: Income_Category, Length: 614, dtype: object

In [29]:
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,ApplicantIncome_log,CoapplicantIncome_log,LoanAmount_log,Loan_Amount_Term_ in_year
0,Male,No,0.0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y,8.674026,0.0,4.85203,30.0
1,Male,Yes,1.0,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,8.430109,7.319202,4.85203,30.0
2,Male,Yes,0.0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,8.006368,0.0,4.189655,30.0
3,Male,Yes,0.0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,7.856707,7.765993,4.787492,30.0
4,Male,No,0.0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,8.699515,0.0,4.94876,30.0


 #### **5.Encode categorical variables (Label + One-Hot)**

In [None]:
label_encoder =  LabelEncoder()
columns_to_encode = ["Gender","Married","Dependents","Education","Self_Employed","Property_Area","Loan_Status",]
df["Gender"] = label_encoder.fit_transform(df["Gender"])
df["Married"] = label_encoder.fit_transform(df["Married"])
df["Dependents"] = label_encoder.fit_transform(df["Dependents"])
df["Education"] = label_encoder.fit_transform(df["Education"])
df["Self_Employed"] = label_encoder.fit_transform(df["Self_Employed"])
df["Property_Area"] = label_encoder.fit_transform(df["Property_Area"])
df["Loan_Status"] = label_encoder.fit_transform(df["Loan_Status"])