This notebook implements preprocessing steps based on the comprehensive EDA findings and recommendations. We'll follow the evidence-based approach from the EDA report to ensure our preprocessing aligns with the data patterns discovered.
Based on the EDA report, we will:
0
1. **Handle Skewed Variables** - Log-transform ``, `APPL`, `chlorides`
2. **Outlier Treatment** - IQR-capping for extreme acidity/sulphates 
3. **Feature Engineering** - Create tottal income and interaction features
4. **Feature Selection** - Keep high-signal features, evaluate low-signal ones
5. **Scaling** - StandardScaler for distance-based models
6. **Target Handling** - Classification approach with stratified splits

**Key EDA Evidence to Implement**

- **High-signal features**: `alcohol`, `volatile acidity`, `sulphates`, `citric acid`, `density`, `chlorides`
- **Low-signal features**: `residual sugar`, `free sulfur dioxide` (evaluate for removal)
- **Skewed variables**: `residual sugar`, `total sulfur dioxide`, `chlorides` (log-transform)
- **Feature engineering**: Acidity ratios, alcohol-acidity interactions, fermentation efficiency
`

In [2]:
#Import necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time
import scipy.stats as stat

In [5]:
file = pd.read_csv('cleaned_eda_data.csv')
file.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
