# 02 - Feature Engineering

**LoanGuardian** — Transform raw UAE synthetic dataset into ML-ready features.

This notebook will:

- Handle missing values & outliers
- Encode categorical variables
- Apply SMOTE for class imbalance
- Scale numerical features
- Perform feature selection


In [3]:
# 1. Imports & Setup
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import os

# Save directories
os.makedirs('../docs/reports/fe_images', exist_ok=True)


Matplotlib is building the font cache; this may take a moment.


In [4]:
# 2. Load Dataset
DATA_PATH = '../data/loan_guardian_uae.csv'
df = pd.read_csv(DATA_PATH)
print('Dataset shape:', df.shape)
df.head(3)


Dataset shape: (50000, 15)


Unnamed: 0,LoanID,LoanAmount,Income,Age,CreditScore,ExistingEMI,EmploymentStatus,Tenure,MaritalStatus,Dependents,RepaymentHistory,Purpose,Stage,DelinquencyCount12M,DefaultStatus
0,1,126958,26125,41,782,16638,Employed,22,Widowed,2,Good,Car Loan,Disbursement,1,0
1,2,676155,7655,31,830,11937,Unemployed,55,Single,0,Good,Car Loan,Disbursement,1,0
2,3,136932,4448,23,826,2112,Retired,39,Widowed,0,Good,Home Loan,File,2,0


## 3. Handle Missing Values

- For numerical columns → median imputation
- For categorical columns → mode imputation
- Document missingness for reproducibility


In [5]:
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = df.select_dtypes(include=['object']).columns.tolist()

# Numerical imputation
for col in num_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)

# Categorical imputation
for col in cat_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0], inplace=True)

# Verify
print('Missing values after imputation:')
print(df.isnull().sum().sum())


Missing values after imputation:
0


## 4. Outlier Treatment (IQR capping)

- Cap outliers at 1.5 * IQR
- Ensure values are realistic (UAE-relevant ranges)


In [6]:
for col in num_cols:
    if col in ['LoanID', 'DefaultStatus']:
        continue
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5*IQR
    upper = Q3 + 1.5*IQR
    df[col] = np.where(df[col] < lower, lower,
                       np.where(df[col] > upper, upper, df[col]))


## 5. Categorical Encoding

- Label Encoding for binary-like features
- One-Hot Encoding for multi-class categorical features
- WOE encoding for target-related categories (its optional)


In [7]:
# Label Encoding
binary_cols = [col for col in cat_cols if df[col].nunique() == 2]
le = LabelEncoder()
for col in binary_cols:
    df[col] = le.fit_transform(df[col])

# One-Hot Encoding
multi_cols = [col for col in cat_cols if df[col].nunique() > 2]
df = pd.get_dummies(df, columns=multi_cols, drop_first=True)
print('Shape after encoding:', df.shape)


Shape after encoding: (50000, 24)


In [8]:
## 6. Train-Test Split
X = df.drop(['LoanID','DefaultStatus'], axis=1)
y = df['DefaultStatus']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42, stratify=y)

print('Training set shape:', X_train.shape)
print('Test set shape:', X_test.shape)


Training set shape: (40000, 22)
Test set shape: (10000, 22)


## 7. Handle Class Imbalance (SMOTE)

- Loan default is very imbalanced
- Apply SMOTE only on training data


In [9]:
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
print('Resampled training shape:', X_train_res.shape)
print('Default rate after SMOTE:', y_train_res.mean())


Resampled training shape: (76380, 22)
Default rate after SMOTE: 0.5


## 8. Scaling Numerical Features

- StandardScaler for tree-independent features
- MinMaxScaler optional for neural networks


In [11]:
scaler = StandardScaler()
num_features = [col for col in X_train_res.columns if X_train_res[col].dtype in [np.int64, np.float64]]
X_train_res[num_features] = scaler.fit_transform(X_train_res[num_features])
X_test[num_features] = scaler.transform(X_test[num_features])


## 9. Feature Selection

- Remove multicollinear features (VIF > 10)
- Remove low-importance features based on correlation


In [12]:
def calculate_vif(df):
    vif_data = pd.DataFrame()
    vif_data['feature'] = df.columns
    vif_data['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif_data

vif = calculate_vif(X_train_res[num_features])
high_vif = vif[vif['VIF']>10]
print('High VIF features:\n', high_vif)

# Optionally drop high VIF features
X_train_res.drop(columns=high_vif['feature'].tolist(), inplace=True, errors='ignore')
X_test.drop(columns=high_vif['feature'].tolist(), inplace=True, errors='ignore')


High VIF features:
 Empty DataFrame
Columns: [feature, VIF]
Index: []


## 10. Save Preprocessed Data

- Save preprocessed training & test sets for modeling


In [14]:
os.makedirs('../data/processed', exist_ok=True)
X_train_res.to_csv('../data/processed/X_train_res.csv', index=False)
y_train_res.to_csv('../data/processed/y_train_res.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

print('Preprocessed data saved in data/processed/')


Preprocessed data saved in data/processed/
