## Preprocessing and Feature Engineering

This notebook prepares the raw credit risk data for machine learning model training.

**Purpose**: Transform raw data into model-ready features through cleaning, encoding, and scaling.

**Key Steps**:
1. **Data Loading**: Load the raw dataset
2. **Train/Test Split**: Split data before any preprocessing (prevents data leakage)
3. **Outlier Handling**: Cap extreme values to prevent model issues
4. **Missing Value Imputation**: Fill missing values using KNN imputation
5. **Feature Scaling**: Scale numeric features using RobustScaler
6. **Categorical Encoding**: Convert text categories to numbers
   - Ordinal encoding for loan_grade (preserves order)
   - One-hot encoding for nominal features (no order)
   - Binary encoding for yes/no features
7. **Save Processed Data**: Save train/test splits and preprocessing components

**Critical Principle**: All preprocessing is fit ONLY on training data, then applied to test data. This prevents data leakage and ensures realistic model evaluation.

In [1]:
# Import required libraries for data preprocessing
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn components for preprocessing
from sklearn.model_selection import train_test_split  # Split data into train/test sets
from sklearn.preprocessing import RobustScaler, OneHotEncoder, LabelEncoder  # Feature scaling and encoding
from sklearn.impute import KNNImputer  # Missing value imputation using K-Nearest Neighbors

import pickle  # For saving preprocessing components (scalers, encoders)
import warnings
warnings.filterwarnings('ignore')  # Suppress warnings for cleaner output

# Configure visualization style
plt.style.use('seaborn-v0_8-whitegrid')  # Clean grid background
sns.set_palette("husl")  # Colorful, distinguishable palette

In [2]:
# Set up project directory paths
# Current directory should be src/, so we go up one level to reach project root
ROOT = os.path.abspath(os.getcwd())
PROJECT_ROOT = os.path.abspath(os.path.join(ROOT, '..'))

# Define key directories
DATASET_DIR = os.path.join(PROJECT_ROOT, 'dataset')  # Where raw and processed data are stored
MODELS_DIR = os.path.join(PROJECT_ROOT, 'models')  # Where trained models and preprocessors are saved
ARTIFACTS_DIR = os.path.join(PROJECT_ROOT, 'artifacts')  # Where analysis results are saved
DATA_PATH = os.path.join(DATASET_DIR, 'credit_risk_dataset.csv')  # Path to raw dataset

# Create directories if they don't exist
# This ensures we can save files without errors
os.makedirs(DATASET_DIR, exist_ok=True)
os.makedirs(MODELS_DIR, exist_ok=True)
os.makedirs(ARTIFACTS_DIR, exist_ok=True)

print('DIRECTORY SETUP')
print(f"Raw data file: {DATA_PATH}")
print(f"Models directory: {MODELS_DIR}")
print(f"Dataset directory: {DATASET_DIR}")
print(f"Artifacts directory: {ARTIFACTS_DIR}")

DIRECTORY SETUP
Raw data file: d:\FINAL PROJECT\dataset\credit_risk_dataset.csv
Models directory: d:\FINAL PROJECT\models
Dataset directory: d:\FINAL PROJECT\dataset
Artifacts directory: d:\FINAL PROJECT\artifacts


In [3]:
df = pd.read_csv(DATA_PATH)
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

Dataset shape: (32581, 12)
Columns: ['person_age', 'person_income', 'person_home_ownership', 'person_emp_length', 'loan_intent', 'loan_grade', 'loan_amnt', 'loan_int_rate', 'loan_status', 'loan_percent_income', 'cb_person_default_on_file', 'cb_person_cred_hist_length']


In [4]:
print("First 5 rows:")
df.head(10)

First 5 rows:


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4
5,21,9900,OWN,2.0,VENTURE,A,2500,7.14,1,0.25,N,2
6,26,77100,RENT,8.0,EDUCATION,B,35000,12.42,1,0.45,N,3
7,24,78956,RENT,5.0,MEDICAL,B,35000,11.11,1,0.44,N,4
8,24,83000,RENT,8.0,PERSONAL,A,35000,8.9,1,0.42,N,2
9,21,10000,OWN,6.0,VENTURE,D,1600,14.74,1,0.16,N,3


In [5]:
print("Dataset Info:")
df.info()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB


In [6]:
TARGET_COL = "loan_status"
print(f"Target unique values: {df[TARGET_COL].unique()}")
print(f"Target dtype: {df[TARGET_COL].dtype}")


Target unique values: [1 0]
Target dtype: int64


In [7]:
print(f"Final target distribution:\n{df[TARGET_COL].value_counts()}")
print(f"Default rate: {(df[TARGET_COL].mean() * 100):.2f}%")

Final target distribution:
loan_status
0    25473
1     7108
Name: count, dtype: int64
Default rate: 21.82%


In [8]:
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100

missing_summary = pd.DataFrame({
    'Missing_Count': missing_data,
    'Missing_Percentage': missing_percentage
}).sort_values('Missing_Count', ascending=False)

print("Missing Values Summary:")
print(missing_summary[missing_summary['Missing_Count'] > 0])

Missing Values Summary:
                   Missing_Count  Missing_Percentage
loan_int_rate               3116            9.563856
person_emp_length            895            2.747000


In [9]:
feature_cols = [col for col in df.columns if col != TARGET_COL]
numeric_cols = df[feature_cols].select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = [col for col in feature_cols if col not in numeric_cols]

print(f"Numerical features ({len(numeric_cols)}): {numeric_cols}")
print(f"Categorical features ({len(categorical_cols)}): {categorical_cols}")

Numerical features (7): ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length']
Categorical features (4): ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file']


In [10]:
# CRITICAL STEP: Split data into features and target BEFORE any preprocessing
# This ensures we don't accidentally use test data information during preprocessing
# Separate features (X) from target variable (y)
X = df[feature_cols].copy()  # All columns except target
y = df[TARGET_COL]  # Target variable (loan_status: 0 = good loan, 1 = bad loan)

# Split data into training and test sets
# IMPORTANT: This split happens FIRST, before any preprocessing
# Why? To prevent data leakage - we must fit preprocessors ONLY on training data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,  # 20% for testing, 80% for training
    stratify=y,  # CRITICAL: Maintains same class distribution in both sets
                 # Without this, we might get all good loans in train and all bad in test
                 # This would break model evaluation and training
    random_state=125,  # Set seed for reproducibility (same split every time)
)

print('TRAIN/TEST SPLIT COMPLETE')
print(f"Training set: {X_train.shape[0]:,} samples ({(X_train.shape[0]/len(X)*100):.1f}%)")
print(f"Test set: {X_test.shape[0]:,} samples ({(X_test.shape[0]/len(X)*100):.1f}%)")
print(f"\nData split before preprocessing - this prevents data leakage!")
print(f"Stratified split ensures same class balance in train and test sets")

TRAIN/TEST SPLIT COMPLETE
Training set: 26,064 samples (80.0%)
Test set: 6,517 samples (20.0%)

Data split before preprocessing - this prevents data leakage!
Stratified split ensures same class balance in train and test sets


In [11]:
# Handle Outliers: Cap extreme values to prevent them from skewing model training
# IMPORTANT: We only cap on training data, then apply the same limits to test data
# This prevents data leakage - we don't use test data to determine what's "normal"

# Cap person_age at 100 years
# Reason: Extremely high ages (e.g., 144) are likely data entry errors
# We cap at 100 as a reasonable maximum age for loan applicants
age_original_max = X_train['person_age'].max()
X_train['person_age'] = X_train['person_age'].clip(upper=100)
print(f"Capped 'person_age': {age_original_max} → 100 (removed extreme outliers)")

# Cap person_emp_length at 50 years
# Reason: Employment length over 50 years is unrealistic (likely data errors)
# We cap at 50 years as a reasonable maximum employment duration
emp_original_max = X_train['person_emp_length'].max()
X_train['person_emp_length'] = X_train['person_emp_length'].clip(upper=50)
print(f"Capped 'person_emp_length': {emp_original_max} → 50 (removed extreme outliers)")

# Apply the same caps to test data (using limits learned from training data only)
X_test['person_age'] = X_test['person_age'].clip(upper=100)
X_test['person_emp_length'] = X_test['person_emp_length'].clip(upper=50)
print(f"\nApplied same caps to test data (no data leakage)")

Capped 'person_age': 144 → 100 (removed extreme outliers)
Capped 'person_emp_length': 123.0 → 50 (removed extreme outliers)

Applied same caps to test data (no data leakage)


In [12]:
X_train['person_age'].max(), X_train['person_emp_length'].max()

(np.int64(100), np.float64(50.0))

In [13]:
# Split categorical columns by encoding type
# ORDINAL: loan_grade (A, B, C, D, E, F, G) - has natural order, use LabelEncoder
# NOMINAL: person_home_ownership, loan_intent - no order, use OneHotEncoder
# BINARY: cb_person_default_on_file (Y/N) - simple mapping
ordinal_cols = ['loan_grade']
nominal_cols = ['person_home_ownership', 'loan_intent']
binary_cols = ['cb_person_default_on_file']

print(f"Ordinal features (1): {ordinal_cols}")
print(f"Nominal features ({len(nominal_cols)}): {nominal_cols}")
print(f"Binary features (1): {binary_cols}")

Ordinal features (1): ['loan_grade']
Nominal features (2): ['person_home_ownership', 'loan_intent']
Binary features (1): ['cb_person_default_on_file']


In [14]:
X_train.isna().sum()

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              719
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 2522
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64

In [15]:
# Missing Value Imputation: Fill in missing values for numeric features
# Method: KNN Imputer (K-Nearest Neighbors)
# Why KNN instead of mean/median? 
#   - KNN finds similar rows and uses their values (preserves relationships between features)
#   - Mean/median ignores feature relationships (less accurate)
#   - Example: If someone has high income but missing employment length, KNN finds similar high-income people
#     and uses their employment length (more realistic than just using overall median)

# Configure KNN Imputer
# n_neighbors=5: Uses 5 most similar rows to impute missing values
#   - More neighbors (e.g., 10): Smoother, less noisy, but slower and may underfit
#   - Fewer neighbors (e.g., 3): Faster, more precise, but may be noisy and overfit
#   - 5 is a good balance for this dataset size
numeric_imputer = KNNImputer(n_neighbors=5)

# CRITICAL: Fit imputer ONLY on training data, then apply to test data
# This prevents data leakage - test data doesn't influence how we handle missing values
print('MISSING VALUE IMPUTATION')
print('Method: KNN Imputer (finds 5 most similar rows)')
print('Fitting imputer on training data only...')

# fit_transform(): Learn imputation pattern from training data AND fill missing values
X_train_num_imputed = numeric_imputer.fit_transform(X_train[numeric_cols])

# transform(): Apply learned pattern to test data (no learning, just application)
X_test_num_imputed = numeric_imputer.transform(X_test[numeric_cols])

# Convert back to DataFrames to preserve column names and row indices
X_train_num_processed = pd.DataFrame(
    X_train_num_imputed,
    columns=numeric_cols,
    index=X_train.index
)

X_test_num_processed = pd.DataFrame(
    X_test_num_imputed,
    columns=numeric_cols,
    index=X_test.index
)

print(f"Training data: Missing values imputed")
print(f"Test data: Missing values imputed using training patterns (no data leakage)")

MISSING VALUE IMPUTATION
Method: KNN Imputer (finds 5 most similar rows)
Fitting imputer on training data only...
Training data: Missing values imputed
Test data: Missing values imputed using training patterns (no data leakage)


In [16]:
# Feature Scaling: Scale numeric features to similar ranges
# Why scale? Different features have different units and ranges (e.g., income in thousands, age in years)
# Scaling ensures all features contribute equally to model training

# Method: RobustScaler (instead of StandardScaler)
# Why RobustScaler?
#   - StandardScaler uses mean and standard deviation (sensitive to outliers)
#   - RobustScaler uses median and IQR (Interquartile Range) - robust to outliers
#   - Our dataset has outliers (from EDA), so RobustScaler is more appropriate
#   - Median: Middle value (not affected by extreme values)
#   - IQR: Spread between 25th and 75th percentile (measures variability without outliers)

robust_scaler = RobustScaler()

# CRITICAL: Fit scaler ONLY on training data, then apply to test data
# This prevents data leakage - test data doesn't influence scaling parameters
print('FEATURE SCALING')
print('Method: RobustScaler (uses median and IQR, robust to outliers)')
print('Fitting scaler on training data only...')

# fit_transform(): Learn scaling parameters (median, IQR) from training data AND scale
X_train_num_scaled = robust_scaler.fit_transform(X_train_num_processed)

# transform(): Apply learned scaling to test data (no learning, just application)
X_test_num_scaled = robust_scaler.transform(X_test_num_processed)

# Convert back to DataFrames to preserve column names and row indices
X_train_num_scaled = pd.DataFrame(
    X_train_num_scaled, 
    columns=numeric_cols, 
    index=X_train.index
)
X_test_num_scaled = pd.DataFrame(
    X_test_num_scaled, 
    columns=numeric_cols, 
    index=X_test.index
)

# Save the scaler for later use in prediction pipeline
# When we make predictions on new data, we need to apply the same scaling
scaler_path = os.path.join(MODELS_DIR, 'RobutScaler.pkl')
with open(scaler_path, 'wb') as f:
    pickle.dump(robust_scaler, f)
print(f"Training data: Features scaled")
print(f"Test data: Features scaled using training parameters (no data leakage)")
print(f"Scaler saved to: {scaler_path} (for use in prediction pipeline)")

FEATURE SCALING
Method: RobustScaler (uses median and IQR, robust to outliers)
Fitting scaler on training data only...
Training data: Features scaled
Test data: Features scaled using training parameters (no data leakage)
Scaler saved to: d:\FINAL PROJECT\models\RobutScaler.pkl (for use in prediction pipeline)


In [17]:
# Categorical Feature Encoding: Convert text categories to numbers
# Different encoding strategies for different types of categorical features
# Strategy chosen based on whether categories have natural order or not

print('CATEGORICAL FEATURE ENCODING')

# ============================================================================
# ORDINAL ENCODING: loan_grade (has natural order: A < B < C < D < E < F < G)
# ============================================================================
# Method: LabelEncoder
# Why? Preserves the natural order (A=0, B=1, C=2, etc.)
# Models can learn that higher numbers = worse grade = higher risk
print('\n1. Ordinal Encoding (loan_grade):')
print('   Method: LabelEncoder (preserves order: A < B < C < D < E < F < G)')

label_encoder = LabelEncoder()
# fit_transform(): Learn category order from training data AND encode
X_train_ordinal = label_encoder.fit_transform(X_train[ordinal_cols].values.ravel())
# transform(): Apply learned mapping to test data
X_test_ordinal = label_encoder.transform(X_test[ordinal_cols].values.ravel())
X_train_ordinal_df = pd.DataFrame(X_train_ordinal, columns=ordinal_cols, index=X_train.index)
X_test_ordinal_df = pd.DataFrame(X_test_ordinal, columns=ordinal_cols, index=X_test.index)
print(f"Encoded: {len(ordinal_cols)} feature(s)")

# ============================================================================
# NOMINAL ENCODING: person_home_ownership, loan_intent (no natural order)
# ============================================================================
# Method: OneHotEncoder
# Why? No order assumption - each category gets its own binary column
# Example: RENT → [1,0,0], OWN → [0,1,0], MORTGAGE → [0,0,1]
print('\n2. Nominal Encoding (person_home_ownership, loan_intent):')
print('   Method: OneHotEncoder (creates binary columns, no order assumption)')

one_hot_encoder = OneHotEncoder(
    drop='first',  # Remove first category to avoid multicollinearity (redundant info)
    sparse_output=False,  # Return regular arrays (not sparse matrices)
    handle_unknown='ignore'  # If test set has unseen category, set all columns to 0
)
# fit_transform(): Learn all categories from training data AND create binary columns
X_train_nominal = one_hot_encoder.fit_transform(X_train[nominal_cols])
# transform(): Apply learned categories to test data
X_test_nominal = one_hot_encoder.transform(X_test[nominal_cols])
nominal_feature_names = one_hot_encoder.get_feature_names_out(nominal_cols)
X_train_nominal_df = pd.DataFrame(X_train_nominal, columns=nominal_feature_names, index=X_train.index)
X_test_nominal_df = pd.DataFrame(X_test_nominal, columns=nominal_feature_names, index=X_test.index)
print(f"Encoded: {len(nominal_cols)} feature(s) → {len(nominal_feature_names)} binary columns")

# ============================================================================
# BINARY ENCODING: cb_person_default_on_file (simple yes/no)
# ============================================================================
# Method: Simple replacement (no encoder needed)
# Why? Just two values, so direct mapping is simplest: "Y" → 1, "N" → 0
print('\n3. Binary Encoding (cb_person_default_on_file):')
print('   Method: Direct mapping (Y → 1, N → 0)')

X_train_binary = X_train[binary_cols].replace({'Y': 1, 'N': 0}).values
X_test_binary = X_test[binary_cols].replace({'Y': 1, 'N': 0}).values
X_train_binary_df = pd.DataFrame(X_train_binary, columns=binary_cols, index=X_train.index)
X_test_binary_df = pd.DataFrame(X_test_binary, columns=binary_cols, index=X_test.index)
print(f"Encoded: {len(binary_cols)} feature(s)")

# ============================================================================
# COMBINE ALL ENCODED CATEGORICAL FEATURES
# ============================================================================
# Concatenate all encoded features horizontally (side by side)
X_train_cat_processed = pd.concat([X_train_ordinal_df, X_train_nominal_df, X_train_binary_df], axis=1)
X_test_cat_processed = pd.concat([X_test_ordinal_df, X_test_nominal_df, X_test_binary_df], axis=1)

# ============================================================================
# SAVE ENCODERS FOR INFERENCE PIPELINE
# ============================================================================
# These encoders must be saved and reused when making predictions on new data
# They ensure new data is encoded the same way as training data
label_encoder_path = os.path.join(MODELS_DIR, 'LabelEncoder.pkl')
one_hot_encoder_path = os.path.join(MODELS_DIR, 'OneHotEncoder.pkl')
with open(label_encoder_path, 'wb') as f:
    pickle.dump(label_encoder, f)
with open(one_hot_encoder_path, 'wb') as f:
    pickle.dump(one_hot_encoder, f)

print('ENCODING SUMMARY')
print(f"LabelEncoder saved to: {label_encoder_path}")
print(f"OneHotEncoder saved to: {one_hot_encoder_path}")
print(f"Total encoded categorical features: {X_train_cat_processed.shape[1]} columns")
print(f"All encoders fitted on training data only (no data leakage)")

CATEGORICAL FEATURE ENCODING

1. Ordinal Encoding (loan_grade):
   Method: LabelEncoder (preserves order: A < B < C < D < E < F < G)
Encoded: 1 feature(s)

2. Nominal Encoding (person_home_ownership, loan_intent):
   Method: OneHotEncoder (creates binary columns, no order assumption)
Encoded: 2 feature(s) → 8 binary columns

3. Binary Encoding (cb_person_default_on_file):
   Method: Direct mapping (Y → 1, N → 0)
Encoded: 1 feature(s)
ENCODING SUMMARY
LabelEncoder saved to: d:\FINAL PROJECT\models\LabelEncoder.pkl
OneHotEncoder saved to: d:\FINAL PROJECT\models\OneHotEncoder.pkl
Total encoded categorical features: 10 columns
All encoders fitted on training data only (no data leakage)


In [18]:
# Put numeric and categorical features back together
X_train_processed = pd.concat([X_train_num_scaled, X_train_cat_processed], axis=1)
X_test_processed = pd.concat([X_test_num_scaled, X_test_cat_processed], axis=1)

In [19]:
# Save Processed Datasets
# These will be used for model training and evaluation
# All preprocessing is complete: missing values filled, features scaled, categories encoded

print('SAVING PROCESSED DATASETS')

# Define file paths
X_train_path = os.path.join(DATASET_DIR, 'X_train.pkl')
X_test_path = os.path.join(DATASET_DIR, 'X_test.pkl')
y_train_path = os.path.join(DATASET_DIR, 'y_train.pkl')
y_test_path = os.path.join(DATASET_DIR, 'y_test.pkl')

# Save processed feature sets
with open(X_train_path, 'wb') as f:
    pickle.dump(X_train_processed, f)
print(f"Training features saved: {X_train_path} ({X_train_processed.shape[0]:,} samples, {X_train_processed.shape[1]} features)")

with open(X_test_path, 'wb') as f:
    pickle.dump(X_test_processed, f)
print(f"Test features saved: {X_test_path} ({X_test_processed.shape[0]:,} samples, {X_test_processed.shape[1]} features)")

# Save target variables
with open(y_train_path, 'wb') as f:
    pickle.dump(y_train, f)
print(f"Training targets saved: {y_train_path} ({len(y_train):,} samples)")

with open(y_test_path, 'wb') as f:
    pickle.dump(y_test, f)
print(f"Test targets saved: {y_test_path} ({len(y_test):,} samples)")

print('PREPROCESSING COMPLETE!')
print('All datasets are ready for model training')
print('Preprocessing components saved (scalers, encoders)')
print('No data leakage - all preprocessing fitted on training data only')
print('\nNext step: Train machine learning models using the processed datasets.')

SAVING PROCESSED DATASETS
Training features saved: d:\FINAL PROJECT\dataset\X_train.pkl (26,064 samples, 17 features)
Test features saved: d:\FINAL PROJECT\dataset\X_test.pkl (6,517 samples, 17 features)
Training targets saved: d:\FINAL PROJECT\dataset\y_train.pkl (26,064 samples)
Test targets saved: d:\FINAL PROJECT\dataset\y_test.pkl (6,517 samples)
PREPROCESSING COMPLETE!
All datasets are ready for model training
Preprocessing components saved (scalers, encoders)
No data leakage - all preprocessing fitted on training data only

Next step: Train machine learning models using the processed datasets.
