# LendingClub Loan Default Prediction - Feature Engineering

## Objective
Transform cleaned data into machine-learning ready features through engineering and preprocessing.

## Steps
1. Handle missing values
2. Create new informative features  
3. Encode categorical variables
4. Remove data leakage columns
5. Prepare final dataset for modeling

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

# Load cleaned data from previous step
df = pd.read_csv('../data/processed/01_cleaned_data.csv')
print(f"Loaded data shape: {df.shape}")

Loaded data shape: (44005, 95)


## 🛠️ Feature Engineering Strategy

**Goal:** Transform raw data into meaningful features for machine learning

**Key Steps:**
1. Handle missing values appropriately
2. Create new informative features
3. Encode categorical variables
4. Remove data leakage columns

**Why This Matters:**
- Better features = better model performance
- Prevents data leakage (using future information)
- Makes categorical data understandable to algorithms

In [3]:
# Identify column types
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

# Remove target and ID columns
if 'loan_default' in numerical_cols:
    numerical_cols.remove('loan_default')
if 'id' in numerical_cols:
    numerical_cols.remove('id')
if 'id' in categorical_cols:
    categorical_cols.remove('id')

print(f"Numerical columns: {len(numerical_cols)}")
print(f"Categorical columns: {len(categorical_cols)}")

Numerical columns: 70
Categorical columns: 23


## 🔍 Handling Missing Values

**Strategy:**
- Numerical columns: Fill with median (robust to outliers)
- Categorical columns: Fill with 'MISSING' as a new category

**Why This Approach:**
- Preserves data structure without distorting distributions
- 'MISSING' can sometimes be informative itself
- Better than dropping rows and losing data

In [4]:
# Handle missing values for numerical columns
print("Handling missing values for numerical columns...")
for col in numerical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)
        print(f"Filled missing values in {col} with median")

# Handle missing values for categorical columns
print("\nHandling missing values for categorical columns...")
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna('MISSING', inplace=True)
        print(f"Filled missing values in {col} with 'MISSING'")

Handling missing values for numerical columns...
Filled missing values in dti with median
Filled missing values in mths_since_last_delinq with median
Filled missing values in revol_util with median
Filled missing values in bc_open_to_buy with median
Filled missing values in bc_util with median
Filled missing values in mo_sin_old_il_acct with median
Filled missing values in mths_since_recent_bc with median
Filled missing values in mths_since_recent_inq with median
Filled missing values in num_tl_120dpd_2m with median
Filled missing values in percent_bc_gt_75 with median

Handling missing values for categorical columns...
Filled missing values in emp_title with 'MISSING'
Filled missing values in emp_length with 'MISSING'
Filled missing values in title with 'MISSING'
Filled missing values in last_pymnt_d with 'MISSING'
Filled missing values in last_credit_pull_d with 'MISSING'


## 🎯 Creating New Features

**Credit Age:** From `earliest_cr_line` → How long someone has had credit  
**Debt Consolidation Flag:** If loan purpose is debt consolidation → Higher risk indicator  
**Income-to-Loan Ratio:** Annual income divided by loan amount → Affordability measure

**Why Feature Engineering:**
- Domain knowledge improves model understanding
- New features can capture patterns raw data misses
- Helps model understand financial relationships

In [5]:
# Feature Engineering: Create new features
print("Creating new features...")

# 1. Credit age (if earliest_cr_line exists)
if 'earliest_cr_line' in df.columns:
    try:
        # Convert to datetime
        df['earliest_cr_line_dt'] = pd.to_datetime(df['earliest_cr_line'], format='%b-%y', errors='coerce')
        # Calculate credit age in years
        df['credit_age_years'] = (pd.Timestamp.now() - df['earliest_cr_line_dt']).dt.days / 365.25
        df['credit_age_years'] = df['credit_age_years'].fillna(df['credit_age_years'].median())
        print("Created 'credit_age_years' feature")
    except:
        print("Could not create credit_age_years feature")

# 2. Debt consolidation flag
if 'purpose' in df.columns:
    df['is_debt_consolidation'] = (df['purpose'] == 'debt_consolidation').astype(int)
    print("Created 'is_debt_consolidation' feature")

# 3. Income to loan ratio
if 'annual_inc' in df.columns and 'loan_amnt' in df.columns:
    df['income_to_loan_ratio'] = df['annual_inc'] / (df['loan_amnt'] + 1)  # +1 to avoid division by zero
    print("Created 'income_to_loan_ratio' feature")

Creating new features...
Created 'credit_age_years' feature
Created 'is_debt_consolidation' feature
Created 'income_to_loan_ratio' feature


## 🔄 Encoding Categorical Variables

**Method:** One-Hot Encoding for low-cardinality categories (<10 unique values)

**Why One-Hot Encoding:**
- Converts categories to numerical format algorithms understand
- Prevents artificial ordering (no "high > medium > low" assumption)
- Works well with most machine learning models

**High-cardinality features** (like ZIP codes) handled separately to avoid too many columns

In [6]:
# Encode categorical variables
print("Encoding categorical variables...")

# Select categorical columns with low cardinality (<10 unique values)
low_cardinality_cols = []
for col in categorical_cols:
    if df[col].nunique() < 10:
        low_cardinality_cols.append(col)

print(f"Encoding {len(low_cardinality_cols)} categorical columns with low cardinality")

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=low_cardinality_cols, drop_first=True, dtype=int)
print(f"Shape after encoding: {df_encoded.shape}")

Encoding categorical variables...
Encoding 12 categorical columns with low cardinality
Shape after encoding: (44005, 104)


## 🚫 Preventing Data Leakage

**Removing columns that wouldn't be known at loan application time:**
- Payment history columns (`total_pymnt`, `recoveries`)
- Collection-related fields
- Future-dated information

**Critical Importance:**
- Using future information creates unrealistically accurate models
- Real-world deployment would fail without this step
- Ensures model learns from application-time data only

In [7]:
# Identify and drop columns that cause data leakage
leakage_indicators = ['recoveries', 'collection_recovery_fee', 'total_pymnt', 
                     'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 
                     'total_rec_late_fee', 'last_pymnt_d', 'last_pymnt_amnt',
                     'last_credit_pull_d', 'collections_12_mths_ex_med']

leakage_cols = [col for col in leakage_indicators if col in df_encoded.columns]
print(f"Dropping {len(leakage_cols)} potential leakage columns: {leakage_cols}")

# Also drop other non-feature columns
non_feature_cols = ['id', 'loan_status', 'issue_d', 'url', 'desc', 'title', 'earliest_cr_line', 'earliest_cr_line_dt']
non_feature_cols = [col for col in non_feature_cols if col in df_encoded.columns]
print(f"Dropping {len(non_feature_cols)} non-feature columns: {non_feature_cols}")

all_cols_to_drop = leakage_cols + non_feature_cols
X = df_encoded.drop(columns=all_cols_to_drop + ['loan_default'])
y = df_encoded['loan_default']

print(f"Final feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")

Dropping 11 potential leakage columns: ['recoveries', 'collection_recovery_fee', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d', 'collections_12_mths_ex_med']
Dropping 5 non-feature columns: ['id', 'url', 'title', 'earliest_cr_line', 'earliest_cr_line_dt']
Final feature matrix shape: (44005, 87)
Target vector shape: (44005,)


## 💾 Saving Processed Data

**Output:**
- `02_processed_features.csv`: Final feature matrix (X)
- `02_processed_target.csv`: Target variable (y)

**Ready for modeling:** Clean, encoded, and leakage-free data for machine learning algorithms

In [8]:
# Save the processed data
X.to_csv('../data/processed/02_processed_features.csv', index=False)
y.to_csv('../data/processed/02_processed_target.csv', index=False)
print("Processed features and target saved to data/processed/")

Processed features and target saved to data/processed/
