# Kaggle: Predict Loan Payback â€” Feature Engineering

**Notebook:** `03_feature_engineering.ipynb`
**Author:** Brice Nelson
**Organization:** Kaggle Series | Brice Machine Learning Projects
**Date Created:** November 16, 2025
**Last Updated:** November 16, 2025

---

## ðŸ§­ Purpose

This notebook focuses on **feature engineering** for the *Predict Loan Payback* Kaggle competition.
Feature engineering is where we convert domain knowledge and statistical insight into **predictive power**, increasing the modelâ€™s ability to distinguish between loans that are repaid vs. defaulted.

Whereas `02_data_cleaning.ipynb` prepared a clean, numeric, and aligned dataset, this notebook builds **new, informative features** that enhance model signal.

### **Objectives**
1. Import scaled training and test datasets from `/data/processed/`.
2. Engineer domain-driven features (credit, income, debt, and loan characteristics).
3. Validate distributions and inspect transformations.
4. Export a feature-enhanced dataset for modeling.

---

## ðŸ§± Feature Engineering Roadmap

The engineered features planned for this notebook:

### **1. Financial Ratio Features**
- **Loan-to-Income Ratio:** `loan_amount / annual_income`
- **Debt-to-Income Bucket:** Flag or bin DTI levels
- **Interest Rate Stress Feature:** Higher-than-expected interest rates relative to credit score

### **2. Credit Behavior Features**
- **Credit Score Buckets:** Very low â†’ very high
- **Grade/Subgrade Interactions:** E.g., `grade * subgrade`
- **Creditworthiness Flags:** Threshold-based indicators

### **3. Loan Purpose & Behavior Features**
- One-hot interactions (e.g., `purpose Ã— credit_score_bucket`)
- Risk grouping based on purpose type

### **4. Optional Modeling-Useful Transforms**
- Log transforms for skewed features
- Quantile transforms (if needed)

---

## ðŸ“¥ Load Processed Data

The processed datasets created in `02_data_cleaning.ipynb` will now be imported from:
`../data/processed/loan_train_scaled.csv` and `../data/processed/loan_test_scaled.csv`.



## Library Imports

In [4]:
import os
from pathlib import Path
import numpy as np
import pandas as pd


## Load Processed Data

In [3]:
# check working directory
os.getcwd()

'/home/brice-nelson/Documents/computerScience/PROJECTS/machine_learning/supervised_learning/kaggle/predict_loan_payback/notebooks'

In [6]:
# Assign base directory
BASE_DIR = Path().resolve().parent
processed_path = BASE_DIR / "data" / "processed"

# Load datasets
train_path = processed_path / 'loan_train_scaled.csv'
test_path = processed_path / 'loan_test_scaled.csv'

# Assign dataframes
loan_train_scaled = pd.read_csv(train_path)
loan_test_scaled = pd.read_csv(test_path)

# Sanity test
if not loan_train_scaled.empty and not loan_test_scaled.empty:
    print("Dataframes have been loaded successfully!")
else:
    print("Warning: One or both DataFrames are empty.")


Dataframes have been loaded successfully!


## Feature Engineering Checklist

- [x] Loan-to-Income ratio
- [ ] High DTI flag
- [ ] Credit score buckets
- [ ] Interaction terms
- [ ] Log / quantile transforms
- [ ] Correlation + feature importance validation
- [ ] Export datasets


### Loan-to-Income Ratio
Forumla:
$$
\text{Loan-to-Income Ratio} = \frac{\text{Loan Amount}}{\text{Monthly Income}}
$$

In [7]:
# Loan to Income Ratio
for df in [loan_train_scaled, loan_test_scaled]:
    df["loan_to_income"] = df["loan_amount"] / df["annual_income"]

# Sanity check
print(f'trained:\n {loan_train_scaled.head()}')
print(f'test:\n {loan_test_scaled.head()}')


trained:
    id  annual_income  debt_to_income_ratio  credit_score  loan_amount  \
0   0      -0.705461             -0.535135      0.993849    -1.803484   
1   1      -0.977248              0.660668     -0.810394    -1.505401   
2   2       0.050689             -0.345556      0.236067     0.286558   
3   3      -0.050687             -0.812211     -2.668764    -1.492497   
4   4      -0.850388             -0.987206     -0.287163    -0.409421   

   interest_rate  loan_paid_back     grade  subgrade  gender_Female  ...  \
0       0.653899             1.0 -0.401966  0.008691            1.0  ...   
1       0.280571             0.0  0.613154  0.008691            0.0  ...   
2      -1.292385             1.0 -0.401966  1.434819            0.0  ...   
3       1.863482             1.0  2.643393 -1.417436            1.0  ...   
4      -1.068388             1.0  0.613154 -1.417436            0.0  ...   

   employment_status_Unemployed  loan_purpose_Business  loan_purpose_Car  \
0                 

### High Debt-to-Income (DTI) Flag

In [8]:
loan_train_scaled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593994 entries, 0 to 593993
Data columns (total 35 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   id                               593994 non-null  int64  
 1   annual_income                    593994 non-null  float64
 2   debt_to_income_ratio             593994 non-null  float64
 3   credit_score                     593994 non-null  float64
 4   loan_amount                      593994 non-null  float64
 5   interest_rate                    593994 non-null  float64
 6   loan_paid_back                   593994 non-null  float64
 7   grade                            593994 non-null  float64
 8   subgrade                         593994 non-null  float64
 9   gender_Female                    593994 non-null  float64
 10  gender_Male                      593994 non-null  float64
 11  gender_Other                     593994 non-null  float64
 12  ma

In [9]:
# Create a high dti flag column
for df in [loan_train_scaled, loan_test_scaled]:
    df["high_dti_flag"] = (df["debt_to_income_ratio"] >= 0.40).astype(int)

# Sanity check
if loan_train_scaled['high_dti_flag'].empty or loan_test_scaled['high_dti_flag'].empty:
    print("Warning: high dti flag was not created")
else:
    print("High dti column successfully created")


High dti column successfully created
