### Feature Engineering Overview

This notebook performs the feature engineering process on the *Loan Portfolio* dataset.  
The objective is to transform and enrich the raw dataset prepared during the EDA stage to support subsequent KPI calculation and portfolio analysis.  
Αll transformations are aimed at improving data quality, interpretability, and aggregation readiness.


In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import re
import os

from pathlib import Path
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce

In [8]:
PROJECT_ROOT = Path.cwd().parents[1]
DATA_DIR = PROJECT_ROOT / "data" / "processed"

# Create a working copy
df_eda = pd.read_csv(DATA_DIR / "eda_processed.csv")
fe = df_eda.copy()
print("Start shape:", fe.shape)


Start shape: (9989, 23)


### Ordinal Feature Encoding

Ordinal variables such as **loan term**, **Employment length**, **grade**, and **sub-grade** were converted into numeric scales that preserve their inherent ranking.  
This transformation enables consistent quantitative analysis, ensuring that the model interprets these features according to their natural order (e.g., longer terms imply higher duration risk, higher grades correspond to lower credit risk).  
By maintaining their ordinal structure, the encoded variables support more meaningful feature interactions and improve downstream model performance.

In [9]:
#Ordinal encodings (term, emp_length, grade, sub_grade)

# term → numeric months
if 'term' in fe.columns:
    fe['term_months'] = (
        fe['term'].astype(str).str.extract(r'(\d+)')[0].astype(float)
    ).astype('float32')

# Employment_length → years (e.g., "< 1 year" -> 0, "10+ years" -> 10)
if 'Employment_length' in fe.columns:
    def emp_len_to_years(x: str) -> float:
        if x is None or (isinstance(x, float) and np.isnan(x)): 
            return np.nan
        x = str(x).strip().lower()
        if x.startswith('<'):
            return 0.0
        m = re.search(r'(\d+)', x)
        if m:
            val = float(m.group(1))
            return 10.0 if '+' in x else val
        return np.nan
    fe['emp_length_years'] = fe['Employment_length'].apply(emp_len_to_years).astype('float32')

# grade → ordinal (A..G). Higher is better risk-wise (A highest)
if 'grade' in fe.columns:
    grade_order = {'g':1,'f':2,'e':3,'d':4,'c':5,'b':6,'a':7}
    fe['grade_ord'] = fe['grade'].astype(str).str.lower().map(grade_order).astype('float32')

# sub_grade → ordinal A1..G5 → 1..35 (higher = better)
if 'sub_grade' in fe.columns:
    def subgrade_to_ord(x: str) -> float:
        if not isinstance(x, str) or len(x) < 2:
            return np.nan
        letter, num = x[0].upper(), x[1:]
        if letter not in 'ABCDEFG' or not num.isdigit():
            return np.nan
        base = (ord(letter) - ord('A')) * 5
        return float(base + int(num))  # A1..A5: 1..5, B1..B5: 6..10, ... G1..G5: 31..35
    fe['sub_grade_ord'] = fe['sub_grade'].apply(subgrade_to_ord).astype('float32')

added = [c for c in fe.columns if c.endswith(('_months','_years','_ord'))]
print("Added (ordinal):", added)
print("Shape:", fe.shape)


Added (ordinal): ['term_months', 'emp_length_years', 'grade_ord', 'sub_grade_ord']
Shape: (9989, 27)


### Date Decomposition

Date fields (e.g., issue date, last payment date, settlement date) are decomposed into separate **year** and **month** components.  
This transformation allows time-based grouping, trend analysis, and the derivation of temporal metrics such as credit history age.


In [10]:
# Date decomposition (year, month, optional ages)

date_cols = [
    'issue_d','earliest_cr_line','last_pymnt_d',
    'next_pymnt_d','last_credit_pull_d',
    'debt_settlement_flag_date','settlement_date',
    'sec_app_earliest_cr_line'
]

# LendingClub dates look like "Dec-2015"
FMT = "%b-%Y"

with warnings.catch_warnings():
    warnings.simplefilter("ignore", UserWarning)  # silence the “Could not infer format” spam
    for col in date_cols:
        if col in fe.columns:
            # strip just in case and parse with explicit format
            dt = pd.to_datetime(fe[col].astype(str).str.strip(), format=FMT, errors="coerce")
            fe[col + "_year"]  = dt.dt.year.astype("float32")
            fe[col + "_month"] = dt.dt.month.astype("float32")

print("Done parsing year/month. Shape:", fe.shape)

Done parsing year/month. Shape: (9989, 31)


### Derived Time Feature: Credit History Age

A new variable `credit_history_age_months` is computed as the difference in months between the loan issue date and the borrower's earliest credit line.  
This feature measures the borrower’s credit experience and is useful for portfolio risk and maturity analysis.


In [11]:
# credit history age in months

issue_col = 'issue_d'
earliest_col = 'earliest_cr_line'

# Explicit date parsing with format "%b-%Y"
issue_dt = pd.to_datetime(fe[issue_col].astype(str).str.strip(), format="%b-%Y", errors="coerce")
earl_dt  = pd.to_datetime(fe[earliest_col].astype(str).str.strip(), format="%b-%Y", errors="coerce")

# Create mask for valid rows
mask = issue_dt.notna() & earl_dt.notna()

# Initialize column
fe['credit_history_age_months'] = np.nan

# Calculate difference in months
months_diff = (
    (issue_dt.dt.year[mask] - earl_dt.dt.year[mask]) * 12
    + (issue_dt.dt.month[mask] - earl_dt.dt.month[mask])
).astype('float32')

# Assign and clean negatives
fe.loc[mask, 'credit_history_age_months'] = months_diff
fe.loc[fe['credit_history_age_months'] < 0, 'credit_history_age_months'] = np.nan

print("Added credit_history_age_months. Shape:", fe.shape)

Added credit_history_age_months. Shape: (9989, 32)


### Numeric Feature Transformations

Numeric variables are standardized and transformed to improve interpretability and reduce skewness.  
Transformations include:
- **Log scaling** for highly skewed financial amounts  
- **Ratio features** (e.g., payment-to-loan ratio, income-to-loan ratio)  
- **Standard scaling** for comparability across KPIs


In [12]:
# Select numeric features for transformation
num_cols = fe.select_dtypes(include=['float64', 'int64']).columns

# Log transform for skewed features (add +1 to avoid log(0))
log_features = ['loan_amnt', 'annual_inc', 'total_pymnt', 'total_rec_int', 'funded_amnt_inv']
for col in log_features:
    if col in fe.columns:
        fe[col + '_log'] = np.log1p(fe[col])

# Ratio features (relative to loan amount)
if all(c in fe.columns for c in ['total_pymnt', 'loan_amnt']):
    fe['payment_to_loan_ratio'] = fe['total_pymnt'] / fe['loan_amnt']

if all(c in fe.columns for c in ['annual_inc', 'loan_amnt']):
    fe['income_to_loan_ratio'] = fe['annual_inc'] / fe['loan_amnt']

# Standard scaling
scaler = StandardScaler()
scaled_features = ['int_rate', 'installment', 'annual_inc', 'dti']
for col in scaled_features:
    if col in fe.columns:
        fe[col + '_scaled'] = scaler.fit_transform(fe[[col]])

print("Added numeric transformations. Shape:", fe.shape)


Added numeric transformations. Shape: (9989, 41)


    ### Categorical Feature Encoding

Categorical attributes are encoded as follows:
- **Low-cardinality** variables (≤30 unique values) use one-hot encoding.  
- **High-cardinality** variables use target encoding with respect to `loan_status`.  

This ensures an optimal balance between interpretability and dimensionality.


In [13]:
# CATEGORICAL FEATURE ENCODING
# Only apply encoding to truly categorical fields that have not already been
# transformed through ordinal or date-based encodings.

# 1. Explicitly define relevant categorical columns
cat_cols = [
    'home_ownership',
    'verification_status',
    'purpose',
    'addr_state',
    'application_type',
]

# Keep only columns that actually exist in the dataframe
cat_cols = [c for c in cat_cols if c in fe.columns]
print("Categorical columns considered:", cat_cols)

# 2. Split into low- and high-cardinality groups
low_card_cols = [c for c in cat_cols if fe[c].nunique() <= 30]
high_card_cols = [c for c in cat_cols if fe[c].nunique() > 30]

print("Low-cardinality columns (One-Hot Encoding):", low_card_cols)
print("High-cardinality columns (kept as-is for KPIs):", high_card_cols)

# 3. Apply One-Hot Encoding to low-cardinality fields only
fe = pd.get_dummies(fe, columns=low_card_cols, drop_first=True)

print("Categorical encoding complete. Final shape:", fe.shape)



Categorical columns considered: ['home_ownership', 'purpose', 'addr_state']
Low-cardinality columns (One-Hot Encoding): ['home_ownership', 'purpose']
High-cardinality columns (kept as-is for KPIs): ['addr_state']
Categorical encoding complete. Final shape: (9989, 54)


## Feature Selection & Export

After completing all feature engineering steps (ordinal encoding, date decomposition, derived temporal metrics, and categorical processing), a final subset of features is selected to support loan-portfolio KPI calculation and downstream risk analytics.

The selected features include:

- Loan identifiers  
- Borrower demographic and financial attributes  
- Loan characteristics (e.g., term, interest rate, amount)  
- Derived time-based features (e.g., issue year/month, credit history age)  
- Encoded categorical variables  
- Performance fields required for KPI computation  

The resulting dataset is exported as `loan_portfolio_features.csv` and will be used in the subsequent KPI Calculation stage of the thesis.



In [14]:
# === FEATURE SELECTION & EXPORT ===

# 1. List of selected features (you can adjust)
selected_features = [
    # Identifiers
    'id', 'member_id',

    # Loan characteristics
    'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
    'term_months', 'int_rate', 'installment',
    'grade_ord', 'sub_grade_ord',
    'emp_length_years', 'purpose',
    
    # Borrower financial profile
    'annual_inc', 'dti', 'revol_bal', 'revol_util',
    'inq_last_6mths', 'open_acc', 'total_acc',

    # Temporal features
    'issue_d_year', 'issue_d_month',
    'earliest_cr_line_year', 'earliest_cr_line_month',
    'credit_history_age_months',

    # Categorical high-card columns left as-is
    'addr_state',

    # Performance (for KPIs)
    'loan_status', 'total_pymnt', 'total_rec_prncp',
]

# 2. Keep only columns that exist
selected_features = [c for c in selected_features if c in fe.columns]
print("Final selected features:", len(selected_features))

# 3. Create trimmed dataframe
fe_final = fe[selected_features].copy()

print("Final dataset shape:", fe_final.shape)

# 4. Export
output_path = DATA_DIR / "fe_features.csv"
fe_final.to_csv(output_path, index=False)

Final selected features: 22
Final dataset shape: (9989, 22)
