
# Loan Default Prediction & Expected Loss Model (Prototype)

**Objective**
- Predict Probability of Default (PD) for personal loans
- Compute Expected Loss (EL) assuming 10% recovery rate
- Provide a prototype model for risk team validation

**Expected Loss Formula**
EL = PD × Exposure × (1 − Recovery Rate)


In [None]:

# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, classification_report


In [None]:

# Load loan data
df = pd.read_csv('/mnt/data/Task 3 and 4_Loan_Data.csv')
df.head()


In [None]:

# Basic data inspection
df.info()


In [None]:

# Handle missing values (simple strategy for prototype)
df = df.fillna(df.median(numeric_only=True))


In [None]:

# Define target and features
target = 'default'
X = df.drop(columns=[target])
y = df[target]


In [None]:

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)


In [None]:

# Standardization (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:

# Model 1: Logistic Regression (baseline PD model)
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train_scaled, y_train)

log_pd = log_model.predict_proba(X_test_scaled)[:,1]
print("Logistic Regression AUC:", roc_auc_score(y_test, log_pd))


In [None]:

# Model 2: Decision Tree (non-linear benchmark)
tree_model = DecisionTreeClassifier(
    max_depth=5,
    min_samples_leaf=50,
    random_state=42
)
tree_model.fit(X_train, y_train)

tree_pd = tree_model.predict_proba(X_test)[:,1]
print("Decision Tree AUC:", roc_auc_score(y_test, tree_pd))



## Model Comparison

- Logistic Regression: interpretable, stable PD estimates
- Decision Tree: captures non-linear risk patterns
- Logistic Regression preferred for capital modeling (stability)


In [None]:

# Expected Loss Function
def expected_loss(
    loan_features: dict,
    exposure: float,
    recovery_rate: float = 0.10
):
    '''
    Inputs:
    - loan_features: dictionary of borrower characteristics
    - exposure: outstanding loan amount
    - recovery_rate: assumed recovery (default = 10%)
    
    Output:
    - Probability of Default
    - Expected Loss
    '''
    
    df_input = pd.DataFrame([loan_features])
    df_input_scaled = scaler.transform(df_input)
    
    pd_estimate = log_model.predict_proba(df_input_scaled)[0,1]
    expected_loss_value = pd_estimate * exposure * (1 - recovery_rate)
    
    return pd_estimate, expected_loss_value


In [None]:

# Example Test Case
sample_loan = X.iloc[0].to_dict()
exposure_amount = 250000  # loan outstanding

pd_est, el = expected_loss(sample_loan, exposure_amount)

pd_est, el



## Interpretation

- PD represents likelihood of default within 1 year
- Expected Loss incorporates recovery assumption
- Prototype suitable for:
  - Portfolio loss estimation
  - Capital provisioning
  - Scenario testing

## Future Enhancements
- Add macroeconomic variables
- Calibrate PDs using population stability index (PSI)
- Extend to LGD and EAD modeling
- Regulatory calibration (Basel / IFRS 9)
