# Loan Default Prediction

**Goal:** Predict whether a loan will default using tabular features.

**Contents:**
- Load & inspect data
- Preprocess (encoding + scaling)
- Train baseline Logistic Regression
- Compare with Random Forest
- Evaluate (Accuracy, ROC-AUC, Confusion Matrix)
- Feature importance

Dataset: `loan_default_dataset.csv` (synthetic; generated for demo).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, RocCurveDisplay, ConfusionMatrixDisplay, classification_report

# Load dataset
df = pd.read_csv('loan_default_dataset.csv')
df.head()


In [None]:
# Quick overview
display(df.describe(include='all'))
print('\nClass balance (Default=1):')
print(df['Default'].value_counts(normalize=True).round(3))

In [None]:
# Split features/target
X = df.drop('Default', axis=1)
y = df['Default']

num_cols = ['ApplicantIncome','CoapplicantIncome','LoanAmount','LoanTerm','Interest_Rate','Dependents']
cat_cols = ['Credit_History','Education','Self_Employed','Property_Area']

# Preprocess: scale numeric; one-hot encode categorical
preprocess = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

In [None]:
# Baseline: Logistic Regression
log_reg = Pipeline(steps=[('prep', preprocess),
                         ('clf', LogisticRegression(max_iter=1000))])
log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:,1]

acc_lr = accuracy_score(y_test, y_pred_lr)
auc_lr = roc_auc_score(y_test, y_proba_lr)
print(f'Logistic Regression -> Accuracy: {acc_lr:.3f} | ROC-AUC: {auc_lr:.3f}')

In [None]:
# Tree-based: Random Forest
rf = Pipeline(steps=[('prep', preprocess),
                    ('clf', RandomForestClassifier(n_estimators=300, random_state=42))])
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
# For ROC we need probabilities
y_proba_rf = rf.named_steps['clf'].predict_proba(rf.named_steps['prep'].transform(X_test))[:,1]

acc_rf = accuracy_score(y_test, y_pred_rf)
auc_rf = roc_auc_score(y_test, y_proba_rf)
print(f'Random Forest -> Accuracy: {acc_rf:.3f} | ROC-AUC: {auc_rf:.3f}')

In [None]:
# Confusion matrices
fig = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_lr))
fig.plot()
plt.title('Logistic Regression - Confusion Matrix')
plt.show()

fig = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_rf))
fig.plot()
plt.title('Random Forest - Confusion Matrix')
plt.show()

In [None]:
RocCurveDisplay.from_predictions(y_test, y_proba_lr)
plt.title('Logistic Regression - ROC Curve')
plt.show()

RocCurveDisplay.from_predictions(y_test, y_proba_rf)
plt.title('Random Forest - ROC Curve')
plt.show()

In [None]:
# Feature importance (approximate) - retrieve feature names after preprocessing
ohe = rf.named_steps['prep'].named_transformers_['cat']
num_features = num_cols
cat_features = list(ohe.get_feature_names_out(cat_cols))
all_features = num_features + cat_features

importances = rf.named_steps['clf'].feature_importances_

feat_imp = pd.Series(importances, index=all_features).sort_values(ascending=False).head(15)
print(feat_imp)

# Plot
plt.figure(figsize=(8,5))
feat_imp.iloc[:15].plot(kind='bar')
plt.title('Top Feature Importances (Random Forest)')
plt.ylabel('Importance')
plt.tight_layout()
plt.show()

### Conclusion
- Two strong baselines were trained: **Logistic Regression** and **Random Forest**.
- Metrics reported: Accuracy and ROC-AUC, plus confusion matrices and ROC curves.
- The Random Forest usually performs slightly better on this dataset.
- Next steps: hyperparameter tuning, class-weighting, and model calibration.
