### Name: Kshitij Chilate
### Roll No.: 35
### Batch: B2
### FA Practical 7
### Experiment 7: Credit Risk Classification Using Logistic Regression

### Aim: Build a credit default prediction model using logistic regression, evaluate it rigorously, interpret it for buisness use, & tune the decision threshold based on cost

### Objective:

* Understand logistic regression for binary credit default prediction.
* Handle class imbalance, preprocessing, and robust train/validation/test splitting.
* Engineer predictive features from payments and bills.
* Fit regularized logistic regression within a pipeline
*   Evaluate with ROC-AUC, PR-AUC, confusion matrices and cost-senstive thresholds.
*   Calibrate probabilites & interpret model(odds ratios permutation importance).
*   Present a brief case study mapping metrics->decision.



## 1) Why Logistic Regression for Credit Risk?
### Credit risk modelling is fundamentally about **predictiong the likelihood** that a customer will **default** on their payments.


*   **Binary nature of the problem:** We have only 2 outcomes:


    *   **Default (1)**->customer fails to pay
    *   **Non Default (0)**->customer pays on time

* **Business needs probabilities, not just labels** : Banks and lenders don't just want to classify customers - They want to **quantify** the likehood of default to:
    * Set **credit limits** (higher limits for safer customer)
    * Price **interest rates** based on risk
    * Approve/reject loan applications
    * Implement **risk-based strategies** for collections and interventions


### 2) Why logistic regression is widely used?

*   **Probablistic** - Gives a score between **0 and 1**, interpretable as the **probability of default**.

*   **Interpretable** - Coefficients directly relate to **log-odds**, and exponentiating them gives **odds ratios**. This helps explain decision to auditors, regulators and stakeholders.
*   **Efficent** - Works well for **large tabular datasets** like financial transactions.


* **Regulatory acceptance** - Logistic regression is considered **industry standard** in finance because its assumptions and outputs are **explainable**.





### 2) Model Formulation

Logistic regression predicts the **log-odds** of default:

$$
\log \frac{p(y=1\mid x)}{1- p(y=1\mid x} = \beta_0 + \beta^\top x
$$

*   $\beta$ → vector of coefficients for features.
*   $p(y=1 \mid x)$ - probabilty of default given customer features $x$.
*   $\beta_0$ - intercept term

Equivalently, we can express the model as:

$$
p(y=1\mid x) = \sigma(\beta_0 + \beta^\top x) = \frac{1}{1 + e^{-(\beta_0 + \beta^\top x)}}
$$

This is the **sigmoid function** - it maps any real no. to a value between *0 & 1.*

**Intution**:

- If $\beta_0 + \beta^\top x$ is **very positive**, the probabilty **$p(y=1)$ **approaches 0 → customer unlikely to default.
- If it's **very negative** , $p(y=1)$ approaches **0** → customer unlikely to default.

- If it's around **0**, the default probability is **50%**.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import (
    roc_auc_score, roc_curve, precision_recall_curve, average_precision_score,
    confusion_matrix, classification_report, brier_score_loss
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.inspection import permutation_importance
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyClassifier
from sklearn.calibration import CalibratedClassifierCV

import warnings
warnings.filterwarnings('ignore')

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [None]:
csv_name = 'UCI_Credit_Card.csv'
candidates = [Path('.')/csv_name, Path('content')/csv_name, Path('/mnt/data')/csv_name]
csv_path = None
for p in candidates:
  if p.exists():
    csv_path = p
    break

if csv_path is None:
  try:
    from google.colab import files
    print('Upload UCI_Credit_Card.csv')
    uploaded = files.upload()
    csv_path = list(uploaded.keys())[0]
  except Exception as e:
    raise FileNotFoundError('Could not find UCI_Credit_Card.csv, upload it in this cell.')

df = pd.read_csv(csv_path)
df = df.rename(columns={'default.payment.next.month':'DEFAULT'})
if 'ID' in df.columns:
  df = df.drop(columns=['ID'])

print('Shape:', df.shape)
df.head()

Shape: (30000, 24)


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT
0,20000.0,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2,2,2,26,-1,2,0,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,50000.0,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [None]:
print("Nulls (top 10):")

print(df.isnull().sum().sort_values(ascending=False).head(10), "\n")

print("Target distribution (proportion):")

print(df['DEFAULT'].value_counts(normalize=True).rename('proportion'), "\n")

print("Data types (first 10 columns):")

print(df.dtypes.head(10))

cols_to_plot = ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'PAY_AMT1']

df[cols_to_plot].hist(bins=30, figsize=(10,6))

plt.tight_layout(); plt.show()

Nulls (top 10):


NameError: name 'df' is not defined

In [None]:
fe = df.copy()
amt_cols = [c for c in fe.columns if c.startswith('BILL_AMT') or c.startswith('PAY_AMT')]

for c in amt_cols:
  lo = fe[c].quantile(0.001)
  hi = fe[c].quantile(0.999)
  fe[c] = fe[c].clip(lower = lo, upper = hi)

for k in range(1,7):
  fe[f'UTIL_{k}'] = fe[f'BILL_AMT{k}'] / fe['LIMIT_BAL'].replace(0, np.nan)

fe[[f'UTIL_{k}' for k in range(1,7)]] = fe[[f'UTIL_{k}' for k in range(1,7)]].fillna(0.0)

for k in range(1,6):
  fe[f'BILL_AMT{k}_{k+1}'] = fe[f'BILL_AMT{k+1}'] - fe[f'BILL_AMT{k}']

for k in range(1,7):
  denom = fe[f'BILL_AMT{k}'].replace(0, np.nan)
  ratio = fe[f'PAY_AMT{k}'] / denom
  fe[f'PAY_RATIO_{k}'] = ratio.replace([np.inf, -np.inf], np.nan).fillna(0.0)

pay_cols = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

fe['MAX_PAY_DELAY'] = fe[pay_cols].max(axis=1)

fe['AVG_PAY_DELAY'] = fe[pay_cols].mean(axis=1)

fe['ANY_PAY_DELAY_GE_2'] = (fe[pay_cols].ge(2).any(axis=1)).astype(int)


fe['DEFAULT'] = df['DEFAULT'].astype(int)

print("Feature engineered shape:", fe.shape)

display(fe.head())

In [None]:
X = fe.drop(columns=['DEFAULT'])
y = fe['DEFAULT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE)

In [None]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

In [None]:
import numpy as np

num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

cat_candidates = ['SEX', 'EDUCATION', 'MARRIAGE']

cat_cols = [c for c in cat_candidates if c in X_train.columns]

num_cols = [c for c in num_cols if c not in cat_cols]


from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, RobustScaler


numeric_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', RobustScaler())
])


categorical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])


preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_pipe, num_cols),
        ('cat', categorical_pipe, cat_cols)
    ]
)

preprocess


NameError: name 'X_train' is not defined

In [None]:
from sklearn.dummy import DummyClassifier

dummy = Pipeline(steps=[
    ('prep', preprocess),
    ('clf', DummyClassifier(strategy='most_frequent'))
])

dummy.fit(X_train, y_train)\

p_dummy = dummy.predict_proba(X_test)[:,1]

print("Baseline Dummy ROC AUC:", roc_auc_score(y_test, p_dummy))

print("Baseline Dummy PR AUC (AP):", average_precision_score(y_test, p_dummy))

In [None]:
logit_base = Pipeline(steps=[
    ('prep', preprocess),
    ('clf', LogisticRegression(
        max_iter=500,
        class_weight='balanced',
        solver='lbfgs'
    ))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

roc_scores = cross_val_score(
    logit_base,
    X_train, y_train,
    cv=cv,
    scoring='roc_auc',
    n_jobs=-1
)

ap_scores = cross_val_score(
    logit_base,
    X_train, y_train,
    cv=cv,
    scoring='average_precision',
    n_jobs=-1
)

print("CV ROC AUC  mean±sd:", f"{roc_scores.mean():.3f} ± {roc_scores.std():.3f}")

print("CV PR  AUC  mean±sd:", f"{ap_scores.mean():.3f} ± {ap_scores.std():.3f}")

* **What Is happening here?**



| **Step**             | **What It Does**                                          | **Why It Matters**                        |
| -------------------- | --------------------------------------------------------- | ----------------------------------------- |
| Preprocessing        | Applies imputations, scaling, and encoding                | Ensures clean, normalized features        |
| Logistic Regression  | Trains a linear classifier with L2 regularization         | Avoids overfitting, interpretable         |
| Class Balancing      | Uses `class_weight='balanced'`                            | Adjusts for default/non-default imbalance |
| Stratified K-Fold CV | Keeps same ratio of positive/negative cases per fold      | Ensures stable, fair evaluation           |
| ROC AUC              | Measures ranking quality of positive vs. negative classes | Good for overall model discrimination     |
| PR AUC               | Measures positive-class precision vs recall               | More meaningful for **imbalanced data**   |

**Key Insights**

* Dummy baseline ROC AUC ≈ 0.5 → random guessing.

* Logistic regression ROC AUC should be significantly higher if features are informative.

* PR AUC is especially important for credit default prediction because defaults are rare events.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from tempfile import mkdtemp
import matplotlib.pyplot as plt
from sklearn.metrics import (
    roc_auc_score, average_precision_score, roc_curve, precision_recall_curve
)

cachedir = mkdtemp()

logit_cv = Pipeline(steps=[
    ('prep', preprocess),
    ('clf', LogisticRegressionCV(
        Cs=10,
        cv=3,
        scoring='roc_auc',
        solver='lbfgs',
        penalty='l2',
        class_weight='balanced',
        max_iter=200,
        tol=1e-3,
        n_jobs=-1,
        refit=True
    ))
], memory=cachedir)


logit_cv.fit(X_train, y_train)


best_model = logit_cv
p_test = best_model.predict_proba(X_test)[:, 1]


auc = roc_auc_score(y_test, p_test)
ap  = average_precision_score(y_test, p_test)

print("Best params (effective):", {
    'penalty': 'l2',
    'solver': 'lbfgs',
    'C': float(best_model.named_steps['clf'].C_[0]),
    'class_weight': 'balanced',
    'max_iter': 200,
    'tol': 1e-3
})
print("Best CV ROC AUC:", f"{best_model.named_steps['clf'].scores_[1].mean(axis=0).max():.3f}"
      if 1 in best_model.named_steps['clf'].scores_ else f"{auc:.3f}")
print("Holdout ROC AUC:", f"{auc:.3f}")
print("Holdout PR  AUC:", f"{ap:.3f}")


fpr, tpr, roc_th = roc_curve(y_test, p_test)
prec, rec, pr_th = precision_recall_curve(y_test, p_test)

plt.figure(figsize=(12,4))
plt.subplot(1,2,1); plt.plot(fpr, tpr); plt.plot([0,1],[0,1],'--')
plt.title(f"ROC (AUC={auc:.3f})"); plt.xlabel('FPR'); plt.ylabel('TPR')

plt.subplot(1,2,2); plt.plot(rec, prec)
plt.title(f"PR (AP={ap:.3f})"); plt.xlabel('Recall'); plt.ylabel('Precision')

plt.tight_layout(); plt.show()

In [None]:
j_scores = tpr - fpr
best_idx = np.argmax(j_scores)
thr_j = roc_th[best_idx]
print("Best threshold (Youden J):", float(thr_j))

cost_fn, cost_fp = 10.0, 1.0
grid_thr = np.linspace(0.05, 0.95, 37)

def cost_from_threshold(thr, y_true, p):
    """Compute total cost and confusion tuple at a given threshold."""
    pred = (p >= thr).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, pred).ravel()
    return cost_fn*fn + cost_fp*fp, (tn, fp, fn, tp)

best_cost, best_thr, best_tuple = np.inf, None, None
for t in grid_thr:
    c, tpl = cost_from_threshold(t, y_test, p_test)
    if c < best_cost:
        best_cost, best_thr, best_tuple = c, t, tpl

print(f"Best cost-sensitive threshold = {best_thr:.3f}, cost={best_cost:.1f}, confusion={best_tuple}")

t_star = cost_fp / (cost_fp + cost_fn)
print("Heuristic threshold t* (if perfectly calibrated):", round(t_star, 3))

for name, thr in [('0.5', 0.5), ('J', thr_j), ('COST', best_thr)]:
    pred = (p_test >= thr).astype(int)
    print(f"\nClassification report @ {name} (thr={thr:.3f})")
    print(classification_report(y_test, pred, digits=3))
    print("Confusion matrix:\n", confusion_matrix(y_test, pred))

In [None]:
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import brier_score_loss, roc_auc_score, average_precision_score

cal_model = CalibratedClassifierCV(
    estimator=best_model,
    method='isotonic',
    cv=3,
    n_jobs=-1
)


cal_model.fit(X_train, y_train)


p_test_cal = cal_model.predict_proba(X_test)[:, 1]


print("Brier score (uncalibrated):", round(brier_score_loss(y_test, p_test), 6))
print("Brier score (calibrated):  ", round(brier_score_loss(y_test, p_test_cal), 6))


auc_cal = roc_auc_score(y_test, p_test_cal)
ap_cal  = average_precision_score(y_test, p_test_cal)
print("Calibrated AUC:", f"{auc_cal:.3f}", "Calibrated AP:", f"{ap_cal:.3f}")


NameError: name 'best_model' is not defined

In [None]:
prep = best_model.named_steps['prep']
clf  = best_model.named_steps['clf']

num_names = list(num_cols)
cat_names = []

if len(cat_cols) > 0:
    try:
        ohe = prep.named_transformers_['cat'].named_steps['ohe']
        cat_names = list(ohe.get_feature_names_out(cat_cols))
    except Exception as e:
        cat_names = [f"{c}_[OHE]" for c in cat_cols]


feature_names = num_names + cat_names

coef = pd.Series(
    clf.coef_.ravel(),
    index=feature_names
).sort_values(key=np.abs, ascending=False)

odds = np.exp(coef)

coef_df = pd.DataFrame({
    'coef': coef,
    'odds_ratio': odds
})

display(
    coef_df.head(25)
    .style.format({'coef': '{:.4f}', 'odds_ratio': '{:.4f}'})
)

print("\nInterpretation tips: odds_ratio > 1 means the feature increases odds of default "
      "(holding others fixed); < 1 decreases odds.")


In [None]:

from sklearn.inspection import permutation_importance
import pandas as pd

perm = permutation_importance(
    best_model,
    X_test, y_test,
    n_repeats=10,
    random_state=RANDOM_STATE,
    scoring='roc_auc'
)

n_perm = perm.importances_mean.shape[0]

if hasattr(X_test, "columns") and X_test.shape[1] == n_perm:
    feat_names = list(X_test.columns)
else:
    try:
        prep = best_model.named_steps.get("prep")
        if prep is not None:
            try:
                feat_names = list(prep.get_feature_names_out(getattr(X_test, "columns", None)))
            except TypeError:
                feat_names = list(prep.get_feature_names_out())
        else:
            feat_names = [f"feature_{i}" for i in range(n_perm)]
    except Exception:
        feat_names = [f"feature_{i}" for i in range(n_perm)]
    if len(feat_names) != n_perm:  # final guard
        feat_names = feat_names[:n_perm]

imp = pd.Series(perm.importances_mean, index=feat_names).sort_values(ascending=False)
display(imp.head(20).to_frame("permutation_importance"))

In [None]:
import pandas as pd

individual_risk = pd.DataFrame({
    "Customer_ID": X_test.index,
    "Predicted_Default_Probability": p_test,
    "Actual_Default": y_test.values
})

display(individual_risk.head(10))