# 02c — Calibration & Ablation (Poland, Horizon = 1)

In [1]:
print('Notebook loaded OK — cells present.')

Notebook loaded OK — cells present.


**Goals**
1. **Calibrate** probabilities (RF: isotonic; Logit: Platt/sigmoid) and re-check Brier + reliability.
2. **Ablation:** RF **without** missingness indicators to test robustness.
3. Produce **threshold tables** at **1%** and **5%** FPR caps for each model.

> Run from repo root so `data/processed/` paths resolve.


### Step 1 — Imports & data

In [2]:
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss, roc_curve
import statsmodels.api as sm

RANDOM_STATE = 42
REPO_ROOT = Path.cwd()
PROC = REPO_ROOT / "data" / "processed"

df_full = pd.read_parquet(PROC / "poland_clean_full.parquet")
df_red  = pd.read_parquet(PROC / "poland_clean_reduced.parquet")

full_h1 = df_full[df_full['horizon']==1].copy()
red_h1  = df_red[df_red['horizon']==1].copy()

y = full_h1['y'].astype(int).values
X_full = full_h1.drop(columns=['y','horizon'])
X_red  = red_h1.drop(columns=['y','horizon'])

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=RANDOM_STATE)
(train_idx, test_idx), = sss.split(X_full, y)

Xf_tr, Xf_te = X_full.iloc[train_idx], X_full.iloc[test_idx]
Xr_tr, Xr_te = X_red.iloc[train_idx], X_red.iloc[test_idx]
y_tr,  y_te  = y[train_idx], y[test_idx]

print('Shapes → X_full:', X_full.shape, '| X_red:', X_red.shape, '| y pos rate:', f"{y.mean():.4f}")

Shapes → X_full: (7027, 65) | X_red: (7027, 48) | y pos rate: 0.0386


### Step 2 — Helpers (metrics, recall@FPR, calibration table, threshold tables)

In [3]:
def recall_at_fpr(y_true, y_score, fpr_cap=0.01):
    fpr, tpr, thr = roc_curve(y_true, y_score)
    mask = fpr <= fpr_cap
    if not np.any(mask):
        return 0.0, None
    idx = np.argmax(tpr[mask])
    return float(tpr[mask][idx]), float(thr[mask][idx])

def calibration_table(y_true, y_prob, n_bins=10):
    bins = pd.qcut(y_prob, q=n_bins, duplicates='drop')
    tab = pd.DataFrame({'bin': bins, 'y': y_true, 'p': y_prob})\
        .groupby('bin', observed=False).agg(count=('y','size'), mean_pred=('p','mean'), event_rate=('y','mean'))\
        .reset_index()
    return tab

def evaluate(y_true, y_prob):
    return {
        'roc': roc_auc_score(y_true, y_prob),
        'pr': average_precision_score(y_true, y_prob),
        'brier': brier_score_loss(y_true, y_prob),
        'rec1': recall_at_fpr(y_true, y_prob, 0.01),
        'rec5': recall_at_fpr(y_true, y_prob, 0.05)
    }

def threshold_summary(y_true, y_score, thr):
    yhat = (y_score >= thr).astype(int)
    n = len(y_true); pos = int(y_true.sum()); neg = n - pos
    tp = int(((yhat==1) & (y_true==1)).sum())
    fp = int(((yhat==1) & (y_true==0)).sum())
    fn = pos - tp; tn = neg - fp
    tpr = tp/pos if pos else 0.0
    fpr = fp/neg if neg else 0.0
    prec = tp/(tp+fp) if (tp+fp) else 0.0
    alerts_per_1000 = 1000*(tp+fp)/n
    return {'TP':tp,'FP':fp,'FN':fn,'TN':tn,'TPR':tpr,'FPR':fpr,'Precision':prec,'Alerts_per_1000':alerts_per_1000}

### Step 3 — Fit base models (same specs as before)

In [4]:
logit = Pipeline([
    ('scaler', StandardScaler(with_mean=True, with_std=True)),
    ('clf', LogisticRegression(penalty='l2', solver='liblinear',
                               class_weight='balanced', C=1.0, max_iter=200, random_state=RANDOM_STATE))
])
logit.fit(Xr_tr, y_tr)
p_logit = logit.predict_proba(Xr_te)[:,1]

rf = RandomForestClassifier(n_estimators=400, random_state=RANDOM_STATE,
                            class_weight='balanced_subsample', n_jobs=-1,
                            max_depth=None, min_samples_leaf=5)
rf.fit(Xf_tr, y_tr)
p_rf = rf.predict_proba(Xf_te)[:,1]

print('Base ROC (Logit, RF):', f"{roc_auc_score(y_te, p_logit):.3f}", f"{roc_auc_score(y_te, p_rf):.3f}")

Base ROC (Logit, RF): 0.924 0.968


### Step 4 — Calibrate probabilities (Logit=Platt, RF=Isotonic)

In [5]:
# Calibrate Logit (Platt/sigmoid) and RF (isotonic)
cal_logit = CalibratedClassifierCV(estimator=logit, method='sigmoid', cv=5)
cal_logit.fit(Xr_tr, y_tr)
p_logit_cal = cal_logit.predict_proba(Xr_te)[:, 1]

cal_rf = CalibratedClassifierCV(estimator=rf, method='isotonic', cv=5)
cal_rf.fit(Xf_tr, y_tr)
p_rf_cal = cal_rf.predict_proba(Xf_te)[:, 1]

# Summaries
for name, p in [
    ('Logit_uncal', p_logit),
    ('Logit_cal',  p_logit_cal),
    ('RF_uncal',   p_rf),
    ('RF_cal',     p_rf_cal),
]:
    e = evaluate(y_te, p)
    r1, t1 = e['rec1']; r5, t5 = e['rec5']
    t1s = 'None' if t1 is None else f'{t1:.6f}'
    t5s = 'None' if t5 is None else f'{t5:.6f}'
    print(f"{name:>12} | ROC={e['roc']:.3f} PR={e['pr']:.3f} Brier={e['brier']:.4f} | "
          f"R@1%={r1:.3f} (thr={t1s}) | R@5%={r5:.3f} (thr={t5s})")


 Logit_uncal | ROC=0.924 PR=0.385 Brier=0.1064 | R@1%=0.278 (thr=0.981800) | R@5%=0.648 (thr=0.744987)
   Logit_cal | ROC=0.925 PR=0.387 Brier=0.0289 | R@1%=0.296 (thr=0.316068) | R@5%=0.648 (thr=0.085443)
    RF_uncal | ROC=0.968 PR=0.738 Brier=0.0208 | R@1%=0.593 (thr=0.311743) | R@5%=0.796 (thr=0.187093)
      RF_cal | ROC=0.963 PR=0.738 Brier=0.0178 | R@1%=0.593 (thr=0.296404) | R@5%=0.815 (thr=0.078241)


## Interpretation (Calibration summaries)

* **Logit (Platt):** Massive **calibration** gain (Brier ~0.029). Ranking unchanged (ROC ~0.925), but thresholds now live at sensible ranges (≈0.316 for 1% FPR; ≈0.085 for 5% FPR). Slight lift at 1% FPR recall is nice but limited.
* **RF (Isotonic):** Slight AUC trade-off (expected), **better Brier** and **slightly better recall @5% FPR** with similar @1% FPR. Net: **calibration without sacrificing detection**.
* **Policy takeaway:** Use **RF_cal** for alerts; keep **Logit_cal/GLM** for interpretable economics.



### Step 5 — Calibration tables (head)

In [6]:
print('Logit (uncal) calibration head:\n', calibration_table(y_te, p_logit).head(10))
print('\nLogit (cal) calibration head:\n', calibration_table(y_te, p_logit_cal).head(10))
print('\nRF (uncal) calibration head:\n', calibration_table(y_te, p_rf).head(10))
print('\nRF (cal) calibration head:\n', calibration_table(y_te, p_rf_cal).head(10))

Logit (uncal) calibration head:
                       bin  count  mean_pred  event_rate
0  (-0.000999895, 0.0195]    141   0.008294    0.000000
1         (0.0195, 0.045]    141   0.031962    0.000000
2         (0.045, 0.0767]    140   0.060557    0.000000
3         (0.0767, 0.112]    141   0.093760    0.000000
4          (0.112, 0.156]    140   0.132571    0.000000
5          (0.156, 0.223]    141   0.189368    0.021277
6          (0.223, 0.311]    140   0.265087    0.014286
7          (0.311, 0.441]    141   0.369789    0.014184
8           (0.441, 0.65]    140   0.533753    0.057143
9           (0.65, 0.999]    141   0.834228    0.276596

Logit (cal) calibration head:
                      bin  count  mean_pred  event_rate
0  (-0.0009839, 0.00529]    141   0.002976    0.000000
1     (0.00529, 0.00856]    141   0.006934    0.000000
2      (0.00856, 0.0116]    140   0.010115    0.000000
3       (0.0116, 0.0148]    141   0.013177    0.000000
4       (0.0148, 0.0183]    140   0.016405  

## Interpretation (Calibration tables)

* **Before calibration**, Logit’s top bin was **wildly overconfident** (mean_pred ~0.83 vs event_rate ~0.28).
* **After calibration**, top bin is **much closer** (mean_pred ~0.19 vs 0.28) — underconfident now, but acceptable; we can fine-tune later if needed.
* **RF** was already reasonable at the top end; **isotonic** preserves that and improves lower bins’ reliability.
* Bottom bins showing ~0% events is expected with such a low base rate and small test fold.


### Step 6 — Threshold summaries at exact FPR caps (1% and 5%)

In [7]:
def table_at_caps(name, y_true, y_score):
    r1, t1 = recall_at_fpr(y_true, y_score, 0.01)
    r5, t5 = recall_at_fpr(y_true, y_score, 0.05)
    rows = []
    if t1 is not None:
        rows.append({'cap':'1% FPR','thr':t1, **threshold_summary(y_true, y_score, t1)})
    if t5 is not None:
        rows.append({'cap':'5% FPR','thr':t5, **threshold_summary(y_true, y_score, t5)})
    return name, pd.DataFrame(rows)

tabs = {}
for name, p in [('Logit_uncal', p_logit), ('Logit_cal', p_logit_cal),
                ('RF_uncal', p_rf), ('RF_cal', p_rf_cal)]:
    nm, df = table_at_caps(name, y_te, p)
    tabs[nm] = df
    print(f"\n{name}:\n", df)

tabs


Logit_uncal:
       cap       thr  TP  FP  FN    TN       TPR       FPR  Precision  \
0  1% FPR  0.981800  15  12  39  1340  0.277778  0.008876   0.555556   
1  5% FPR  0.744987  35  63  19  1289  0.648148  0.046598   0.357143   

   Alerts_per_1000  
0        19.203414  
1        69.701280  

Logit_cal:
       cap       thr  TP  FP  FN    TN       TPR       FPR  Precision  \
0  1% FPR  0.316068  16  12  38  1340  0.296296  0.008876   0.571429   
1  5% FPR  0.085443  35  61  19  1291  0.648148  0.045118   0.364583   

   Alerts_per_1000  
0        19.914651  
1        68.278805  

RF_uncal:
       cap       thr  TP  FP  FN    TN       TPR       FPR  Precision  \
0  1% FPR  0.311743  32  13  22  1339  0.592593  0.009615   0.711111   
1  5% FPR  0.187093  43  59  11  1293  0.796296  0.043639   0.421569   

   Alerts_per_1000  
0         32.00569  
1         72.54623  

RF_cal:
       cap       thr  TP  FP  FN    TN       TPR       FPR  Precision  \
0  1% FPR  0.296404  32   9  22  1343 

{'Logit_uncal':       cap       thr  TP  FP  FN    TN       TPR       FPR  Precision  \
 0  1% FPR  0.981800  15  12  39  1340  0.277778  0.008876   0.555556   
 1  5% FPR  0.744987  35  63  19  1289  0.648148  0.046598   0.357143   
 
    Alerts_per_1000  
 0        19.203414  
 1        69.701280  ,
 'Logit_cal':       cap       thr  TP  FP  FN    TN       TPR       FPR  Precision  \
 0  1% FPR  0.316068  16  12  38  1340  0.296296  0.008876   0.571429   
 1  5% FPR  0.085443  35  61  19  1291  0.648148  0.045118   0.364583   
 
    Alerts_per_1000  
 0        19.914651  
 1        68.278805  ,
 'RF_uncal':       cap       thr  TP  FP  FN    TN       TPR       FPR  Precision  \
 0  1% FPR  0.311743  32  13  22  1339  0.592593  0.009615   0.711111   
 1  5% FPR  0.187093  43  59  11  1293  0.796296  0.043639   0.421569   
 
    Alerts_per_1000  
 0         32.00569  
 1         72.54623  ,
 'RF_cal':       cap       thr  TP  FP  FN    TN       TPR       FPR  Precision  \
 0  1% FPR  0

## Interpretation (Threshold tables @ 1% / 5% FPR)

* **RF_cal (recommended):**

  * **1% FPR** (thr ≈ **0.2964**): **TP 32, FP 9, FN 22**, Precision ~**0.78**, Alerts **~29/1,000**.
    → This is an excellent *strict* early-warning setting.
  * **5% FPR** (thr ≈ **0.0782**): **TP 44, FP 55, FN 10**, Precision ~0.44, Alerts **~70/1,000**.
    → Use only if stakeholders can tolerate more false alerts to catch ≈**82%** of events.
* **Logit_cal** at the same caps catches **fewer TPs** for similar FPs — keep it for interpretation, not as the production detector.



### Step 7 — Ablation: RF without missingness indicators

In [8]:
no_ind_cols = [c for c in X_full.columns if not c.endswith('__isna')]
Xf_tr_noind = Xf_tr[no_ind_cols]
Xf_te_noind = Xf_te[no_ind_cols]

rf_noind = RandomForestClassifier(n_estimators=400, random_state=RANDOM_STATE,
                                  class_weight='balanced_subsample', n_jobs=-1,
                                  max_depth=None, min_samples_leaf=5)
rf_noind.fit(Xf_tr_noind, y_tr)
p_rf_noind = rf_noind.predict_proba(Xf_te_noind)[:,1]

e = evaluate(y_te, p_rf_noind)
r1, t1 = e['rec1']; r5, t5 = e['rec5']
t1s = 'None' if t1 is None else f'{t1:.6f}'
t5s = 'None' if t5 is None else f'{t5:.6f}'
print(f"RF_noind | ROC={e['roc']:.3f} PR={e['pr']:.3f} Brier={e['brier']:.4f} | "
      f"R@1%={r1:.3f} (thr={t1s}) | R@5%={r5:.3f} (thr={t5s})")

RF_noind | ROC=0.938 PR=0.581 Brier=0.0275 | R@1%=0.407 (thr=0.305815) | R@5%=0.704 (thr=0.209867)


## Interpretation (Ablation: no `__isna`)

* Removing missingness indicators **materially degrades RF** (PR-AUC **0.738 → 0.581**; Recall@1%FPR **0.593 → 0.407**).
* Conclusion: **keep the indicators**. They’re legitimate signals (predictors available at decision time), not leakage. We’ll document this explicitly in the report.

### Step 8 — Save calibrated predictions (optional)

In [9]:
out = pd.DataFrame({
    'y_true': y_te,
    'p_logit_uncal': p_logit,
    'p_logit_cal': p_logit_cal,
    'p_rf_uncal': p_rf,
    'p_rf_cal': p_rf_cal,
    'p_rf_noind': p_rf_noind
})
out_path = PROC / 'poland_h1_test_predictions_calibrated.csv'
out.to_csv(out_path, index=False)
print('Saved:', out_path)

Saved: /Users/reebal/FH-Wedel/WS25/seminar-bankruptcy-prediction/data/processed/poland_h1_test_predictions_calibrated.csv


## Quick robustness + sanity checks
Are indicators just “proxy for the label”? Check their prevalence by class:

In [10]:
full_h1[['Attr21__isna','Attr27__isna','y']].groupby('y').mean().rename(index={0:'non-bankrupt',1:'bankrupt'})


Unnamed: 0_level_0,Attr21__isna,Attr27__isna
y,Unnamed: 1_level_1,Unnamed: 2_level_1
non-bankrupt,0.223949,0.028271
bankrupt,0.402214,0.442804


Attr21__isna: 40.2% missing among bankrupt vs 22.4% non-bankrupt

Attr27__isna: 44.3% vs 2.8%
→ Missingness itself is a strong, valid early-warning signal (as long as those fields are indeed unknown at prediction time, which they are in this dataset).