# 02b — Calibration & Thresholds (Poland, Horizon = 1)

**Goal:** Turn model scores into actionable early-warning rules.  
We will:
1) Load saved test predictions (`data/processed/poland_h1_test_predictions.csv`)  
2) Compute PR/ROC metrics focused on **low FPR** (1% / 5%)  
3) Build **reliability (calibration) tables** and curves  
4) Make a **threshold summary table** with expected TP/FP counts and alerts-per-1,000 firms

> Paste your short interpretation under each output where indicated.


### Step 1 — Imports & load predictions

**Why:** Bring in metrics tools and load the saved test-set predictions.

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd

from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss, roc_curve, precision_recall_curve

REPO_ROOT = Path.cwd()
DATA_DIR = REPO_ROOT / "data" / "processed"
preds = pd.read_csv(DATA_DIR / "poland_h1_test_predictions.csv")

y = preds['y_true'].values
scores = {
    'Logit': preds['p_logit'].values,
    'RF': preds['p_rf'].values,
    'GLM': preds['p_glm'].values,
}

print('Rows:', len(y), '| Positives:', int(y.sum()), '| Pos rate:', f"{y.mean():.4f}")
for name, s in scores.items():
    roc = roc_auc_score(y, s)
    pr  = average_precision_score(y, s)
    b   = brier_score_loss(y, s)
    print(f"{name} ROC-AUC={roc:.3f} PR-AUC={pr:.3f} Brier={b:.4f}")


Rows: 1406 | Positives: 54 | Pos rate: 0.0384
Logit ROC-AUC=0.924 PR-AUC=0.385 Brier=0.1064
RF ROC-AUC=0.968 PR-AUC=0.738 Brier=0.0208
GLM ROC-AUC=0.924 PR-AUC=0.395 Brier=0.1061


**Interpretation (after running):**  
- Confirm counts and that metrics match the modeling notebook (small diffs are fine).

### Step 2 — Helpers: recall@FPR and calibration table

**Why:** Re-useable utilities to summarize early-warning operating points and reliability.

In [3]:
def recall_at_fpr(y_true, y_score, fpr_cap=0.01):
    fpr, tpr, thr = roc_curve(y_true, y_score)
    mask = fpr <= fpr_cap
    if not np.any(mask):
        return 0.0, None
    idx = np.argmax(tpr[mask])
    return float(tpr[mask][idx]), float(thr[mask][idx])

def calibration_table(y_true, y_prob, n_bins=10):
    bins = pd.qcut(y_prob, q=n_bins, duplicates='drop')
    tab = pd.DataFrame({'bin': bins, 'y': y_true, 'p': y_prob})\
        .groupby('bin').agg(count=('y','size'), mean_pred=('p','mean'), event_rate=('y','mean'))\
        .reset_index()
    return tab

def threshold_summary(y_true, y_score, thresholds):
    out = []
    n = len(y_true)
    pos = int(y_true.sum())
    neg = n - pos
    for t in thresholds:
        yhat = (y_score >= t).astype(int)
        tp = int(((yhat==1) & (y_true==1)).sum())
        fp = int(((yhat==1) & (y_true==0)).sum())
        fn = pos - tp
        tn = neg - fp
        tpr = tp / pos if pos else 0.0
        fpr = fp / neg if neg else 0.0
        prec = tp / (tp + fp) if (tp+fp) else 0.0
        alerts_per_1000 = 1000 * (tp + fp) / n
        out.append({'thr': t, 'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn,
                    'TPR': tpr, 'FPR': fpr, 'Precision': prec,
                    'Alerts_per_1000': alerts_per_1000})
    return pd.DataFrame(out)

*No interpretation needed here.*

### Step 3 — Low-FPR operating points (1% and 5%)

**Why:** Early-warning systems usually cap false alarms. We compute recall (TPR) at **FPR ≤ 1%** and **≤ 5%**, plus the corresponding thresholds.

In [4]:
for name, s in scores.items():
    r1, t1 = recall_at_fpr(y, s, 0.01)
    r5, t5 = recall_at_fpr(y, s, 0.05)
    print(f"{name:>5} | Recall@1%FPR={r1:.3f} (thr={t1}) | Recall@5%FPR={r5:.3f} (thr={t5})")

Logit | Recall@1%FPR=0.278 (thr=0.9817998903019366) | Recall@5%FPR=0.648 (thr=0.7449873031547386)
   RF | Recall@1%FPR=0.593 (thr=0.311743306538647) | Recall@5%FPR=0.796 (thr=0.1870931271050699)
  GLM | Recall@1%FPR=0.278 (thr=0.9828315289483236) | Recall@5%FPR=0.667 (thr=0.7333376160367324)


**Interpretation (after running):**  
- Compare models at the same FPR cap. The “best” model is the one with higher recall at the chosen cap (usually **RF** here).

## Interpretation

* At **the same false-alarm budget**, **RF** recovers many more true bankruptcies than Logit/GLM.
* Recommended caps for the report: **1% FPR (strict)** and **5% FPR (operational)**, with thresholds above.

### Step 4 — Threshold tables around the chosen caps

**Why:** Give decision-makers tangible counts. We tabulate **TP/FP/FN** and alerts-per-1,000 around the low-FPR thresholds.

In [5]:
# Build threshold tables ± a small margin around each model's 1% FPR threshold
tables = {}
for name, s in scores.items():
    r1, t1 = recall_at_fpr(y, s, 0.01)
    grid = sorted(set([t1, max(0.0, t1-0.05), min(1.0, t1+0.05)]))
    tables[name] = threshold_summary(y, s, grid)

tables

{'Logit':       thr  TP  FP  FN    TN       TPR       FPR  Precision  Alerts_per_1000
 0  0.9318  27  21  27  1331  0.500000  0.015533   0.562500        34.139403
 1  0.9818  15  12  39  1340  0.277778  0.008876   0.555556        19.203414
 2  1.0000   0   0  54  1352  0.000000  0.000000   0.000000         0.000000,
 'RF':         thr  TP  FP  FN    TN       TPR       FPR  Precision  Alerts_per_1000
 0  0.261743  35  20  19  1332  0.648148  0.014793   0.636364        39.118065
 1  0.311743  32  13  22  1339  0.592593  0.009615   0.711111        32.005690
 2  0.361743  31   7  23  1345  0.574074  0.005178   0.815789        27.027027,
 'GLM':         thr  TP  FP  FN    TN       TPR       FPR  Precision  Alerts_per_1000
 0  0.932832  27  21  27  1331  0.500000  0.015533   0.562500        34.139403
 1  0.982832  15  12  39  1340  0.277778  0.008876   0.555556        19.203414
 2  1.000000   0   0  54  1352  0.000000  0.000000   0.000000         0.000000}

**Interpretation (after running):**  
- Use these tables to pick a threshold that balances recall with a tolerable number of false alerts.  
- “Alerts per 1,000” is easy to communicate to non-technical stakeholders.

## Interpretation

* The **RF** threshold table near **1% FPR** shows a good balance (**TP 32 vs FP 13**).
* Tightening slightly (thr ≈ 0.362) reduces FPs to **7** but only loses **1 TP**; this is worth mentioning as an alternative *“fewer alerts”* policy.
* For **Logit/GLM**, comparable thresholds produce **many fewer TPs** for similar FPs → not ideal for early-warning.

### Step 5 — Calibration tables (top 10 bins)

**Why:** Check if predicted probabilities match observed event rates in bins (reliability). Poor calibration → add a calibrator later.

In [6]:
cal_tabs = {}
for name, s in scores.items():
    cal = calibration_table(y, s, n_bins=10)
    cal_tabs[name] = cal
    print(f"\n{name} calibration (head):\n", cal.head(10))


Logit calibration (head):
                       bin  count  mean_pred  event_rate
0  (-0.000999895, 0.0195]    141   0.008294    0.000000
1         (0.0195, 0.045]    141   0.031962    0.000000
2         (0.045, 0.0767]    140   0.060557    0.000000
3         (0.0767, 0.112]    141   0.093760    0.000000
4          (0.112, 0.156]    140   0.132571    0.000000
5          (0.156, 0.223]    141   0.189368    0.021277
6          (0.223, 0.311]    140   0.265087    0.014286
7          (0.311, 0.441]    141   0.369789    0.014184
8           (0.441, 0.65]    140   0.533753    0.057143
9           (0.65, 0.999]    141   0.834228    0.276596

RF calibration (head):
                  bin  count  mean_pred  event_rate
0  (-0.001, 0.00454]    141   0.002086    0.000000
1  (0.00454, 0.0104]    141   0.007476    0.000000
2   (0.0104, 0.0171]    140   0.013646    0.000000
3   (0.0171, 0.0245]    141   0.020499    0.000000
4   (0.0245, 0.0345]    140   0.029951    0.000000
5   (0.0345, 0.0503]    1

  .groupby('bin').agg(count=('y','size'), mean_pred=('p','mean'), event_rate=('y','mean'))\
  .groupby('bin').agg(count=('y','size'), mean_pred=('p','mean'), event_rate=('y','mean'))\
  .groupby('bin').agg(count=('y','size'), mean_pred=('p','mean'), event_rate=('y','mean'))\


**Interpretation (after running):**  
- If the top bin has mean_pred far above event_rate (e.g., 0.8 vs 0.28), the model is **over-confident** at the high end — typical with class weights.  
- Plan: add a **post-hoc calibrator** (isotonic or Platt) later.

## Interpretation

* **Logit/GLM**: strong **overconfidence** (top bin mean_pred >> event rate) → probabilities are unreliable without calibration.
* **RF**: **well calibrated** at the top bin; still verify after isotonic calibration.

### Step 6 — Save summaries

**Why:** Persist the threshold and calibration tables for the report.

In [7]:
OUT_DIR = DATA_DIR
for name, tab in tables.items():
    tab.to_csv(OUT_DIR / f'poland_h1_thresholds_{name.lower()}.csv', index=False)
for name, tab in cal_tabs.items():
    tab.to_csv(OUT_DIR / f'poland_h1_calibration_{name.lower()}.csv', index=False)
print('Saved threshold & calibration tables to', OUT_DIR)

Saved threshold & calibration tables to /Users/reebal/FH-Wedel/WS25/seminar-bankruptcy-prediction/data/processed


**Interpretation (after running):**  
- Confirm CSVs were written; we’ll cite them in the comparison notebook.