# 02 ‚Äî Baseline Models (Poland, Horizon = 1)

**Goal:** Establish clean baselines on the Polish *1-year* horizon:
- **Logistic Regression** (sklearn, class-weighted) for a strong linear baseline
- **Random Forest** (class-weighted) for a non-linear baseline robust to collinearity
- **GLM Binomial (logit)** with robust SEs (statsmodels) for interpretable coefficients

**Metrics to report:**
- PR-AUC (average precision), ROC-AUC, **Brier score** (calibration)
- **Recall at FPR ‚â§ 1%** and **‚â§ 5%** (early-warning operating points)
- Reliability (binned calibration) table

> We will use `poland_clean_full.parquet` for RF and `poland_clean_reduced.parquet` for Logit/GLM (to stabilize inference).


### Step 1 ‚Äî Imports

**Why:** Bring in modeling & metrics utilities. If imports fail, install locally:
```
uv pip install scikit-learn statsmodels pyarrow
```


In [1]:
from pathlib import Path
import warnings
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss, roc_curve, precision_recall_curve

import statsmodels.api as sm

warnings.filterwarnings('ignore')
RANDOM_STATE = 42

print('‚úÖ Imports OK')

‚úÖ Imports OK


*No interpretation needed here.*

### Step 2 ‚Äî Load cleaned datasets and subset to horizon = 1

**Why:** We‚Äôll train/evaluate on the 1-year horizon first.  
- `full`: winsorized+imputed with indicators (for RF)  
- `reduced`: correlation-pruned set (for logistic inference)


In [2]:
REPO_ROOT = Path.cwd()
DATA_DIR = REPO_ROOT / "data" / "processed"

df_full = pd.read_parquet(DATA_DIR / "poland_clean_full.parquet")
df_red  = pd.read_parquet(DATA_DIR / "poland_clean_reduced.parquet")

# Filter to horizon = 1
full_h1 = df_full[df_full['horizon'] == 1].copy()
red_h1  = df_red[df_red['horizon'] == 1].copy()

y_full = full_h1['y'].astype(int).values
X_full = full_h1.drop(columns=['y', 'horizon'])
y_red  = red_h1['y'].astype(int).values
X_red  = red_h1.drop(columns=['y', 'horizon'])

print('full_h1:', X_full.shape, '| pos rate =', y_full.mean().round(4))
print('red_h1 :', X_red.shape,  '| pos rate =', y_red.mean().round(4))

full_h1: (7027, 65) | pos rate = 0.0386
red_h1 : (7027, 48) | pos rate = 0.0386


**Interpretation (after running):**  
- Confirm sample size (~7k rows) and ~3.9% positives for both views.  
- `X_red` should have fewer features than `X_full` (by design).

### Step 3 ‚Äî Stratified train/test split (80/20)

**Why:** Hold out a test fold for honest evaluation. We use stratification to preserve the rare-event rate in both folds.

In [3]:
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=RANDOM_STATE)
(train_idx, test_idx), = sss.split(X_full, y_full)

Xf_tr, Xf_te = X_full.iloc[train_idx], X_full.iloc[test_idx]
yf_tr, yf_te = y_full[train_idx], y_full[test_idx]

Xr_tr, Xr_te = X_red.iloc[train_idx], X_red.iloc[test_idx]  # match same indices for comparability
yr_tr, yr_te = y_red[train_idx], y_red[test_idx]

print('Train size:', Xf_tr.shape[0], '| Test size:', Xf_te.shape[0])
print('Train pos rate:', yf_tr.mean().round(4), '| Test pos rate:', yf_te.mean().round(4))

Train size: 5621 | Test size: 1406
Train pos rate: 0.0386 | Test pos rate: 0.0384


**Interpretation (after running):**  
- Train/test sizes should be close to 80/20.  
- Positive rates should be similar across folds (¬±0.2pp).

### Step 4 ‚Äî Helper functions for evaluation

**Why:** Consistent metrics + early-warning recall at fixed FPR caps; calibration table; nice one-liner summary.

In [4]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss, roc_curve

def recall_at_fpr(y_true, y_score, fpr_cap=0.01):
    fpr, tpr, thr = roc_curve(y_true, y_score)
    mask = fpr <= fpr_cap
    if not np.any(mask):
        return 0.0, None
    idx = np.argmax(tpr[mask])  # best recall within cap
    return float(tpr[mask][idx]), float(thr[mask][idx])

def calibration_table(y_true, y_prob, n_bins=10):
    bins = pd.qcut(y_prob, q=n_bins, duplicates='drop')
    tab = pd.DataFrame({'bin': bins, 'y': y_true, 'p': y_prob})\
        .groupby('bin').agg(count=('y','size'), mean_pred=('p','mean'), event_rate=('y','mean'))\
        .reset_index()
    return tab

def evaluate_scores(y_true, y_prob):
    roc = roc_auc_score(y_true, y_prob)
    pr  = average_precision_score(y_true, y_prob)
    b   = brier_score_loss(y_true, y_prob)
    rec1, thr1 = recall_at_fpr(y_true, y_prob, 0.01)
    rec5, thr5 = recall_at_fpr(y_true, y_prob, 0.05)
    return {'roc_auc': roc, 'pr_auc': pr, 'brier': b, 'rec1': rec1, 'thr1': thr1, 'rec5': rec5, 'thr5': thr5}

def print_eval(label, res):
    print(f"[{label}] ROC-AUC={res['roc_auc']:.3f} | PR-AUC={res['pr_auc']:.3f} | Brier={res['brier']:.4f} | "
          f"Recall@1%FPR={res['rec1']:.3f} (thr={res['thr1']}) | Recall@5%FPR={res['rec5']:.3f} (thr={res['thr5']})")

*No interpretation needed here.*

### Step 5 ‚Äî Logistic Regression (sklearn, class-weighted)

**Why:** Strong linear baseline; we scale features and tune `C` on the training fold via 5-fold CV (scoring = PR-AUC).

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

logit_pipe = Pipeline([
    ('scaler', StandardScaler(with_mean=True, with_std=True)),
    ('clf', LogisticRegression(
        penalty='l2', solver='liblinear',
        class_weight='balanced', max_iter=200, random_state=RANDOM_STATE
    ))
])

param_grid = {'clf__C': np.logspace(-2, 2, 9)}  # 0.01 ... 100
gs_logit = GridSearchCV(logit_pipe, param_grid=param_grid, scoring='average_precision',
                        cv=5, n_jobs=-1, refit=True, verbose=0)
gs_logit.fit(Xr_tr, yr_tr)

print('Best C:', gs_logit.best_params_)
proba_logit = gs_logit.predict_proba(Xr_te)[:,1]
res_logit = evaluate_scores(yr_te, proba_logit)
print_eval('Logit (sklearn)', res_logit)

# Calibration table (test fold)
cal_logit = calibration_table(yr_te, proba_logit, n_bins=10)
cal_logit.head(10)

Best C: {'clf__C': np.float64(1.0)}
[Logit (sklearn)] ROC-AUC=0.924 | PR-AUC=0.385 | Brier=0.1064 | Recall@1%FPR=0.278 (thr=0.9817998903019366) | Recall@5%FPR=0.648 (thr=0.7449873031547386)


Unnamed: 0,bin,count,mean_pred,event_rate
0,"(-0.000999895, 0.0195]",141,0.008294,0.0
1,"(0.0195, 0.045]",141,0.031962,0.0
2,"(0.045, 0.0767]",140,0.060557,0.0
3,"(0.0767, 0.112]",141,0.09376,0.0
4,"(0.112, 0.156]",140,0.132571,0.0
5,"(0.156, 0.223]",141,0.189368,0.021277
6,"(0.223, 0.311]",140,0.265087,0.014286
7,"(0.311, 0.441]",141,0.369789,0.014184
8,"(0.441, 0.65]",140,0.533753,0.057143
9,"(0.65, 0.999]",141,0.834228,0.276596


**Interpretation (after running):**  
- Note the chosen `C` and whether performance is balanced (ROC-AUC vs PR-AUC).  
- Focus on **Recall@1% FPR** ‚Äî if it‚Äôs very low, we may need stronger regularization or more features.  
- Review the calibration table: mean predicted ‚âà event rate in bins ‚Üí good calibration; big gaps ‚Üí consider calibration later.

### Step 6 ‚Äî Random Forest (class-weighted)

**Why:** Non-linear, robust to collinearity/outliers; tune shallow-to-moderate depth to avoid overfitting rare events.

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier(
    n_estimators=400, random_state=RANDOM_STATE,
    class_weight='balanced_subsample', n_jobs=-1
)

param_grid = {
    'max_depth': [None, 4, 6, 8, 12],
    'min_samples_leaf': [1, 5, 10]
}

gs_rf = GridSearchCV(rf, param_grid=param_grid,
                     cv=5, scoring='average_precision', n_jobs=-1, refit=True, verbose=0)
gs_rf.fit(Xf_tr, yf_tr)

print('Best params:', gs_rf.best_params_)
proba_rf = gs_rf.predict_proba(Xf_te)[:,1]
res_rf = evaluate_scores(yf_te, proba_rf)
print_eval('RandomForest', res_rf)

# Feature importances (top 15)
imp = pd.Series(gs_rf.best_estimator_.feature_importances_, index=Xf_tr.columns).sort_values(ascending=False).head(15)
imp

Best params: {'max_depth': None, 'min_samples_leaf': 5}
[RandomForest] ROC-AUC=0.968 | PR-AUC=0.738 | Brier=0.0208 | Recall@1%FPR=0.593 (thr=0.311743306538647) | Recall@5%FPR=0.796 (thr=0.18709312710506992)


Attr27__isna    0.099219
Attr27          0.066737
Attr24          0.041010
Attr13          0.034497
Attr34          0.034312
Attr26          0.030567
Attr46          0.027151
Attr9           0.021988
Attr6           0.021905
Attr16          0.021877
Attr11          0.019498
Attr58          0.017776
Attr21          0.017438
Attr5           0.016978
Attr19          0.016508
dtype: float64

**Interpretation (after running):**  
- Compare PR-AUC and Recall@1%FPR to the Logit baseline.  
- Inspect top importances ‚Äî do they cluster by profitability, leverage, liquidity, etc.? This helps connect to theory.

### Step 7 ‚Äî GLM Binomial (logit) with robust SEs (statsmodels)

**Why:** For interpretable coefficients with **robust (HC1) standard errors**.  
We use the **reduced** feature set (collinearity-pruned) and class weights via `freq_weights` to approximate balancing.

In [7]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
Xr_tr_scaled = scaler.fit_transform(Xr_tr)
Xr_te_scaled = scaler.transform(Xr_te)

Xsm_tr = sm.add_constant(Xr_tr_scaled, has_constant='add')
Xsm_te = sm.add_constant(Xr_te_scaled, has_constant='add')

# class-balance weights
pos_w = (len(yr_tr) - yr_tr.sum()) / (yr_tr.sum())
weights = np.where(yr_tr==1, pos_w, 1.0)

glm_binom = sm.GLM(yr_tr, Xsm_tr, family=sm.families.Binomial(), freq_weights=weights)
glm_res = glm_binom.fit(cov_type='HC1')

proba_glm = glm_res.predict(Xsm_te)
res_glm = evaluate_scores(yr_te, proba_glm)
print_eval('GLM Binomial (robust)', res_glm)

coefs = pd.DataFrame({
    'feature': ['const'] + list(Xr_tr.columns),
    'coef': glm_res.params,
    'std_err': glm_res.bse,
    'z': glm_res.tvalues,
    'pval': glm_res.pvalues
}).sort_values('z', key=lambda s: s.abs(), ascending=False)

coefs.head(15)

[GLM Binomial (robust)] ROC-AUC=0.924 | PR-AUC=0.395 | Brier=0.1061 | Recall@1%FPR=0.278 (thr=0.9828315289483237) | Recall@5%FPR=0.667 (thr=0.7333376160367324)


Unnamed: 0,feature,coef,std_err,z,pval
0,const,-1.703393,0.043059,-39.559786,0.0
48,Attr27__isna,1.089711,0.039069,27.891917,3.343837e-171
47,Attr21__isna,0.487463,0.031053,15.697618,1.570277e-55
25,Attr41,-0.478438,0.037644,-12.709502,5.236679e-37
19,Attr34,1.090835,0.092647,11.774044,5.311593e-32
4,Attr15,0.258086,0.022772,11.333277,8.978184e-30
23,Attr39,-1.573863,0.18122,-8.684798,3.794174e-18
8,Attr21,-0.294944,0.037155,-7.938227,2.05091e-15
22,Attr38,-0.571141,0.07702,-7.415515,1.211529e-13
24,Attr40,0.685965,0.105926,6.475868,9.426839e-11


**Interpretation (after running):**  
- Compare GLM performance to sklearn Logit ‚Äî they should be close.  
- Check sign/magnitude against bankruptcy theory (profitability ‚Üì risk, leverage/liquidity ‚Üë risk, etc.).

### Step 8 ‚Äî Save test-set predictions for later analysis

**Why:** We‚Äôll use these to plot decision curves / make threshold analyses in the comparison notebook.

In [8]:
OUT_DIR = DATA_DIR
preds = pd.DataFrame({
    'y_true': yf_te,
    'p_logit': proba_logit,
    'p_rf': proba_rf,
    'p_glm': proba_glm
})
out_path = OUT_DIR / 'poland_h1_test_predictions.csv'
preds.to_csv(out_path, index=False)
print('Saved:', out_path)

Saved: /Users/reebal/FH-Wedel/WS25/seminar-bankruptcy-prediction/data/processed/poland_h1_test_predictions.csv


**Interpretation (after running):**  
- Confirm the CSV was saved to `data/processed/poland_h1_test_predictions.csv`.  
- We‚Äôll reuse it when comparing models/datasets next.

üî• Solid first baselines. Here‚Äôs a straight, critical read of what you got, plus exactly what to do next.

# What the results say

## Logistic Regression (sklearn)

* **ROC-AUC 0.924, PR-AUC 0.385** ‚Üí good ranking overall, but not great in the rare-event region compared to RF.
* **Recall@1% FPR = 0.278** (threshold ‚âà 0.982) ‚Üí low catch rate under strict false-alarm budgets.
* **Calibration is poor**: In the top bin, **mean_pred ‚âà 0.834 vs event_rate ‚âà 0.277**. That‚Äôs massive **overconfidence**, typical when using `class_weight='balanced'` for rare events ‚Äî it helps ranking, hurts probability calibration.
* **Brier 0.106** is relatively high (but remember: with very low base rates, Brier is dominated by negatives).

üëâ Conclusion: fine as an interpretable baseline, but **not** the best early-warning detector under tight FPR caps without calibration.

## Random Forest

* **ROC-AUC 0.968, PR-AUC 0.738** ‚Üí **excellent**; huge uplift vs. Logit in the rare-event regime.
* **Recall@1% FPR = 0.593**, **@5% FPR = 0.796** ‚Üí much better early-warning capture at the same false-alarm budgets.
* **Top features** include `Attr27__isna` (missingness indicator) and `Attr27` itself, plus several ratios (e.g., `Attr24`, `Attr13`, `Attr34`, `Attr26`).

  * Missingness being #1 says: **lack of certain disclosures is itself predictive** ‚Äî plausible in financial distress. This is valid because ‚Äúmissing vs. present‚Äù is known at prediction time. Still, we‚Äôll do a robustness check by training **without** indicators to make sure performance doesn‚Äôt collapse (guarding against accidental leakage).
* **Brier 0.0208** is very low. With rare events, that often reflects conservative probabilities; we still need a **calibration check** (Brier alone can be misleading.

üëâ Conclusion: **RF is your best detector** right now. Keep it as the primary scorer.

## GLM Binomial (robust SEs)

* **ROC-AUC 0.924, PR-AUC 0.395** ‚Üí similar to sklearn Logit (as expected).
* **Recall@1% FPR = 0.278**; again, weak in the strict low-FPR zone without calibration.
* **Inference looks sensible** (signs & large |z| for some features), but we‚Äôll only narrate economics once we map `Attr*` ‚Üí ratio names (see ‚ÄúNext steps‚Äù below).

  * E.g., positive coefficients (‚Üë risk): `Attr34`, `Attr15`, `Attr40`, `Attr35`
  * Negative (‚Üì risk): `Attr41`, `Attr39`, `Attr38`, `Attr25`, `Attr50`

üëâ Conclusion: Use GLM for **interpretation** (with robust SEs) and RF for **detection**.

---

# What this means in practical alert terms (approx.)

Assuming your test fold is ~20% of 7,027 (‚âà1,405 obs, ~3.9% bankrupt ‚âà **55 positives**, **1,350 negatives**):

* **RF @ 1% FPR** ‚Üí recall 0.593

  * **TP ‚âà 33**, **FP ‚âà 13**, **FN ‚âà 22**
  * That‚Äôs ~**46** alerts total (**~33** are real), i.e., **~33 true bankruptcies caught** with only ~13 false alarms.
* **Logit/GLM @ 1% FPR** ‚Üí recall ~0.278

  * **TP ‚âà 15**, **FP ‚âà 13** ‚Äî materially fewer true catches at same false-alarm budget.

These back-of-envelope numbers are helpful for stakeholders.

---

# Immediate fixes / upgrades

1. **Calibrate probabilities** (especially for Logit/GLM; also check RF):

   * Use `CalibratedClassifierCV` with **isotonic** (or Platt/sigmoid) on a validation split inside CV.
   * Re-check Brier + reliability curves. Expect Logit calibration to improve a lot.

2. **Choose operational thresholds** at **1% and 5% FPR**: quantify expected TP/FP and **alerts per 1,000 firms** ‚Äî easy to communicate.

3. **Robustness checks**:

   * Refit RF **without missingness indicators** (`__isna` cols). If performance barely moves, great; if it tanks, we‚Äôll document that ‚Äúmissingness is key‚Äù but still valid.
   * Try **Logit (L1)** for sparser coefficients (cleaner interpretation). Keep Logit/GLM in the thesis even if RF wins on detection.

4. **Map `Attr*` ‚Üí real ratio names** from the dataset docs/metadata file so your GLM table speaks ‚Äúfinance‚Äù (profitability, leverage, liquidity, activity).

