# Distress Pipeline — Technical Bug Review (Leakage‑Safe)

This notebook contains:

1. **Full pipeline code** organized along the *data science lifecycle* (ingest → clean → label → split → impute → engineer → transform → validate → EDA).
2. A **bug log** of material technical issues found in the provided script, with fixes applied in the “Corrected Pipeline” cells.
3. **Schema compliance checks** against `Variables.xlsx` (sheet `Corrected`) so the pipeline only relies on approved variables.

> Note: The objective here is *technical correctness* (especially leakage, ratio construction, and target definition). Minor style nits are intentionally omitted.


## Bug log (material issues)

### 1) Ratio construction: “+EPS” division
- **Issue:** Adding an epsilon to denominators changes the economics (especially for ratios used in *labels*), and can silently turn zero‑denominator rows into extreme values.
- **Fix:** All ratios now use `np.divide(..., where=denom!=0)` and produce **NaN** when the ratio is undefined. Downstream, missingness is handled explicitly.

### 2) Debt/EBITDA and other “sign” pathologies
- **Issue:** `Debt/EBITDA` becomes negative when EBITDA is negative; mechanically this can *reduce* leverage if you later compare to “> 4.5x”.
- **Fix:** For leverage classification, `Debt/EBITDA` is only computed when `EBITDA > 0` (otherwise NaN). Negative equity is handled separately.

### 3) Negative / nonsensical raw values
- **Issue:** Imputation enforced non‑negativity only for **imputed** values; existing negative values (e.g., negative debt components) could persist and break ratios.
- **Fix:** Post‑conversion, hard‑constraint columns are clipped at 0 (configurable, applied before engineering).

### 4) Distress definition alignment
- **Issue:** The requested “S&P style leverage bands” table was not reflected in the label definition.
- **Fix:** `distress_dummy` is defined as: **equity < 0 OR “highly leveraged”** based on the table cutoffs (FFO/debt < 15%, debt/cap > 55%, debt/EBITDA > 4.5x). A conservative *2‑of‑3* rule is used by default to flag fewer firms (configurable).

### 5) Leakage discipline
- **Check:** Imputation parameters (year medians + TWFE means) are fit **only** on TRAIN (excluding the hold‑out validation label year) and then applied to val/test. Winsorization bounds and StandardScaler are also fit only on TRAIN.


In [5]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

FILE_NAME = "data.csv"
VARS_XLSX = "Variables.xlsx"  # sheet: 'Corrected' with column 'Variable'

TRAIN_CUTOFF_LABEL_YEAR = 2022
VAL_YEARS = 1
N_SPLITS_TIME_CV = 5

WINSOR_LOWER_Q = 0.01
WINSOR_UPPER_Q = 0.99

# Distress rule: how many of the three "highly leveraged" triggers must be met
# Options: "any", "2of3", "all"
LEV_RULE = "all"

# "Highly leveraged" thresholds from the provided table
TH_FFO_TO_DEBT = 0.15       # < 15%
TH_DEBT_TO_CAP = 0.55       # > 55%
TH_DEBT_TO_EBITDA = 4.5     # > 4.5x

NUMERIC_COLS = [
    "gvkey", "fyear", "ismod",
    "ib", "at", "dltt", "dlc", "seq", "mibt",
    "che", "act", "lct", "re",
    "oancf", "ivncf", "fincf",
    "oibdp", "xint", "capx",
    "dv", "prstkc",
    "txt", "txdc", "txach", "txp",
    "mkvalt", "csho", "prcc_f", "prcc_c",
    "dltis", "dltr", "sstk"
]
REQUIRED_KEYS = ["gvkey", "fyear"]

def need(df, cols):
    miss = [c for c in cols if c not in df.columns]
    if miss:
        raise KeyError(f"Missing required column(s): {miss}")

def sdiv(a, b):
    # division only when denominator is finite and non-zero; undefined ratios -> NaN
    a = pd.to_numeric(a, errors="coerce").to_numpy(dtype=float)
    b = pd.to_numeric(b, errors="coerce").to_numpy(dtype=float)
    out = np.full_like(a, np.nan, dtype=float)
    np.divide(a, b, out=out, where=(b != 0) & np.isfinite(b) & np.isfinite(a))
    return out

def roll_year_folds(df_in, year_col="label_year", n_splits=5, min_train_years=3):
    yrs = np.sort(df_in[year_col].dropna().unique())
    if len(yrs) <= min_train_years:
        return []
    n_splits = min(n_splits, len(yrs) - min_train_years)
    out = []
    for k in range(n_splits):
        tr_yrs = yrs[:min_train_years + k]
        va_yr = yrs[min_train_years + k]
        tr_idx = df_in.index[df_in[year_col].isin(tr_yrs)].to_numpy()
        va_idx = df_in.index[df_in[year_col] == va_yr].to_numpy()
        out.append((tr_idx, va_idx, tr_yrs, va_yr))
    return out



In [3]:
# =============================================================================
# 1) Data ingestion & basic cleaning
# =============================================================================
df = pd.read_csv(FILE_NAME, low_memory=False)
need(df, REQUIRED_KEYS)

if "datadate" in df.columns:
    df["datadate"] = pd.to_datetime(df["datadate"], errors="coerce")

for c in NUMERIC_COLS:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors="coerce")

for c in ["gvkey", "fyear", "ismod"]:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors="coerce").astype("Int64")

df["firm_id"] = df["gvkey"]
df = (
    df.sort_values(["firm_id", "fyear"])
      .drop_duplicates(subset=["firm_id", "fyear"], keep="last")
      .reset_index(drop=True)
)
df["label_year"] = df["fyear"] + 1

# Hard constraints: clip negative values where they are not economically meaningful
for c in [x for x in ["dlc", "dltt", "at", "mkvalt", "capx"] if x in df.columns]:
    df[c] = pd.to_numeric(df[c], errors="coerce").clip(lower=0)

# =============================================================================
# 2) Train/Val/Test masks for leakage-safe fitting (based on label_year = fyear+1)
# =============================================================================
pool_mask = df["label_year"] <= TRAIN_CUTOFF_LABEL_YEAR
pool_years = np.sort(df.loc[pool_mask, "label_year"].dropna().unique())
val_years = pool_years[-VAL_YEARS:] if len(pool_years) else np.array([], dtype=int)

train_mask = pool_mask & (~df["label_year"].isin(val_years))  # for fitting
val_mask = pool_mask & (df["label_year"].isin(val_years))
test_mask = df["label_year"] > TRAIN_CUTOFF_LABEL_YEAR

# =============================================================================
# 3) Missingness flags + imputation (fit on TRAIN only)
# =============================================================================
raw = [c for c in [
    "at", "mkvalt", "seq", "mibt", "dlc", "dltt", "oibdp", "xint",
    "oancf", "capx", "txt", "txdc", "txach", "dv", "prstkc", "ismod"
] if c in df.columns]

for c in raw:
    df[f"miss_{c}"] = df[c].isna().astype("int8")

twfe_cols = [c for c in ["at", "mkvalt", "seq", "dlc", "dltt", "oibdp", "xint", "oancf", "capx", "txt"]
             if c in df.columns]
med_cols = [c for c in ["dv", "prstkc", "txdc", "txach", "mibt", "ismod"]
            if c in df.columns]
nonneg = set([c for c in ["at", "mkvalt", "dlc", "dltt", "capx"] if c in df.columns])

tr_obs = df.loc[train_mask].copy()

# --- Year median models (TRAIN-fit)
year_meds = {}
for c in med_cols:
    s = pd.to_numeric(tr_obs[c], errors="coerce")
    overall = float(s.median()) if np.isfinite(s.median()) else 0.0
    by_year = tr_obs.groupby("fyear")[c].median()
    year_meds[c] = (overall, by_year)

for c in med_cols:
    m = df[c].isna()
    if m.any():
        overall, by_year = year_meds[c]
        fill = df.loc[m, "fyear"].map(by_year).astype(float).fillna(overall)
        df.loc[m, c] = fill.to_numpy()

# --- TWFE mean models (TRAIN-fit): y_it = alpha_i + gamma_t - overall
twfe = {}
for c in twfe_cols:
    obs = tr_obs[["firm_id", "fyear", c]].copy()
    obs[c] = pd.to_numeric(obs[c], errors="coerce")
    obs = obs.dropna(subset=[c])
    if obs.empty:
        twfe[c] = (0.0, pd.Series(dtype="float64"), pd.Series(dtype="float64"))
        continue
    overall = float(obs[c].mean())
    fmean = obs.groupby("firm_id")[c].mean()
    ymean = obs.groupby("fyear")[c].mean()
    twfe[c] = (overall, fmean, ymean)

for c in twfe_cols:
    m = df[c].isna()
    if m.any():
        overall, fmean, ymean = twfe[c]
        fpart = df.loc[m, "firm_id"].map(fmean)
        ypart = df.loc[m, "fyear"].map(ymean)
        pred = fpart + ypart - overall
        pred = pred.where(pred.notna(), ypart)
        pred = pred.where(pred.notna(), fpart)
        pred = pred.fillna(overall)
        if c in nonneg:
            pred = pred.clip(lower=0.0)
        df.loc[m, c] = pred.to_numpy()

# =============================================================================
# 4) Feature engineering (S&P-style ratios) + distress label (t) and target (t+1)
# =============================================================================
dlc = pd.to_numeric(df.get("dlc", np.nan), errors="coerce")
dltt = pd.to_numeric(df.get("dltt", np.nan), errors="coerce")
df["total_debt"] = pd.concat([dlc, dltt], axis=1).sum(axis=1, min_count=1)

seq = pd.to_numeric(df.get("seq", np.nan), errors="coerce")
mibt = pd.to_numeric(df.get("mibt", 0.0), errors="coerce")
df["equity_plus_mi_sp"] = seq + mibt

df["total_capital_sp"] = df["total_debt"] + df["equity_plus_mi_sp"]
# ratio only meaningful when total capital > 0 (otherwise S&P-style percent bands are undefined)
cap_pos = (pd.to_numeric(df["total_capital_sp"], errors="coerce") > 0).to_numpy()
df["sp_debt_to_capital"] = np.full(len(df), np.nan, dtype=float)
df.loc[cap_pos, "sp_debt_to_capital"] = sdiv(df.loc[cap_pos, "total_debt"], df.loc[cap_pos, "total_capital_sp"])

oibdp = pd.to_numeric(df.get("oibdp", np.nan), errors="coerce")  # EBITDA proxy
xint = pd.to_numeric(df.get("xint", np.nan), errors="coerce")

# Debt/EBITDA only meaningful if EBITDA > 0
ebitda_pos = (oibdp > 0).to_numpy()
df["sp_debt_to_ebitda"] = np.full(len(df), np.nan, dtype=float)
df.loc[ebitda_pos, "sp_debt_to_ebitda"] = sdiv(df.loc[ebitda_pos, "total_debt"], df.loc[ebitda_pos, "oibdp"])

txt = pd.to_numeric(df.get("txt", np.nan), errors="coerce")
txdc = pd.to_numeric(df.get("txdc", 0.0), errors="coerce")
txach = pd.to_numeric(df.get("txach", 0.0), errors="coerce")
df["cash_tax_paid_proxy"] = txt - txdc - txach

df["ffo_proxy"] = oibdp - xint - pd.to_numeric(df["cash_tax_paid_proxy"], errors="coerce")

debt_pos = (pd.to_numeric(df["total_debt"], errors="coerce") > 0).to_numpy()
df["sp_ffo_to_debt"] = np.full(len(df), np.nan, dtype=float)
df.loc[debt_pos, "sp_ffo_to_debt"] = sdiv(df.loc[debt_pos, "ffo_proxy"], df.loc[debt_pos, "total_debt"])

oancf = pd.to_numeric(df.get("oancf", np.nan), errors="coerce")
capx = pd.to_numeric(df.get("capx", np.nan), errors="coerce")
df["sp_cfo_to_debt"] = np.full(len(df), np.nan, dtype=float)
df.loc[debt_pos, "sp_cfo_to_debt"] = sdiv(df.loc[debt_pos, "oancf"], df.loc[debt_pos, "total_debt"])

df["focf"] = oancf - capx
df["sp_focf_to_debt"] = np.full(len(df), np.nan, dtype=float)
df.loc[debt_pos, "sp_focf_to_debt"] = sdiv(df.loc[debt_pos, "focf"], df.loc[debt_pos, "total_debt"])

dv = pd.to_numeric(df.get("dv", 0.0), errors="coerce")
prstkc = pd.to_numeric(df.get("prstkc", 0.0), errors="coerce")
df["dcf"] = df["focf"] - dv - prstkc
df["sp_dcf_to_debt"] = np.full(len(df), np.nan, dtype=float)
df.loc[debt_pos, "sp_dcf_to_debt"] = sdiv(df.loc[debt_pos, "dcf"], df.loc[debt_pos, "total_debt"])

for c in ["at", "mkvalt"]:
    if c in df.columns:
        s = pd.to_numeric(df[c], errors="coerce")
        df[f"log_{c}"] = np.where(s >= 0, np.log1p(s), np.nan)

# Interest coverage as feature (not used for table-based distress by default)
df["sp_interest_coverage"] = sdiv(oibdp, xint.abs())

# --- Distress at time t: equity < 0 OR "highly leveraged" per table thresholds
c1 = (pd.to_numeric(df["sp_ffo_to_debt"], errors="coerce") < TH_FFO_TO_DEBT)
c2 = (pd.to_numeric(df["sp_debt_to_capital"], errors="coerce") > TH_DEBT_TO_CAP)
c3 = (pd.to_numeric(df["sp_debt_to_ebitda"], errors="coerce") > TH_DEBT_TO_EBITDA)

hits = c1.astype(int) + c2.astype(int) + c3.astype(int)
if LEV_RULE == "any":
    highly_lev = hits >= 1
elif LEV_RULE == "all":
    highly_lev = hits == 3
else:  # "2of3"
    highly_lev = hits >= 2

neg_equity = (pd.to_numeric(seq, errors="coerce") < 0).fillna(False).to_numpy()
df["distress_dummy"] = (neg_equity | highly_lev.to_numpy()).astype("int8")

# --- Target: distress next year (t+1), leakage-safe by firm
df["target_next_year_distress"] = df.groupby("firm_id")["distress_dummy"].shift(-1)
df = df.dropna(subset=["target_next_year_distress"]).reset_index(drop=True)

# =============================================================================
# 5) Final train/val/test split after target availability
# =============================================================================
train_pool = df[df["label_year"] <= TRAIN_CUTOFF_LABEL_YEAR].copy()
test = df[df["label_year"] > TRAIN_CUTOFF_LABEL_YEAR].copy()

years = np.sort(train_pool["label_year"].dropna().unique())
val_years = years[-VAL_YEARS:] if len(years) else np.array([], dtype=int)

val = train_pool[train_pool["label_year"].isin(val_years)].copy()
train = train_pool[~train_pool["label_year"].isin(val_years)].copy()

print("Split:", f"train={len(train):,}", f"val={len(val):,}", f"test={len(test):,}", "| val_years:", list(val_years))

# =============================================================================
# 6) Winsorization (fit on TRAIN only) + StandardScaler (fit on TRAIN only)
# =============================================================================
base_feats = [
    "sp_debt_to_capital", "sp_ffo_to_debt", "sp_cfo_to_debt",
    "sp_focf_to_debt", "sp_dcf_to_debt", "sp_debt_to_ebitda",
    "sp_interest_coverage", "log_at", "log_mkvalt"
]
feats = [c for c in base_feats if c in train.columns and c in val.columns and c in test.columns]

for d in (train, val, test):
    d[feats] = d[feats].replace([np.inf, -np.inf], np.nan)

# Feature-level fill to support winsor/scaler (TRAIN median only)
fill = train[feats].median(numeric_only=True)
for d in (train, val, test):
    d[feats] = d[feats].fillna(fill)

bounds = {}
for c in feats:
    s = pd.to_numeric(train[c], errors="coerce")
    bounds[c] = (s.quantile(WINSOR_LOWER_Q), s.quantile(WINSOR_UPPER_Q))

for d in (train, val, test):
    for c, (lo, hi) in bounds.items():
        s = pd.to_numeric(d[c], errors="coerce")
        d[c] = s.clip(lo, hi)

x_train = train[feats].to_numpy(dtype=float)
x_val = val[feats].to_numpy(dtype=float)
x_test = test[feats].to_numpy(dtype=float)

scaler = StandardScaler().fit(x_train)
x_train_z = scaler.transform(x_train)
x_val_z = scaler.transform(x_val)
x_test_z = scaler.transform(x_test)

z_cols = [f"z_{c}" for c in feats]
train[z_cols] = x_train_z
val[z_cols] = x_val_z
test[z_cols] = x_test_z

# =============================================================================
# 7) Diagnostics (train only): correlations
# =============================================================================
t = "target_next_year_distress"
corr = train[[t] + feats].corr(numeric_only=True)[t].drop(t).sort_values(key=np.abs, ascending=False)
print(corr)

# =============================================================================
# 8) Rolling / forward CV by year (within train_pool)
# =============================================================================
folds = roll_year_folds(train_pool, n_splits=N_SPLITS_TIME_CV, min_train_years=3)
for i, (tr_idx, va_idx, tr_yrs, va_yr) in enumerate(folds, 1):
    print(f"Fold {i}: train_years={tr_yrs[0]}..{tr_yrs[-1]} (n={len(tr_idx)}), val_year={va_yr} (n={len(va_idx)})")


Split: train=44,783 val=6,415 test=12,404 | val_years: [np.int64(2022)]
sp_debt_to_capital      0.377494
log_at                 -0.167140
log_mkvalt             -0.086856
sp_cfo_to_debt         -0.059771
sp_interest_coverage   -0.045743
sp_debt_to_ebitda       0.033574
sp_focf_to_debt        -0.026974
sp_ffo_to_debt         -0.026302
sp_dcf_to_debt         -0.008430
Name: target_next_year_distress, dtype: float64
Fold 1: train_years=2015..2017 (n=19775), val_year=2018 (n=6337)
Fold 2: train_years=2015..2018 (n=26112), val_year=2019 (n=6173)
Fold 3: train_years=2015..2019 (n=32285), val_year=2020 (n=6233)
Fold 4: train_years=2015..2020 (n=38518), val_year=2021 (n=6265)
Fold 5: train_years=2015..2021 (n=44783), val_year=2022 (n=6415)


In [6]:
# =============================================================================
# 9) EDA (complete end)
# =============================================================================
def _overview(d, name):
    n = len(d)
    nf = d["firm_id"].nunique() if "firm_id" in d.columns else np.nan
    ny = d["fyear"].nunique() if "fyear" in d.columns else np.nan
    tr = float(d[t].mean()) if t in d.columns else np.nan
    print(f"\n=== {name} === rows={n:,} | firms={nf:,} | years={ny} | target_rate={tr:.4f}")
    if "label_year" in d.columns:
        byy = d.groupby("label_year")[t].agg(["mean", "count"])
        print("\nTarget by label_year (tail):")
        print(byy.tail(12))

_overview(train, "TRAIN")
_overview(val, "VAL")
_overview(test, "TEST")

post_miss = pd.DataFrame({
    "col": raw,
    "train_pct_na": [train[c].isna().mean() * 100 for c in raw if c in train.columns],
    "val_pct_na": [val[c].isna().mean() * 100 for c in raw if c in val.columns],
    "test_pct_na": [test[c].isna().mean() * 100 for c in raw if c in test.columns],
})
if not post_miss.empty:
    post_miss = post_miss.sort_values("train_pct_na", ascending=False)
    print("\nPost-imputation missingness on raw inputs (pct):")
    print(post_miss.head(50).round(4))

def _dist(d, cols, name):
    x = d[cols].replace([np.inf, -np.inf], np.nan)
    q = x.quantile([0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]).T
    out = pd.DataFrame({
        "n": x.notna().sum(),
        "mean": x.mean(),
        "std": x.std(ddof=0),
        "min": x.min(),
        "p01": q[0.01],
        "p05": q[0.05],
        "p25": q[0.25],
        "p50": q[0.50],
        "p75": q[0.75],
        "p95": q[0.95],
        "p99": q[0.99],
        "max": x.max(),
        "skew": x.skew(numeric_only=True),
        "kurt": x.kurtosis(numeric_only=True),
    })
    print(f"\nDistribution summary ({name})")
    print(out.round(4).sort_values("skew", key=lambda s: s.abs(), ascending=False))
    return out

_ = _dist(train, feats, "TRAIN | winsorized raw feats")
_ = _dist(train, z_cols, "TRAIN | standardized feats")

def _hi_corr(d, cols, thr=0.80):
    cm = d[cols].corr(numeric_only=True)
    pairs = []
    for i in range(len(cols)):
        for j in range(i + 1, len(cols)):
            r = cm.iloc[i, j]
            if np.isfinite(r) and abs(r) >= thr:
                pairs.append((cols[i], cols[j], float(r)))
    pairs = sorted(pairs, key=lambda x: abs(x[2]), reverse=True)
    return pairs

pairs = _hi_corr(train, feats, thr=0.80)
print("\nHigh collinearity pairs among feats (|corr|>=0.80) [top 25]:")
for a, b, r in pairs[:25]:
    print(f"{a} vs {b}: r={r:.3f}")

def _drift_smd(a_df, b_df, cols):
    rows = []
    for c in cols:
        a = pd.to_numeric(a_df[c], errors="coerce").replace([np.inf, -np.inf], np.nan)
        b = pd.to_numeric(b_df[c], errors="coerce").replace([np.inf, -np.inf], np.nan)
        ma, mb = float(a.mean()), float(b.mean())
        sa, sb = float(a.std(ddof=0)), float(b.std(ddof=0))
        sp = np.sqrt(0.5 * (sa**2 + sb**2))
        smd = (mb - ma) / sp if sp > 0 else np.nan
        rows.append((c, ma, mb, smd, abs(smd) if np.isfinite(smd) else np.nan))
    out = pd.DataFrame(rows, columns=["feature", "mean_train", "mean_test", "smd", "abs_smd"])
    return out.sort_values("abs_smd", ascending=False)

drift = _drift_smd(train, test, feats)
print("\nTrain→Test drift (SMD) [top 15]:")
print(drift.head(15).round(4))

def _group_diff(d, cols):
    g = d.groupby(t)[cols].mean(numeric_only=True)
    if 0 in g.index and 1 in g.index:
        diff = (g.loc[1] - g.loc[0]).sort_values(key=np.abs, ascending=False)
        return diff
    return pd.Series(dtype="float64")

diff = _group_diff(train, feats)
if not diff.empty:
    print("\nMean difference (target=1 minus target=0) on TRAIN feats [top 15]:")
    print(diff.head(15).round(4))

# Sanity checks that commonly catch silent bugs
assert df.duplicated(subset=["firm_id","fyear"]).sum() == 0, "Duplicates remain at firm-year level"
assert train[feats].isna().sum().sum() == 0, "NaNs remain in TRAIN features after fill"
assert np.isfinite(train[z_cols].to_numpy(dtype=float)).all(), "Non-finite z-features in TRAIN"
print("\nSanity checks passed.")



=== TRAIN === rows=44,783 | firms=9,220 | years=7 | target_rate=0.2091

Target by label_year (tail):
                mean  count
label_year                 
2015        0.221468   6773
2016        0.215525   6570
2017        0.200715   6432
2018        0.203724   6337
2019        0.229872   6173
2020        0.216108   6233
2021        0.175738   6265

=== VAL === rows=6,415 | firms=6,415 | years=1 | target_rate=0.1906

Target by label_year (tail):
                mean  count
label_year                 
2022        0.190647   6415

=== TEST === rows=12,404 | firms=6,633 | years=2 | target_rate=0.2067

Target by label_year (tail):
                mean  count
label_year                 
2023        0.207049   6327
2024        0.206352   6077

Post-imputation missingness on raw inputs (pct):
       col  train_pct_na  val_pct_na  test_pct_na
0       at           0.0         0.0          0.0
1   mkvalt           0.0         0.0          0.0
2      seq           0.0         0.0          0.0
