# 01 — EDA (Polish Companies Bankruptcy)

**Goal:** sanity-check and audit the Polish bankruptcy datasets (1–5 year horizons) and prepare them for modeling.
We will:
1. Locate and verify the ARFF files (1year…5year)
2. Load each file, attach a `horizon` column, harmonize columns
3. Check shapes, dtypes, target encoding, missingness
4. Summarize class balance by horizon

> After each **code** cell you'll find a placeholder markdown cell titled **Interpretation**.  
> Leave it empty for now — once you run a cell and see the output, paste your short interpretation there (or keep it as notes for us to fill together).


### Step 1 — Import libraries

**Why:** We set up all packages used for EDA and safe loading of ARFF data.
If your environment is missing some packages, install them first (see notes in the cell).

In [1]:
# If any imports fail, install the package in your env, e.g.:
# uv pip install scipy statsmodels imbalanced-learn pyarrow

from pathlib import Path
import os
import sys
import json
import warnings

import numpy as np
import pandas as pd

from scipy.io import arff

# Display options
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 180)

RANDOM_STATE = 42
warnings.filterwarnings("ignore")

print("✅ Imports OK. Pandas:", pd.__version__)


✅ Imports OK. Pandas: 2.3.3


### Step 2 — Locate dataset files on disk

**Why:** Ensure the expected ARFF files exist so subsequent steps don't fail.
We keep paths **relative** to the repo root so notebooks are reproducible.

In [2]:
# Adjust REPO_ROOT if you open the notebook from another folder.
REPO_ROOT = Path.cwd()
DATA_DIR = REPO_ROOT / "data" / "polish-companies-bankruptcy" / "polish+companies+bankruptcy+data"

expected_files = [DATA_DIR / f"{i}year.arff" for i in [1,2,3,4,5]]
files_found = [p for p in expected_files if p.exists()]

print("Repo root:", REPO_ROOT)
print("Data dir:", DATA_DIR)
print("Expected files:", [p.name for p in expected_files])
print("Found files:", [p.name for p in files_found])

assert len(files_found) == 5, "❌ Not all ARFF files are present. Please check your path."
print("✅ All ARFF files present.")

Repo root: /Users/reebal/FH-Wedel/WS25/seminar-bankruptcy-prediction
Data dir: /Users/reebal/FH-Wedel/WS25/seminar-bankruptcy-prediction/data/polish-companies-bankruptcy/polish+companies+bankruptcy+data
Expected files: ['1year.arff', '2year.arff', '3year.arff', '4year.arff', '5year.arff']
Found files: ['1year.arff', '2year.arff', '3year.arff', '4year.arff', '5year.arff']
✅ All ARFF files present.


**Interpretation (after running):**  
- You should see 5 expected files and 5 found files listed.  
- If the assertion fails, fix `REPO_ROOT` or the data directory structure.

### Step 3 — Define robust ARFF → DataFrame loader

**Why:** ARFF returns a structured array; we convert it into a clean `DataFrame` and standardize the target to integers `{0,1}`.
We also attach the horizon (`1…5`).

In [3]:
def load_arff_to_df(path: Path, horizon: int) -> pd.DataFrame:
    data, meta = arff.loadarff(str(path))
    df = pd.DataFrame(data)
    # Decode byte columns that represent strings
    for c in df.columns:
        if df[c].dtype == object:
            try:
                df[c] = df[c].str.decode('utf-8')
            except Exception:
                pass

    # Try to identify the target column (commonly 'class')
    target_col_candidates = [c for c in df.columns if c.lower() in {'class', 'target', 'bankrupt'}]
    if not target_col_candidates:
        raise ValueError(f"No obvious target column found in {path.name}. Columns: {list(df.columns)[:10]}...")
    target_col = target_col_candidates[0]

    # Normalize target to {0,1}
    y_raw = df[target_col]
    # Common encodings in UCI sets: '0'/'1', 'N'/'B', 'No'/'Yes', 'negative'/'positive'
    mapping = {
        '0': 0, '1': 1,
        'N': 0, 'B': 1,
        'No': 0, 'Yes': 1,
        'negative': 0, 'positive': 1,
        'bankrupt': 1, 'non-bankrupt': 0,
        'false': 0, 'true': 1,
        0: 0, 1: 1
    }
    def map_target(v):
        # convert bytes to str if needed
        if isinstance(v, bytes):
            v = v.decode('utf-8', errors='ignore')
        return mapping.get(v, v)

    df[target_col] = y_raw.map(map_target)
    # If still not numeric 0/1, coerce cautiously
    if not pd.api.types.is_numeric_dtype(df[target_col]):
        # try cast to int, else to category then to 0/1 by ordering
        try:
            df[target_col] = df[target_col].astype(int)
        except Exception:
            df[target_col] = pd.factorize(df[target_col])[0]
            # ensure minority is coded as 1 if class is imbalanced
            if df[target_col].value_counts().shape[0] == 2:
                counts = df[target_col].value_counts()
                minority = counts.idxmin()
                df[target_col] = (df[target_col] == minority).astype(int)

    # Attach horizon
    df['horizon'] = horizon
    # Standardize target name to 'y'
    if target_col != 'y':
        df = df.rename(columns={target_col: 'y'})
    return df


**Interpretation (after running):**  
- No output expected. This defines a helper to load ARFF files and normalize the target to `y ∈ {0,1}`.

### Step 4 — Load all horizons, check shapes and class balance

**Why:** We need to confirm column consistency across horizons, dataset sizes, and the proportion of bankrupt firms per horizon.

In [4]:
dfs = []
for h in [1,2,3,4,5]:
    df_h = load_arff_to_df(DATA_DIR / f"{h}year.arff", horizon=h)
    dfs.append(df_h)

# Column consistency
cols_per_h = {h: set(df_h.columns) for h, df_h in zip([1,2,3,4,5], dfs)}
all_equal = len({tuple(sorted(cols)) for cols in cols_per_h.values()}) == 1

print("Shapes per horizon:")
for h, df_h in zip([1,2,3,4,5], dfs):
    pos_rate = df_h['y'].mean() if 'y' in df_h.columns else np.nan
    print(f"  h={h}: {df_h.shape}, bankrupt share (y=1) ≈ {pos_rate:.3f}")

print("\nColumns consistent across horizons:", all_equal)
if not all_equal:
    # print a small diff
    base_cols = cols_per_h[1]
    for h in [2,3,4,5]:
        diff_add = sorted(list(cols_per_h[h] - base_cols))
        diff_drop = sorted(list(base_cols - cols_per_h[h]))
        if diff_add or diff_drop:
            print(f"  vs h=1 → h={h}: +{len(diff_add)} cols, -{len(diff_drop)} cols")
            if diff_add: print("    added:", diff_add[:10], ("…" if len(diff_add)>10 else ""))
            if diff_drop: print("    dropped:", diff_drop[:10], ("…" if len(diff_drop)>10 else ""))

df_all = pd.concat(dfs, axis=0, ignore_index=True)
print("\nCombined shape:", df_all.shape)
print("Targets: ", df_all['y'].value_counts(dropna=False).to_dict())
print("Horizon counts:", df_all['horizon'].value_counts().sort_index().to_dict())

Shapes per horizon:
  h=1: (7027, 66), bankrupt share (y=1) ≈ 0.039
  h=2: (10173, 66), bankrupt share (y=1) ≈ 0.039
  h=3: (10503, 66), bankrupt share (y=1) ≈ 0.047
  h=4: (9792, 66), bankrupt share (y=1) ≈ 0.053
  h=5: (5910, 66), bankrupt share (y=1) ≈ 0.069

Columns consistent across horizons: True

Combined shape: (43405, 66)
Targets:  {0: 41314, 1: 2091}
Horizon counts: {1: 7027, 2: 10173, 3: 10503, 4: 9792, 5: 5910}


**Interpretation (after running):**  
- Note dataset sizes by horizon and the bankrupt share; this gives you an early view of class imbalance.  
- If columns are not consistent, we’ll need to align them (e.g., intersect features).  
- Confirm that `y` has only `{0,1}` and that the overall positive rate is plausible.

## Step 4 — Interpretation

* **Consistent schema:** All horizons share the same 64 features → great, we can concatenate safely and compare horizons directly without extra alignment work.
* **Class imbalance:** Positives (bankrupt) rise from **3.9% (h=1)** to **6.9% (h=5)**, which is exactly what we expect: a longer “look-ahead” window captures more future bankruptcies. Overall rate ≈ **4.8%** (2,091 of 43,405). That’s imbalanced but absolutely usable with class-weighted learners and PR-AUC–oriented evaluation.
* **Sample size:** Each horizon has thousands of firms (e.g., 7k–10k rows for h=1–4). That’s enough for robust CV and for holding out a test fold.
* **Early-warning framing:** For a strict *Frühwarnsystem*, the 2–3 year horizons are arguably the most interesting; we’ll still start modeling with **h=1** as a baseline and then compare h=2–3.

### Step 5 — Dtypes and missingness snapshot

**Why:** We need to understand variable types and the extent of missing data before modeling. This informs cleaning and winsorization plans.

In [5]:
print("dtypes (first 15):") 
print(df_all.dtypes.head(15))

# Basic missingness
na_pct = df_all.isna().mean().sort_values(ascending=False)
print("\nMissingness (top 20):") 
print(na_pct.head(20))

# Quick look at y, horizon joint distribution
print("\nClass balance by horizon:") 
print(df_all.groupby('horizon')['y'].agg(['count','mean']).rename(columns={'mean':'pos_rate'}))

dtypes (first 15):
Attr1     float64
Attr2     float64
Attr3     float64
Attr4     float64
Attr5     float64
Attr6     float64
Attr7     float64
Attr8     float64
Attr9     float64
Attr10    float64
Attr11    float64
Attr12    float64
Attr13    float64
Attr14    float64
Attr15    float64
dtype: object

Missingness (top 20):
Attr37    0.437369
Attr21    0.134869
Attr27    0.063679
Attr60    0.049580
Attr45    0.049464
Attr24    0.021242
Attr64    0.018708
Attr53    0.018708
Attr28    0.018708
Attr54    0.018708
Attr41    0.017371
Attr32    0.008478
Attr52    0.006935
Attr47    0.006843
Attr46    0.003110
Attr4     0.003087
Attr33    0.003087
Attr40    0.003087
Attr12    0.003087
Attr63    0.003087
dtype: float64

Class balance by horizon:
         count  pos_rate
horizon                 
1         7027  0.038566
2        10173  0.039320
3        10503  0.047129
4         9792  0.052594
5         5910  0.069374


**Interpretation (after running):**  
- List any non-numeric columns that look like financial ratios (they should be numeric).  
- Identify variables with heavy missingness (e.g., >30%) to consider for dropping or imputation.  
- Comment on how the bankrupt share changes with the horizon (expected: nearest horizon has higher event rate).

## Step 5 — Interpretation

* **Types:** All features are numeric (`float64`) → convenient for modeling and statistics.
* **Missingness:** One feature, **Attr37**, shows **~44% NA** (very high). A second, **Attr21**, sits at ~13%. The rest are mostly below 6%.

  * Keeping a 44%–missing ratio and blindly imputing can inject noise; often **dropping** such a feature is cleaner (unless metadata says it’s essential).
  * For features with **≥5%** NA, it’s useful to add a **missingness indicator**; sometimes “missing” is informative in financial statements.
* **Class balance by horizon:** Positive rate increases monotonically from h=1 → h=5, as noted above. Good sanity check.

**Limitations to keep in mind:**

* The dataset doesn’t include firm IDs or calendar years (just `horizon` and ratios). That means we **cannot** do a strict time-based split or clustered CV by firm. We’ll be transparent about this limitation and (a) evaluate models **within a horizon** via stratified CV and (b) assess **cross-horizon robustness** (e.g., train on h=1, test on h=2/3) as a proxy for stability.