## 02 — Preprocessing & Baseline Modeling (Governed)

This notebook constructs the **governed preprocessing pipeline** and fits
**interpretable baseline models** for the early credit risk warning analysis.

All preprocessing and modeling decisions are made **once and only once** in this
notebook to ensure methodological consistency, prevent data leakage, and
preserve auditability across downstream evaluation.

The outputs of this notebook are **frozen modeling artifacts** (preprocessing
pipelines, train/test splits, and predicted probabilities) that are saved to
disk and consumed unchanged by Notebook 03.

---

### Objectives

- Construct a leak-safe, reproducible preprocessing pipeline  
- Explicitly define feature transformations for numeric and categorical variables  
- Enforce strict train/test separation prior to any transformation  
- Fit interpretable baseline models suitable for early warning contexts  
- Persist governed modeling artifacts for downstream evaluation  

---

### Real-World Usage Context

This notebook supports the development of **early credit risk warning models**
intended to:

- identify accounts requiring closer monitoring,  
- support manual review and prioritization,  
- inform proportionate, preventative interventions.  

Model outputs are **probabilistic risk signals**, not automated decisions, and
are not used for credit approval, rejection, or pricing.

---

### Governance Principles

- **No data leakage**: preprocessing is fit on training data only via pipelines  
- **Single source of truth**: all inputs originate from governed artifacts
  produced in Notebook 01  
- **Interpretability by design**: logistic regression baselines are prioritized  
- **Separation of concerns**: evaluation and benchmarking are deferred to
  Notebook 03  
- **Reproducibility**: all randomness is seeded and all outputs are persisted  

---

### Scope and Limitations

The models fitted in this notebook are **baseline reference models**, not
production-ready systems.

They are intended to:
- validate signal presence,  
- establish transparent benchmarks,  
- support methodological discussion.  

Performance optimization, threshold selection, and benchmark comparisons are
handled in the subsequent notebook.


#### Environment setup

In [4]:
# Objective:
# Ensure a clean, reproducible environment and load all dependencies required
# for data preparation, modeling, and diagnostics.

from pathlib import Path
import json
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import joblib


#### Governed data loading 

In [5]:
# Paths (governed artifacts only)
NOTEBOOK_DIR = Path("/Users/steph/Desktop/Github/early-warning-credit-risk/02_notebooks")
PROJECT_ROOT = NOTEBOOK_DIR.parent

IN_DIR = PROJECT_ROOT / "03_artifacts" / "notebook01"
OUT_DIR = PROJECT_ROOT / "03_artifacts" / "notebook02"
OUT_DIR.mkdir(parents=True, exist_ok=True)

FIG_DIR = OUT_DIR / "figures"
FIG_DIR.mkdir(parents=True, exist_ok=True)

print("Figures will be saved to:", FIG_DIR.resolve())



#### Governance principle:
- This notebook must only consume artifacts produced upstream
- Raw data files are never reloaded at this stage
- Target definition and column structure are considered frozen


In [6]:

DATA_PATH = IN_DIR / "german_credit_with_y_bad.csv"

# Fail fast if the governed artifact is missing.
# This prevents silent fallbacks and ensures reproducibility.
assert DATA_PATH.exists(), (
    f"Governed dataset not found at {DATA_PATH}. "
    "Notebook 01 must be re-run to regenerate artifacts."
)

# Load governed dataset
df = pd.read_csv(DATA_PATH)

# Minimal structural validation
assert "y_bad" in df.columns, (
    "Missing event label 'y_bad'. "
    "Target definition is not governed or dataset is inconsistent."
)

print("Loaded governed dataset")
print("Shape:", df.shape)

df.head()


Loaded governed dataset
Shape: (1000, 22)


Unnamed: 0,status_checking_account,duration_months,credit_history,purpose,credit_amount,savings_account,employment_since,installment_rate,personal_status_sex,other_debtors,...,age,other_installment_plans,housing,existing_credits,job,num_dependents,telephone,foreign_worker,credit_risk,y_bad
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,67,A143,A152,2,A173,1,A192,A201,1,0
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,22,A143,A152,1,A173,1,A191,A201,2,1
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,49,A143,A152,1,A172,2,A191,A201,1,0
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,45,A143,A153,1,A173,2,A191,A201,1,0
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,53,A143,A153,2,A173,2,A191,A201,2,1


In [7]:
print("Columns:")
display(df.dtypes)

Columns:


status_checking_account    object
duration_months             int64
credit_history             object
purpose                    object
credit_amount               int64
savings_account            object
employment_since           object
installment_rate            int64
personal_status_sex        object
other_debtors              object
residence_since             int64
property                   object
age                         int64
other_installment_plans    object
housing                    object
existing_credits            int64
job                        object
num_dependents              int64
telephone                  object
foreign_worker             object
credit_risk                 int64
y_bad                       int64
dtype: object

In [8]:
# Define target and exclude leakage-prone columns
TARGET_COL = "y_bad"

EXCLUDE_COLS = [TARGET_COL]

# Exclude any raw label columns to prevent leakage
for raw_target in ["credit_risk", "target"]:
    if raw_target in df.columns:
        EXCLUDE_COLS.append(raw_target)

X = df.drop(columns=EXCLUDE_COLS)
y = df[TARGET_COL].astype(int)

print("Excluded columns:", EXCLUDE_COLS)
print("X shape:", X.shape)
print("y distribution:")
print(y.value_counts().sort_index())


Excluded columns: ['y_bad', 'credit_risk']
X shape: (1000, 20)
y distribution:
y_bad
0    700
1    300
Name: count, dtype: int64


#### Feature typing 

In [9]:
categorical_cols = [
    c for c in [
        "status_checking","credit_history","purpose","savings","employment",
        "personal_status_sex","other_debtors","property","other_installment_plans",
        "housing","job","telephone","foreign_worker"
    ]
    if c in X.columns
]

numeric_cols = [c for c in X.columns if c not in categorical_cols]

print("Categorical columns:", len(categorical_cols))
print("Numeric columns:", len(numeric_cols))

# Governance note: prevent misclassification of coded categories


Categorical columns: 10
Numeric columns: 10


In [10]:
# Leakage safeguards
assert "target" not in X.columns, "Leakage risk: 'target' still in features"
assert "credit_risk" not in X.columns, "Leakage risk: 'credit_risk' still in features"
assert "y_bad" not in X.columns, "Leakage risk: 'y_bad' still in features"


#### Train/test split (stratified, reproducible)


In [11]:
# Train/Test split
RANDOM_STATE = 2712
TEST_SIZE = 0.20

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y
)

print("Train shape:", X_train.shape, "| Bad rate:", round(y_train.mean(), 4))
print("Test  shape:", X_test.shape,  "| Bad rate:", round(y_test.mean(), 4))


Train shape: (800, 20) | Bad rate: 0.3
Test  shape: (200, 20) | Bad rate: 0.3


In [12]:
# Override any previous lists: use dtype-based typing (robust to coded categories)
categorical_cols = X_train.select_dtypes(include=["object"]).columns.tolist()
numeric_cols = X_train.select_dtypes(exclude=["object"]).columns.tolist()

print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)

# Safety: ensure no overlap
assert set(categorical_cols).isdisjoint(set(numeric_cols))


Categorical columns: ['status_checking_account', 'credit_history', 'purpose', 'savings_account', 'employment_since', 'personal_status_sex', 'other_debtors', 'property', 'other_installment_plans', 'housing', 'job', 'telephone', 'foreign_worker']
Numeric columns: ['duration_months', 'credit_amount', 'installment_rate', 'residence_since', 'age', 'existing_credits', 'num_dependents']


In [13]:
# Class balance check (stratification integrity)
assert abs(y_train.mean() - y.mean()) < 0.01, "Train bad-rate drift"
assert abs(y_test.mean() - y.mean()) < 0.01, "Test bad-rate drift"


In [14]:
# Principles:
# - All transformations are learned on TRAIN only (via Pipeline.fit on X_train)
# - Numeric and categorical features are handled separately and explicitly
# - Outputs are deterministic and reproducible across notebooks
#
# Rationale (risk / governance):
# - Missing data handling is standardized (no ad-hoc fixes later)
# - Scaling is applied for coefficient stability in logistic regression
# - One-hot encoding avoids imposing an artificial order on categories
# - Unknown categories in test/production are handled safely (no crashes)
# ------------------------------------------------------------

# Numeric preprocessing: impute missing values + standardize scale
numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),   # robust to outliers
    ("scaler", StandardScaler()),                    # improves numerical stability
])

# Categorical preprocessing: impute missing categories + one-hot encode
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),  # stable default for categories
    ("onehot", OneHotEncoder(handle_unknown="ignore")),    # prevents failure on unseen categories
])

# Combine into a single preprocessor applied consistently to train/test
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_cols),
        ("cat", categorical_pipeline, categorical_cols),
    ],
    remainder="drop",  # governance: only approved feature sets are used
)


In [15]:
from sklearn.pipeline import Pipeline

# Logistic regression baseline (interpretable)
log_reg = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=2000))
])

# L1-regularized logistic (sparser, feature stability)
l1_log_reg = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", LogisticRegression(
        penalty="l1",
        solver="saga",
        max_iter=4000
    ))
])

# Fit on training data only
log_reg.fit(X_train, y_train)
l1_log_reg.fit(X_train, y_train)

print("Models fitted: logistic, l1_logistic")


Models fitted: logistic, l1_logistic


In [16]:
# ------------------------------------------------------------
# Probability generation (governed outputs)

# Governance note:
# - Probabilities are saved unchanged
# - No thresholding or performance evaluation occurs here
# - This prevents analytical drift between modeling and evaluation

def predict_proba(pipe, X):
    """Return event probability P(y_bad = 1)."""
    return pipe.predict_proba(X)[:, 1]

# Training-set probabilities (for diagnostics only)
train_predictions = pd.DataFrame({
    "y": y_train,
    "p_logistic": predict_proba(log_reg, X_train),
    "p_l1_logistic": predict_proba(l1_log_reg, X_train),
})

# Test-set probabilities (used for final evaluation)
test_predictions = pd.DataFrame({
    "y": y_test,
    "p_logistic": predict_proba(log_reg, X_test),
    "p_l1_logistic": predict_proba(l1_log_reg, X_test),
})

# Persist predictions for downstream evaluation (Notebook 03)
train_predictions.to_csv(OUT_DIR / "train_predictions.csv", index=True)
test_predictions.to_csv(OUT_DIR / "test_predictions.csv", index=True)



In [17]:
def predict_proba(pipe, X):
    return pipe.predict_proba(X)[:, 1]

train_predictions = pd.DataFrame({
    "y": y_train,
    "p_logistic": predict_proba(log_reg, X_train),
    "p_l1_logistic": predict_proba(l1_log_reg, X_train),
})

test_predictions = pd.DataFrame({
    "y": y_test,
    "p_logistic": predict_proba(log_reg, X_test),
    "p_l1_logistic": predict_proba(l1_log_reg, X_test),
})

train_predictions.to_csv(OUT_DIR / "train_predictions.csv", index=True)
test_predictions.to_csv(OUT_DIR / "test_predictions.csv", index=True)



#### Artifacts saving for Notebook 03

In [18]:

joblib.dump(log_reg, OUT_DIR / "pipeline_logistic.joblib")
joblib.dump(l1_log_reg, OUT_DIR / "pipeline_l1_logistic.joblib")

print("Saved fitted pipelines.")


Saved fitted pipelines.


In [19]:
X_train.to_csv(OUT_DIR / "X_train.csv", index=True)
X_test.to_csv(OUT_DIR / "X_test.csv", index=True)
y_train.to_csv(OUT_DIR / "y_train.csv", index=True)
y_test.to_csv(OUT_DIR / "y_test.csv", index=True)

print("Saved governed splits.")


Saved governed splits.


In [20]:
metadata = {
    "notebook": "02_preprocessing_modeling.ipynb",
    "random_state": RANDOM_STATE,
    "test_size": TEST_SIZE,
    "excluded_columns": EXCLUDE_COLS,
    "n_train": int(X_train.shape[0]),
    "n_test": int(X_test.shape[0]),
    "bad_rate_train": float(y_train.mean()),
    "bad_rate_test": float(y_test.mean()),
    "numeric_columns": numeric_cols,
    "categorical_columns": categorical_cols,
    "models": ["logistic_regression", "l1_logistic_regression"],
    "notes": "Preprocessing + interpretable baseline models only. Evaluation and Random Forest benchmark in Notebook 03."
}

with open(OUT_DIR / "run_metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

print("Saved run metadata.")


Saved run metadata.
