# Notebook 02 – Preprocessing Validation

Objective:  
- Prototype preprocessing steps for numeric and categorical features.  
- Validate ColumnTransformer logic (imputation + scaling + encoding).  
- Ensure transformed dataset has expected shape and interpretable columns.  


In [3]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from category_encoders import TargetEncoder
from sklearn.impute import SimpleImputer
import json

# Load dataset (curated version with decoded categories)
df = pd.read_csv("../german_credit.csv")

X = df.drop("CreditRisk", axis=1)
y = df["CreditRisk"]

# Load feature groups (saved from Notebook 01)
with open("../configs/feature_groups.json", "r") as f:
    feature_groups = json.load(f)

num_cols = feature_groups["num_cols"]
simple_cat_cols = feature_groups["simple_cat_cols"]
complex_cat_cols = feature_groups["complex_cat_cols"]

num_cols, simple_cat_cols, complex_cat_cols


(['Duration',
  'CreditAmount',
  'InstallmentRate',
  'ResidenceSince',
  'Age',
  'ExistingCredits',
  'PeopleLiable'],
 ['OtherDetors',
  'OtherInstallmentPlans',
  'Housing',
  'Telephone',
  'ForeignWorker'],
 ['Status',
  'CreditHistory',
  'Purpose',
  'Savings',
  'Employment',
  'SexAndStatus',
  'Property',
  'Job'])

In [5]:
# Numeric pipeline: impute median + scale
num_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Simple categorical pipeline: impute most frequent + one-hot encode
simple_cat_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

# Complex categorical pipeline: impute most frequent + target encode
complex_cat_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("targetenc", TargetEncoder())
])

# Combine everything into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_pipeline, num_cols),
        ("simple_cat", simple_cat_pipeline, simple_cat_cols),
        ("complex_cat", complex_cat_pipeline, complex_cat_cols),
    ]
)


### Preprocessing pipelines

- Numeric: median imputation + scaling.  
- Simple categorical: mode imputation + one-hot encoding.  
- Complex categorical: mode imputation + target encoding.  

These pipelines cover all feature groups and are combined into a single ColumnTransformer for use in model training.


In [6]:
# Fit the preprocessor on the dataset and transform
X_transformed = preprocessor.fit_transform(X, y)

print("Original shape:", X.shape)
print("Transformed shape:", X_transformed.shape)

# Build DataFrame with column names for inspection
import numpy as np

# Get transformed column names
num_features = preprocessor.named_transformers_["num"].get_feature_names_out(num_cols).tolist()
simple_features = preprocessor.named_transformers_["simple_cat"].get_feature_names_out(simple_cat_cols).tolist()
complex_features = complex_cat_cols  # target encoder keeps same col names

all_features = num_features + simple_features + complex_features

X_transformed_df = pd.DataFrame(X_transformed, columns=all_features)

X_transformed_df.head()


Original shape: (1000, 20)
Transformed shape: (1000, 28)


Unnamed: 0,Duration,CreditAmount,InstallmentRate,ResidenceSince,Age,ExistingCredits,PeopleLiable,OtherDetors_co-applicant,OtherDetors_guarantor,OtherDetors_none,...,ForeignWorker_no,ForeignWorker_yes,Status,CreditHistory,Purpose,Savings,Employment,SexAndStatus,Property,Job
0,-1.236478,-0.745131,0.918477,1.046987,2.766456,1.027079,-0.42829,0.0,0.0,1.0,...,0.0,1.0,0.492701,0.170648,0.221429,0.174863,0.252964,0.266423,0.212766,0.295238
1,2.248194,0.949817,-0.870183,-0.765977,-1.191404,-0.704926,-0.42829,0.0,0.0,1.0,...,0.0,1.0,0.390335,0.318868,0.221429,0.359867,0.306785,0.351613,0.212766,0.295238
2,-0.738668,-0.416562,-0.870183,0.140505,1.183312,-0.704926,2.334869,0.0,0.0,1.0,...,0.0,1.0,0.116751,0.170648,0.43336,0.359867,0.224138,0.266423,0.212766,0.28
3,1.750384,1.634247,-0.870183,1.046987,0.831502,-0.704926,2.334869,0.0,1.0,0.0,...,0.0,1.0,0.492701,0.318868,0.320442,0.359867,0.224138,0.266423,0.306034,0.295238
4,0.256953,0.566664,0.024147,1.046987,1.535122,1.027079,2.334869,0.0,0.0,1.0,...,0.0,1.0,0.492701,0.318162,0.380342,0.359867,0.306785,0.266423,0.435065,0.295238


### Transformed dataset

- Original dataset: (1000 rows, 20 features).  
- Transformed dataset: larger feature space (28 features) due to one-hot encoding.  
- Column names are preserved for numeric and complex features, and expanded for simple categorical (OHE).  

This confirms the preprocessing design works as intended and can be embedded in the training pipeline.
