In [1]:
import os
import pandas as pd

os.makedirs("../artifacts/dataset", exist_ok=True)

raw_path = "../artifacts/dataset/german_credit_raw.csv"
df = pd.read_csv(raw_path)

# target: good=1, bad=0
df["target"] = df["class"].map({"good": 1, "bad": 0}).astype(int)

FEATURES = [
    "duration",
    "credit_amount",
    "age",
    "checking_status",
    "employment",
    "savings_status",
    "purpose",
]

clean = df[FEATURES + ["target"]].copy()

print("Clean shape:", clean.shape)
display(clean.head())

clean_path = "../artifacts/dataset/german_credit_clean.csv"
clean.to_csv(clean_path, index=False)
print("Saved:", clean_path)


Clean shape: (1000, 8)


Unnamed: 0,duration,credit_amount,age,checking_status,employment,savings_status,purpose,target
0,6.0,1169.0,67.0,<0,>=7,no known savings,radio/tv,1
1,48.0,5951.0,22.0,0<=X<200,1<=X<4,<100,radio/tv,0
2,12.0,2096.0,49.0,no checking,4<=X<7,<100,education,1
3,42.0,7882.0,45.0,<0,4<=X<7,<100,furniture/equipment,1
4,24.0,4870.0,53.0,<0,1<=X<4,<100,new car,0


Saved: ../artifacts/dataset/german_credit_clean.csv


# Data Preparation and Feature Engineering

This notebook implements the data transformation stage required before model training. Raw data cannot be directly used by machine learning algorithms and must be transformed into numerical and normalized representations.

Transformations applied:

- Missing value imputation
- Numerical feature scaling
- Categorical variable encoding
- Feature grouping into numerical and categorical sets
- Reproducible preprocessing pipeline

This corresponds to feature engineering and data preparation best practices.


## Selected Modeling Features

From the original dataset variables, the following features are selected for modeling based on relevance and availability:

- duration
- credit_amount
- age
- checking_status
- employment
- savings_status
- purpose

These features represent financial behavior and client profile indicators relevant for credit risk prediction.


## Preprocessing Pipeline Design

A preprocessing pipeline is created using Scikit-learn transformers. This ensures that the same transformations are applied consistently during training, prediction, and retraining phases.
