# 02 - Data Preprocessing

Goal: create clean train/test splits with proper scaling while preventing data leakage. No resampling here; imbalance is handled later.

Steps:
- Load raw data
- Stratified train-test split (`random_state=42`)
- Scale features using `StandardScaler` (fit on train only)
- Save processed splits to `data/processed`

Outputs:
- X_train_scaled.csv, X_test_scaled.csv
- y_train.csv, y_test.csv

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from pathlib import Path
import joblib

RAW_PATH = Path('data/raw/credit_card_fraud_dataset.csv')
PROCESSED_DIR = Path('data/processed')
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Load data
data = pd.read_csv(RAW_PATH)
X = data.drop(columns=['Class'])
y = data['Class']

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Scale features (fit on train only to avoid leakage)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Persist scaled data and scaler
pd.DataFrame(X_train_scaled, columns=X.columns).to_csv(PROCESSED_DIR / 'X_train_scaled.csv', index=False)
pd.DataFrame(X_test_scaled, columns=X.columns).to_csv(PROCESSED_DIR / 'X_test_scaled.csv', index=False)
y_train.to_csv(PROCESSED_DIR / 'y_train.csv', index=False)
y_test.to_csv(PROCESSED_DIR / 'y_test.csv', index=False)
joblib.dump(scaler, PROCESSED_DIR / 'scaler.joblib')

X_train.shape, X_test.shape

Note: Resampling to address imbalance is deferred to `03_imbalance_handling.ipynb` to keep data leakage risks low.