# 02 - Data Preprocessing

**Goal:** Prepare the new `creditcard.csv` dataset for modeling.
**Steps:**
1. Train/Test Split (Stratified)
2. Scaling (`Amount` and `Time`)
3. Save processed files to `data/processed` (Overwriting previous analysis)


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from pathlib import Path

# Paths
RAW_DATA = Path('../data/raw/creditcard.csv')
PROCESSED_DIR = Path('../data/processed/new_analysis')
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print("Loading Raw Data...")
df = pd.read_csv(RAW_DATA)
print(f"Shape: {df.shape}")

Loading Raw Data...
Shape: (284807, 31)


In [2]:
# RobustScaler is less prone to outliers
rob_scaler = RobustScaler()

df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

df.drop(['Time','Amount'], axis=1, inplace=True)
print("Scaled Amount and Time. Dropped originals.")

Scaled Amount and Time. Dropped originals.


In [3]:
# Split
X = df.drop('Class', axis=1)
y = df['Class']

print("Splitting (80/20 Stratified)...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Save
print(f"Saving to {PROCESSED_DIR}...")
X_train.to_csv(PROCESSED_DIR / 'X_train_scaled.csv', index=False)
X_test.to_csv(PROCESSED_DIR / 'X_test_scaled.csv', index=False)
y_train.to_csv(PROCESSED_DIR / 'y_train.csv', index=False)
y_test.to_csv(PROCESSED_DIR / 'y_test.csv', index=False)

print("Done.")

Splitting (80/20 Stratified)...
Saving to ..\data\processed\new_analysis...
Done.
