# Preprocessing and Feature Engineering

This notebook defines the preprocessing pipeline for the Credit Card Fraud Detection system.
All transformations implemented here are treated as part of the production data contract and
must be consistently applied during training and real-time inference.

The goal is to perform minimal, justified preprocessing while avoiding data leakage.

In [1]:
# imports and deterministic setup

import pandas as pd
import numpy as np
import joblib 

from pathlib import Path 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# configurations

RANDOM_STATE = 42

DATA_PATH = '../data/raw/creditcard.csv'
ARTIFACTS_DIR = Path("../models")
ARTIFACTS_DIR.mkdir(exist_ok=True)

In [3]:
# load the dataset

df = pd.read_csv(DATA_PATH)

In [4]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [5]:
df.shape

(284807, 31)

In [6]:
# target and feature seperation 

TARGET_COL = "Class"
FEATURE_COLS = [col  for col in df.columns if col != TARGET_COL]

X = df[FEATURE_COLS]
y = df[TARGET_COL]

In [7]:
X.shape, y.shape

((284807, 30), (284807,))

In [8]:
# data splitting

X_train, X_temp, y_train, y_temp = train_test_split(
  X, y, 
  test_size=0.3, 
  stratify=y, 
  random_state=RANDOM_STATE
)

In [9]:
# split temp into test and validation sets 

X_val, X_test, y_val, y_test = train_test_split(
  X_temp, 
  y_temp, 
  test_size=0.5, 
  stratify=y_temp, 
  random_state=RANDOM_STATE
)

In [10]:
print("Train fraud ratio: ", y_train.mean())
print("Validation fraud ratio: ", y_val.mean())
print("Test fraud ratio: ", y_test.mean())

Train fraud ratio:  0.0017254870488152324
Validation fraud ratio:  0.0017321691907960957
Test fraud ratio:  0.0017321286456626562


Observation:
- Class distribution is preserved across all splits
- This is critical for evaluation consistency


In [11]:
DATA_INTERIM_DIR = Path("../data/interim")
DATA_PROCESSED_DIR = Path("../data/processed")

In [12]:
# Features
X_train.to_parquet(DATA_INTERIM_DIR / "X_train.parquet", index=False)
X_val.to_parquet(DATA_INTERIM_DIR / "X_val.parquet", index=False)
X_test.to_parquet(DATA_INTERIM_DIR / "X_test.parquet", index=False)

# Targets
y_train.to_frame().to_parquet(DATA_INTERIM_DIR / "y_train.parquet", index=False)
y_val.to_frame().to_parquet(DATA_INTERIM_DIR / "y_val.parquet", index=False)
y_test.to_frame().to_parquet(DATA_INTERIM_DIR / "y_test.parquet", index=False)


### Feature Scaling

Scaling rationale:
- PCA components may differ in variance
- Time and Amount are unbounded
- Required for linear and distance-based models

In [13]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

In [14]:
# convert back into dataframes 

X_train_scaled = pd.DataFrame(
  X_train_scaled, 
  columns=X_train.columns,
  index=X_train.index
)

X_val_scaled = pd.DataFrame(
  X_val_scaled, 
  columns=X_val.columns,
  index=X_val.index
)

X_test_scaled = pd.DataFrame(
  X_test_scaled, 
  columns=X_test.columns, 
  index=X_test.index
)

In [15]:
joblib.dump(scaler, ARTIFACTS_DIR / "scaler.joblib")

['../models/scaler.joblib']

In [16]:
X_train_scaled.to_parquet(DATA_PROCESSED_DIR / "X_train.parquet")
y_train.to_frame().to_parquet(DATA_PROCESSED_DIR / "y_train.parquet")

X_val_scaled.to_parquet(DATA_PROCESSED_DIR / "X_val.parquet")
y_val.to_frame().to_parquet(DATA_PROCESSED_DIR / "y_val.parquet")

X_test_scaled.to_parquet(DATA_PROCESSED_DIR / "X_test.parquet")
y_test.to_frame().to_parquet(DATA_PROCESSED_DIR / "y_test.parquet")


## Preprocessing Decisions and Exclusions

The following decisions were made intentionally:

Included:
- Stratified train/validation/test split
- StandardScaler fit on training data only
- Persistence of preprocessing artifacts

Excluded:
- Missing value imputation (no missing data)
- Categorical encoding (all features numeric)
- Outlier removal (fraud cases are outliers by nature)
- Additional PCA or feature engineering
- Any class imbalance handling (handled in a separate notebook)

These exclusions reduce leakage risk and preserve dataset integrity.
