# Step 9 - Preprocessing pipeline

Not much preprocessing is needed for this project. Previous steps have shown that # Code for 9.3y. However, step 7 showed the Amount feature can benefit from log transformation and thus we'll include that in our pipeline.

In this step I will also perform the train/test/validate split and save the data and the pipeline for future steps and use.

## 9.1 Data Splitting

Before we create the pipeline we have to split the data to avoid data leakage. To preserve the class imbalance, stratified data splitting must be used.

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pathlib import Path

# Load data
data_path = Path("../data/creditcard.csv")
df = pd.read_csv(data_path)

# Prepare features and target
X = df.drop('Class', axis=1)
y = df['Class']

print(f"Original dataset: {len(df):,} samples")
print(f"Fraud rate: {y.mean():.3%}")

# First split: 70% train, 30% temp (which we'll split into 15% val + 15% test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, 
    test_size=0.3, 
    random_state=42, 
    stratify=y
)

# Second split: Split the 30% temp into 15% validation and 15% test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, 
    test_size=0.5,  # 0.5 of 30% = 15% each
    random_state=42, 
    stratify=y_temp
)

# Confirm splits
print(f"\nData splits:")
print(f"Training: {len(X_train):,} samples ({len(X_train)/len(df):.1%})")
print(f"Validation: {len(X_val):,} samples ({len(X_val)/len(df):.1%})")
print(f"Test: {len(X_test):,} samples ({len(X_test)/len(df):.1%})")

# Verify fraud rates are preserved
print(f"\nFraud rates:")
print(f"Training: {y_train.mean():.3%}")
print(f"Validation: {y_val.mean():.3%}")
print(f"Test: {y_test.mean():.3%}")

Original dataset: 284,807 samples
Fraud rate: 0.173%

Data splits:
Training: 199,364 samples (70.0%)
Validation: 42,721 samples (15.0%)
Test: 42,722 samples (15.0%)

Fraud rates:
Training: 0.173%
Validation: 0.173%
Test: 0.173%


## 9.2 Pipeline Creation

Now the pipeline can be created.

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
import joblib

def log_transform_amount(X):
    """Apply log1p transformation to Amount column"""
    X_transformed = X.copy()
    X_transformed['Amount'] = np.log1p(X_transformed['Amount'])
    return X_transformed

# Create preprocessing pipeline. We only transform Amount since PCA features are already standardized
preprocessing_pipeline = FunctionTransformer(
    func=log_transform_amount,
    validate=False
)

print("Preprocessing pipeline created")

Preprocessing pipeline created


## 9.3 Pipeline Testing & Validation

Before saving the pipeline it's important to test and validate if the pipeline works.

In [16]:
# Test pipeline on training data
print("Testing preprocessing pipeline...")

# Fit and transform training data
X_train_processed = preprocessing_pipeline.fit_transform(X_train)

# Transform validation and test data (using fitted pipeline)
X_val_processed = preprocessing_pipeline.transform(X_val)
X_test_processed = preprocessing_pipeline.transform(X_test)

# Verify transformations worked correctly
print(f"\nAmount transformation verification:")
print(f"Original Amount range: ${X_train['Amount'].min():.2f} - ${X_train['Amount'].max():.2f}")
print(f"Transformed Amount range: {X_train_processed['Amount'].min():.3f} - {X_train_processed['Amount'].max():.3f}")

# Check that other features are unchanged
pca_feature = 'V1'
print(f"\nPCA feature ({pca_feature}) unchanged verification:")
print(f"Original: {X_train[pca_feature].mean():.4f} ± {X_train[pca_feature].std():.4f}")
print(f"Processed: {X_train_processed[pca_feature].mean():.4f} ± {X_train_processed[pca_feature].std():.4f}")

# End-to-end test
print(f"\nEnd-to-end pipeline test:")
print(f"Training data shape: {X_train_processed.shape}")
print(f"Validation data shape: {X_val_processed.shape}")
print(f"Test data shape: {X_test_processed.shape}")

Testing preprocessing pipeline...

Amount transformation verification:
Original Amount range: $0.00 - $25691.16
Transformed Amount range: 0.000 - 10.154

PCA feature (V1) unchanged verification:
Original: -0.0011 ± 1.9658
Processed: -0.0011 ± 1.9658

End-to-end pipeline test:
Training data shape: (199364, 30)
Validation data shape: (42721, 30)
Test data shape: (42722, 30)


## 9.4 Save Everything

Finally the pipeline and split data need to be saved

In [18]:
# Create directories if they don't exist
Path("../models").mkdir(exist_ok=True)
Path("../data/splits").mkdir(exist_ok=True)

# Save preprocessing pipeline
pipeline_path = "../models/preprocessing_pipeline.pkl"
joblib.dump(preprocessing_pipeline, pipeline_path)
print(f"Preprocessing pipeline saved")
# print(f"Preprocessing pipeline saved to: {pipeline_path}")

# Save processed data splits
print(f"\nSaving processed data splits...")

# Save processed features
X_train_processed.to_csv("../data/splits/X_train_processed.csv", index=False)
X_val_processed.to_csv("../data/splits/X_val_processed.csv", index=False)
X_test_processed.to_csv("../data/splits/X_test_processed.csv", index=False)

# Save targets
y_train.to_csv("../data/splits/y_train.csv", index=False)
y_val.to_csv("../data/splits/y_val.csv", index=False)
y_test.to_csv("../data/splits/y_test.csv", index=False)

print(f"Training data: {X_train_processed.shape[0]:,} samples")
print(f"Validation data: {X_val_processed.shape[0]:,} samples") 
print(f"Test data: {X_test_processed.shape[0]:,} samples")

# Test loading pipeline (verification)
print(f"\nTesting pipeline loading...")
loaded_pipeline = joblib.load(pipeline_path)
test_transform = loaded_pipeline.transform(X_train.head(5))
print(f"Pipeline loads and transforms correctly")

Preprocessing pipeline saved

Saving processed data splits...
Training data: 199,364 samples
Validation data: 42,721 samples
Test data: 42,722 samples

Testing pipeline loading...
Pipeline loads and transforms correctly


## Step 9 - Conclusion

The preprocessing pipeline has been successfully created and validated. Since our data was already high quality from the Kaggle dataset, minimal preprocessing was required:

* **Data splitting**: 70/15/15 train/validation/test split with stratification to preserve the fraud class balance
* **Amount transformation**: Log transformation applied to handle the right-skewed distribution identified in Step 7
* **Pipeline validation**: Confirmed transformations work correctly with no data leakage

The pipeline and split datasets are now saved and ready for model training in Step 10.