# 02 – Data Preprocessing

Scale numerical features, optionally apply PCA, and prepare train/test splits.

## 2.1 – Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import joblib

# For reproducibility
RANDOM_SEED = 31
DATA_PATH = 'creditcard.csv'

## 2.2 – Load Raw Data

In [3]:
# Read CSV and inspect basic info
df = pd.read_csv(DATA_PATH)
print(f"Full dataset shape: {df.shape}")
print(df.head())


Full dataset shape: (284807, 31)
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   


In [5]:
# Check class balance
print("\nClass distribution:\n", df['Class'].value_counts())


Class distribution:
 Class
0    284315
1       492
Name: count, dtype: int64


## 2.3 – Train/Test Split (Before Scaling)

We stratify on the “Class” column to preserve the imbalance ratio in both splits.

In [7]:
fraud_df = df[df['Class'] == 1]
normal_df = df[df['Class'] == 0].sample(n=3000, random_state=RANDOM_SEED)

sample_df = pd.concat([fraud_df, normal_df], axis=0)
X = sample_df.drop(columns=['Class'])
y = sample_df['Class']

In [9]:
# Stratified split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    stratify=y,
    random_state=RANDOM_SEED
)

In [11]:
print(f"X_train: {X_train.shape},  y_train: {y_train.shape}")
print(f"X_test:  {X_test.shape},   y_test:  {y_test.shape}")
print("\nTrain class distribution:\n", y_train.value_counts())
print("\nTest class distribution:\n", y_test.value_counts())

X_train: (2793, 30),  y_train: (2793,)
X_test:  (699, 30),   y_test:  (699,)

Train class distribution:
 Class
0    2399
1     394
Name: count, dtype: int64

Test class distribution:
 Class
0    601
1     98
Name: count, dtype: int64


## 2.4 – Feature Scaling

All “V1…V28” features are already PCA components provided by Kaggle.  
We need to standardize “Amount” and “Time” **using only the training set** to avoid data leakage.

In [13]:
# Initialize scaler and fit on training “Amount” and “Time” only
scaler = StandardScaler()

# Fit on train
scaler.fit(X_train[['Amount', 'Time']])

# Transform both train and test
X_train_scaled = X_train.copy()
X_test_scaled  = X_test.copy()

X_train_scaled[['Amount_scaled', 'Time_scaled']] = scaler.transform(X_train[['Amount', 'Time']])
X_test_scaled[['Amount_scaled', 'Time_scaled']]  = scaler.transform(X_test[['Amount', 'Time']])

# Now drop the original “Amount” and “Time” columns
X_train_scaled = X_train_scaled.drop(columns=['Amount', 'Time'])
X_test_scaled  = X_test_scaled.drop(columns=['Amount', 'Time'])

# Sanity check: train “Amount_scaled” mean ~0, std ~1
print("Scaled feature stats (train set):")
print(X_train_scaled[['Amount_scaled', 'Time_scaled']].describe().loc[['mean','std']])

Scaled feature stats (train set):
      Amount_scaled   Time_scaled
mean  -3.688818e-17 -6.741634e-17
std    1.000179e+00  1.000179e+00


> **Notes:**
>
> * We created two new columns—`Amount_scaled` and `Time_scaled`—rather than overwriting, to keep clear which features have been processed.
> * Dropping the original “Amount” and “Time” ensures our modeling pipelines only see scaled values.


## 2.5 – PCA for Dimensionality Reduction


In order to reduce dimensionality further (e.g., to speed up certain algorithms), we will run PCA on the training set (excluding “Class”) and then transform the test set. Below, we keep enough components to explain 95% of variance.

In [15]:
# Prepare arrays for PCA: convert DataFrames to NumPy arrays
X_train_for_pca = X_train_scaled.values
X_test_for_pca  = X_test_scaled.values

In [17]:
# Fit PCA on training data
pca = PCA(n_components=0.95, random_state=RANDOM_SEED)
X_train_pca = pca.fit_transform(X_train_for_pca)
X_test_pca  = pca.transform(X_test_for_pca)

In [19]:
print(f"Original feature dims: {X_train_for_pca.shape[1]}")
print(f"PCA-reduced dims:     {X_train_pca.shape[1]}  (95% variance retained)")

Original feature dims: 30
PCA-reduced dims:     16  (95% variance retained)


In [21]:
# Example: save both scaled (no PCA) and PCA-transformed versions
joblib.dump((X_train_scaled, X_test_scaled, y_train, y_test), 'sample/processed_scaled.pkl')
joblib.dump((X_train_pca, X_test_pca, y_train, y_test), 'sample/processed_pca.pkl')

print("Saved:\n - processed_scaled.pkl  (scaled, no PCA)\n - processed_pca.pkl     (scaled + PCA)")

Saved:
 - processed_scaled.pkl  (scaled, no PCA)
 - processed_pca.pkl     (scaled + PCA)


### End of Preprocessing

---

**Summary of Changes & Improvements**

1. **Train/Test Split Before Scaling**

   * Ensures no data leakage by fitting the `StandardScaler` only on the training set.
2. **Combined Scaling**

   * Scaled “Amount” and “Time” together via a single `StandardScaler` call (instead of fitting separate scalers).
3. **Clear Feature Columns**

   * Created new columns `Amount_scaled` and `Time_scaled` and dropped the originals.
4. **Optional PCA on Train/Test**

   * Demonstrated how to fit PCA on training data and transform both train/test.
5. **Saving to Disk**

   * Showed how to pickle both the scaled/no-PCA and scaled+PCA versions for faster iteration.
