# Credit Card Fraud Detection — 01 Exploration & Preprocessing

In this notebook, I:

- Configure the Python environment and project paths  
- Load the credit card fraud dataset  
- Explore the structure and class imbalance  
- Engineer a log-transformed transaction amount feature  
- Split the data into training and test sets  
- Scale features in preparation for model training

Modeling and evaluation are done in a separate notebook: `02_modeling.ipynb`.


## 1. Environment & Project Paths

Before loading the dataset, I verify that the notebook is running under the correct Python interpreter and in the correct project directory. This protects against:

- Missing packages  
- Using the wrong virtual environment  
- Incorrect file paths  
- FileNotFound errors  

I also confirm the dataset exists inside the expected `data/raw/` directory before continuing.


In [4]:
# ---------------------------------------------------------
# DEBUGGING CHECK: Confirm the Python Interpreter in Use
# ---------------------------------------------------------
# Why this is important:
# In VS Code, Jupyter notebooks sometimes run on the wrong
# Python interpreter (e.g., the system Python instead of the
# project's virtual environment `.venv`). This can cause:
#   - Missing packages
#   - Module import errors
#   - Inconsistent environment behavior
#
# `sys.executable` prints the exact path of the Python
# interpreter currently running this notebook. This helps
# verify that the notebook is using the correct virtual
# environment before continuing with the ML pipeline.

import sys
print(sys.executable)



c:\Users\Juant\OneDrive\Documents\ML-Projects\classic-ml-fraud-detection\.venv\Scripts\python.exe


In [5]:
# ---------------------------
# STEP 1: Import Core Libraries
# ---------------------------
# NumPy: numerical operations
# Pandas: data loading & manipulation
# Matplotlib/Seaborn: visualization
# Scikit-learn: model building, preprocessing, and evaluation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score






In [6]:
# ---------------------------------------------------------
# DATA VISUALIZATION LIBRARY: Seaborn
# ---------------------------------------------------------
# Why we import Seaborn:
# Seaborn is a statistical data visualization library built
# on top of Matplotlib. It provides cleaner, more intuitive
# plotting functions and visually appealing default styles.
#
# In this fraud detection project, Seaborn helps us:
#   - Visualize class imbalance
#   - Plot feature distributions (e.g., Amount vs. log_amount)
#   - Create heatmaps and correlation matrices
#   - Generate confusion matrices and ROC/PR curves with
#     improved styling
#
# Overall, Seaborn makes it easier to understand the data,
# communicate insights, and present findings in a way that
# looks professional in a portfolio setting.

import seaborn as sns


## 2. Load the Dataset

Here I load the Credit Card Fraud Detection dataset using Pandas.  
This dataset contains anonymized PCA-transformed transaction features (`V1`–`V28`), the raw transaction amount, and the `Class` label:

- **0 — legitimate transaction**  
- **1 — fraudulent transaction**

I begin by reading the dataset into a Pandas DataFrame and previewing the first few rows.


In [7]:


# ---------------------------
# STEP 2: Load the Dataset
# ---------------------------
# We load the well-known Credit Card Fraud Detection dataset.
# This dataset is highly imbalanced and contains numerical features
# obtained from PCA transformation, along with 'Amount' and 'Class'.
# 'Class' = 1 indicates fraud, 'Class' = 0 indicates a normal transaction.

df = pd.read_csv("data/raw/creditcard.csv")
df.head()


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## 3. Dataset Structure & Class Imbalance

Next, I inspect the dataset’s structure and verify the degree of class imbalance.  
Fraud detection datasets are usually extremely imbalanced, which strongly affects model choice, evaluation strategy, and how thresholds are chosen in production.


In [8]:
# ---------------------------
# STEP 3: Inspect the Data
# ---------------------------
# Check the structure, column types, and confirm that the dataset loaded correctly.

df.info()

# Check class distribution (fraud vs non-fraud)
# Fraud cases are extremely rare — less than 0.2%.
df['Class'].value_counts()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

Class
0    284315
1       492
Name: count, dtype: int64

## 4. Feature Engineering

The `Amount` feature is highly skewed, with many small transactions and a few very large ones.  
To help machine learning models learn more effectively, I create a log-transformed version of this feature (`log_amount`). This transformation reduces skew and compresses outliers, while preserving important relative differences between transaction amounts.


In [9]:
# ---------------------------
# STEP 4: Feature Engineering — Log Transform the 'Amount' Feature
# ---------------------------
# Why log transform?
# - Transaction amounts are highly skewed (long tail distribution)
# - Skewed features can negatively impact ML models
# - Log transformation compresses large values and expands small ones,
#   making patterns easier to learn.
# We keep both 'Amount' and 'log_amount' because different model families
# benefit from different representations.

df['log_amount'] = np.log1p(df['Amount'])   # log1p handles zero values safely
df[['Amount', 'log_amount']].head()



Unnamed: 0,Amount,log_amount
0,149.62,5.01476
1,2.69,1.305626
2,378.66,5.939276
3,123.5,4.824306
4,69.99,4.262539


## 5. Train/Test Split & Feature Scaling

To fairly evaluate model performance, I split the dataset into training and test sets using a stratified split to preserve the fraud ratio.  
I then standardize the features using `StandardScaler`, fitting the scaler only on the training data to prevent data leakage and applying the same transformation to the test set.


In [10]:
# ---------------------------------------------------------
# STEP: Split the Dataset into Training and Testing Sets
# ---------------------------------------------------------
# - We separate features (X) and labels (y) above.
# - Now we split them into:
#       * Training set  → used to fit the model
#       * Testing set   → used to evaluate model performance
#
# Why this matters:
# - Fraud datasets are extremely imbalanced, so we use
#   stratify=y to preserve the fraud/non-fraud ratio.
# - random_state=42 ensures the split is reproducible.
# ---------------------------------------------------------

from sklearn.model_selection import train_test_split

# Separate the dataset into input features (X) and target labels (y)
X = df.drop(columns=['Class'])   # All feature columns
y = df['Class']                  # Fraud label (0 or 1)

# Perform the train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,      # 20% of data reserved for testing
    stratify=y,         # Maintain fraud ratio across splits
    random_state=42     # Ensures reproducibility
)

# Display shapes & basic fraud ratios to confirm correct splitting
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)
print("Fraud ratio in train:", y_train.mean())
print("Fraud ratio in test:", y_test.mean())


# ---------------------------------------------------------
# STEP: Scale Numerical Features
# ---------------------------------------------------------
# Why scale?
# - Logistic Regression, SVMs, and many ML algorithms perform
#   better when numerical features share a similar scale.
#
# IMPORTANT:
# - Fit the scaler ONLY on the training data to avoid 
#   data leakage.
# - Apply the SAME scaler to transform the test set.
# ---------------------------------------------------------

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler on training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaling parameters
X_test_scaled = scaler.transform(X_test)


Train shape: (227845, 31)
Test shape: (56962, 31)
Fraud ratio in train: 0.001729245759178389
Fraud ratio in test: 0.0017204452090867595


## 6. Save Preprocessed Data for Modeling

To keep the workflow modular and reproducible, I save the preprocessed training and test sets to disk.  
The modeling notebook (`02_modeling.ipynb`) will load these saved arrays instead of re-running all preprocessing steps.



In [11]:
# ---------------------------------------------------------
# SAVE PREPROCESSED DATA FOR MODELING
# ---------------------------------------------------------
# Why save?
# - Keeps preprocessing and modeling decoupled
# - Allows the modeling notebook to run independently
# - Mirrors real ML workflows where data prep and modeling
#   happen in separate steps or services.
#
# We save:
#   - X_train_scaled, X_test_scaled: input features
#   - y_train, y_test: labels
# ---------------------------------------------------------

import os
import joblib

# Ensure the processed data directory exists
os.makedirs("data/processed", exist_ok=True)

# Save the preprocessed arrays
joblib.dump(X_train_scaled, "data/processed/X_train_scaled.pkl")
joblib.dump(X_test_scaled, "data/processed/X_test_scaled.pkl")
joblib.dump(y_train, "data/processed/y_train.pkl")
joblib.dump(y_test, "data/processed/y_test.pkl")

print("Saved preprocessed data to 'data/processed/'.")


Saved preprocessed data to 'data/processed/'.


## 7. Next Steps

With the data prepared, standardized, and split into training and test sets, the next steps will be performed in a separate modeling notebook:

- Train a baseline Logistic Regression model  
- Evaluate performance using imbalanced-data metrics  
  (ROC AUC, Precision, Recall, F1-score, PR AUC)  
- Visualize the ROC and Precision-Recall curves  
- Train and compare a tree-based model such as XGBoost  
- Discuss how threshold selection affects fraud detection performance  

This concludes the preprocessing stage of the workflow.
