# **Part 2: DATA PROCESSING**
**Objective:** Clean, transform, and prepare data for the model using **NumPy only**.

**Pipeline Steps:**
1. Load Data & Handle Missing Values.
2. **Feature Engineering:** Log-transform 'Amount' and extract 'Hour' from 'Time'.
3. **Stratified Split:** Split data into Train/Val/Test while preserving the Fraud ratio.
4. **Normalization:** Apply Z-score scaling (Fit on Train -> Transform Val/Test) to prevent data leakage.
5. Save processed data for modeling.


## **Setup & Import**

In [1]:
import sys
import os
import numpy as np
sys.path.append(os.path.abspath('..'))

from src.data_processing import (
    load_csv_numpy, 
    impute_missing, 
    extract_time_features, 
    stratified_split, 
    save_processed
)

---

## **Load and Clean Data**
We load the raw CSV and check for missing values. If any exist, we fill them with the column median.

In [2]:
# Load data
header, data = load_csv_numpy('../data/raw/creditcard.csv')

X = data[:, :-1]
y = data[:, -1]

print(f"Original X shape: {X.shape}")
print(f"Original y shape: {y.shape}")

# Handle Missing Values
# Although this dataset is usually clean, we apply this for robustness
X_clean = impute_missing(X, strategy='median')
print("Missing values handled.")

Original X shape: (284807, 30)
Original y shape: (284807,)
Missing values handled.


---

## **Feature Engineering**
Based on our EDA (Notebook `01_data_exploration.ipynb`), we know:
- **Amount:** Highly skewed. We apply `Log(1+x)` to make it more normal.
- **Time:** The raw seconds are not useful. We convert them to "Hour of Day" (0-23) to capture the 2 AM fraud pattern.
- **V1-V28:** Already PCA-transformed, so we keep them as is.

In [3]:
# 1. Extract V1-V28 (Indices 1 to 28)
v_features = X_clean[:, 1:29]

# 2. Transform Amount (Index 29)
# Use log1p to avoid log(0) errors
amount_col = X_clean[:, 29]
amount_log = np.log1p(amount_col).reshape(-1, 1)

# 3. Transform Time (Index 0)
time_col = X_clean[:, 0]
hour_feature = extract_time_features(time_col)

# 4. Concatenate Features
# New structure: [V1...V28, LogAmount, Hour]
X_engineered = np.hstack([v_features, amount_log, hour_feature])

print(f"Shape after Feature Engineering: {X_engineered.shape}")
print("New Feature Order: V1-V28 (0-27), LogAmount (28), Hour (29)")

Shape after Feature Engineering: (284807, 30)
New Feature Order: V1-V28 (0-27), LogAmount (28), Hour (29)


---

## **Stratified Train-Val-Test Split**
Because fraud cases are rare (0.17%), random splitting might put all frauds in the Test set. 

We use **Stratified Splitting** to ensure the fraud ratio is consistent across all sets.
- **Train:** 70%
- **Validation:** 15%
- **Test:** 15%

In [4]:
# Note: We split BEFORE Normalization to avoid Data Leakage
(X_train_raw, y_train), (X_val_raw, y_val), (X_test_raw, y_test) = stratified_split(
    X_engineered, y, 
    train_frac=0.7, val_frac=0.15, test_frac=0.15, 
    seed=42
)

print(f"Train Shape: {X_train_raw.shape} | Fraud Count: {np.sum(y_train==1)}")
print(f"Val Shape:   {X_val_raw.shape} | Fraud Count: {np.sum(y_val==1)}")
print(f"Test Shape:  {X_test_raw.shape} | Fraud Count: {np.sum(y_test==1)}")

Train Shape: (199364, 30) | Fraud Count: 344
Val Shape:   (42720, 30) | Fraud Count: 73
Test Shape:  (42723, 30) | Fraud Count: 75


---

## **Normalization (Standardization)**
We apply **Z-score Standardization**: $z = \frac{x - \mu}{\sigma}$.

**Crucial Rule:** We calculate $\mu$ (mean) and $\sigma$ (std) using **ONLY the Training Set**. 

Then we use those values to transform the Validation and Test sets. This simulates a real-world scenario where we don't know the future data.

In [5]:
# 1. Calculate Mean and Std from TRAIN set
mean_train = X_train_raw.mean(axis=0)
std_train = X_train_raw.std(axis=0)

# Avoid division by zero
std_train[std_train == 0] = 1e-8

# 2. Apply to Train
X_train = (X_train_raw - mean_train) / std_train

# 3. Apply to Val and Test using TRAIN statistics
X_val = (X_val_raw - mean_train) / std_train
X_test = (X_test_raw - mean_train) / std_train

print("Normalization complete.")
print(f"Train Mean (approx 0): {np.mean(X_train):.4f}")
print(f"Train Std  (approx 1): {np.std(X_train):.4f}")

Normalization complete.
Train Mean (approx 0): -0.0000
Train Std  (approx 1): 1.0000


In [6]:
# Final Check before saving
print("Checking for NaNs in final arrays...")
if np.isnan(X_train).any() or np.isnan(X_val).any() or np.isnan(X_test).any():
    print("WARNING: NaNs detected!")
else:
    print("Data is clean.")

Checking for NaNs in final arrays...
Data is clean.


---

## **Save Processed Data**
Save everything into a compressed `.npz` file for the Modeling notebook.

In [7]:
savepath = save_processed('../data/processed', 
                          X_train=X_train, y_train=y_train,
                          X_val=X_val, y_val=y_val,
                          X_test=X_test, y_test=y_test)

print(f"Processed data saved successfully at: {savepath}")

Processed data saved successfully at: ../data/processed/processed.npz
