# Data Preprocessing 

In this notebook, we prepare the dataset for modeling by applying transformations such as normalization, feature creation, and data splitting. 

### Main steps:
- Normalize relevant features
- Create engineered variables
- Split the dataset into training, validation and test sets
- Save the processed data for reuse


In [2]:
import sys
import os
sys.path.append(os.path.abspath(".."))

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from src.data_loader import load_raw_data, save_processed_data
import os

## Loading the raw dataset

In [3]:
df = load_raw_data()
df.shape

(284807, 31)

## Normalize Amount column

The `Amount` feature will be standardized using `StandardScaler` to ensure it has zero mean and unit variance. This helps prevent models from being biased toward high-magnitude features.

In [4]:
scaler = StandardScaler()

df["Amount_Scaled"] = scaler.fit_transform(df[["Amount"]])

## Feature engineering

We will now create new variables to enrich the dataset. These features are designed to highlight behavior over time or relative to other transactions, which can help improve model performance.

The following features will be added:
- `Hour`: transaction hour (based on the `Time` feature)
- `Amount_to_mean_ratio`: ratio between the transaction amount and the global mean
- `Amount_to_std_ratio`: deviation from the mean amount in terms of standard deviations

In [5]:
# Convert Time (in seconds) to transaction hour
df["Hour"] = (df["Time"] // 3600) % 24

# Global statistics for amount
mean_amount = df["Amount"].mean()
std_amount = df["Amount"].std()

# Create new features
df["Amount_to_mean_ratio"] = df["Amount"] / mean_amount
df["Amount_to_std_ratio"] = (df["Amount"] - mean_amount) / std_amount

In [7]:
df[["Amount_to_mean_ratio", "Amount_to_std_ratio"]].describe()

Unnamed: 0,Amount_to_mean_ratio,Amount_to_std_ratio
count,284807.0,284807.0
mean,1.0,3.1933720000000004e-17
std,2.831026,1.0
min,0.0,-0.3532288
25%,0.063385,-0.3308395
50%,0.249011,-0.265271
75%,0.873405,-0.04471699
max,290.789708,102.3621


## Data splitting

We split the dataset into three subsets:
- **80% for training**
- **10% for validation (dev set)**
- **10% for testing**

Given the dataset's size (284,807 transactions), allocating 10% for both validation and test sets results in ~28,480 samples for each. This is more than enough to support stable model tuning and unbiased final evaluation.

Stratified sampling is used to preserve the original fraud/non-fraud ratio in all subsets.

In [8]:
X = df.drop(columns=["Class"])
y = df["Class"]

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=7
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=7
)

save_processed_data(pd.concat([X_train, y_train], axis=1), split="train")
save_processed_data(pd.concat([X_val, y_val], axis=1), split="val")
save_processed_data(pd.concat([X_test, y_test], axis=1), split="test")
