In [1]:
import pandas as pd
import numpy as np
import sys
import os 

sys.path.append(os.path.abspath("../src/"))
from utils.utils import load_data
from core.DataTransformer import FraudPreprocessor

In [2]:
fraud_df = load_data("../data/processed/cleaned_fraud_data.csv")
creditcard_df = load_data("../data/processed/cleaned_creditcard_data.csv")

print("Both datas loaded sucssusfully✅")

Both datas loaded sucssusfully✅


# Feature Engineering for fraud data
- Transaction **frequency** and **velocity** for Fraud_Data.csv
- **Time-Based** features for Fraud_Data.csv
    - hour_of _day
    - Day_of_week
- **time_since_signup** (Calculate the duration between signup_time and purchase_time)

In [3]:
fraud_df['signup_time'] = pd.to_datetime(fraud_df['signup_time'])
fraud_df['purchase_time'] = pd.to_datetime(fraud_df['purchase_time'])
print("Converted signup_time and purchase_time to datetime successfully✅")

Converted signup_time and purchase_time to datetime successfully✅


In [4]:
# Extract time-related components from purchase_time
fraud_df['purchase_hour'] = fraud_df['purchase_time'].dt.hour
fraud_df['purchase_dayofweek'] = fraud_df['purchase_time'].dt.dayofweek  # Monday=0, Sunday=6

# Calculate session duration: time between signup and purchase
fraud_df['time_since_signup'] = (fraud_df['purchase_time'] - fraud_df['signup_time']).dt.total_seconds() / 3600  # in hours
print("Extracted purchase_hour, purchase_dayofweek, and time_since_signup successfully✅")
print("Time base features created")


Extracted purchase_hour, purchase_dayofweek, and time_since_signup successfully✅
Time base features created


## What These New Features Capture:
- purchase_hour: reveals hourly transaction patterns (e.g., fraud spikes at odd hours)
- purchase_dayofweek: captures weekday fraud tendencies
- time_since_signup: great for detecting unusually fast purchases (e.g., fraudsters exploiting new accounts)


In [5]:
# Transaction frequency per user
txn_counts = fraud_df.groupby('user_id')['purchase_time'].count()
fraud_df['user_txn_count'] = fraud_df['user_id'].map(txn_counts)

# Velocity: average time between transactions per user
user_time_range = fraud_df.groupby('user_id')['purchase_time'].agg(['min', 'max'])
user_time_range['active_span_hours'] = (user_time_range['max'] - user_time_range['min']).dt.total_seconds() / 3600
user_velocity = user_time_range['active_span_hours'] / txn_counts
fraud_df['user_txn_velocity'] = fraud_df['user_id'].map(user_velocity.fillna(0))

print("Transaction frequency and velocity features created successfully✅")

Transaction frequency and velocity features created successfully✅


In [6]:
# fraud data with new features
fraud_df.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,lower_bound_ip_address,upper_bound_ip_address,country,purchase_hour,purchase_dayofweek,time_since_signup,user_txn_count,user_txn_velocity
0,62421,2015-02-16 00:17:05,2015-03-08 10:00:39,46,ZCLZTAJPCRAQX,Direct,Safari,M,36,52093.496895,0,,,Unknown,10,6,489.726111,1,0.0
1,173212,2015-03-08 04:03:22,2015-03-20 17:23:45,33,YFGYOALADBHLT,Ads,IE,F,30,93447.138961,0,,,Unknown,17,4,301.339722,1,0.0
2,242286,2015-05-17 16:45:54,2015-05-26 08:54:34,33,QZNVQTUITFTHH,Direct,FireFox,F,32,105818.501505,0,,,Unknown,8,1,208.144444,1,0.0
3,370003,2015-03-03 19:58:39,2015-05-28 21:09:13,33,PIBUQMBIELMMG,Ads,IE,M,40,117566.664867,0,,,Unknown,21,3,2065.176111,1,0.0
4,119824,2015-03-20 00:31:27,2015-04-05 07:31:46,55,WFIIFCPIOGMHT,Ads,Safari,M,38,131423.789042,0,,,Unknown,7,6,391.005278,1,0.0


## What These New Features Reveal:
- user_txn_count: how often users transact—low frequency may indicate bots, high may mean habitual users or attackers
- user_txn_velocity: speed of transactions across time—useful to detect rush patterns or slow churners


In [7]:
# since we are moving to data transformation we should remove some columns which are redundant and just noise
cols_to_be_removed = ["user_id", "signup_time", "purchase_time", "ip_address", "lower_bound_ip_address", "upper_bound_ip_address"]

fraud_df.drop(cols_to_be_removed, axis=1, inplace=True)
print("Sucssesfully removed noise and redendant coulmns✅")

Sucssesfully removed noise and redendant coulmns✅


In [8]:
# fraud case distribution in credit data will be crucial when choosing imbalance hadling method later
creditcard_df["Class"].value_counts()

Class
0    283253
1       473
Name: count, dtype: int64

In [9]:
# fraud case distribution for the fraud data
fraud_df["class"].value_counts()

class
0    136961
1     14151
Name: count, dtype: int64

# Data Transformation Demonstration (Fraud & Credit Card Datasets)

This notebook serves as a demonstration of how we preprocess two distinct fraud detection datasets: `fraud_data` and `creditcard_data`. The transformations shown here are for illustrative purposes only — no train-test split or model training will be performed.

 **Important:** In the actual pipeline, all transformations involving target-dependent statistics (like country-level fraud rates) and class balancing techniques (SMOTE or Random Undersampling) will be applied **only on training data** after splitting, to avoid data leakage. This demo is meant to visualize the preprocessing workflow and justify our design decisions.

We’ll walk through:
- Encoding strategies based on column characteristics
- Scaling and normalization choices
- Class imbalance handling rationale
- And reusable pipeline components for deployment

Let’s begin with the `fraud_data` transformation workflow.

In [11]:
# Load demo fraud dataset
X_fraud = fraud_df.drop(columns='class')
y_fraud = fraud_df['class']

print("Fraud class distribution:\n", y_fraud.value_counts())
fraud_df.head()

Fraud class distribution:
 class
0    136961
1     14151
Name: count, dtype: int64


Unnamed: 0,purchase_value,device_id,source,browser,sex,age,class,country,purchase_hour,purchase_dayofweek,time_since_signup,user_txn_count,user_txn_velocity
0,46,ZCLZTAJPCRAQX,Direct,Safari,M,36,0,Unknown,10,6,489.726111,1,0.0
1,33,YFGYOALADBHLT,Ads,IE,F,30,0,Unknown,17,4,301.339722,1,0.0
2,33,QZNVQTUITFTHH,Direct,FireFox,F,32,0,Unknown,8,1,208.144444,1,0.0
3,33,PIBUQMBIELMMG,Ads,IE,M,40,0,Unknown,21,3,2065.176111,1,0.0
4,55,WFIIFCPIOGMHT,Ads,Safari,M,38,0,Unknown,7,6,391.005278,1,0.0


## 🧠 Dataset Overview: fraud_data

This dataset combines user transaction behavior with contextual metadata (e.g. device, browser, country). We also include engineered temporal features (e.g. `purchase_hour`, `time_since_signup`) that will be used directly.

Notice the class distribution:
- Legitimate: 136,961
- Fraudulent: 14,151 (~9.3%)

This moderate imbalance informs our sampling choice: **Random Undersampling**. Because this dataset contains mostly categorical features (like `device_id`, `browser`, `country`, `source`, `sex`), and a large majority class, SMOTE would struggle to interpolate meaningfully between string-based entries. RUS offers a simple, effective solution for balancing during training 

In [12]:
# Initialize preprocessor for fraud dataset
pre_fraud = FraudPreprocessor(mode='fraud_data')

# Fit on the full dataset for demonstration
pre_fraud.fit(X_fraud, y_fraud)

INFO:core.DataTransformer:Fitting FraudPreprocessor for mode: fraud_data
INFO:core.DataTransformer:ColumnTransformer fitted


## Fitting Preprocessor on fraud_data

In this step, I'm fitting the custom `FraudPreprocessor` on the full fraud dataset. This allows me to demonstrate how the encoding maps and scaling rules are learned—even though in the real pipeline, this would happen on the training split only.

Here’s what’s happening internally:
- **Frequency encoding** for `device_id`: This approach gives each device a numeric value based on how common it is. I chose this method because one-hot encoding would explode the dimensionality with 130K+ distinct device IDs, and label encoding could suggest false ordinal meaning.
- **Target encoding** for `country`: Each country is mapped to its fraud rate. I’m using this because the feature has 182 unique values, and I want the model to learn fraud patterns across geographies without blowing up the feature space.
- **Numerical scaling**: I apply `StandardScaler` to normalize all numerical features to have mean 0 and unit variance.
- **One-hot encoding**: Applied to compact categorical features (`source`, `browser`, `sex`) for interpretability and modeling convenience.

These choices are tuned to preserve signal while keeping the transformation scalable.

In [13]:
# Transform the full dataset (demo only)
X_transformed_fraud = pre_fraud.transform(X_fraud)

# Apply Random Undersampling for demonstration
X_sampled_fraud, y_sampled_fraud = pre_fraud.sample(X_transformed_fraud, y_fraud)

print("Resampled class distribution:")
print(pd.Series(y_sampled_fraud).value_counts())

INFO:core.DataTransformer:Transforming data with trained encoders/scalers
INFO:core.DataTransformer:Transformation complete → shape: (151112, 16)
INFO:core.DataTransformer:Applying sampling method: auto
INFO:core.DataTransformer:Resampled data → new shape: (28302, 16)


Resampled class distribution:
class
0    14151
1    14151
Name: count, dtype: int64


## Balancing Classes with Random Undersampling (RUS)

Fraud detection datasets are notoriously imbalanced—and `fraud_data` is no exception. Here, about 91% of transactions are legitimate, which causes models to bias toward predicting the majority class unless corrected.

I chose **Random Undersampling** for this dataset because:
- The majority class is large enough (~137K rows) to afford removal without losing diversity.
- Many features are categorical (like `device_id`, `country`), which SMOTE struggles to synthesize reliably.
- It’s a simple and effective method that trims the majority class to match the fraud count, leading to a more balanced learning environment.

I applied the sampling directly after transforming the features to ensure the class ratio is visible in the demo—even though, in practice, this would only affect training data.

In [15]:
# Load demo fraud dataset
X_credit = creditcard_df.drop(columns='Class')
y_credit = creditcard_df['Class']

print("Creditcard class distribution:\n", y_fraud.value_counts())
fraud_df.head()

Creditcard class distribution:
 Class
0    283253
1       473
Name: count, dtype: int64


Unnamed: 0,purchase_value,device_id,source,browser,sex,age,class,country,purchase_hour,purchase_dayofweek,time_since_signup,user_txn_count,user_txn_velocity
0,46,ZCLZTAJPCRAQX,Direct,Safari,M,36,0,Unknown,10,6,489.726111,1,0.0
1,33,YFGYOALADBHLT,Ads,IE,F,30,0,Unknown,17,4,301.339722,1,0.0
2,33,QZNVQTUITFTHH,Direct,FireFox,F,32,0,Unknown,8,1,208.144444,1,0.0
3,33,PIBUQMBIELMMG,Ads,IE,M,40,0,Unknown,21,3,2065.176111,1,0.0
4,55,WFIIFCPIOGMHT,Ads,Safari,M,38,0,Unknown,7,6,391.005278,1,0.0


## Data Transformation Demo: creditcard_data

Now I’m demonstrating the preprocessing pipeline for the `creditcard_data`, which contains anonymized numeric features (V1–V28 via PCA) and a transaction `Amount` column. Unlike the fraud dataset, this one is **fully numerical** and extremely imbalanced:

- Legitimate: 283,253
- Fraudulent: 473 (~0.17%)

Given this severe imbalance, I’ll apply **SMOTE (Synthetic Minority Oversampling Technique)** to boost the minority class during training. It’s ideal here because:
- All features are numeric → SMOTE can interpolate safely
- We can enrich the fraud class without losing valuable normal transactions



In [16]:
# Initialize preprocessor for credit card data
pre_credit = FraudPreprocessor(mode='creditcard_data')

# Fit on full dataset (demo only)
pre_credit.fit(X_credit, y_credit)

INFO:core.DataTransformer:Fitting FraudPreprocessor for mode: creditcard_data
INFO:core.DataTransformer:ColumnTransformer fitted


## Why SMOTE Is Best for creditcard_data

This dataset has a near-total imbalance, with only 473 fraud cases out of nearly 284K rows. Undersampling here would discard too much valuable non-fraud data.

Instead, I’m using **SMOTE**, which works best when:
- Data is **fully numeric**, avoiding category interpolation issues
- Minority class is **very small**, and every real fraud case is valuable
- The feature space is clean (e.g. PCA-based features don’t have nulls or strings)

This lets me boost learning signals in the rare class while preserving healthy data density in the majority class. Let’s fit the transformation logic.

In [17]:
# Transform the full dataset
X_transformed_credit = pre_credit.transform(X_credit)

# Apply SMOTE for balancing
X_sampled_credit, y_sampled_credit = pre_credit.sample(X_transformed_credit, y_credit)

print("Resampled class distribution:")
print(pd.Series(y_sampled_credit).value_counts())

INFO:core.DataTransformer:Transforming data with trained encoders/scalers
INFO:core.DataTransformer:Transformation complete → shape: (283726, 29)
INFO:core.DataTransformer:Applying sampling method: auto
INFO:core.DataTransformer:Resampled data → new shape: (566506, 29)


Resampled class distribution:
Class
0    283253
1    283253
Name: count, dtype: int64


## Synthetic Oversampling Results

Here I’ve applied SMOTE to generate synthetic fraud samples by interpolating between real fraud points. This balanced the class distribution from:

- 0 → 283,253  
- 1 → 473

To:

- 0 → 283,253  
- 1 → 283,253

This does not affect the test or deployment phases—only the training input benefits from this enriched fraud signal. And because all features are numeric, there’s no risk of unnatural category blending.

SMOTE is robust here and helps the model learn much richer fraud patterns than it could with the original sparse signal.

In [20]:
# Save the feature engineered fraud dataset
fraud_df.to_csv("../data/processed/Feature_engineered/Feature_engineered_fraud_data.csv", index=False)
print("saved feature engineered fraud data sucessfully✅")

saved feature engineered fraud data sucessfully✅


# Feature Engineering & Transformation Summary

This notebook demonstrates feature engineering and preprocessing for two financial fraud detection datasets:

- **fraud_data**: User transaction logs with categorical and temporal features. We engineer time-based features, transaction frequency/velocity, and apply encoding/scaling strategies. Class imbalance is handled via Random Undersampling.
- **creditcard_data**: Anonymized numeric features (V1–V28) and transaction amounts. We apply scaling and balance classes using SMOTE.

The workflow covers:
- Feature creation (temporal, behavioral)
- Encoding and scaling choices
- Imbalance handling rationale
- Demonstration of reusable pipeline components

All transformations are shown for illustration—actual model training would split data first to avoid leakage.