# Feature Engineering for Fraud Detection

## Objective
This notebook creates behavioral, temporal, and geolocation features
to enhance fraud detection performance.
## Feature Engineering Strategy

- Time-based features capture suspicious transaction timing.
- Velocity features detect automated or scripted behavior.
- Country features capture geo-risk patterns.
- No target leakage features are introduced.


üåç IP ‚Üí Country Merge


In [None]:

# Fraud Detection Feature Engineering Pipeline
# Allow imports from src/
import sys
from pathlib import Path
import pandas as pd
import numpy as np

PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src.data_loader import load_fraud_data, load_ip_country_data
from src.preprocessing import clean_fraud_data


# Load raw data
df = load_fraud_data("../data/raw/Fraud_Data.csv")
ip_df = load_ip_country_data("../data/raw/IpAddress_to_Country.csv")

# Clean fraud data
df = clean_fraud_data(df)

df.head()


In [None]:
from src.geo_utils import convert_ip_to_int, merge_ip_country




fraud_df = convert_ip_to_int(df)
fraud_df = merge_ip_country(fraud_df, ip_df)
print(fraud_df.head())

fraud_df[["ip_address", "ip_int", "country"]].head()

Mapping IP addresses to countries enables the detection
of geographically anomalous transactions.


‚öôÔ∏è Time & Velocity Features

In [None]:
#üïí Time-Based Features
from src.feature_engineering import add_time_features, add_transaction_velocity

fraud_df = add_time_features(fraud_df)
fraud_df = add_transaction_velocity(fraud_df)
fraud_df[["hour_of_day", "day_of_week", "time_since_signup"]].head()



Time-based features capture behavioral patterns,
such as fraud occurring late at night or soon after signup.

In [None]:
fraud_df = add_transaction_velocity(fraud_df)
fraud_df.filter(like="transactions_last").head()
fraud_df.head()

Fraud often occurs in bursts.
Velocity features quantify rapid transaction activity,
which is uncommon for legitimate users.

Numerical features will be scaled using StandardScaler
to support distance-based and gradient-based models.
Scaling is not applied to PCA features in the credit card dataset.

Both datasets exhibit severe class imbalance.
Resampling techniques such as SMOTE will be applied
only to training data during modeling to avoid information leakage.


üíæ Save Processed Data

In [None]:

#üíæ Save Processed Data
fraud_df.to_csv("../data/processed/fraud_data_features.csv", index=False)