# Feature Engineering — Fraud Detection (IEEE-CIS)

## Objective

The objective of this notebook is to engineer production-ready fraud features from raw transaction and identity data.

Fraud detection is fundamentally a behavioral problem.
Raw features are rarely sufficient — fraud emerges when behavior deviates from historical norms or changes rapidly over time.

This notebook focuses on creating features that capture:

- Transaction velocity

- Behavioral aggregation across time windows

- Amount deviation from historical behavior

- Identity and device signals

- Missingness as an explicit risk indicator

All features are designed to be:

- Leakage-safe

- Time-aware

- Compatible with real-time scoring systems

In [1]:
# Import and Configuration 

import pandas as pd
import numpy as np

from collections import defaultdict

In [2]:
# Load and Prepare Data

DATA_PATH = "../data/"

train_transaction = pd.read_csv(DATA_PATH + "train_transaction.csv")
train_identity = pd.read_csv(DATA_PATH + "train_identity.csv")

df = train_transaction.merge(
    train_identity,
    on="TransactionID",
    how="left"
)

df = df.sort_values("TransactionDT").reset_index(drop=True)
df.shape


(590540, 434)

Fraud features answer four key questions:

- How fast is the behavior changing? (velocity)

- How does this compare to the past? (aggregation)

- How abnormal is the amount? (deviation)

- Is identity information missing or unusual? (identity risk)

In [3]:
# Velocity Features.
# Velocity features capture short-term bursts of activity, a common fraud pattern.

# Transaction Count per Card (Rolling Window)

df["card_tx_count_1h"] = (
    df.groupby("card1")["TransactionDT"]
      .transform(lambda x: x.diff().fillna(np.inf))
)
# Now create a rolling transaction count using time

WINDOW_1H = 3600  # seconds

df["card_tx_count_1h"] = (
    df.groupby("card1")["TransactionDT"]
      .transform(lambda x: (x.diff() <= WINDOW_1H).cumsum())
)

# Time Since Last Transaction

df["time_since_last_card_tx"] = (
    df.groupby("card1")["TransactionDT"]
      .diff()
      .fillna(-1)
)


Why this matters

- Fraud often involves rapid successive transactions

- Very small time gaps are suspicious

In [4]:
# Aggregation Features (Behavioral Baselines)
# Aggregatin provide historical context

# Rolling Mean Transaction Amount per card
df["card_avg_amount"] = (
    df.groupby("card1")["TransactionAmt"]
      .expanding()
      .mean()
      .shift(1)
      .reset_index(level=0, drop=True)
)


#Roliing Standard Deviation 
df["card_std_amount"] = (
    df.groupby("card1")["TransactionAmt"]
      .expanding()
      .std()
      .shift(1)
      .reset_index(level=0, drop=True)
)


In [5]:
# Amount Deviation Features 
# Raw transaction amount is weak.Deviation from normal behavior is powerful.
df["amount_ratio"] = df["TransactionAmt"] / (df["card_avg_amount"] + 1)

df["amount_zscore"] = (
    (df["TransactionAmt"] - df["card_avg_amount"]) /
    (df["card_std_amount"] + 1)
)



In [6]:
# Identity & Device Features
# Missing Identity Indicator
df["has_identity"] = (~df["DeviceType"].isna()).astype(int)

# Device Type Encoding 
df["DeviceType"] = df["DeviceType"].fillna("missing")


Categorical features will later be:

- Label encoded, or

- One-hot encoded (depending on model choice)

In [7]:
# Email Domain Risk Signals
df["P_emaildomain"] = df["P_emaildomain"].fillna("missing")
df["R_emaildomain"] = df["R_emaildomain"].fillna("missing")


Email domains often act as proxy identity features.

In [8]:
# Missing Value Indictors 
#Instead of blindly imputing, we explicitly model missingness.

identity_cols = [col for col in df.columns if col.startswith("id_")]

for col in identity_cols:
    df[f"{col}_missing"] = df[col].isna().astype(int)


In [9]:
# Feature Cleanup
# Drop columns not suitable for modeling:
DROP_COLS = [
    "TransactionID",
    "TransactionDT"
]

features_df = df.drop(columns=DROP_COLS)
features_df.shape


(590540, 477)

## Feature Engineering Summary

Features created in this notebook:

1.  Velocity features :

- Short-term transaction frequency

- Time since last transaction

2. Aggregation features :

- Rolling mean and standard deviation

3. Amount deviation features :

- Amount ratio

- Z-score

4. Identity risk features :

- Missing identity indicator

- Device and email signals

5. Explicit missingness indicators

These features capture how behavior changes over time, not just static attributes.

In [10]:
# Save the feature Dataset 

OUTPUT_PATH = "../data/"

features_df.to_csv(
    OUTPUT_PATH + "features_train.csv",
    index=False
)
