## Feature Engineering Overview

This notebook transforms raw BankSim transaction data into fraud-aware features.
The features are designed based on insights obtained during exploratory data analysis,
with a focus on temporal patterns, customer behavior, merchant interactions,
and location inconsistency.


In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.preprocessing import LabelEncoder


In [2]:
pd.set_option("display.max_columns", None)


In [3]:
df = pd.read_csv("Dataset/data.csv")
df.shape


(594643, 10)

In [4]:
df.head()


Unnamed: 0,step,customer,age,gender,zipcodeOri,merchant,zipMerchant,category,amount,fraud
0,0,'C1093826151','4','M','28007','M348934600','28007','es_transportation',4.55,0
1,0,'C352968107','2','M','28007','M348934600','28007','es_transportation',39.68,0
2,0,'C2054744914','4','F','28007','M1823072687','28007','es_transportation',26.89,0
3,0,'C1760612790','3','M','28007','M348934600','28007','es_transportation',17.25,0
4,0,'C757503768','5','M','28007','M348934600','28007','es_transportation',35.72,0


In [5]:
START_TIME = datetime(2024, 1, 1)

df["timestamp"] = df["step"].apply(
    lambda x: START_TIME + timedelta(hours=int(x))
)


In [6]:
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["month"] = df["timestamp"].dt.month


In [7]:
df["is_night"] = df["hour"].apply(lambda x: 1 if x < 6 or x > 22 else 0)
df["is_weekend"] = df["day_of_week"].apply(lambda x: 1 if x >= 5 else 0)


In [8]:
df["same_zip"] = (df["zipcodeOri"] == df["zipMerchant"]).astype(int)


In [9]:
df["distance_proxy"] = df["same_zip"].apply(lambda x: 0 if x == 1 else 1)


In [10]:
customer_txn_count = df.groupby("customer").size()
df["customer_txn_count"] = df["customer"].map(customer_txn_count)


In [11]:
customer_avg_amount = df.groupby("customer")["amount"].mean()
df["customer_avg_amount"] = df["customer"].map(customer_avg_amount)


In [12]:
df["amount_deviation"] = df["amount"] - df["customer_avg_amount"]


In [13]:
df["amount_deviation_abs"] = df["amount_deviation"].abs()


In [14]:
merchant_txn_count = df.groupby("merchant").size()
df["merchant_txn_count"] = df["merchant"].map(merchant_txn_count)


In [15]:
cust_merchant_count = (
    df.groupby(["customer", "merchant"])
    .size()
    .reset_index(name="cust_merchant_txn_count")
)

df = df.merge(
    cust_merchant_count,
    on=["customer", "merchant"],
    how="left"
)


In [16]:
df = df.sort_values(["customer", "timestamp"])


In [17]:
df["txn_count_last_24"] = (
    df.groupby("customer")
    .rolling(window=24, on="step")
    .count()["amount"]
    .reset_index(drop=True)
)


In [18]:
df["txn_count_last_24"] = df["txn_count_last_24"].fillna(0)


In [19]:
label_cols = ["category", "gender", "age"]

label_encoders = {}

for col in label_cols:
    le = LabelEncoder()
    df[col + "_enc"] = le.fit_transform(df[col])
    label_encoders[col] = le


In [20]:
df["amount_log"] = np.log1p(df["amount"])


Fraud in payment systems is often concentrated among a small subset of customers
and merchants. Behavioral aggregation features such as customer transaction count,
average spending, and spending deviation are introduced to capture abnormal behavior
relative to historical patterns.

In [21]:
feature_cols = [
    # Transaction
    "amount",
    "amount_log",
    "amount_deviation_abs",

    # Temporal
    "hour",
    "day_of_week",
    "is_night",
    "is_weekend",

    # Behavioral
    "customer_txn_count",
    "customer_avg_amount",
    "txn_count_last_24",
    "cust_merchant_txn_count",

    # Spatial
    "same_zip",
    "distance_proxy",

    # Encoded categories
    "category_enc",
    "gender_enc",
    "age_enc",
    "merchant_txn_count"
]


In [22]:
X = df[feature_cols]
y = df["fraud"]


In [23]:
X.shape, y.value_counts(normalize=True)


((594643, 17),
 fraud
 0    0.987892
 1    0.012108
 Name: proportion, dtype: float64)

In [24]:
X.isnull().sum().sum()


np.int64(0)

In [25]:
df.to_csv("banksim_feature_engineered.csv", index=False)
X.to_csv("X_features.csv", index=False)
y.to_csv("y_labels.csv", index=False)


Feature engineering was guided by insights from exploratory data analysis and focused on capturing temporal, behavioral, and spatial anomalies associated with fraud.
Key engineered features include customer-specific spending deviations, transaction velocity indicators, merchant interaction frequencies, and time-based contextual variables.
All features were designed to be compatible with unsupervised anomaly detection models, avoiding data leakage and preserving real-world deployment constraints.

## Feature Engineering Summary

The engineered feature set captures transactional, temporal, behavioral, and spatial
characteristics of payment activity. These features are suitable for unsupervised
and semi-supervised anomaly detection models and avoid the use of direct fraud labels
during feature construction, ensuring deployment safety.
