# Feature Engineering and Data Preparation

This notebook transforms cleaned transaction datasets into feature-rich,
model-ready formats. It includes time-based features, transaction behavior
features, scaling, encoding, and proper handling of class imbalance.

In [2]:
print("Importing the neccessary dependecies...")
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

print("succussfully imported")

Importing the neccessary dependecies...
succussfully imported


In [3]:
print("Loading from processed and raw file")
fraud_df = pd.read_csv("./data/processed/fraud_cleaned.csv")
credit_df = pd.read_csv("./data/processed/creditcard_cleaned.csv")
ip_df = pd.read_csv("./data/raw/IpAddress_to_Country.csv")

Loading from processed and raw file


### Country Feature

We enrich the dataset by mapping IP addresses to countries.
This geographic signal can improve fraud detection accuracy.

In [36]:
print(ip_df.head())

   lower_bound_ip_address  upper_bound_ip_address    country
0                16777216                16777471  Australia
1                16777472                16777727      China
2                16777728                16778239      China
3                16778240                16779263  Australia
4                16779264                16781311      China


In [37]:
# Convert IP addresses to integer
fraud_df["ip_int"] = fraud_df["ip_address"].astype(int)

ip_df["lower_bound_ip_address"] = ip_df["lower_bound_ip_address"].astype(int)
ip_df["upper_bound_ip_address"] = ip_df["upper_bound_ip_address"].astype(int)

print("IP addresses converted to integers")


IP addresses converted to integers


In [39]:
def find_country(ip):
    match = ip_df[
        (ip_df["lower_bound_ip_address"] <= ip) &
        (ip_df["upper_bound_ip_address"] >= ip)
    ]
    if not match.empty:
        return match.iloc[0]["country"]
    return "Unknown"

fraud_df["country"] = fraud_df["ip_int"].apply(find_country)

print(fraud_df["country"].value_counts().head())


country
United States     58049
Unknown           21966
China             12038
Japan              7306
United Kingdom     4490
Name: count, dtype: int64


In [59]:
assert "country" in fraud_df.columns
print("Country feature successfully added")


Country feature successfully added


In [60]:
fraud_df.columns

Index(['user_id', 'signup_time', 'purchase_time', 'purchase_value',
       'device_id', 'source', 'browser', 'sex', 'age', 'ip_address', 'class',
       'time_since_signup', 'hour_of_day', 'day_of_week',
       'user_transaction_count', 'time_since_last_tx', 'country_x',
       'country_y', 'country'],
      dtype='object')

## Fraud_Data Feature Engineering

We begin by converting timestamps and creating time-based features
that capture user behavior and transaction timing.

In [4]:
print("Converting to datetime...")
fraud_df["signup_time"] = pd.to_datetime(fraud_df["signup_time"])
fraud_df["purchase_time"] = pd.to_datetime(fraud_df["purchase_time"])

Converting to datetime...


In [5]:
print("Time since signup")
fraud_df["time_since_signup"] = (
    fraud_df["purchase_time"] - fraud_df["signup_time"]
).dt.total_seconds()

Time since signup


In [6]:
print("Hour and day features")
fraud_df["hour_of_day"] = fraud_df["purchase_time"].dt.hour
fraud_df["day_of_week"] = fraud_df["purchase_time"].dt.dayofweek

Hour and day features


### Transaction Velocity

Fraudulent users often perform multiple transactions in a short time window.
We compute transaction frequency per user.

In [7]:
print("Sorting by user and time")
fraud_df = fraud_df.sort_values(["user_id" , "purchase_time"])

Sorting by user and time


In [8]:
print("Transactions per user")
fraud_df["user_transaction_count"] = fraud_df.groupby("user_id")["purchase_time"].transform("count")

Transactions per user


In [9]:
print("Time between tranasactions")
fraud_df["time_since_last_tx"] = fraud_df.groupby("user_id")["purchase_time"].diff().dt.total_seconds()
fraud_df["time_since_last_tx"] = fraud_df["time_since_last_tx"].fillna(-1)

Time between tranasactions


### Leakage Prevention

Raw timestamps and identifiers are removed to prevent data leakage.


In [12]:
print("Drop unused columns")
fraud_df_model = fraud_df.drop(
    columns=["signup_time", "purchase_time", "ip_address", "device_id"]
)

Drop unused columns


In [64]:
fraud_df.columns

Index(['user_id', 'signup_time', 'purchase_time', 'purchase_value',
       'device_id', 'source', 'browser', 'sex', 'age', 'ip_address', 'class',
       'time_since_signup', 'hour_of_day', 'day_of_week',
       'user_transaction_count', 'time_since_last_tx', 'country_x',
       'country_y', 'country'],
      dtype='object')

## PART B â€” Preprocessing Pipeline (Fraud_Data)

## Fraud_Data Preprocessing Pipeline

### Class Imbalance Handling (Fraud_Data)

SMOTE was not applied due to the presence of high-cardinality categorical features.
Applying SMOTE would require generating artificial categorical values, which is
statistically invalid.

Instead, class-weighted models are used to handle imbalance while preserving
data integrity.

## Class imbalance handling

## Train-test split
Stratified splitting preserves fraud distribution.

In [69]:
# from sklearn.model_selection import train_test_split
feature_cols = [
    "user_id",
    "purchase_value",
    "source",
    "browser",
    "sex",
    "age",
    "time_since_signup",
    "hour_of_day",
    "day_of_week",
    "user_transaction_count",
    "time_since_last_tx",
    "country"   # ðŸ”¥ THIS WAS MISSING
]

X = fraud_df[feature_cols]
y = fraud_df["class"]

print("Features used for modeling:")
print(X.columns.tolist())

Xf_train, Xf_test, yf_train, yf_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
    
)
print("Train fraud ratio: ", yf_train.mean)
print("Test fraud ratio:", yf_test.mean())

Features used for modeling:
['user_id', 'purchase_value', 'source', 'browser', 'sex', 'age', 'time_since_signup', 'hour_of_day', 'day_of_week', 'user_transaction_count', 'time_since_last_tx', 'country']
Train fraud ratio:  <bound method Series.mean of 50483     0
95633     0
139041    0
28331     0
123407    0
         ..
137700    0
13103     0
22126     0
9121      0
120156    0
Name: class, Length: 120889, dtype: int64>
Test fraud ratio: 0.09363729609899746


In [70]:
print(X.columns)

Index(['user_id', 'purchase_value', 'source', 'browser', 'sex', 'age',
       'time_since_signup', 'hour_of_day', 'day_of_week',
       'user_transaction_count', 'time_since_last_tx', 'country'],
      dtype='object')


### Scaling

In [71]:
scaler = StandardScaler()

num_cols = [
    "purchase_value",
    "age",
    "time_since_signup",
    "user_transaction_count",
    "time_since_last_tx"
]

Xf_train[num_cols] = scaler.fit_transform(Xf_train[num_cols])
Xf_test[num_cols] = scaler.transform(Xf_test[num_cols])

print("Numerical features scaled")


Numerical features scaled


In [72]:
print(Xf_train.columns)

Index(['user_id', 'purchase_value', 'source', 'browser', 'sex', 'age',
       'time_since_signup', 'hour_of_day', 'day_of_week',
       'user_transaction_count', 'time_since_last_tx', 'country'],
      dtype='object')


In [73]:
print("X_train shape:", Xf_train.shape)
print("X_test shape:", Xf_test.shape)
print("y_train distribution:\n", yf_train.value_counts())


X_train shape: (120889, 12)
X_test shape: (30223, 12)
y_train distribution:
 class
0    109568
1     11321
Name: count, dtype: int64


## Hot-encoding

In [74]:
Xf_train = pd.get_dummies(
    Xf_train,
    columns=["source", "browser", "sex", "country"],
    drop_first=True
)

Xf_test = pd.get_dummies(
    Xf_test,
    columns=["source", "browser", "sex", "country"],
    drop_first=True
)

Xf_train, Xf_test = Xf_train.align(Xf_test, join="left", axis=1, fill_value=0)

print("One-hot encoding completed")


One-hot encoding completed


## PART C â€” Feature Prep for Credit Card Dataset

## Credit Card Dataset Preparation

PCA features are already standardized.
Only `Time` and `Amount` require scaling.


## Feature Engineering Summary

Completed:
- Time-based features
- Transaction velocity features
- Geographic enrichment
- Scaling and encoding
- Class imbalance handling
- Leakage prevention

Datasets are now fully prepared for modeling.


## Summary of Feature Engineering (Fraud_Data)

In this notebook, we prepared the e-commerce fraud dataset for modeling by:
- Cleaning and validating raw transaction data
- Engineering time-based and behavioral features (transaction velocity, recency)
- Integrating geolocation data via IP-to-country range mapping
- Scaling numerical features using StandardScaler
- Encoding categorical variables using one-hot encoding
- Preserving class imbalance to be handled at the modeling stage

The resulting dataset is fully numeric, free of missing values, and ready for machine learning.


### Class Imbalance Handling Decision

For the e-commerce fraud dataset, we did not apply SMOTE during feature engineering.
This is because:
- Tree-based ensemble models (e.g., Random Forest, XGBoost) are robust to class imbalance
- SMOTE can distort categorical patterns and transaction sequences
- Class imbalance will instead be handled using:
  - Stratified sampling
  - Class-weighted models
  - Precisionâ€“Recall focused evaluation metrics


In [75]:
import os

os.makedirs("data/processed", exist_ok=True)

Xf_train.to_csv("data/processed/fraud_X_train.csv", index=False)
Xf_test.to_csv("data/processed/fraud_X_test.csv", index=False)
yf_train.to_csv("data/processed/fraud_y_train.csv", index=False)
yf_test.to_csv("data/processed/fraud_y_test.csv", index=False)

print("Processed fraud dataset saved successfully")

Processed fraud dataset saved successfully
