# Feature Engineering and Data Preparation

This notebook transforms cleaned transaction datasets into feature-rich,
model-ready formats. It includes time-based features, transaction behavior
features, scaling, encoding, and proper handling of class imbalance.

In [1]:
print("Importing the neccessary dependecies...")
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from imblearn.over_sampling import SMOTE
print("succussfully imported")

Importing the neccessary dependecies...
succussfully imported


In [2]:
print("Loading from processed and raw file")
fraud_df = pd.read_csv("./data/processed/fraud_cleaned.csv")
credit_df = pd.read_csv("./data/processed/creditcard_cleaned.csv")
ip_df = pd.read_csv("./data/raw/IpAddress_to_Country.csv")

Loading from processed and raw file


## Fraud_Data Feature Engineering

We begin by converting timestamps and creating time-based features
that capture user behavior and transaction timing.

In [3]:
print("Converting to datetime...")
fraud_df["signup_time"] = pd.to_datetime(fraud_df["signup_time"])
fraud_df["purchase_time"] = pd.to_datetime(fraud_df["purchase_time"])

Converting to datetime...


In [4]:
print("Time since signup")
fraud_df["time_since_signup"] = (
    fraud_df["purchase_time"] - fraud_df["signup_time"]
).dt.total_seconds()

Time since signup


In [5]:
print("Hour and day features")
fraud_df["hour_of_day"] = fraud_df["purchase_time"].dt.hour
fraud_df["day_of_week"] = fraud_df["purchase_time"].dt.dayofweek

Hour and day features


### Transaction Velocity

Fraudulent users often perform multiple transactions in a short time window.
We compute transaction frequency per user.

In [6]:
print("Sorting by user and time")
fraud_df = fraud_df.sort_values(["user_id" , "purchase_time"])

Sorting by user and time


In [7]:
print("Transactions per user")
fraud_df["user_transaction_count"] = fraud_df.groupby("user_id")["purchase_time"].transform("count")

Transactions per user


In [8]:
print("Time betweein tranasactions")
fraud_df["time_since_last_tx"] = fraud_df.groupby("user_id")["purchase_time"].diff().dt.total_seconds()
fraud_df["time_since_last_tx"] = fraud_df["time_since_last_tx"].fillna(-1)

Time betweein tranasactions


### Country Feature

We enrich the dataset by mapping IP addresses to countries.
This geographic signal can improve fraud detection accuracy.

In [9]:
print("Ip conversion...")
fraud_df["ip_address"] = fraud_df["ip_address"].astype(int)
ip_df["lower_bound_ip_address"] = ip_df["lower_bound_ip_address"].astype(int)
ip_df["upper_bound_ip_address"] = ip_df["upper_bound_ip_address"].astype(int)

Ip conversion...


In [11]:
print("IP mapping function")
def map_ip_to_country(ip):
    match = ip_df[
        (ip_df["lower_bound_ip_address"] <= ip) &
        (ip_df["upper_bound_ip_address"] >= ip)
    ]
    return match.iloc[0]["country"] if not match.empty else "Unknown"

IP mapping function


In [12]:
print("Applying country mapping")
fraud_df["country"] = fraud_df["ip_address"].apply(map_ip_to_country)

Applying country mapping


### Leakage Prevention

Raw timestamps and identifiers are removed to prevent data leakage.


In [13]:
print("Drop unused columns")
fraud_df_model = fraud_df.drop(
    columns=["signup_time", "purchase_time", "ip_address", "device_id"]
)

Drop unused columns


## PART B — Preprocessing Pipeline (Fraud_Data)

## Fraud_Data Preprocessing Pipeline

### Class Imbalance Handling (Fraud_Data)

SMOTE was not applied due to the presence of high-cardinality categorical features.
Applying SMOTE would require generating artificial categorical values, which is
statistically invalid.

Instead, class-weighted models are used to handle imbalance while preserving
data integrity.

## Class imbalance handling

In [52]:
X = fraud_df_model.drop("class", axis=1)
y = fraud_df_model["class"]


In [53]:
from sklearn.model_selection import train_test_split

Xf_train, Xf_test, yf_train, yf_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [54]:
print("Feature types")
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns

Feature types


In [55]:
print("Column transformer")
preprocessor_fraud = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)

Column transformer


### Train-Test Split

Stratified splitting preserves fraud distribution.

In [56]:
print("Split")
Xf_train, Xf_test, yf_train, yf_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)
print("Train fraud ratio: ", yf_train.mean)
print("Test fraud ratio:", yf_test.mean())

Split
Train fraud ratio:  <bound method Series.mean of 145492    0
116211    0
81635     0
109796    0
38621     0
         ..
147940    0
75408     0
66453     0
9251      0
113722    0
Name: class, Length: 120889, dtype: int64>
Test fraud ratio: 0.09363729609899746


In [57]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

num_cols = [
    "purchase_value",
    "age",
    "time_since_signup",
    "user_transaction_count",
    "time_since_last_tx"
]

Xf_train[num_cols] = scaler.fit_transform(Xf_train[num_cols])
Xf_test[num_cols] = scaler.transform(Xf_test[num_cols])

print("Numerical features scaled successfully")


Numerical features scaled successfully


In [83]:
print(Xf_train.columns)

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'log_amount', 'hour'],
      dtype='object')


In [59]:
print("X_train shape:", Xf_train.shape)
print("X_test shape:", Xf_test.shape)
print("y_train distribution:\n", yf_train.value_counts())


X_train shape: (120889, 12)
X_test shape: (30223, 12)
y_train distribution:
 class
0    109568
1     11321
Name: count, dtype: int64


In [82]:
print(Xf_train.columns.tolist())


['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'log_amount', 'hour']


In [61]:
numeric_features = X.select_dtypes(include=["int64", "float64"])
categorical_features = X.select_dtypes(include=["object"])

In [79]:
Xf_train["log_amount"] = np.log1p(Xf_train["Amount"])
Xf_test["log_amount"] = np.log1p(Xf_test["Amount"])

Xf_train["hour"] = (Xf_train["Time"] // 3600) % 24
Xf_test["hour"] = (Xf_test["Time"] // 3600) % 24

print("Feature engineering complete")


Feature engineering complete


## PART C — Feature Prep for Credit Card Dataset

## Credit Card Dataset Preparation

PCA features are already standardized.
Only `Time` and `Amount` require scaling.


In [23]:
X_credit = credit_df.drop("Class", axis=1)
y_credit = credit_df["Class"]
print("Target and features")

Target and features


In [84]:
Xc_train, Xc_test, yc_train, yc_test = train_test_split(
    X_credit, y_credit,
    test_size=0.2,
    stratify=y_credit,
    random_state=42
)
print("Train-Test split")

Train-Test split


In [85]:
print("Scale Amount & Time")


scaler = StandardScaler()

cols_to_scale = ["Amount", "Time"]

Xc_train[cols_to_scale] = scaler.fit_transform(Xc_train[cols_to_scale])
Xc_test[cols_to_scale] = scaler.transform(Xc_test[cols_to_scale])

print("Scaling complete")



Scale Amount & Time
Scaling complete


In [90]:
print(Xc_train.columns)

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount'],
      dtype='object')


In [78]:
Xf_train["log_amount"] = np.log1p(Xf_train["Amount"])
Xf_test["log_amount"] = np.log1p(Xf_test["Amount"])

Xf_train["hour"] = (Xf_train["Time"] // 3600) % 24
Xf_test["hour"] = (Xf_test["Time"] // 3600) % 24

print("Feature engineering complete")


Feature engineering complete


## Feature Engineering Summary

Completed:
- Time-based features
- Transaction velocity features
- Geographic enrichment
- Scaling and encoding
- Class imbalance handling
- Leakage prevention

Datasets are now fully prepared for modeling.
