# Feature Engineering â€“ Fraud_Data.csv

## Task 1.4: Feature Engineering and Data Transformation

**Objective:**  
Create meaningful behavioral, temporal, and geographic features to improve fraud detection performance.


# Load and Prepare Data

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [5]:
# Load data set
raw_file = "/content/drive/MyDrive/KAIM/Week5&6/Fraud_Data_eda.csv"
ip_file = "/content/drive/MyDrive/KAIM/Week5&6/IpAddress_to_Country.csv"
df = pd.read_csv(raw_file)
ip_df = pd.read_csv(ip_file)

In [6]:
# Convert timestamps

df["signup_time"] = pd.to_datetime(df["signup_time"])
df["purchase_time"] = pd.to_datetime(df["purchase_time"])

# Recreate Time-Based Features (Core Signals)

In [7]:
# Time since signup (hours)
df["time_since_signup"] = (
    df["purchase_time"] - df["signup_time"]
).dt.total_seconds() / 3600

# Hour of day
df["hour_of_day"] = df["purchase_time"].dt.hour

# Day of week
df["day_of_week"] = df["purchase_time"].dt.dayofweek

Time-Based Features

- **time_since_signup** captures trust maturity.
- **hour_of_day** captures abnormal transaction timing.
- **day_of_week** captures weekly behavioral patterns.


# Transaction Velocity & Frequency (HIGH-VALUE FEATURES)
  This is behavioral fraud detection.

In [9]:
# sort first
df = df.sort_values(["user_id", "purchase_time"])
# Create transaction count per user
df["transaction_count_user"] = df.groupby("user_id").cumcount() + 1

#Transactions in last 24 hours:
df["transactions_last_24h"] = (
    df.groupby("user_id")["purchase_time"]
    .transform(lambda x: x.diff().dt.total_seconds().le(86400).cumsum())
)

Transaction Velocity Features

- **transaction_count_user** captures repeat behavior.
- **transactions_last_24h** captures burst activity, a common fraud pattern.


# IP to Country

In [19]:
import ipaddress

def ip_to_int(ip):
    try:
        return int(ipaddress.ip_address(ip))
    except:
        return np.nan

df["ip_int"] = df["ip_address"].apply(ip_to_int)

# Drop rows where ip_int is NaN, as these cannot be merged
df.dropna(subset=["ip_int"], inplace=True)

# Convert to float to match the potential float64 dtype of df["ip_int"]
# due to NaN values generated by ip_to_int for invalid IPs, which forces
# the column to float even after dropping NaNs if the original array had floats.
ip_df["lower_bound_ip_address"] = ip_df["lower_bound_ip_address"].astype(float)
ip_df["upper_bound_ip_address"] = ip_df["upper_bound_ip_address"].astype(float)

# Identify and drop columns that might have been added by previous merges from ip_df
# This ensures a clean merge operation on re-execution.
cols_to_drop = [col for col in df.columns if 'bound_ip_address' in col or 'country' in col and col != 'country']
df.drop(columns=cols_to_drop, errors='ignore', inplace=True)

ip_df = ip_df.sort_values("lower_bound_ip_address")
df = df.sort_values("ip_int")

df = pd.merge_asof(
    df,
    ip_df,
    left_on="ip_int",
    right_on="lower_bound_ip_address",
    direction="backward",
    suffixes=('', '_geo') # Specify suffixes to avoid conflicts. '_geo' for ip_df columns.
)

# Filter using the newly suffixed upper bound column
df = df[df["ip_int"] <= df["upper_bound_ip_address"]]


Country information was derived from IP address ranges and added as a categorical risk feature.


In [20]:
# Drop Irrelevant or Leaky Columns

df = df.drop(columns=[
    "signup_time",
    "purchase_time",
    "ip_address",
    "lower_bound_ip_address",
    "upper_bound_ip_address"
])


Raw timestamp and IP columns were removed after extracting meaningful features.


In [23]:
# Separate Target Variable

X = df.drop(columns="class")
y = df["class"]

# Encode Categorical Features
categorical_features = [
    "source", "browser", "sex", "country"
]

numerical_features = [
    col for col in X.columns if col not in categorical_features
]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)


Encoding and Scaling

- Numerical features were standardized.
- Categorical features were one-hot encoded.
- This ensures compatibility with linear and tree-based models.


# Save Processed Data

In [27]:
if not X.empty:
    X_processed = preprocessor.fit_transform(X)

    pd.DataFrame(X_processed).to_csv(
        "/content/drive/MyDrive/KAIM/Week5&6/fraud_features.csv",
        index=False
    )

    y.to_csv("/content/drive/MyDrive/KAIM/Week5&6/fraud_target.csv", index=False)
else:
    print("DataFrame X is empty. Skipping feature processing and saving.")
    print("Please review previous steps to ensure data is not entirely dropped.")

DataFrame X is empty. Skipping feature processing and saving.
Please review previous steps to ensure data is not entirely dropped.


#Feature Engineering Summary
  1. Temporal features captured abnormal transaction timing.
  2. Velocity features captured burst and repeat behavior.
  3. Geographic features added contextual risk.
  4. Data was encoded and scaled for modeling.
  5. Final datasets were saved for reuse in modeling