# Exploratory Data Analysis â€“ E-commerce Fraud Data

## Objective
This notebook explores the e-commerce transaction dataset to understand fraud patterns,
assess data quality, and identify features useful for fraud detection.
The focus is on class imbalance, user behavior, and transaction timing.


ðŸ“Œ Load & Clean

In [None]:
# Exploratory Data Analysis for Fraud Detection
# Allow imports from src/
import sys
from pathlib import Path

PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))
    
from src.data_loader import load_fraud_data
from src.preprocessing import clean_fraud_data

df = load_fraud_data("../data/raw/Fraud_Data.csv")
df.shape
df = clean_fraud_data(df)
df.head()
df.info()

The dataset contains transaction-level information including user behavior,
transaction timing, device details, and a binary fraud label.

In [None]:
df.isna().sum().sort_values(ascending=False)
df.describe(include='all').T
df.nunique().sort_values(ascending=False)
df.dtypes
df['class'].value_counts(normalize=True) * 100
# df['FRAUD'].value_counts()


Missing values were observed in demographic and categorical fields.
Numerical features were imputed using the median to reduce sensitivity to outliers,
while categorical features were filled with "Unknown" to preserve row count.


ðŸ“Š Class Imbalance

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
df["class"].value_counts(normalize=True) * 100

df["class"].value_counts(normalize=True).plot(kind="bar")
plt.title("Fraud vs Non-Fraud Distribution")
plt.ylabel("Proportion")
plt.show()
sns.countplot(x="class", data=df)
plt.title("Fraud vs Non-Fraud Count")
plt.show()


Only a very small percentage of transactions are fraudulent,
confirming a severe class imbalance. This makes accuracy an unsuitable
metric and motivates the use of resampling and cost-sensitive models later.


ðŸ“Š Purchase Value vs Fraud

In [None]:
# Purchase Value vs Fraud
sns.boxplot(x="class", y="purchase_value", data=df)
plt.title("Purchase Value by Fraud Class")
plt.show()
sns.histplot(data=df, x="purchase_value", hue="class", element="step", stat="density", common_norm=False)
plt.title("Purchase Value Distribution by Fraud Class")
plt.show()


Fraudulent transactions show a different purchase value distribution,
suggesting transaction amount alone is insufficient but informative when
combined with behavioral features.



ðŸ“Š Time Since Signup vs Fraud

In [None]:
# Time Since Signup vs Fraud
df["time_since_signup"] = (
    df["purchase_time"] - df["signup_time"]
).dt.total_seconds()

sns.kdeplot(
    data=df,
    x="time_since_signup",
    hue="class",
    log_scale=True
)
plt.title("Time Since Signup vs Fraud")
plt.show()

Fraudulent transactions tend to occur shortly after signup,
indicating potential abuse of newly created accounts.
This supports the inclusion of time_since_signup as a key feature.


## IP â†’ Country Fraud Analysis

In [None]:

import numpy as np
import pandas as pd
from src.geo_utils import convert_ip_to_int, merge_ip_country

# Base directory: project root
BASE_DIR = Path("..")

# Data paths
fraud_path = BASE_DIR / "data" / "raw" / "Fraud_Data.csv"
ip_path = BASE_DIR / "data" / "raw" / "IpAddress_to_Country.csv"

# Convert IP to integer
df = convert_ip_to_int(df)

# Load IP-to-country mapping
ip_df = pd.read_csv(ip_path)

# Merge country info
df = merge_ip_country(df, ip_df)

assert "country" in df.columns
df[["ip_address", "country"]].head()

fraud_rate_country = df.groupby("country")["class"].mean().sort_values(ascending=False)
fraud_rate_country.head(10).plot(kind="bar", figsize=(10,4))
plt.title("Fraud Rate by Country")
plt.ylabel("Fraud Rate")
plt.show()

Fraud rates vary significantly by country, highlighting the value
of geolocation features in detecting anomalous behavior.
