# Exploratory Data Analysis â€“ Credit Card Fraud Dataset

## Objective
This notebook analyzes the credit card fraud dataset,
which is already PCA-transformed and extremely imbalanced.


ðŸ“Š Class Distribution

In [None]:
#!/usr/bin/env python3
import sys
from pathlib import Path

PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src.data_loader import load_fraud_data, load_ip_country_data
from src.preprocessing import clean_fraud_data
from src.geo_utils import convert_ip_to_int, merge_ip_country
# Load raw data
df = load_fraud_data("../data/raw/creditcard.csv")

df.info()
df.head()
import matplotlib.pyplot as plt
import seaborn as sns

df["Class"].value_counts(normalize=True) * 100
sns.countplot(x="Class", data=df)


Fraudulent transactions account for less than 1% of all records,
making this a highly imbalanced classification problem.

ðŸ“Š Amount Distribution by Class

In [None]:
# Purchase Amount vs Fraud
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x="Class", y="Amount", data=df)
plt.title("Transaction Amount by Fraud Class")
plt.show()


Transaction amounts overlap heavily between classes,
indicating that amount alone is a weak fraud indicator.

ðŸ“Š Time vs Fraud

In [None]:
# Time vs Fraud
sns.histplot(data=df, x="Time", hue="Class", bins=100)
plt.title("Transaction Time vs Fraud")
plt.show()

All features (V1â€“V28) are anonymized and PCA-transformed.
As a result, traditional feature interpretation is limited,
and the focus shifts to model performance rather than feature semantics.
