# ðŸ“Š Notebook 01 â€” Exploratory Data Analysis

This notebook performs comprehensive EDA on both datasets:
1. **European Cardholders** (284,807 transactions, PCA-transformed)
2. **Sparkov Simulated** (simulated transactions with demographics)

---

In [None]:
import sys, os
sys.path.insert(0, os.path.abspath('..'))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src.utils.config import RAW_DATA_DIR, EU_DATASET_FILE, SPARKOV_DATASET_FILE, SPARKOV_TEST_FILE
from src.visualization.plot_utils import (
    plot_class_distribution, plot_feature_distributions,
    plot_correlation_matrix
)

sns.set_theme(style='whitegrid', font_scale=1.1)
%matplotlib inline
print('Setup complete.')

## 1. European Cardholders Dataset

In [None]:
eu_df = pd.read_csv(RAW_DATA_DIR / EU_DATASET_FILE)
print(f'Shape: {eu_df.shape}')
print(f'Missing values: {eu_df.isnull().sum().sum()}')
eu_df.head()

In [None]:
print('=== Class Distribution ===')
print(eu_df['Class'].value_counts())
print(f"\nFraud percentage: {eu_df['Class'].mean() * 100:.4f}%")

plot_class_distribution(eu_df['Class'], dataset_name='European')
plt.show()

In [None]:
print('=== Descriptive Statistics ===')
eu_df.describe().T

In [None]:
# Transaction Amount distribution by class
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for cls, ax, title in zip([0, 1], axes, ['Legitimate', 'Fraudulent']):
    data = eu_df[eu_df['Class'] == cls]['Amount']
    ax.hist(data, bins=50, color='#55A868' if cls == 0 else '#C44E52', alpha=0.8)
    ax.set_title(f'{title} â€” Transaction Amount')
    ax.set_xlabel('Amount')
    ax.set_ylabel('Count')

plt.tight_layout()
plt.show()

print(f"Legit mean amount:  ${eu_df[eu_df['Class']==0]['Amount'].mean():.2f}")
print(f"Fraud mean amount:  ${eu_df[eu_df['Class']==1]['Amount'].mean():.2f}")

In [None]:
# Time distribution
fig, ax = plt.subplots(figsize=(12, 4))
ax.hist(eu_df[eu_df['Class']==0]['Time'], bins=100, alpha=0.5, label='Legit', color='#55A868')
ax.hist(eu_df[eu_df['Class']==1]['Time'], bins=100, alpha=0.7, label='Fraud', color='#C44E52')
ax.set_title('Transaction Time Distribution by Class')
ax.legend()
plt.show()

In [None]:
# Feature distributions (V1-V8)
pca_features = [f'V{i}' for i in range(1, 9)]
plot_feature_distributions(eu_df, pca_features, target_col='Class', dataset_name='European')
plt.show()

In [None]:
# Correlation matrix
plot_correlation_matrix(eu_df, dataset_name='European')
plt.show()

## 2. Sparkov Simulated Dataset

In [None]:
sp_train = pd.read_csv(RAW_DATA_DIR / SPARKOV_DATASET_FILE)
sp_test  = pd.read_csv(RAW_DATA_DIR / SPARKOV_TEST_FILE)
sp_df = pd.concat([sp_train, sp_test], ignore_index=True)

print(f'Shape: {sp_df.shape}')
print(f'Missing values: {sp_df.isnull().sum().sum()}')
sp_df.head()

In [None]:
print('=== Class Distribution ===')
print(sp_df['is_fraud'].value_counts())
print(f"\nFraud percentage: {sp_df['is_fraud'].mean() * 100:.4f}%")

plot_class_distribution(sp_df['is_fraud'], dataset_name='Sparkov')
plt.show()

In [None]:
# Transaction amount by class
fig, ax = plt.subplots(figsize=(10, 5))
sp_df.boxplot(column='amt', by='is_fraud', ax=ax)
ax.set_title('Transaction Amount by Class')
ax.set_xlabel('Is Fraud')
ax.set_ylabel('Amount')
plt.suptitle('')
plt.show()

In [None]:
# Category distribution
fig, ax = plt.subplots(figsize=(14, 5))
sp_df.groupby(['category', 'is_fraud']).size().unstack(fill_value=0).plot(
    kind='bar', stacked=True, ax=ax, color=['#55A868', '#C44E52'])
ax.set_title('Transactions by Category & Fraud Status')
ax.legend(['Legit', 'Fraud'])
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Temporal analysis
sp_df['trans_dt'] = pd.to_datetime(sp_df['trans_date_trans_time'])
sp_df['hour'] = sp_df['trans_dt'].dt.hour

fig, ax = plt.subplots(figsize=(12, 4))
hourly = sp_df.groupby(['hour', 'is_fraud']).size().unstack(fill_value=0)
hourly.plot(kind='bar', ax=ax, color=['#55A868', '#C44E52'])
ax.set_title('Transactions by Hour of Day')
ax.legend(['Legit', 'Fraud'])
plt.tight_layout()
plt.show()

## 3. Key EDA Findings

1. **Severe class imbalance** in both datasets â€” fraud represents < 1% of transactions.
2. **European dataset**: PCA-transformed features make direct interpretation difficult.  
   `Amount` and `Time` are the only original-scale features.
3. **Sparkov dataset**: Rich demographic and temporal features available.  
   Fraud distribution varies significantly by merchant `category` and `hour`.
4. Correlations between PCA features and the target are weak individually,  
   motivating the use of ensemble methods.

---
*Proceed to Notebook 02 for existing work replication.*