# 01 - EDA

This notebook explores the credit card transactions dataset.

- Loads `data/raw/creditcard.csv`
- Basic exploration: `head`, `info`, `describe`
- Visualizations: Amount distribution, Class distribution, correlation heatmap, KDE by class, Time vs Amount
- Exports a small sample to `data/processed/sample_for_app.csv` for the Streamlit app



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

%matplotlib inline

raw_path = Path('data/raw/creditcard.csv')
assert raw_path.exists(), f"File not found: {raw_path}. Run scripts/generate_synthetic.py or place Kaggle CSV."

df = pd.read_csv(raw_path)
df.head()


In [None]:
df.info()


In [None]:
df.describe().T


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
# Amount distribution (log scale for clarity)
sns.histplot(df['Amount'], bins=100, ax=ax[0])
ax[0].set_title('Amount distribution')
ax[0].set_yscale('log')

# Class distribution
sns.countplot(x='Class', data=df, ax=ax[1])
ax[1].set_title('Class distribution (imbalance)')
plt.tight_layout()
plt.show()


In [None]:
# Correlation heatmap (subset for speed)
cols = [f"V{i}" for i in range(1, 29)] + ['Time', 'Amount']
plt.figure(figsize=(12, 10))
sns.heatmap(df[cols].corr(), cmap='coolwarm', center=0)
plt.title('Correlation heatmap (features)')
plt.show()


In [None]:
# Amount density by class
plt.figure(figsize=(10, 4))
sns.kdeplot(data=df, x='Amount', hue='Class', common_norm=False)
plt.title('Amount density: fraud vs non-fraud')
plt.xscale('log')
plt.show()

# Time vs Amount
plt.figure(figsize=(10, 4))
sns.scatterplot(data=df.sample(min(10000, len(df)), random_state=42), x='Time', y='Amount', hue='Class', alpha=0.5)
plt.title('Time vs Amount (sample)')
plt.yscale('log')
plt.show()


In [None]:
# Save small sample for the app
processed_dir = Path('data/processed')
processed_dir.mkdir(parents=True, exist_ok=True)

df.sample(n=min(2000, len(df)), random_state=42).to_csv(processed_dir / 'sample_for_app.csv', index=False)
processed_dir / 'sample_for_app.csv'


## Observations

- Strong class imbalance: the positive class (fraud) is extremely rare.
- Amount is highly skewed; log scale improves visibility.
- Some `V*` components show correlation structure; model can exploit interactions.
- PR-AUC is more informative than ROC-AUC given the imbalance.

