# Day 1 - Exploratory Data Analysis (EDA)

This notebook explores the **Credit Card Fraud Detection dataset** from Kaggle.  
Goal: Understand dataset structure, fraud distribution, and basic feature properties before augmentation.

---

In [1]:
%pip install -e ../../..

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys, os
# adjust path so Python can find the package under src/
sys.path.append(os.path.abspath("src"))  # adjust path to project root if needed

from daredev_fraud.data_ingestion import load_fraud_dataset  # If your package folder is daredev_fraud

# If your package folder is daredev-fraud, use:
# from daredev_fraud.data_ingestion import load_fraud_dataset

# Keep plots styled
sns.set(style="whitegrid")

# 1. Load dataset
df = load_fraud_dataset()

Obtaining file:///Users/nxwright/Documents/projects-2025/fraud-signal-detector-augmented-producer
[31mERROR: file:///Users/nxwright/Documents/projects-2025/fraud-signal-detector-augmented-producer does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


ModuleNotFoundError: No module named 'daredev_fraud'

### 1. Dataset Overview

We start by checking the shape, columns, and datatypes of the dataset.  
This ensures we understand the raw structure before moving forward.

In [1]:
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nDtypes:\n", df.dtypes.value_counts())
print("\nMissing values:\n", df.isnull().sum())

NameError: name 'df' is not defined

## 2. Fraud Distribution

The target column is **Class**:  
- `0` → Not Fraud  
- `1` → Fraud  

Let’s check how imbalanced the dataset is.

In [None]:
class_counts = df['Class'].value_counts()
class_percent = df['Class'].value_counts(normalize=True) * 100

print("Counts:\n", class_counts)
print("\nPercentages:\n", class_percent)

# Bar plot
sns.barplot(x=class_counts.index, y=class_counts.values, palette="pastel")
plt.xticks([0,1], ['Not Fraud (0)', 'Fraud (1)'])
plt.ylabel("Count")
plt.title("Fraud Class Distribution")
plt.show()

# Pie chart
plt.pie(class_counts, labels=['Not Fraud', 'Fraud'], autopct='%1.3f%%', startangle=90, colors=["#66b3ff","#ff9999"])
plt.title("Fraud vs Not Fraud")
plt.show()


## 3. Descriptive Statistics

We’ll generate summary statistics to spot unusual ranges or outliers, especially in `Amount` and the PCA features (`V1`–`V28`).


In [None]:
df.describe().T


## 4. Correlation Heatmap

Finally, let’s visualize correlations between features.  
This can hint at which features may carry strong signals for fraud.

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), cmap="coolwarm", center=0, cbar=True)
plt.title("Correlation Heatmap", fontsize=14)
plt.show()

---
# Findings (Day 1)

- Dataset contains ~285k transactions, with 492 fraud cases (**0.17% fraud**).  
- No missing values detected.  
- Features are mostly PCA-transformed (`V1–V28`), plus `Time`, `Amount`, and target `Class`.  
- Class imbalance is extreme → augmentation will be required before modeling.  
- Some features (like `V14`, `V17`) show higher correlation with `Class`.  

✅ Next Step (Day 2): Begin **data augmentation** with Faker and resampling strategies to balance fraud vs non-fraud.