# Exploratory Data Analysis (EDA) for Fraud Detection

**Objective:** Understand the dataset's characteristics, including class imbalance, feature distributions, correlations, and missing values. This analysis will guide our preprocessing and modeling decisions.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Load Data
Load the raw dataset.

In [None]:
df = pd.read_csv('../data/raw/fraud_dataset.csv')

## 2. Initial Data Inspection

In [None]:
print("Dataset Shape:", df.shape)
print("\n--- Data Info ---")
df.info()
print("\n--- First 5 Rows ---")
df.head()

In [None]:
print("--- Descriptive Statistics ---")
df.describe().T

## 3. Class Imbalance Analysis
This is the most critical step for a fraud detection problem. We need to check how skewed our target variable is.

In [None]:
class_counts = df['isFraud'].value_counts()
class_percentage = df['isFraud'].value_counts(normalize=True) * 100

print("Fraud Class Distribution:")
print(class_counts)
print("\nFraud Class Percentage:")
print(f"{class_percentage}")

sns.countplot(x='isFraud', data=df)
plt.title('Class Distribution (0: Not Fraud, 1: Fraud)')
plt.show()

**Finding:** The dataset is highly imbalanced. Fraudulent transactions are a tiny minority. This means accuracy is a poor metric, and we must use metrics like **Precision-Recall AUC**.

## 4. Missing Values & Duplicates

In [None]:
print("--- Missing Values per Column ---")
print(df.isnull().sum())

print("\n--- Total Duplicated Rows ---")
print(df.duplicated().sum())

**Finding:** There appear to be no missing values or duplicates in this synthetic dataset. In a real-world scenario, we would need to define an imputation strategy (e.g., mean, median, constant).

## 5. Feature Distribution and Correlation

In [None]:
# Visualizing distributions of numerical features
numerical_features = df.select_dtypes(include=np.number).columns.tolist()
df[numerical_features].hist(bins=30, figsize=(20, 15))
plt.suptitle('Distribution of Numerical Features')
plt.show()

In [None]:
# Correlation Matrix
plt.figure(figsize=(15, 10))
# Ensure only numeric columns are used for correlation matrix
numeric_df = df.select_dtypes(include=['float64', 'int64'])
corr_matrix = numeric_df.corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()

**Findings:**
- `oldbalanceOrg` and `newbalanceOrg` are highly correlated.
- `oldbalanceDest` and `newbalanceDest` are also highly correlated.
These strong correlations suggest potential multicollinearity, but tree-based models like XGBoost are generally robust to it.

## EDA Summary

1.  **Imbalance:** The dataset is extremely imbalanced, which is the primary challenge. Our model evaluation strategy must prioritize metrics like AUC-PR, Precision, and Recall.
2.  **Features:** We have a mix of numerical and categorical (`type`) features. The `nameOrig` and `nameDest` columns are identifiers and should likely be dropped.
3.  **Preprocessing Needs:**
    -   Categorical features (`type`) must be one-hot encoded.
    -   Numerical features should be scaled to prevent features with large ranges from dominating the model.
    -   Identifier columns (`nameOrig`, `nameDest`) should be dropped.