# Task 1.1: Exploratory Data Analysis - Fraud Data

## Objective
Explore and understand the e-commerce fraud dataset (`Fraud_Data.csv`) to:
1. Understand data structure, types, and quality
2. Identify missing values and duplicates
3. Analyze class distribution (fraud vs non-fraud)
4. Discover patterns and relationships in features

## Key Questions
- How imbalanced is the fraud class?
- What are the distributions of key features?
- Are there obvious patterns that distinguish fraud from legitimate transactions?

In [None]:
# Standard imports
import sys
from pathlib import Path

# Add project root to path for imports
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Project imports
from src.data.loader import load_fraud_data
from src.data.cleaning import clean_fraud_data, get_missing_value_summary, get_duplicate_summary

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

## 1. Load Raw Data

In [None]:
# Load the raw fraud data
DATA_PATH = project_root / "data" / "raw" / "Fraud_Data.csv"

df_raw = load_fraud_data(DATA_PATH)
print(f"Dataset shape: {df_raw.shape}")
print(f"Columns: {list(df_raw.columns)}")

In [None]:
# First look at the data
df_raw.head(10)

In [None]:
# Data types and info
df_raw.info()

In [None]:
# Basic statistics for numeric columns
df_raw.describe()

### Interpretation: Raw Data Overview

*TODO: After running, describe:*
- Number of rows and columns
- Data types observed
- Initial observations about the data

## 2. Data Quality Assessment

In [None]:
# Check for missing values
missing_summary = get_missing_value_summary(df_raw)
if len(missing_summary) > 0:
    print("Missing Values Found:")
    display(missing_summary)
else:
    print("No missing values found in the dataset.")

In [None]:
# Check for duplicates
dup_summary = get_duplicate_summary(df_raw)
print("Duplicate Analysis:")
for key, value in dup_summary.items():
    print(f"  {key}: {value}")

### Interpretation: Data Quality

*TODO: After running, describe:*
- Were there missing values? In which columns?
- Were there duplicates? How many?
- What cleaning actions are needed?

## 3. Clean the Data

In [None]:
# Apply cleaning function
df_clean, cleaning_report = clean_fraud_data(df_raw)

print("Cleaning Report:")
for key, value in cleaning_report.items():
    print(f"  {key}: {value}")

In [None]:
# Verify cleaned data types
df_clean.info()

## 4. Class Distribution Analysis (Target Variable)

In [None]:
# Class distribution
class_counts = df_clean['class'].value_counts()
class_pct = df_clean['class'].value_counts(normalize=True) * 100

print("Class Distribution:")
print(f"  Non-Fraud (0): {class_counts[0]:,} ({class_pct[0]:.2f}%)")
print(f"  Fraud (1):     {class_counts[1]:,} ({class_pct[1]:.2f}%)")
print(f"\nImbalance Ratio: 1:{class_counts[0]/class_counts[1]:.1f}")

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart
colors = ['#2ecc71', '#e74c3c']
axes[0].bar(['Non-Fraud', 'Fraud'], class_counts.values, color=colors)
axes[0].set_ylabel('Count')
axes[0].set_title('Class Distribution (Count)')
for i, v in enumerate(class_counts.values):
    axes[0].text(i, v + 500, f'{v:,}', ha='center', fontweight='bold')

# Pie chart
axes[1].pie(class_counts.values, labels=['Non-Fraud', 'Fraud'], autopct='%1.2f%%',
            colors=colors, explode=[0, 0.1])
axes[1].set_title('Class Distribution (Percentage)')

plt.tight_layout()
plt.show()

### Interpretation: Class Imbalance

*TODO: After running, describe:*
- What is the exact fraud rate?
- How severe is the imbalance?
- What are the implications for modeling? (e.g., accuracy is misleading, need SMOTE/undersampling)

## 5. Univariate Analysis

### 5.1 Numeric Features

In [None]:
# Purchase value distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Histogram
axes[0].hist(df_clean['purchase_value'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Purchase Value ($)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Purchase Value')

# Box plot
axes[1].boxplot(df_clean['purchase_value'], vert=True)
axes[1].set_ylabel('Purchase Value ($)')
axes[1].set_title('Purchase Value Box Plot')

plt.tight_layout()
plt.show()

print(f"Purchase Value Statistics:")
print(df_clean['purchase_value'].describe())

In [None]:
# Age distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(df_clean['age'], bins=30, edgecolor='black', alpha=0.7, color='#3498db')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Age')

axes[1].boxplot(df_clean['age'], vert=True)
axes[1].set_ylabel('Age')
axes[1].set_title('Age Box Plot')

plt.tight_layout()
plt.show()

print(f"Age Statistics:")
print(df_clean['age'].describe())

### 5.2 Categorical Features

In [None]:
# Categorical columns analysis
cat_cols = ['source', 'browser', 'sex']

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, col in enumerate(cat_cols):
    value_counts = df_clean[col].value_counts()
    axes[i].bar(value_counts.index, value_counts.values, color='#9b59b6', edgecolor='black')
    axes[i].set_xlabel(col.capitalize())
    axes[i].set_ylabel('Count')
    axes[i].set_title(f'Distribution of {col.capitalize()}')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Value counts for each categorical
for col in cat_cols:
    print(f"\n{col.upper()}:")
    print(df_clean[col].value_counts())

### Interpretation: Feature Distributions

*TODO: After running, describe:*
- What is the shape of purchase_value distribution? Any outliers?
- What is the age distribution?
- Which sources/browsers are most common?

## 6. Bivariate Analysis (Features vs Target)

In [None]:
# Purchase value by fraud class
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Box plot
df_clean.boxplot(column='purchase_value', by='class', ax=axes[0])
axes[0].set_xlabel('Class (0=Non-Fraud, 1=Fraud)')
axes[0].set_ylabel('Purchase Value ($)')
axes[0].set_title('Purchase Value by Class')
plt.suptitle('')  # Remove auto-title

# Violin plot
parts = axes[1].violinplot(
    [df_clean[df_clean['class']==0]['purchase_value'].values,
     df_clean[df_clean['class']==1]['purchase_value'].values],
    positions=[0, 1]
)
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(['Non-Fraud', 'Fraud'])
axes[1].set_ylabel('Purchase Value ($)')
axes[1].set_title('Purchase Value Distribution by Class')

plt.tight_layout()
plt.show()

In [None]:
# Age by fraud class
fig, ax = plt.subplots(figsize=(8, 4))

df_clean.boxplot(column='age', by='class', ax=ax)
ax.set_xlabel('Class (0=Non-Fraud, 1=Fraud)')
ax.set_ylabel('Age')
ax.set_title('Age by Class')
plt.suptitle('')
plt.show()

# Statistics by class
print("Age by Class:")
print(df_clean.groupby('class')['age'].describe())

In [None]:
# Fraud rate by categorical features
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, col in enumerate(cat_cols):
    fraud_rate = df_clean.groupby(col)['class'].mean() * 100
    fraud_rate = fraud_rate.sort_values(ascending=False)
    
    axes[i].bar(fraud_rate.index, fraud_rate.values, color='#e74c3c', edgecolor='black')
    axes[i].set_xlabel(col.capitalize())
    axes[i].set_ylabel('Fraud Rate (%)')
    axes[i].set_title(f'Fraud Rate by {col.capitalize()}')
    axes[i].tick_params(axis='x', rotation=45)
    axes[i].axhline(y=df_clean['class'].mean()*100, color='black', linestyle='--', label='Overall')

plt.tight_layout()
plt.show()

In [None]:
# Fraud rate statistics by category
for col in cat_cols:
    print(f"\nFraud Rate by {col.upper()}:")
    fraud_stats = df_clean.groupby(col).agg(
        total_count=('class', 'count'),
        fraud_count=('class', 'sum'),
        fraud_rate=('class', 'mean')
    ).round(4)
    fraud_stats['fraud_rate'] = (fraud_stats['fraud_rate'] * 100).round(2)
    print(fraud_stats.sort_values('fraud_rate', ascending=False))

### Interpretation: Feature vs Target Relationships

*TODO: After running, describe:*
- Is there a difference in purchase_value between fraud and non-fraud?
- Do certain sources/browsers have higher fraud rates?
- Are there any surprising patterns?

## 7. Summary and Next Steps

*TODO: Fill in after completing the analysis*

### Key Findings
1. **Class Imbalance**: [Describe the imbalance ratio]
2. **Data Quality**: [Describe missing values, duplicates, cleaning actions]
3. **Feature Insights**: [Key patterns discovered]

### Next Steps
- Proceed to geolocation analysis (IP to country mapping)
- Engineer time-based features
- Create velocity features
- Handle class imbalance for modeling

In [None]:
# Save cleaned data for next notebook
output_path = project_root / "data" / "processed" / "fraud_cleaned.parquet"
df_clean.to_parquet(output_path, index=False)
print(f"Cleaned data saved to: {output_path}")