# 1. Exploratory Data Analysis - Fraud Data

This notebook performs a comprehensive exploratory data analysis (EDA) on the e-commerce fraud dataset. We use the modular functions defined in the `src/` directory for consistency and reusability.

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data.loading import load_fraud_data, load_ip_country_data
from src.data.cleaning import handle_missing_values, remove_duplicates, correct_data_types, validate_data
from src.visualization.eda import (
    plot_univariate,
    plot_bivariate,
    plot_class_distribution,
    plot_correlation_matrix,
    plot_fraud_by_country
)
from src.features.geolocation import merge_ip_country, analyze_fraud_by_country

## Load Data

In [None]:
# Load the datasets
fraud_df = load_fraud_data()
ip_country_df = load_ip_country_data()

print(f"Fraud Data Shape: {fraud_df.shape}")
print(f"IP-to-Country Data Shape: {ip_country_df.shape}")

fraud_df.head()

## Data Cleaning and Validation

In [None]:
# Basic validation
if validate_data(fraud_df):
    print("Data validation passed.")
else:
    print("Data validation failed!")

# Handle missing values (if any)
fraud_df = handle_missing_values(fraud_df, strategy='median')

# Remove duplicates
fraud_df = remove_duplicates(fraud_df)

# Correct data types
type_mapping = {
    'signup_time': 'datetime',
    'purchase_time': 'datetime'
}
fraud_df = correct_data_types(fraud_df, type_mapping)

fraud_df.info()

## Class Distribution

Visualizing the imbalance in the target variable (`class`).

In [None]:
plot_class_distribution(fraud_df, col='class')

## Univariate Analysis

Exploring the distribution of individual features.

In [None]:
# Plot numeric distributions
plot_univariate(fraud_df, col='purchase_value', kind='hist')
plot_univariate(fraud_df, col='age', kind='hist')

In [None]:
# Plot categorical distributions
plot_univariate(fraud_df, col='source', kind='count')
plot_univariate(fraud_df, col='browser', kind='count')
plot_univariate(fraud_df, col='sex', kind='count')

## Bivariate Analysis

Exploring relationships between features and the target variable.

In [None]:
plot_bivariate(fraud_df, x='class', y='purchase_value', kind='box')
plot_bivariate(fraud_df, x='class', y='age', kind='box')

## Correlation Analysis

In [None]:
plot_correlation_matrix(fraud_df)

## Geolocation Analysis

Merging with IP-to-Country data and analyzing geographic patterns.

In [None]:
# Merge fraud data with country mapping
merged_df = merge_ip_country(fraud_df, ip_country_df)

print(f"Merged Data Shape: {merged_df.shape}")
merged_df.head()

In [None]:
# Analyze fraud by country
country_stats = analyze_fraud_by_country(merged_df)

# Plot fraud distribution by country
plot_fraud_by_country(country_stats, top_n=15)

## Conclusion

The initial EDA provides insights into the dataset structure, missing values, and imbalances. We've established a baseline understanding of the features and their relationships with fraud. 

**Key Findings:**
- Significant class imbalance (fraud is rare).
- No immediate visual separation for purchase value or age between classes.
- Geolocation mapping reveals countries with higher fraud volumes and rates.