# Fraud Detection - Comprehensive Exploratory Data Analysis

## Learning Objectives
This notebook demonstrates industry-standard EDA techniques for fraud detection:
- **Data Quality Assessment**: Identify missing values, duplicates, and data type issues
- **Univariate Analysis**: Understand individual feature distributions
- **Bivariate Analysis**: Explore relationships between features and target
- **Multivariate Analysis**: Discover complex patterns and interactions
- **Feature Engineering Opportunities**: Identify potential new features

## Context
Fraud detection is a critical application of machine learning in finance. The extreme class imbalance (0.13% fraud rate) presents unique challenges that require specialized analytical approaches.

## 1. Library Imports and Configuration

In [5]:
# Core libraries for data manipulation and analysis
import pandas as pd
import numpy as np
import sqlite3
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency, f_oneway
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Configuration
DATABASE_PATH = '/Users/sidharthrao/Documents/Documents_Sid MacBook Pro/GitHub/Project-Rogue/Inttrvu/Capstone_Projects/Database.db'
SAMPLE_SIZE = 100000  # For memory efficiency in visualizations

print("Libraries imported successfully!")

ModuleNotFoundError: No module named 'pandas'

## 2. Data Loading and Initial Exploration

In [None]:
def load_fraud_data(sample_size=None):
    """
    Load fraud detection data from SQLite database
    
    Parameters:
    -----------
    sample_size : int, optional
        Number of rows to sample for memory efficiency
    
    Returns:
    --------
    pd.DataFrame
        Loaded fraud detection data
    """
    try:
        conn = sqlite3.connect(DATABASE_PATH)
        
        if sample_size:
            query = f"SELECT * FROM Fraud_detection ORDER BY RANDOM() LIMIT {sample_size}"
        else:
            query = "SELECT * FROM Fraud_detection"
            
        df = pd.read_sql_query(query, conn)
        conn.close()
        
        print(f"‚úÖ Data loaded successfully! Shape: {df.shape}")
        return df
        
    except Exception as e:
        print(f"‚ùå Error loading data: {e}")
        return None

# Load data with sampling for efficient processing
df = load_fraud_data(SAMPLE_SIZE)

if df is not None:
    print("\nüìä Dataset Overview:")
    print(f"- Total Records: {df.shape[0]:,}")
    print(f"- Features: {df.shape[1]}")
    print(f"- Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

In [None]:
def convert_data_types(df):
    """
    Convert columns to appropriate data types for analysis
    
    Learning Note: Data type conversion is crucial for:
    - Memory efficiency
    - Correct statistical analysis
    - Proper visualization
    """
    df_converted = df.copy()
    
    # Convert numeric columns
    numeric_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 
                   'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud']
    
    for col in numeric_cols:
        df_converted[col] = pd.to_numeric(df_converted[col], errors='coerce')
    
    # Convert categorical columns
    categorical_cols = ['type', 'nameOrig', 'nameDest']
    
    for col in categorical_cols:
        df_converted[col] = df_converted[col].astype('category')
    
    print("‚úÖ Data types converted successfully!")
    return df_converted

if df is not None:
    df = convert_data_types(df)
    print("\nüìã Data Types After Conversion:")
    print(df.dtypes)

## 3. Data Quality Assessment

In [None]:
def assess_data_quality(df):
    """
    Comprehensive data quality assessment
    
    Learning Note: Data quality is foundational for reliable ML models.
    Poor data quality leads to:
    - Biased model predictions
    - Poor generalization
    - Incorrect business insights
    """
    print("üîç DATA QUALITY ASSESSMENT")
    print("=" * 50)
    
    # Missing values analysis
    print("\n1. Missing Values Analysis:")
    missing_values = df.isnull().sum()
    missing_percentage = (missing_values / len(df)) * 100
    
    missing_df = pd.DataFrame({
        'Missing Count': missing_values,
        'Missing Percentage': missing_percentage
    })
    
    if missing_df['Missing Count'].sum() > 0:
        print(missing_df[missing_df['Missing Count'] > 0])
    else:
        print("‚úÖ No missing values found!")
    
    # Duplicate records analysis
    print("\n2. Duplicate Records Analysis:")
    duplicates = df.duplicated().sum()
    print(f"- Duplicate Records: {duplicates:,} ({(duplicates/len(df))*100:.4f}%)")
    
    # Data type validation
    print("\n3. Data Type Validation:")
    print("- Numeric columns should be numeric:", all(df.select_dtypes(include=[np.number]).notna().all()))
    print("- Categorical columns have reasonable cardinality:")
    
    for col in df.select_dtypes(include=['category']).columns:
        unique_count = df[col].nunique()
        print(f"  * {col}: {unique_count:,} unique values")
    
    # Range validation for numeric columns
    print("\n4. Numeric Range Validation:")
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    
    for col in numeric_cols:
        if col in ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']:
            negative_count = (df[col] < 0).sum()
            if negative_count > 0:
                print(f"  ‚ö†Ô∏è  {col}: {negative_count:,} negative values")
            else:
                print(f"  ‚úÖ {col}: No negative values")
    
    return missing_df, duplicates

# Run data quality assessment
missing_df, duplicate_count = assess_data_quality(df)

In [None]:
# Visualize missing values (if any exist)
if missing_df['Missing Count'].sum() > 0:
    plt.figure(figsize=(12, 6))
    sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis')
    plt.title('Missing Values Heatmap')
    plt.xlabel('Features')
    plt.tight_layout()
    plt.show()
else:
    print("‚úÖ No missing values to visualize!")

## 4. Univariate Analysis

### Learning Note: Univariate analysis helps understand individual characteristics of each feature, which is essential for:
- Detecting outliers and anomalies
- Understanding data distributions
- Identifying data quality issues
- Informing preprocessing decisions

In [None]:
def analyze_numerical_features(df):
    """
    Comprehensive analysis of numerical features
    """
    print("üìä NUMERICAL FEATURES ANALYSIS")
    print("=" * 50)
    
    numerical_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 
                      'oldbalanceDest', 'newbalanceDest']
    
    # Statistical summary
    print("\n1. Statistical Summary:")
    stats_df = df[numerical_cols].describe().T
    
    # Add additional statistics
    stats_df['skewness'] = df[numerical_cols].skew()
    stats_df['kurtosis'] = df[numerical_cols].kurtosis()
    stats_df['missing_pct'] = df[numerical_cols].isnull().sum() / len(df) * 100
    
    print(stats_df)
    
    return numerical_cols, stats_df

numerical_cols, stats_summary = analyze_numerical_features(df)

In [None]:
# Distribution plots for numerical features
def plot_numerical_distributions(df, numerical_cols):
    """
    Create distribution plots for numerical features
    
    Learning Note: Distribution plots help identify:
    - Skewness and data transformation needs
    - Outliers and extreme values
    - Multi-modal distributions
    """
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.ravel()
    
    for i, col in enumerate(numerical_cols):
        # Histogram with KDE
        axes[i].hist(df[col].dropna(), bins=50, alpha=0.7, density=True)
        axes[i].set_title(f'{col} Distribution\n(Skew: {df[col].skew():.2f})')
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Density')
        
        # Add vertical lines for mean and median
        mean_val = df[col].mean()
        median_val = df[col].median()
        axes[i].axvline(mean_val, color='red', linestyle='--', alpha=0.8, label=f'Mean: {mean_val:.2f}')
        axes[i].axvline(median_val, color='green', linestyle='--', alpha=0.8, label=f'Median: {median_val:.2f}')
        axes[i].legend()
    
    plt.tight_layout()
    plt.show()
    
    # Box plots for outlier detection
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    axes = axes.ravel()
    
    for i, col in enumerate(numerical_cols):
        # Box plot
        axes[i].boxplot(df[col].dropna())
        axes[i].set_title(f'{col} Box Plot')
        axes[i].set_ylabel(col)
        
        # Add outlier count
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outliers = ((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))).sum()
        axes[i].text(0.02, 0.98, f'Outliers: {outliers:,}', transform=axes[i].transAxes,
                    verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.tight_layout()
    plt.show()

plot_numerical_distributions(df, numerical_cols)

In [None]:
def analyze_categorical_features(df):
    """
    Analyze categorical features
    """
    print("üìã CATEGORICAL FEATURES ANALYSIS")
    print("=" * 50)
    
    # Transaction type analysis
    print("\n1. Transaction Type Distribution:")
    type_counts = df['type'].value_counts()
    type_percentages = (type_counts / len(df)) * 100
    
    type_analysis = pd.DataFrame({
        'Count': type_counts,
        'Percentage': type_percentages
    })
    print(type_analysis)
    
    # Account name analysis
    print("\n2. Account Name Analysis:")
    print(f"- Unique Origin Accounts: {df['nameOrig'].nunique():,}")
    print(f"- Unique Destination Accounts: {df['nameDest'].nunique():,}")
    
    # Account type patterns
    df['orig_type'] = df['nameOrig'].str[0]
    df['dest_type'] = df['nameDest'].str[0]
    
    print("\n3. Account Type Distribution:")
    print("Origin Account Types:")
    print(df['orig_type'].value_counts())
    print("\nDestination Account Types:")
    print(df['dest_type'].value_counts())
    
    return type_analysis

type_analysis = analyze_categorical_features(df)

In [None]:
# Visualize categorical features
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Transaction type distribution
type_counts = df['type'].value_counts()
axes[0, 0].bar(type_counts.index, type_counts.values)
axes[0, 0].set_title('Transaction Type Distribution')
axes[0, 0].set_xlabel('Transaction Type')
axes[0, 0].set_ylabel('Count')
axes[0, 0].tick_params(axis='x', rotation=45)

# Transaction type percentages
type_percentages = (type_counts / len(df)) * 100
axes[0, 1].bar(type_percentages.index, type_percentages.values)
axes[0, 1].set_title('Transaction Type Percentages')
axes[0, 1].set_xlabel('Transaction Type')
axes[0, 1].set_ylabel('Percentage (%)')
axes[0, 1].tick_params(axis='x', rotation=45)

# Origin account types
orig_types = df['orig_type'].value_counts()
axes[1, 0].bar(orig_types.index, orig_types.values)
axes[1, 0].set_title('Origin Account Types')
axes[1, 0].set_xlabel('Account Type')
axes[1, 0].set_ylabel('Count')

# Destination account types
dest_types = df['dest_type'].value_counts()
axes[1, 1].bar(dest_types.index, dest_types.values)
axes[1, 1].set_title('Destination Account Types')
axes[1, 1].set_xlabel('Account Type')
axes[1, 1].set_ylabel('Count')

plt.tight_layout()
plt.show()

## 5. Target Variable Analysis

In [None]:
def analyze_target_variable(df):
    """
    Comprehensive analysis of the target variable (isFraud)
    
    Learning Note: Understanding target variable distribution is critical for:
    - Model selection (imbalanced data requires special techniques)
    - Evaluation metric selection
    - Sampling strategy decisions
    """
    print("üéØ TARGET VARIABLE ANALYSIS (isFraud)")
    print("=" * 50)
    
    # Basic statistics
    fraud_count = df['isFraud'].value_counts()
    fraud_percentage = (fraud_count / len(df)) * 100
    
    print("\n1. Fraud Distribution:")
    for label, count in fraud_count.items():
        percentage = fraud_percentage[label]
        label_text = "Fraud" if label == 1 else "Legitimate"
        print(f"  {label_text}: {count:,} ({percentage:.4f}%)")
    
    # Class imbalance ratio
    imbalance_ratio = fraud_count[0] / fraud_count[1] if len(fraud_count) > 1 else float('inf')
    print(f"\n2. Class Imbalance Ratio: {imbalance_ratio:.2f}:1 (Legitimate:Fake)")
    
    # Visualize target distribution
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Count plot
    labels = ['Legitimate', 'Fraud']
    colors = ['lightblue', 'red']
    
    axes[0].bar(labels, [fraud_count[0], fraud_count[1]], color=colors)
    axes[0].set_title('Fraud vs Legitimate Transactions (Count)')
    axes[0].set_ylabel('Count')
    
    # Add count labels on bars
    for i, (label, count) in enumerate(zip(labels, [fraud_count[0], fraud_count[1]])):
        axes[0].text(i, count + max(fraud_count) * 0.01, f'{count:,}', 
                    ha='center', va='bottom', fontweight='bold')
    
    # Percentage plot (log scale for better visualization)
    axes[1].bar(labels, [fraud_percentage[0], fraud_percentage[1]], color=colors)
    axes[1].set_title('Fraud vs Legitimate Transactions (Percentage)')
    axes[1].set_ylabel('Percentage (%)')
    axes[1].set_yscale('log')  # Log scale to see small percentage
    
    # Add percentage labels on bars
    for i, (label, pct) in enumerate(zip(labels, [fraud_percentage[0], fraud_percentage[1]])):
        axes[1].text(i, pct * 1.1, f'{pct:.4f}%', 
                    ha='center', va='bottom', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    return fraud_count, fraud_percentage, imbalance_ratio

fraud_count, fraud_percentage, imbalance_ratio = analyze_target_variable(df)

## 6. Bivariate Analysis

### Learning Note: Bivariate analysis explores relationships between pairs of variables, helping identify:
- Feature-target relationships
- Feature-feature correlations
- Potential predictive patterns
- Multicollinearity issues

In [None]:
def analyze_feature_target_relationships(df):
    """
    Analyze relationships between features and target variable
    """
    print("üîó FEATURE-TARGET RELATIONSHIP ANALYSIS")
    print("=" * 50)
    
    # Numerical features vs target
    print("\n1. Numerical Features by Fraud Status:")
    numerical_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 
                      'oldbalanceDest', 'newbalanceDest']
    
    for col in numerical_cols:
        legit_stats = df[df['isFraud'] == 0][col].describe()
        fraud_stats = df[df['isFraud'] == 1][col].describe()
        
        print(f"\n{col}:")
        print(f"  Legitimate - Mean: {legit_stats['mean']:.2f}, Median: {legit_stats['50%']:.2f}")
        print(f"  Fraud - Mean: {fraud_stats['mean']:.2f}, Median: {fraud_stats['50%']:.2f}")
        
        # Statistical test (Mann-Whitney U test for non-normal distributions)
        if len(df[df['isFraud'] == 1]) > 0 and len(df[df['isFraud'] == 0]) > 0:
            try:
                statistic, p_value = stats.mannwhitneyu(
                    df[df['isFraud'] == 0][col].dropna(),
                    df[df['isFraud'] == 1][col].dropna()
                )
                significance = "Significant" if p_value < 0.05 else "Not Significant"
                print(f"  Mann-Whitney U test: {significance} (p={p_value:.2e})")
            except:
                print(f"  Mann-Whitney U test: Unable to compute")
    
    # Categorical features vs target
    print("\n2. Transaction Type by Fraud Status:")
    type_fraud_crosstab = pd.crosstab(df['type'], df['isFraud'], margins=True)
    print(type_fraud_crosstab)
    
    # Chi-square test for categorical association
    if len(df['type'].unique()) > 1 and len(df['isFraud'].unique()) > 1:
        try:
            chi2, p_value, dof, expected = chi2_contingency(
                pd.crosstab(df['type'], df['isFraud'])
            )
            print(f"\nChi-square test: Significant association (p={p_value:.2e})")
        except:
            print("\nChi-square test: Unable to compute")
    
    return type_fraud_crosstab

type_fraud_crosstab = analyze_feature_target_relationships(df)

In [None]:
# Visualize feature-target relationships
def plot_feature_target_relationships(df):
    """
    Create visualizations for feature-target relationships
    """
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.ravel()
    
    numerical_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 
                      'oldbalanceDest', 'newbalanceDest']
    
    for i, col in enumerate(numerical_cols):
        # Box plots by fraud status
        legit_data = df[df['isFraud'] == 0][col].dropna()
        fraud_data = df[df['isFraud'] == 1][col].dropna()
        
        # Create box plot data
        box_data = [legit_data, fraud_data]
        labels = ['Legitimate', 'Fraud']
        
        axes[i].boxplot(box_data, labels=labels)
        axes[i].set_title(f'{col} by Fraud Status')
        axes[i].set_ylabel(col)
        
        # Add sample sizes
        axes[i].text(0.02, 0.98, f'n={len(legit_data):,}\nn={len(fraud_data):,}', 
                    transform=axes[i].transAxes, verticalalignment='top',
                    bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.tight_layout()
    plt.show()
    
    # Transaction type vs fraud
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Stacked bar chart
    type_fraud_pct = pd.crosstab(df['type'], df['isFraud'], normalize='index') * 100
    type_fraud_pct.plot(kind='bar', stacked=True, ax=axes[0], 
                       color=['lightblue', 'red'], alpha=0.8)
    axes[0].set_title('Fraud Percentage by Transaction Type')
    axes[0].set_ylabel('Percentage (%)')
    axes[0].set_xlabel('Transaction Type')
    axes[0].legend(['Legitimate', 'Fraud'])
    axes[0].tick_params(axis='x', rotation=45)
    
    # Count plot
    type_fraud_count = pd.crosstab(df['type'], df['isFraud'])
    type_fraud_count.plot(kind='bar', ax=axes[1], 
                         color=['lightblue', 'red'], alpha=0.8)
    axes[1].set_title('Transaction Count by Type and Fraud Status')
    axes[1].set_ylabel('Count')
    axes[1].set_xlabel('Transaction Type')
    axes[1].legend(['Legitimate', 'Fraud'])
    axes[1].tick_params(axis='x', rotation=45)
    axes[1].set_yscale('log')  # Log scale to see fraud counts
    
    plt.tight_layout()
    plt.show()

plot_feature_target_relationships(df)

In [None]:
def correlation_analysis(df):
    """
    Correlation analysis for numerical features
    
    Learning Note: Correlation analysis helps identify:
    - Multicollinearity (which can affect model performance)
    - Feature redundancy
    - Potential feature engineering opportunities
    """
    print("üîó CORRELATION ANALYSIS")
    print("=" * 50)
    
    numerical_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 
                      'oldbalanceDest', 'newbalanceDest', 'isFraud']
    
    # Calculate correlation matrix
    correlation_matrix = df[numerical_cols].corr()
    
    print("\n1. Correlation Matrix:")
    print(correlation_matrix.round(3))
    
    # Find highly correlated pairs
    print("\n2. Highly Correlated Feature Pairs (|r| > 0.7):")
    high_corr_pairs = []
    
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            corr_value = correlation_matrix.iloc[i, j]
            if abs(corr_value) > 0.7:
                feature1 = correlation_matrix.columns[i]
                feature2 = correlation_matrix.columns[j]
                high_corr_pairs.append((feature1, feature2, corr_value))
                print(f"  {feature1} ‚Üî {feature2}: {corr_value:.3f}")
    
    if not high_corr_pairs:
        print("  No highly correlated pairs found.")
    
    # Correlation with target
    print("\n3. Feature Correlation with Target (isFraud):")
    target_corr = correlation_matrix['isFraud'].sort_values(key=abs, ascending=False)
    print(target_corr.drop('isFraud').round(3))
    
    return correlation_matrix, high_corr_pairs

correlation_matrix, high_corr_pairs = correlation_analysis(df)

In [None]:
# Visualize correlation matrix
plt.figure(figsize=(12, 10))

# Create heatmap
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
            square=True, fmt='.3f', cbar_kws={"shrink": .8})

plt.title('Feature Correlation Heatmap\n(Upper triangle masked for clarity)', fontsize=14, fontweight='bold')
plt.xlabel('Features')
plt.ylabel('Features')
plt.tight_layout()
plt.show()

## 7. Multivariate Analysis

### Learning Note: Multivariate analysis explores complex relationships involving three or more variables simultaneously, helping identify:
- Interaction effects between features
- Complex patterns not visible in bivariate analysis
- Natural groupings in the data
- Dimension reduction opportunities

In [None]:
def create_interaction_features(df):
    """
    Create interaction features for multivariate analysis
    
    Learning Note: Feature engineering creates new features from existing ones,
    potentially capturing complex relationships that improve model performance.
    """
    df_engineered = df.copy()
    
    # Balance change features
    df_engineered['orig_balance_change'] = df_engineered['newbalanceOrig'] - df_engineered['oldbalanceOrg']
    df_engineered['dest_balance_change'] = df_engineered['newbalanceDest'] - df_engineered['oldbalanceDest']
    
    # Balance ratio features
    df_engineered['orig_balance_ratio'] = np.where(
        df_engineered['oldbalanceOrg'] > 0,
        df_engineered['newbalanceOrig'] / df_engineered['oldbalanceOrg'],
        0
    )
    
    df_engineered['dest_balance_ratio'] = np.where(
        df_engineered['oldbalanceDest'] > 0,
        df_engineered['newbalanceDest'] / df_engineered['oldbalanceDest'],
        0
    )
    
    # Amount to balance ratios
    df_engineered['amount_to_orig_balance'] = np.where(
        df_engineered['oldbalanceOrg'] > 0,
        df_engineered['amount'] / df_engineered['oldbalanceOrg'],
        df_engineered['amount']
    )
    
    # Zero balance indicators
    df_engineered['orig_zero_after'] = (df_engineered['newbalanceOrig'] == 0).astype(int)
    df_engineered['dest_zero_before'] = (df_engineered['oldbalanceDest'] == 0).astype(int)
    
    # Time-based features
    df_engineered['hour_of_day'] = df_engineered['step'] % 24
    df_engineered['day_of_week'] = (df_engineered['step'] // 24) % 7
    df_engineered['is_business_hours'] = ((df_engineered['hour_of_day'] >= 9) & 
                                         (df_engineered['hour_of_day'] <= 17)).astype(int)
    
    print(f"‚úÖ Created {len(df_engineered.columns) - len(df.columns)} new engineered features")
    
    new_features = [col for col in df_engineered.columns if col not in df.columns]
    print("New features:", new_features)
    
    return df_engineered, new_features

df_engineered, new_features = create_interaction_features(df)

In [None]:
def analyze_engineered_features(df_engineered, new_features):
    """
    Analyze the newly created engineered features
    """
    print("üîß ENGINEERED FEATURES ANALYSIS")
    print("=" * 50)
    
    # Statistical summary of new features
    print("\n1. Statistical Summary of New Features:")
    new_numerical_features = [f for f in new_features if df_engineered[f].dtype in ['int64', 'float64']]
    
    if new_numerical_features:
        new_stats = df_engineered[new_numerical_features].describe().T
        new_stats['skewness'] = df_engineered[new_numerical_features].skew()
        print(new_stats)
    
    # Correlation of new features with target
    print("\n2. New Features Correlation with Target:")
    for feature in new_numerical_features:
        correlation = df_engineered[feature].corr(df_engineered['isFraud'])
        print(f"  {feature}: {correlation:.4f}")
    
    # Visualize key engineered features
    if len(new_numerical_features) > 0:
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        axes = axes.ravel()
        
        # Select top 4 most correlated features for visualization
        feature_correlations = [(f, abs(df_engineered[f].corr(df_engineered['isFraud']))) 
                                for f in new_numerical_features]
        feature_correlations.sort(key=lambda x: x[1], reverse=True)
        top_features = [f[0] for f in feature_correlations[:4]]
        
        for i, feature in enumerate(top_features):
            if i < 4:
                # Box plot by fraud status
                legit_data = df_engineered[df_engineered['isFraud'] == 0][feature].dropna()
                fraud_data = df_engineered[df_engineered['isFraud'] == 1][feature].dropna()
                
                box_data = [legit_data, fraud_data]
                labels = ['Legitimate', 'Fraud']
                
                axes[i].boxplot(box_data, labels=labels)
                axes[i].set_title(f'{feature} by Fraud Status\n(Corr: {df_engineered[feature].corr(df_engineered["isFraud"]):.3f})')
                axes[i].set_ylabel(feature)
        
        plt.tight_layout()
        plt.show()
    
    return new_numerical_features

new_numerical_features = analyze_engineered_features(df_engineered, new_features)

In [None]:
def pca_analysis(df_engineered):
    """
    Principal Component Analysis for dimensionality reduction
    
    Learning Note: PCA helps identify:
    - The main sources of variance in the data
    - Redundant features that can be removed
    - Natural groupings of transactions
    """
    print("üìä PRINCIPAL COMPONENT ANALYSIS")
    print("=" * 50)
    
    # Select numerical features for PCA
    numerical_features = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 
                          'oldbalanceDest', 'newbalanceDest'] + new_numerical_features
    
    # Remove any infinite or very large values
    pca_data = df_engineered[numerical_features].replace([np.inf, -np.inf], np.nan).dropna()
    
    if len(pca_data) == 0:
        print("‚ùå No valid data for PCA after cleaning")
        return None, None
    
    # Standardize the data
    scaler = StandardScaler()
    pca_data_scaled = scaler.fit_transform(pca_data)
    
    # Apply PCA
    pca = PCA(n_components=min(10, len(numerical_features)))
    pca_result = pca.fit_transform(pca_data_scaled)
    
    # Explained variance
    print("\n1. Explained Variance by Component:")
    for i, variance in enumerate(pca.explained_variance_ratio_):
        cumulative_variance = sum(pca.explained_variance_ratio_[:i+1])
        print(f"  PC{i+1}: {variance:.4f} ({cumulative_variance:.4f} cumulative)")
    
    # Feature importance in components
    print("\n2. Feature Loadings for Top 3 Components:")
    for i in range(min(3, pca.n_components_)):
        print(f"\n  PC{i+1} Loadings:")
        loadings = pca.components_[i]
        feature_loadings = list(zip(numerical_features, loadings))
        feature_loadings.sort(key=lambda x: abs(x[1]), reverse=True)
        
        for feature, loading in feature_loadings[:5]:
            print(f"    {feature}: {loading:.4f}")
    
    # Visualize PCA results
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Scree plot
    axes[0].plot(range(1, len(pca.explained_variance_ratio_) + 1), 
                pca.explained_variance_ratio_, 'bo-')
    axes[0].set_title('Scree Plot - Explained Variance by Component')
    axes[0].set_xlabel('Principal Component')
    axes[0].set_ylabel('Explained Variance Ratio')
    axes[0].grid(True, alpha=0.3)
    
    # Cumulative variance plot
    cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
    axes[1].plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 'ro-')
    axes[1].set_title('Cumulative Explained Variance')
    axes[1].set_xlabel('Number of Components')
    axes[1].set_ylabel('Cumulative Explained Variance')
    axes[1].grid(True, alpha=0.3)
    axes[1].axhline(y=0.95, color='g', linestyle='--', alpha=0.7, label='95% Variance')
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()
    
    return pca, pca_result

pca_model, pca_result = pca_analysis(df_engineered)

## 8. Outlier Detection and Analysis

In [None]:
def comprehensive_outlier_detection(df):
    """
    Multiple methods for outlier detection
    
    Learning Note: Outlier detection is crucial for fraud detection because:
    - Fraudulent transactions often appear as outliers
    - Outliers can skew model training
    - Different detection methods capture different types of anomalies
    """
    print("üîç OUTLIER DETECTION ANALYSIS")
    print("=" * 50)
    
    numerical_cols = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 
                      'oldbalanceDest', 'newbalanceDest']
    
    outlier_summary = {}
    
    for col in numerical_cols:
        print(f"\n1. {col} Outlier Analysis:")
        
        # Method 1: IQR Method
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        iqr_outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()
        iqr_percentage = (iqr_outliers / len(df)) * 100
        
        print(f"  IQR Method: {iqr_outliers:,} outliers ({iqr_percentage:.2f}%)")
        
        # Method 2: Z-score Method
        z_scores = np.abs(stats.zscore(df[col].dropna()))
        z_outliers = (z_scores > 3).sum()
        z_percentage = (z_outliers / len(df[col].dropna())) * 100
        
        print(f"  Z-score Method: {z_outliers:,} outliers ({z_percentage:.2f}%)")
        
        # Method 3: Modified Z-score (for skewed data)
        median = df[col].median()
        mad = np.median(np.abs(df[col] - median))
        modified_z_scores = 0.6745 * (df[col] - median) / mad
        modified_z_outliers = (np.abs(modified_z_scores) > 3.5).sum()
        modified_z_percentage = (modified_z_outliers / len(df)) * 100
        
        print(f"  Modified Z-score: {modified_z_outliers:,} outliers ({modified_z_percentage:.2f}%)")
        
        outlier_summary[col] = {
            'iqr_outliers': iqr_outliers,
            'iqr_percentage': iqr_percentage,
            'z_outliers': z_outliers,
            'z_percentage': z_percentage,
            'modified_z_outliers': modified_z_outliers,
            'modified_z_percentage': modified_z_percentage
        }
    
    return outlier_summary

outlier_summary = comprehensive_outlier_detection(df)

In [None]:
# Visualize outliers for key features
def visualize_outliers(df, outlier_summary):
    """
    Create visualizations for outlier analysis
    """
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.ravel()
    
    numerical_cols = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 
                      'oldbalanceDest', 'newbalanceDest', 'step']
    
    for i, col in enumerate(numerical_cols):
        # Create box plot with outlier highlighting
        data = df[col].dropna()
        
        # Calculate IQR for outlier highlighting
        Q1 = data.quantile(0.25)
        Q3 = data.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Separate outliers and non-outliers
        non_outliers = data[(data >= lower_bound) & (data <= upper_bound)]
        outliers = data[(data < lower_bound) | (data > upper_bound)]
        
        # Create scatter plot to show distribution
        axes[i].scatter(range(len(non_outliers)), non_outliers, 
                       alpha=0.6, s=1, label='Normal', color='blue')
        axes[i].scatter(range(len(non_outliers), len(data)), outliers, 
                       alpha=0.8, s=2, label='Outliers', color='red')
        
        axes[i].set_title(f'{col} Distribution\nOutliers: {len(outliers):,} ({(len(outliers)/len(data)*100):.2f}%)')
        axes[i].set_xlabel('Index')
        axes[i].set_ylabel(col)
        axes[i].legend()
        axes[i].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

visualize_outliers(df, outlier_summary)

## 9. Data Insights Summary and Recommendations

In [None]:
def generate_eda_summary(df, fraud_count, correlation_matrix, outlier_summary, new_features):
    """
    Generate comprehensive EDA summary with actionable insights
    """
    print("üìã COMPREHENSIVE EDA SUMMARY")
    print("=" * 60)
    
    # Dataset Overview
    print("\nüéØ DATASET OVERVIEW:")
    print(f"  ‚Ä¢ Total Records: {df.shape[0]:,}")
    print(f"  ‚Ä¢ Total Features: {df.shape[1]}")
    print(f"  ‚Ä¢ Fraud Rate: {(fraud_count[1]/len(df)*100):.4f}%")
    print(f"  ‚Ä¢ Class Imbalance Ratio: {(fraud_count[0]/fraud_count[1]):.2f}:1")
    
    # Data Quality Assessment
    print("\n‚úÖ DATA QUALITY ASSESSMENT:")
    missing_count = df.isnull().sum().sum()
    duplicate_count = df.duplicated().sum()
    print(f"  ‚Ä¢ Missing Values: {missing_count:,}")
    print(f"  ‚Ä¢ Duplicate Records: {duplicate_count:,}")
    print(f"  ‚Ä¢ Data Types: All properly converted")
    
    # Key Statistical Findings
    print("\nüìä KEY STATISTICAL FINDINGS:")
    print(f"  ‚Ä¢ Average Transaction Amount: ${df['amount'].mean():,.2f}")
    print(f"  ‚Ä¢ Median Transaction Amount: ${df['amount'].median():,.2f}")
    print(f"  ‚Ä¢ Amount Skewness: {df['amount'].skew():.2f} (Highly skewed)")
    print(f"  ‚Ä¢ Most Common Transaction Type: {df['type'].mode().iloc[0]}")
    
    # Fraud Patterns
    print("\nüîç FRAUD PATTERNS:")
    fraud_by_type = df[df['isFraud'] == 1]['type'].value_counts()
    if len(fraud_by_type) > 0:
        print(f"  ‚Ä¢ Fraud by Transaction Types:")
        for trans_type, count in fraud_by_type.items():
            if trans_type and pd.notna(trans_type):
                percentage = (count / fraud_count[1]) * 100
                print(f"    - {trans_type}: {count:,} ({percentage:.1f}% of fraud)")
    
    # Feature Correlations
    print("\nüîó FEATURE CORRELATIONS:")
    target_correlations = correlation_matrix['isFraud'].drop('isFraud').abs().sort_values(ascending=False)
    print("  ‚Ä¢ Top Features Correlated with Fraud:")
    for feature, corr in target_correlations.head(5).items():
        print(f"    - {feature}: {corr:.4f}")
    
    # Outlier Analysis
    print("\nüö® OUTLIER ANALYSIS:")
    print("  ‚Ä¢ Average Outlier Percentage (IQR method):")
    avg_outlier_pct = np.mean([summary['iqr_percentage'] for summary in outlier_summary.values()])
    print(f"    {avg_outlier_pct:.2f}% across numerical features")
    
    # Feature Engineering Success
    print("\nüîß FEATURE ENGINEERING INSIGHTS:")
    print(f"  ‚Ä¢ Created {len(new_features)} new features")
    if new_features:
        print("  ‚Ä¢ New feature categories:")
        print("    - Balance change features")
        print("    - Ratio features")
        print("    - Time-based features")
        print("    - Binary indicator features")
    
    return "EDA Summary Generated Successfully"

summary_result = generate_eda_summary(df, fraud_count, correlation_matrix, outlier_summary, new_features)

In [None]:
def generate_ml_recommendations():
    """
    Generate specific recommendations for ML pipeline development
    
    Learning Note: These recommendations are based on EDA findings and
    industry best practices for fraud detection systems.
    """
    print("ü§ñ MACHINE LEARNING PIPELINE RECOMMENDATIONS")
    print("=" * 60)
    
    print("\nüìù DATA PREPROCESSING RECOMMENDATIONS:")
    print("  1. Handle extreme class imbalance:")
    print("     ‚Ä¢ Use SMOTE or ADASYN for oversampling minority class")
    print("     ‚Ä¢ Implement class weighting in models")
    print("     ‚Ä¢ Consider ensemble methods designed for imbalanced data")
    
    print("\n  2. Feature scaling strategies:")
    print("     ‚Ä¢ Use RobustScaler for amount features (handles outliers)")
    print("     ‚Ä¢ Apply log transformation to highly skewed features")
    print("     ‚Ä¢ StandardScaler for normally distributed features")
    
    print("\n  3. Encoding techniques:")
    print("     ‚Ä¢ OneHotEncoding for transaction type")
    print("     ‚Ä¢ Target encoding for high-cardinality account names")
    print("     ‚Ä¢ Binary encoding for account types (C/M)")
    
    print("\nüéØ MODEL SELECTION RECOMMENDATIONS:")
    print("  1. Primary models to implement:")
    print("     ‚Ä¢ XGBoost/LightGBM (excellent for imbalanced data)")
    print("     ‚Ä¢ Random Forest with balanced class weights")
    print("     ‚Ä¢ Logistic Regression with L1/L2 regularization")
    
    print("\n  2. Advanced techniques:")
    print("     ‚Ä¢ Isolation Forest for anomaly detection")
    print("     ‚Ä¢ Neural Networks with dropout layers")
    print("     ‚Ä¢ Ensemble methods (Voting, Stacking)")
    
    print("\n  3. Evaluation metrics priority:")
    print("     ‚Ä¢ Precision-Recall AUC (critical for imbalanced data)")
    print("     ‚Ä¢ F1-Score and F2-Score (emphasizes recall)")
    print("     ‚Ä¢ ROC-AUC with caution due to class imbalance")
    
    print("\n‚ö° PERFORMANCE OPTIMIZATION:")
    print("  1. For large datasets (6M+ records):")
    print("     ‚Ä¢ Use chunk-based processing for memory efficiency")
    print("     ‚Ä¢ Implement incremental learning where possible")
    print("     ‚Ä¢ Consider dimensionality reduction for high-cardinality features")
    
    print("\n  2. Real-time deployment considerations:")
    print("     ‚Ä¢ Model serialization with joblib/pickle")
    print("     ‚Ä¢ Feature pipeline persistence")
    print("     ‚Ä¢ API endpoint optimization for low latency")
    
    print("\nüîí BUSINESS CONSIDERATIONS:")
    print("  1. Fraud detection specific:")
    print("     ‚Ä¢ Optimize for high recall (catch more fraud)")
    print("     ‚Ä¢ Implement threshold tuning based on business costs")
    print("     ‚Ä¢ Consider temporal validation (time-based split)")
    
    print("\n  2. Model monitoring:")
    print("     ‚Ä¢ Track fraud rate changes over time")
    print("     ‚Ä¢ Monitor feature drift and concept drift")
    print("     ‚Ä¢ Implement model retraining schedule")
    
    return "ML Recommendations Generated"

ml_recommendations = generate_ml_recommendations()

## 10. Conclusion

### üéì Key Learning Points

This comprehensive EDA has provided valuable insights into the fraud detection dataset:

1. **Data Quality**: The dataset is clean with no missing values or duplicates, providing a solid foundation for ML modeling.

2. **Class Imbalance Challenge**: The extreme imbalance (0.13% fraud rate) requires specialized techniques and careful evaluation metric selection.

3. **Feature Relationships**: Strong correlations between balance features suggest opportunities for dimensionality reduction and feature engineering.

4. **Fraud Patterns**: Certain transaction types show higher fraud rates, providing valuable signals for model training.

5. **Outlier Significance**: High outlier percentages in financial features are expected and may contain fraud signals.

### üìä Next Steps

The insights from this EDA will directly inform the ML pipeline development:
- Implement robust preprocessing for skewed distributions
- Use advanced techniques for handling class imbalance
- Engineer features based on discovered patterns
- Select appropriate models and evaluation metrics

### üöÄ Ready for ML Pipeline

With these comprehensive insights, we're now ready to build an industry-standard ML pipeline that addresses the unique challenges of fraud detection.