## Summary

The table evaluator now provides:

1. **Basic Statistical Analysis** - Traditional metrics like correlation, mean/std comparison
2. **Advanced Statistical Analysis** - Wasserstein distance and MMD for sophisticated distribution comparison
3. **Privacy Analysis** - Existing privacy metrics for assessing data leakage
4. **Visual Analysis** - Comprehensive plotting capabilities
5. **Performance Optimization** - Automatic sampling for large datasets

### Key Features:

- **Automatic Warnings**: Alerts when datasets are large (>250,000 rows)
- **Sampling Support**: Optional sampling with configurable limits
- **Multiple Kernels**: RBF, Polynomial, and Linear kernels for MMD
- **Comprehensive Reporting**: Combined metrics with quality ratings and recommendations
- **Type Safety**: Full type hints for better development experience

### When to Use Each Method:

- **Basic Analysis**: Always start here for quick overview
- **Wasserstein Distance**: Good for detecting distributional differences
- **MMD**: Excellent for detecting complex, non-linear differences
- **Combined Analysis**: Use for comprehensive assessment with actionable recommendations

In [None]:
# Direct access to advanced metrics
from table_evaluator.advanced_metrics.wasserstein import (
    wasserstein_distance_1d, 
    wasserstein_distance_df,
    earth_movers_distance_summary
)
from table_evaluator.advanced_metrics.mmd import (
    mmd_squared, 
    RBFKernel,
    mmd_column_wise,
    mmd_multivariate
)

# Example: Single column Wasserstein distance
if len(numerical_cols) > 0:
    col = numerical_cols[0]
    w_dist = wasserstein_distance_1d(real[col], fake[col])
    print(f"Wasserstein distance for {col}: {w_dist:.4f}")

# Example: MMD with custom kernel
import numpy as np
rbf_kernel = RBFKernel(gamma=1.0)
if len(numerical_cols) >= 2:
    real_sample = real[numerical_cols[:2]].dropna().values
    fake_sample = fake[numerical_cols[:2]].dropna().values
    
    if len(real_sample) > 0 and len(fake_sample) > 0:
        mmd_value = mmd_squared(real_sample, fake_sample, rbf_kernel)
        print(f"MMD² between distributions: {mmd_value:.6f}")

# Example: Column-wise analysis
if len(numerical_cols) > 0:
    # Wasserstein distances for all numerical columns
    wass_distances = wasserstein_distance_df(real, fake, numerical_cols)
    print(f"\nWasserstein distances by column:")
    for _, row in wass_distances.iterrows():
        print(f"  {row['column']}: {row['wasserstein_distance']:.4f}")
    
    # MMD analysis for all columns
    mmd_column_results = mmd_column_wise(real, fake, numerical_cols)
    print(f"\nMMD analysis by column:")
    for _, row in mmd_column_results.iterrows():
        sig_mark = "***" if row['significant'] else ""
        print(f"  {row['column']}: MMD² = {row['mmd_squared']:.6f}, p = {row['p_value']:.4f} {sig_mark}")

### 5. Direct Access to Advanced Metrics

You can also use the advanced metrics directly for specific analysis:

In [None]:
# Example with sampling for large datasets
# This will automatically warn if dataset is large and suggest sampling

# Enable sampling for performance
sampling_results = advanced_evaluator.comprehensive_evaluation(
    real, fake, numerical_cols,
    enable_sampling=True,  # Enable sampling for large datasets
    max_samples=5000       # Maximum samples per dataset
)

print("Sampling configuration:")
print(f"  Enable sampling: True")
print(f"  Max samples per dataset: 5000")
print(f"  Automatic warnings: Enabled for datasets >250,000 rows")
print(f"  Current dataset size: {len(real) + len(fake)} rows")

# For very large datasets, you can also configure individual methods
large_dataset_config = {
    'wasserstein_config': {
        'include_2d': False,        # Skip 2D analysis for speed
        'enable_sampling': True,
        'max_samples': 3000
    },
    'mmd_config': {
        'kernel_types': ['rbf'],    # Use only RBF kernel for speed
        'include_multivariate': True,
        'enable_sampling': True,
        'max_samples': 3000
    }
}

print("\nConfiguration for very large datasets:")
print("  - Wasserstein: 1D only, sampling enabled")
print("  - MMD: RBF kernel only, sampling enabled")
print("  - Max samples: 3000 per method")

### 4. Working with Large Datasets

When working with large datasets (>250,000 rows), the advanced metrics can become computationally intensive. You can enable sampling to improve performance:

In [None]:
# Suppress warnings before any imports
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='dython')
warnings.filterwarnings('ignore', category=FutureWarning, module='sklearn')
warnings.filterwarnings('ignore', category=UserWarning, message='.*pkg_resources.*')
warnings.filterwarnings('ignore', category=FutureWarning, message='.*multi_class.*')

%load_ext autoreload
%autoreload 2

# Example Notebook: How to analyse synthetic data
This notebook will show, fairly compact, how to analyse the data that you have someone obtained.

In [ ]:
from table_evaluator import TableEvaluator, load_data

In [ ]:
real, fake = load_data("data/real_test_sample.csv", "data/fake_test_sample.csv")

In [ ]:
real.head()

In [ ]:
fake.head()

In [None]:
cat_cols = ["trans_type", "trans_operation", "trans_k_symbol"]

In [None]:
evaluator = TableEvaluator(real, fake, cat_cols=cat_cols)

### We can do a numerical analysis

In [None]:
evaluator.evaluate(target_col="trans_type", notebook=False, verbose=False)

### But we can also do a visual analysis

In [None]:
evaluator.visual_evaluation()

## Advanced Statistical Analysis

The table evaluator now includes advanced statistical methods for more sophisticated distribution comparison:

### 1. Wasserstein Distance (Earth Mover's Distance)
The Wasserstein distance provides a robust measure of the distance between two probability distributions.

In [ ]:
# Import advanced statistical evaluator
from table_evaluator.evaluators.advanced_statistical import AdvancedStatisticalEvaluator

# Create advanced evaluator
advanced_evaluator = AdvancedStatisticalEvaluator(verbose=True)

# Get numerical columns for analysis
numerical_cols = evaluator.numerical_columns
print(f"Numerical columns: {numerical_cols}")

# Run Wasserstein distance analysis
wasserstein_results = advanced_evaluator.wasserstein_evaluation(
    real, fake, numerical_cols
)

print(f"Wasserstein Quality Rating: {wasserstein_results['quality_metrics']['quality_rating']}")
print(f"Mean Wasserstein Distance: {wasserstein_results['quality_metrics']['mean_wasserstein_p1']:.4f}")
print(f"Distribution Similarity Score: {wasserstein_results['quality_metrics']['distribution_similarity_score']:.4f}")

### 2. Maximum Mean Discrepancy (MMD)
MMD is a powerful kernel-based method for detecting distribution differences.

In [ ]:
# Run MMD analysis with different kernels
mmd_results = advanced_evaluator.mmd_evaluation(
    real, fake, numerical_cols, 
    kernel_types=["rbf", "polynomial", "linear"]
)

if 'quality_metrics' in mmd_results:
    print(f"MMD Quality Rating: {mmd_results['quality_metrics']['mmd_rating']}")
    print(f"Mean MMD: {mmd_results['quality_metrics']['mean_mmd']:.6f}")
    print(f"Fraction of columns with significant differences: {mmd_results['quality_metrics']['fraction_significant_differences']:.2f}")

# Show results for different kernels
if 'multivariate' in mmd_results:
    print("\nMultivariate MMD results by kernel:")
    for kernel, results in mmd_results['multivariate'].items():
        if 'error' not in results:
            print(f"  {kernel}: MMD² = {results['mmd_squared']:.6f}, p-value = {results['p_value']:.4f}")
            print(f"    Significant: {results['significant']}")
            
# Show best kernel
if 'best_kernel' in mmd_results:
    best = mmd_results['best_kernel']
    print(f"\nBest kernel: {best['kernel']} (discriminative power: {best['discriminative_power']})")

### 3. Comprehensive Advanced Analysis
Combine all advanced methods for a complete assessment:

In [ ]:
# Run comprehensive advanced evaluation
comprehensive_results = advanced_evaluator.comprehensive_evaluation(
    real, fake, numerical_cols
)

# Show combined metrics
combined = comprehensive_results['combined_metrics']
print(f"Overall Similarity Score: {combined['overall_similarity']:.4f}")
print(f"Quality Consensus: {combined['quality_consensus']}")
print(f"Wasserstein Rating: {combined['wasserstein_rating']}")
print(f"MMD Rating: {combined['mmd_rating']}")

# Show statistical significance
sig_info = combined['statistical_significance']
print(f"\nStatistical Significance:")
print(f"  Fraction of columns with significant differences: {sig_info['fraction_columns_different']:.2f}")
print(f"  Interpretation: {sig_info['interpretation']}")

# Show recommendations
print(f"\nRecommendations:")
for i, rec in enumerate(comprehensive_results['recommendations'], 1):
    print(f"  {i}. {rec}")