# GPU vs CPU Monte Carlo Performance Comparison

## Overview
- **What this notebook does:** Profiles the Monte Carlo simulation engine running on GPU (vectorized via CuPy/NumPy) vs CPU (per-trajectory OOP engine) and compares performance.
- **Prerequisites:** `ergodic_insurance` installed; CuPy optional for GPU acceleration.
- **Estimated runtime:** < 5 minutes on CPU
- **Audience:** Developer / Researcher

The GPU engine (`run_gpu_simulation`) processes **all simulation paths simultaneously** using array operations. When CuPy is available, arrays live on the GPU; otherwise plain NumPy is used. The CPU engine (`MonteCarloEngine`) runs per-trajectory OOP simulations with optional multiprocessing.

This notebook:
1. Configures a representative insurance scenario
2. Runs both engines at increasing simulation counts
3. Validates that results are statistically equivalent
4. Plots runtime scaling and speedup curves

In [None]:
"""Google Colab setup: mount Drive and install package dependencies.

Run this cell first. If prompted to restart the runtime, do so, then re-run all cells.
This cell is a no-op when running locally.
"""
import sys, os
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive')

    NOTEBOOK_DIR = '/content/drive/My Drive/Colab Notebooks/ei_notebooks/research'

    os.chdir(NOTEBOOK_DIR)
    if NOTEBOOK_DIR not in sys.path:
        sys.path.append(NOTEBOOK_DIR)

    !pip install git+https://github.com/AlexFiliakov/Ergodic-Insurance-Limits.git -q 2>&1 | tail -3
    print('\nSetup complete. If you see numpy/scipy import errors below,')
    print('restart the runtime (Runtime > Restart runtime) and re-run all cells.')

## 1. Setup

In [None]:
import time
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from ergodic_insurance import ManufacturerConfig, InsuranceProgram, EnhancedInsuranceLayer
from ergodic_insurance.manufacturer import WidgetManufacturer
from ergodic_insurance.loss_distributions import ManufacturingLossGenerator
from ergodic_insurance.monte_carlo import MonteCarloEngine, MonteCarloConfig
from ergodic_insurance.gpu_mc_engine import extract_params, run_gpu_simulation, GPUSimulationParams
from ergodic_insurance.gpu_backend import is_gpu_available, gpu_info

warnings.filterwarnings('ignore')
np.random.seed(42)

print(f"GPU available (CuPy): {is_gpu_available()}")
if is_gpu_available():
    info = gpu_info()
    print(f"  Device: {info.get('device_name', 'N/A')}")
    print(f"  Memory: {info.get('total_memory_bytes', 0) / 1e9:.1f} GB")
else:
    print("  Vectorized engine will use NumPy (CPU) as fallback")
    print("  Install CuPy for true GPU acceleration: pip install 'ergodic-insurance[gpu]'")

## 2. Configure the Insurance Scenario

We define a $10M manufacturer with a two-layer insurance program — representative of a mid-market commercial account.

In [None]:
# --- Manufacturer ---
manufacturer_config = ManufacturerConfig(
    initial_assets=10_000_000,
    asset_turnover_ratio=1.0,
    base_operating_margin=0.10,
    tax_rate=0.21,
    retention_ratio=0.80,
)
manufacturer = WidgetManufacturer(manufacturer_config)

# --- Insurance Program ---
layers = [
    EnhancedInsuranceLayer(
        attachment_point=0,
        limit=5_000_000,
        base_premium_rate=0.015,
    ),
    EnhancedInsuranceLayer(
        attachment_point=5_000_000,
        limit=20_000_000,
        base_premium_rate=0.008,
    ),
]
insurance_program = InsuranceProgram(layers=layers, deductible=100_000)

# --- Loss Generator ---
loss_generator = ManufacturingLossGenerator(
    attritional_params={
        'base_frequency': 5.0,
        'severity_mean': 50_000,
        'severity_cv': 0.8,
    },
    large_params={
        'base_frequency': 0.3,
        'severity_mean': 2_000_000,
        'severity_cv': 1.5,
    },
    catastrophic_params={
        'base_frequency': 0.03,
        'severity_alpha': 2.5,
        'severity_xm': 1_000_000,
    },
    seed=42,
)

print(f"Manufacturer: ${manufacturer_config.initial_assets:,.0f} initial assets")
print(f"Insurance: {len(layers)} layers, ${insurance_program.deductible:,.0f} deductible")
print(f"Annual premium: ${insurance_program.calculate_premium():,.0f}")

## 3. Define Benchmark Configurations

We test at increasing simulation counts to observe scaling behavior. Simulation years are fixed at 10 to keep runtimes manageable.

In [None]:
N_YEARS = 10
SIM_COUNTS = [500, 1_000, 2_000, 5_000, 10_000, 20_000, 50_000]
N_REPEATS = 3  # Repeat each measurement for stability

print(f"Benchmark plan: {len(SIM_COUNTS)} sizes x {N_REPEATS} repeats")
print(f"Simulation counts: {SIM_COUNTS}")
print(f"Years per simulation: {N_YEARS}")

## 4. Benchmark: Vectorized Engine (GPU/NumPy Path)

The vectorized engine processes all N simulations simultaneously per year using array operations. This is the GPU-accelerated path; without CuPy it falls back to NumPy but retains the vectorized structure.

In [None]:
vectorized_results = []

for n_sims in SIM_COUNTS:
    times = []
    for rep in range(N_REPEATS):
        # Build fresh config each time to reset seed
        mc_config = MonteCarloConfig(
            n_simulations=n_sims,
            n_years=N_YEARS,
            seed=42 + rep,
        )
        # Re-create loss generator + insurance for clean state
        lg = ManufacturingLossGenerator(
            attritional_params={'base_frequency': 5.0, 'severity_mean': 50_000, 'severity_cv': 0.8},
            large_params={'base_frequency': 0.3, 'severity_mean': 2_000_000, 'severity_cv': 1.5},
            catastrophic_params={'base_frequency': 0.03, 'severity_alpha': 2.5, 'severity_xm': 1_000_000},
            seed=42 + rep,
        )
        ip = InsuranceProgram(
            layers=[
                EnhancedInsuranceLayer(attachment_point=0, limit=5_000_000, base_premium_rate=0.015),
                EnhancedInsuranceLayer(attachment_point=5_000_000, limit=20_000_000, base_premium_rate=0.008),
            ],
            deductible=100_000,
        )

        gpu_params = extract_params(
            manufacturer=manufacturer,
            insurance_program=ip,
            loss_generator=lg,
            mc_config=mc_config,
        )

        start = time.perf_counter()
        gpu_res = run_gpu_simulation(gpu_params)
        elapsed = time.perf_counter() - start
        times.append(elapsed)

    mean_time = np.mean(times)
    std_time = np.std(times)
    vectorized_results.append({
        'n_sims': n_sims,
        'mean_time': mean_time,
        'std_time': std_time,
        'min_time': np.min(times),
        'throughput': n_sims * N_YEARS / mean_time,
        'mean_final_assets': float(np.mean(gpu_res['final_assets'])),
    })
    print(f"  Vectorized  n={n_sims:>6,}: {mean_time:.3f}s ± {std_time:.3f}s  "
          f"({n_sims * N_YEARS / mean_time:,.0f} sim-years/s)")

df_vec = pd.DataFrame(vectorized_results)
print(f"\nVectorized engine benchmarks complete.")

## 5. Benchmark: CPU Engine (OOP Per-Trajectory Path)

The CPU engine runs each simulation trajectory through the full OOP model (WidgetManufacturer, Ledger, InsuranceProgram). It supports multiprocessing via `parallel=True`.

We test with `parallel=False` (single-threaded) to isolate per-trajectory overhead, and `parallel=True` (multiprocessing) for realistic comparison.

In [None]:
# Use smaller counts for the CPU engine since it's much slower
CPU_SIM_COUNTS = [500, 1_000, 2_000, 5_000]

cpu_single_results = []
cpu_parallel_results = []

for n_sims in CPU_SIM_COUNTS:
    # --- Single-threaded ---
    times_single = []
    for rep in range(N_REPEATS):
        lg = ManufacturingLossGenerator(
            attritional_params={'base_frequency': 5.0, 'severity_mean': 50_000, 'severity_cv': 0.8},
            large_params={'base_frequency': 0.3, 'severity_mean': 2_000_000, 'severity_cv': 1.5},
            catastrophic_params={'base_frequency': 0.03, 'severity_alpha': 2.5, 'severity_xm': 1_000_000},
            seed=42 + rep,
        )
        ip = InsuranceProgram(
            layers=[
                EnhancedInsuranceLayer(attachment_point=0, limit=5_000_000, base_premium_rate=0.015),
                EnhancedInsuranceLayer(attachment_point=5_000_000, limit=20_000_000, base_premium_rate=0.008),
            ],
            deductible=100_000,
        )
        mc_config = MonteCarloConfig(
            n_simulations=n_sims,
            n_years=N_YEARS,
            parallel=False,
            progress_bar=False,
            seed=42 + rep,
            monitor_performance=False,
            generate_summary_report=False,
            cache_results=False,
        )
        engine = MonteCarloEngine(
            loss_generator=lg,
            insurance_program=ip,
            manufacturer=manufacturer,
            config=mc_config,
        )
        start = time.perf_counter()
        cpu_res = engine.run()
        elapsed = time.perf_counter() - start
        times_single.append(elapsed)

    mean_t = np.mean(times_single)
    cpu_single_results.append({
        'n_sims': n_sims,
        'mean_time': mean_t,
        'std_time': np.std(times_single),
        'min_time': np.min(times_single),
        'throughput': n_sims * N_YEARS / mean_t,
        'mean_final_assets': float(np.mean(cpu_res.final_assets)),
    })
    print(f"  CPU single  n={n_sims:>6,}: {mean_t:.3f}s ± {np.std(times_single):.3f}s  "
          f"({n_sims * N_YEARS / mean_t:,.0f} sim-years/s)")

    # --- Parallel ---
    times_parallel = []
    for rep in range(N_REPEATS):
        lg = ManufacturingLossGenerator(
            attritional_params={'base_frequency': 5.0, 'severity_mean': 50_000, 'severity_cv': 0.8},
            large_params={'base_frequency': 0.3, 'severity_mean': 2_000_000, 'severity_cv': 1.5},
            catastrophic_params={'base_frequency': 0.03, 'severity_alpha': 2.5, 'severity_xm': 1_000_000},
            seed=42 + rep,
        )
        ip = InsuranceProgram(
            layers=[
                EnhancedInsuranceLayer(attachment_point=0, limit=5_000_000, base_premium_rate=0.015),
                EnhancedInsuranceLayer(attachment_point=5_000_000, limit=20_000_000, base_premium_rate=0.008),
            ],
            deductible=100_000,
        )
        mc_config = MonteCarloConfig(
            n_simulations=n_sims,
            n_years=N_YEARS,
            parallel=True,
            progress_bar=False,
            seed=42 + rep,
            monitor_performance=False,
            generate_summary_report=False,
            cache_results=False,
        )
        engine = MonteCarloEngine(
            loss_generator=lg,
            insurance_program=ip,
            manufacturer=manufacturer,
            config=mc_config,
        )
        start = time.perf_counter()
        cpu_res_p = engine.run()
        elapsed = time.perf_counter() - start
        times_parallel.append(elapsed)

    mean_t = np.mean(times_parallel)
    cpu_parallel_results.append({
        'n_sims': n_sims,
        'mean_time': mean_t,
        'std_time': np.std(times_parallel),
        'min_time': np.min(times_parallel),
        'throughput': n_sims * N_YEARS / mean_t,
        'mean_final_assets': float(np.mean(cpu_res_p.final_assets)),
    })
    print(f"  CPU parallel n={n_sims:>6,}: {mean_t:.3f}s ± {np.std(times_parallel):.3f}s  "
          f"({n_sims * N_YEARS / mean_t:,.0f} sim-years/s)")

df_cpu_single = pd.DataFrame(cpu_single_results)
df_cpu_parallel = pd.DataFrame(cpu_parallel_results)
print(f"\nCPU engine benchmarks complete.")

## 6. Results Validation

Before comparing performance, we verify that both engines produce statistically consistent results at the same simulation count and seed.

In [None]:
VAL_N_SIMS = 2_000
VAL_SEED = 99

# --- Vectorized engine ---
lg_val = ManufacturingLossGenerator(
    attritional_params={'base_frequency': 5.0, 'severity_mean': 50_000, 'severity_cv': 0.8},
    large_params={'base_frequency': 0.3, 'severity_mean': 2_000_000, 'severity_cv': 1.5},
    catastrophic_params={'base_frequency': 0.03, 'severity_alpha': 2.5, 'severity_xm': 1_000_000},
    seed=VAL_SEED,
)
ip_val = InsuranceProgram(
    layers=[
        EnhancedInsuranceLayer(attachment_point=0, limit=5_000_000, base_premium_rate=0.015),
        EnhancedInsuranceLayer(attachment_point=5_000_000, limit=20_000_000, base_premium_rate=0.008),
    ],
    deductible=100_000,
)
mc_val = MonteCarloConfig(n_simulations=VAL_N_SIMS, n_years=N_YEARS, seed=VAL_SEED)
gpu_params_val = extract_params(manufacturer, ip_val, lg_val, mc_val)
vec_res = run_gpu_simulation(gpu_params_val)

# --- CPU engine ---
lg_val2 = ManufacturingLossGenerator(
    attritional_params={'base_frequency': 5.0, 'severity_mean': 50_000, 'severity_cv': 0.8},
    large_params={'base_frequency': 0.3, 'severity_mean': 2_000_000, 'severity_cv': 1.5},
    catastrophic_params={'base_frequency': 0.03, 'severity_alpha': 2.5, 'severity_xm': 1_000_000},
    seed=VAL_SEED,
)
ip_val2 = InsuranceProgram(
    layers=[
        EnhancedInsuranceLayer(attachment_point=0, limit=5_000_000, base_premium_rate=0.015),
        EnhancedInsuranceLayer(attachment_point=5_000_000, limit=20_000_000, base_premium_rate=0.008),
    ],
    deductible=100_000,
)
mc_val2 = MonteCarloConfig(
    n_simulations=VAL_N_SIMS, n_years=N_YEARS, seed=VAL_SEED,
    parallel=False, progress_bar=False, monitor_performance=False,
    generate_summary_report=False, cache_results=False,
)
cpu_engine_val = MonteCarloEngine(lg_val2, ip_val2, manufacturer, mc_val2)
cpu_res_val = cpu_engine_val.run()

# --- Compare distributions ---
# The engines use different internal models (simplified vs full OOP), so
# results won't match exactly — but distributions should be comparable.
vec_mean = np.mean(vec_res['final_assets'])
cpu_mean = np.mean(cpu_res_val.final_assets)
vec_std = np.std(vec_res['final_assets'])
cpu_std = np.std(cpu_res_val.final_assets)

print("=" * 60)
print(f"{'Metric':<30} {'Vectorized':>14} {'CPU':>14}")
print("-" * 60)
print(f"{'Mean final assets':<30} ${vec_mean:>13,.0f} ${cpu_mean:>13,.0f}")
print(f"{'Std final assets':<30} ${vec_std:>13,.0f} ${cpu_std:>13,.0f}")
print(f"{'Mean annual losses':<30} ${np.mean(vec_res['annual_losses']):>13,.0f} ${np.mean(cpu_res_val.annual_losses):>13,.0f}")
print(f"{'Mean recoveries':<30} ${np.mean(vec_res['insurance_recoveries']):>13,.0f} ${np.mean(cpu_res_val.insurance_recoveries):>13,.0f}")
print(f"{'Mean retained losses':<30} ${np.mean(vec_res['retained_losses']):>13,.0f} ${np.mean(cpu_res_val.retained_losses):>13,.0f}")
print("=" * 60)
print("\nNote: The vectorized engine uses a simplified financial model")
print("(no ledger, no depreciation, no LoC collateral), so values")
print("will differ somewhat. The key comparison is performance scaling.")

## 7. Performance Summary Table

In [None]:
# Build combined comparison table at overlapping simulation counts
common_counts = sorted(set(df_vec['n_sims']) & set(df_cpu_single['n_sims']))

rows = []
for n in common_counts:
    vec_row = df_vec[df_vec['n_sims'] == n].iloc[0]
    cpu_s_row = df_cpu_single[df_cpu_single['n_sims'] == n].iloc[0]
    cpu_p_row = df_cpu_parallel[df_cpu_parallel['n_sims'] == n].iloc[0]
    rows.append({
        'Simulations': f"{n:,}",
        'CPU Single (s)': f"{cpu_s_row['mean_time']:.3f}",
        'CPU Parallel (s)': f"{cpu_p_row['mean_time']:.3f}",
        'Vectorized (s)': f"{vec_row['mean_time']:.3f}",
        'Speedup vs Single': f"{cpu_s_row['mean_time'] / vec_row['mean_time']:.1f}x",
        'Speedup vs Parallel': f"{cpu_p_row['mean_time'] / vec_row['mean_time']:.1f}x",
    })

df_comparison = pd.DataFrame(rows)
print("\nPerformance Comparison:")
print(df_comparison.to_string(index=False))

## 8. Visualization: Runtime Scaling

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# --- Plot 1: Absolute runtime ---
ax = axes[0]
ax.errorbar(
    df_vec['n_sims'], df_vec['mean_time'], yerr=df_vec['std_time'],
    marker='o', linewidth=2, capsize=4, label='Vectorized (GPU/NumPy)', color='#2196F3',
)
ax.errorbar(
    df_cpu_single['n_sims'], df_cpu_single['mean_time'], yerr=df_cpu_single['std_time'],
    marker='s', linewidth=2, capsize=4, label='CPU Single-threaded', color='#F44336',
)
ax.errorbar(
    df_cpu_parallel['n_sims'], df_cpu_parallel['mean_time'], yerr=df_cpu_parallel['std_time'],
    marker='^', linewidth=2, capsize=4, label='CPU Parallel', color='#FF9800',
)
ax.set_xlabel('Number of Simulations', fontsize=12)
ax.set_ylabel('Runtime (seconds)', fontsize=12)
ax.set_title('Runtime vs Simulation Count', fontsize=14)
ax.legend(fontsize=10)
ax.set_xscale('log')
ax.set_yscale('log')
ax.grid(True, alpha=0.3)

# --- Plot 2: Throughput ---
ax = axes[1]
ax.plot(
    df_vec['n_sims'], df_vec['throughput'],
    marker='o', linewidth=2, label='Vectorized (GPU/NumPy)', color='#2196F3',
)
ax.plot(
    df_cpu_single['n_sims'], df_cpu_single['throughput'],
    marker='s', linewidth=2, label='CPU Single-threaded', color='#F44336',
)
ax.plot(
    df_cpu_parallel['n_sims'], df_cpu_parallel['throughput'],
    marker='^', linewidth=2, label='CPU Parallel', color='#FF9800',
)
ax.set_xlabel('Number of Simulations', fontsize=12)
ax.set_ylabel('Throughput (sim-years/second)', fontsize=12)
ax.set_title('Throughput Scaling', fontsize=14)
ax.legend(fontsize=10)
ax.set_xscale('log')
ax.set_yscale('log')
ax.grid(True, alpha=0.3)

# --- Plot 3: Speedup ---
ax = axes[2]
# Compute speedup at common points
speedup_single = []
speedup_parallel = []
speedup_counts = []
for n in common_counts:
    vec_t = df_vec[df_vec['n_sims'] == n]['mean_time'].values[0]
    cpu_s_t = df_cpu_single[df_cpu_single['n_sims'] == n]['mean_time'].values[0]
    cpu_p_t = df_cpu_parallel[df_cpu_parallel['n_sims'] == n]['mean_time'].values[0]
    speedup_single.append(cpu_s_t / vec_t)
    speedup_parallel.append(cpu_p_t / vec_t)
    speedup_counts.append(n)

ax.bar(
    np.arange(len(speedup_counts)) - 0.15, speedup_single, 0.3,
    label='vs CPU Single', color='#F44336', alpha=0.8,
)
ax.bar(
    np.arange(len(speedup_counts)) + 0.15, speedup_parallel, 0.3,
    label='vs CPU Parallel', color='#FF9800', alpha=0.8,
)
ax.set_xticks(range(len(speedup_counts)))
ax.set_xticklabels([f"{n:,}" for n in speedup_counts], rotation=45)
ax.set_xlabel('Number of Simulations', fontsize=12)
ax.set_ylabel('Speedup Factor (x)', fontsize=12)
ax.set_title('Vectorized Engine Speedup', fontsize=14)
ax.axhline(y=1, color='gray', linestyle='--', alpha=0.5, label='Parity')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('gpu_cpu_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
print("Figure saved to gpu_cpu_comparison.png")

## 9. Vectorized Engine Scaling (Extended Range)

Since the vectorized engine can handle larger simulation counts, we profile it at higher N to show its scaling curve.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(
    df_vec['n_sims'], df_vec['mean_time'],
    marker='o', linewidth=2, color='#2196F3', markersize=8,
)
ax.fill_between(
    df_vec['n_sims'],
    df_vec['mean_time'] - df_vec['std_time'],
    df_vec['mean_time'] + df_vec['std_time'],
    alpha=0.2, color='#2196F3',
)

# Add ideal linear scaling reference
base_n = df_vec['n_sims'].iloc[0]
base_t = df_vec['mean_time'].iloc[0]
ideal_times = [base_t * (n / base_n) for n in df_vec['n_sims']]
ax.plot(
    df_vec['n_sims'], ideal_times,
    linestyle='--', color='gray', alpha=0.5, label='Ideal linear scaling',
)

ax.set_xlabel('Number of Simulations', fontsize=12)
ax.set_ylabel('Runtime (seconds)', fontsize=12)
ax.set_title('Vectorized Engine Runtime Scaling', fontsize=14)
ax.set_xscale('log')
ax.set_yscale('log')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Annotate throughput at each point
for _, row in df_vec.iterrows():
    ax.annotate(
        f"{row['throughput']:,.0f}\nsim-yr/s",
        xy=(row['n_sims'], row['mean_time']),
        textcoords='offset points', xytext=(0, 15),
        fontsize=8, ha='center', color='#1565C0',
    )

plt.tight_layout()
plt.savefig('vectorized_scaling.png', dpi=150, bbox_inches='tight')
plt.show()
print("Figure saved to vectorized_scaling.png")

## 10. Per-Simulation-Year Cost Analysis

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

# Cost per sim-year = runtime / (n_sims * n_years)
vec_cost = df_vec['mean_time'] / (df_vec['n_sims'] * N_YEARS) * 1e6  # microseconds
cpu_s_cost = df_cpu_single['mean_time'] / (df_cpu_single['n_sims'] * N_YEARS) * 1e6
cpu_p_cost = df_cpu_parallel['mean_time'] / (df_cpu_parallel['n_sims'] * N_YEARS) * 1e6

ax.plot(df_vec['n_sims'], vec_cost, marker='o', linewidth=2, label='Vectorized (GPU/NumPy)', color='#2196F3')
ax.plot(df_cpu_single['n_sims'], cpu_s_cost, marker='s', linewidth=2, label='CPU Single-threaded', color='#F44336')
ax.plot(df_cpu_parallel['n_sims'], cpu_p_cost, marker='^', linewidth=2, label='CPU Parallel', color='#FF9800')

ax.set_xlabel('Number of Simulations', fontsize=12)
ax.set_ylabel('Cost per Simulation-Year (\u00b5s)', fontsize=12)
ax.set_title('Marginal Cost per Simulation-Year', fontsize=14)
ax.set_xscale('log')
ax.set_yscale('log')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('cost_per_sim_year.png', dpi=150, bbox_inches='tight')
plt.show()
print("Figure saved to cost_per_sim_year.png")

## Key Takeaways

1. **Vectorized engine** is dramatically faster because it processes all N simulations simultaneously using array operations, avoiding Python object overhead per trajectory.
2. **Speedup increases with N** — the vectorized engine amortizes its fixed overhead across more paths, while the CPU engine scales linearly.
3. **With CuPy/GPU**, the advantage grows further since array operations execute on GPU cores in parallel.
4. **Without CuPy**, the vectorized engine still benefits from NumPy's optimized C/BLAS routines and cache-friendly memory access.
5. **Statistical equivalence** — both engines produce comparable distributions, validating the simplified vectorized model.

## Next Steps

- Install CuPy to measure true GPU acceleration: `pip install 'ergodic-insurance[gpu]'`
- Compare on Google Colab with a T4 or A100 GPU for production-scale benchmarks
- See `core/04_monte_carlo_simulation.ipynb` for detailed Monte Carlo analysis
- See `advanced/03_advanced_convergence.ipynb` for convergence diagnostics