# Module 08: Performance Benchmarking and Analysis

**Difficulty**: ⭐⭐

**Estimated Time**: 75 minutes

**Prerequisites**: 
- Module 05: Serial Implementation
- Module 06: Multi-Core Parallelization
- Module 07: GPU Acceleration

## Learning Objectives

By the end of this notebook, you will be able to:
1. Design comprehensive benchmark suites for parallel programs
2. Measure and analyze speedup and efficiency metrics
3. Conduct strong and weak scaling experiments
4. Apply statistical methods to performance data
5. Create professional visualizations of benchmark results
6. Identify performance bottlenecks through systematic analysis

## 1. Setup and Imports

In [None]:
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from typing import Tuple, List, Dict, Callable
from dataclasses import dataclass
from multiprocessing import Pool, cpu_count
import warnings
warnings.filterwarnings('ignore')

# Try importing GPU support
try:
    from numba import cuda
    GPU_AVAILABLE = cuda.is_available()
except:
    GPU_AVAILABLE = False

# Set random seed
np.random.seed(42)

# Configure plotting
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

# System info
num_cores = cpu_count()
print(f"System Information:")
print(f"  CPU cores: {num_cores}")
print(f"  GPU available: {GPU_AVAILABLE}")
if GPU_AVAILABLE:
    gpu = cuda.get_current_device()
    print(f"  GPU: {gpu.name.decode('utf-8')}")
print(f"  NumPy version: {np.__version__}")

## 2. Benchmark Framework

Let's create a reusable framework for performance testing.

In [None]:
@dataclass
class BenchmarkResult:
    """
    Store results from a benchmark run.
    """
    name: str
    problem_size: int
    num_workers: int  # cores or threads
    times: List[float]  # multiple runs
    
    @property
    def mean_time(self) -> float:
        return np.mean(self.times)
    
    @property
    def std_time(self) -> float:
        return np.std(self.times)
    
    @property
    def min_time(self) -> float:
        return np.min(self.times)
    
    @property
    def max_time(self) -> float:
        return np.max(self.times)
    
    def confidence_interval(self, confidence=0.95) -> Tuple[float, float]:
        """
        Calculate confidence interval for mean time.
        """
        from scipy import stats
        ci = stats.t.interval(confidence, len(self.times)-1, 
                             loc=self.mean_time, 
                             scale=stats.sem(self.times))
        return ci


class BenchmarkSuite:
    """
    Framework for running and analyzing benchmarks.
    """
    
    def __init__(self):
        self.results: List[BenchmarkResult] = []
    
    def run_benchmark(self, 
                     name: str,
                     func: Callable,
                     args: tuple,
                     problem_size: int,
                     num_workers: int = 1,
                     num_runs: int = 5,
                     warmup_runs: int = 1) -> BenchmarkResult:
        """
        Run a benchmark with multiple iterations.
        
        Args:
            name: Benchmark name
            func: Function to benchmark
            args: Arguments to pass to func
            problem_size: Size of problem (for scaling analysis)
            num_workers: Number of cores/threads used
            num_runs: Number of timed runs
            warmup_runs: Number of warmup runs (not timed)
        
        Returns:
            BenchmarkResult object
        """
        # Warmup runs (for JIT compilation, cache warming, etc.)
        for _ in range(warmup_runs):
            _ = func(*args)
        
        # Timed runs
        times = []
        for _ in range(num_runs):
            start = time.perf_counter()
            result = func(*args)
            end = time.perf_counter()
            times.append(end - start)
        
        # Store result
        benchmark = BenchmarkResult(
            name=name,
            problem_size=problem_size,
            num_workers=num_workers,
            times=times
        )
        
        self.results.append(benchmark)
        return benchmark
    
    def calculate_speedup(self, baseline_name: str, target_name: str) -> float:
        """
        Calculate speedup: T_baseline / T_target
        """
        baseline = next(r for r in self.results if r.name == baseline_name)
        target = next(r for r in self.results if r.name == target_name)
        return baseline.mean_time / target.mean_time
    
    def calculate_efficiency(self, baseline_name: str, target_name: str, 
                           num_workers: int) -> float:
        """
        Calculate parallel efficiency: Speedup / num_workers
        """
        speedup = self.calculate_speedup(baseline_name, target_name)
        return speedup / num_workers
    
    def to_dataframe(self) -> pd.DataFrame:
        """
        Convert results to pandas DataFrame for analysis.
        """
        data = []
        for r in self.results:
            data.append({
                'Name': r.name,
                'Problem Size': r.problem_size,
                'Workers': r.num_workers,
                'Mean Time (s)': r.mean_time,
                'Std Dev (s)': r.std_time,
                'Min Time (s)': r.min_time,
                'Max Time (s)': r.max_time
            })
        return pd.DataFrame(data)


print("Benchmark framework ready!")
print(f"Available for testing: {num_cores} CPU cores", end="")
if GPU_AVAILABLE:
    print(" + GPU")
else:
    print()

## 3. Test Problem: Monte Carlo π Estimation

We'll use Monte Carlo as our test case since it's simple but representative.

In [None]:
# Serial implementation
def monte_carlo_pi_serial(num_samples):
    """Serial Monte Carlo π estimation."""
    inside = 0
    for _ in range(num_samples):
        x = np.random.random()
        y = np.random.random()
        if x*x + y*y <= 1.0:
            inside += 1
    return 4.0 * inside / num_samples


# Vectorized implementation
def monte_carlo_pi_vectorized(num_samples):
    """Vectorized Monte Carlo (single thread)."""
    x = np.random.random(num_samples)
    y = np.random.random(num_samples)
    inside = np.sum(x*x + y*y <= 1.0)
    return 4.0 * inside / num_samples


# Parallel implementation
def estimate_pi_chunk(num_samples):
    """Worker function for parallel version."""
    x = np.random.random(num_samples)
    y = np.random.random(num_samples)
    return np.sum(x*x + y*y <= 1.0)


def monte_carlo_pi_parallel(num_samples, num_processes):
    """Parallel Monte Carlo using multiprocessing."""
    samples_per_process = num_samples // num_processes
    
    with Pool(processes=num_processes) as pool:
        results = pool.map(estimate_pi_chunk, [samples_per_process] * num_processes)
    
    total_inside = sum(results)
    return 4.0 * total_inside / num_samples


print("Monte Carlo implementations ready!")
print("\nQuick test (10K samples):")
test_samples = 10000
pi_est = monte_carlo_pi_vectorized(test_samples)
print(f"  π estimate: {pi_est:.6f}")
print(f"  Actual π:   {np.pi:.6f}")
print(f"  Error:      {abs(pi_est - np.pi):.6f}")

## 4. Strong Scaling Analysis

**Strong scaling**: Fixed problem size, increase number of workers.

**Ideal**: Time decreases linearly with workers (speedup = workers)

In [None]:
print("="*70)
print("STRONG SCALING ANALYSIS: Monte Carlo π Estimation")
print("="*70)

# Fixed problem size
problem_size = 10_000_000  # 10 million samples

print(f"\nProblem size: {problem_size:,} samples (fixed)")
print(f"Testing with 1 to {num_cores} workers\n")

# Run benchmarks
suite = BenchmarkSuite()

# Serial baseline
print("Serial (baseline)...")
result = suite.run_benchmark(
    name="Serial",
    func=monte_carlo_pi_vectorized,
    args=(problem_size,),
    problem_size=problem_size,
    num_workers=1,
    num_runs=5
)
print(f"  Time: {result.mean_time:.4f} ± {result.std_time:.4f} s")
serial_time = result.mean_time

# Parallel with varying workers
worker_counts = [1, 2, 4] + ([num_cores] if num_cores > 4 else [])

for num_workers in worker_counts:
    if num_workers > num_cores:
        continue
    
    print(f"\nParallel ({num_workers} workers)...")
    result = suite.run_benchmark(
        name=f"Parallel-{num_workers}",
        func=monte_carlo_pi_parallel,
        args=(problem_size, num_workers),
        problem_size=problem_size,
        num_workers=num_workers,
        num_runs=5
    )
    
    speedup = serial_time / result.mean_time
    efficiency = speedup / num_workers
    
    print(f"  Time: {result.mean_time:.4f} ± {result.std_time:.4f} s")
    print(f"  Speedup: {speedup:.2f}x")
    print(f"  Efficiency: {efficiency*100:.1f}%")

# Display summary
df_strong = suite.to_dataframe()
print("\n" + "="*70)
print("STRONG SCALING SUMMARY")
print("="*70)
print(df_strong.to_string(index=False))

In [None]:
# Visualize strong scaling
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Extract data for parallel runs
parallel_results = df_strong[df_strong['Name'].str.startswith('Parallel')].copy()
workers = parallel_results['Workers'].values
times = parallel_results['Mean Time (s)'].values
speedups = serial_time / times
efficiencies = speedups / workers * 100

# Plot 1: Speedup
axes[0].plot(workers, speedups, 'o-', linewidth=2, markersize=10, label='Actual')
axes[0].plot(workers, workers, '--', linewidth=2, color='gray', label='Ideal (linear)')
axes[0].set_xlabel('Number of Workers', fontsize=12)
axes[0].set_ylabel('Speedup', fontsize=12)
axes[0].set_title('Strong Scaling: Speedup', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[0].set_xticks(workers)

# Plot 2: Efficiency
axes[1].plot(workers, efficiencies, 'o-', linewidth=2, markersize=10, color='green')
axes[1].axhline(y=100, linestyle='--', color='gray', linewidth=2, label='100% (ideal)')
axes[1].set_xlabel('Number of Workers', fontsize=12)
axes[1].set_ylabel('Efficiency (%)', fontsize=12)
axes[1].set_title('Parallel Efficiency', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].set_xticks(workers)
axes[1].set_ylim([0, 110])

# Plot 3: Execution Time
axes[2].plot(workers, times, 'o-', linewidth=2, markersize=10, color='red')
axes[2].axhline(y=serial_time, linestyle='--', color='gray', linewidth=2, label='Serial time')
axes[2].set_xlabel('Number of Workers', fontsize=12)
axes[2].set_ylabel('Execution Time (seconds)', fontsize=12)
axes[2].set_title('Execution Time vs Workers', fontsize=14, fontweight='bold')
axes[2].legend(fontsize=11)
axes[2].grid(True, alpha=0.3)
axes[2].set_xticks(workers)

plt.tight_layout()
plt.savefig('strong_scaling_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print("Strong scaling plots saved to: strong_scaling_analysis.png")

## 5. Weak Scaling Analysis

**Weak scaling**: Problem size increases proportionally with workers.

**Ideal**: Time remains constant (each worker does same amount of work)

In [None]:
print("="*70)
print("WEAK SCALING ANALYSIS: Monte Carlo π Estimation")
print("="*70)

# Samples per worker (constant work per core)
samples_per_worker = 2_000_000  # 2 million per worker

print(f"\nSamples per worker: {samples_per_worker:,} (constant)")
print(f"Total problem size increases with workers\n")

suite_weak = BenchmarkSuite()

# Baseline: 1 worker
print("1 worker (baseline)...")
total_samples = 1 * samples_per_worker
result = suite_weak.run_benchmark(
    name="Weak-1",
    func=monte_carlo_pi_vectorized,
    args=(total_samples,),
    problem_size=total_samples,
    num_workers=1,
    num_runs=5
)
print(f"  Problem size: {total_samples:,}")
print(f"  Time: {result.mean_time:.4f} ± {result.std_time:.4f} s")
baseline_time = result.mean_time

# Increase workers and problem size proportionally
for num_workers in worker_counts:
    if num_workers == 1 or num_workers > num_cores:
        continue
    
    total_samples = num_workers * samples_per_worker
    
    print(f"\n{num_workers} workers...")
    result = suite_weak.run_benchmark(
        name=f"Weak-{num_workers}",
        func=monte_carlo_pi_parallel,
        args=(total_samples, num_workers),
        problem_size=total_samples,
        num_workers=num_workers,
        num_runs=5
    )
    
    # Weak scaling efficiency: T_1 / T_p
    weak_efficiency = baseline_time / result.mean_time
    
    print(f"  Problem size: {total_samples:,}")
    print(f"  Time: {result.mean_time:.4f} ± {result.std_time:.4f} s")
    print(f"  Weak scaling efficiency: {weak_efficiency*100:.1f}%")

# Summary
df_weak = suite_weak.to_dataframe()
print("\n" + "="*70)
print("WEAK SCALING SUMMARY")
print("="*70)
print(df_weak.to_string(index=False))

In [None]:
# Visualize weak scaling
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

workers_weak = df_weak['Workers'].values
times_weak = df_weak['Mean Time (s)'].values
problem_sizes = df_weak['Problem Size'].values

# Plot 1: Execution time (should be constant)
axes[0].plot(workers_weak, times_weak, 'o-', linewidth=2, markersize=10, color='purple')
axes[0].axhline(y=baseline_time, linestyle='--', color='gray', linewidth=2, label='Ideal (constant)')
axes[0].set_xlabel('Number of Workers', fontsize=12)
axes[0].set_ylabel('Execution Time (seconds)', fontsize=12)
axes[0].set_title('Weak Scaling: Time vs Workers', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[0].set_xticks(workers_weak)

# Plot 2: Problem size scaling
ax2 = axes[1]
ax2.bar(workers_weak, problem_sizes / 1e6, alpha=0.7, color='orange')
ax2.set_xlabel('Number of Workers', fontsize=12)
ax2.set_ylabel('Problem Size (millions of samples)', fontsize=12)
ax2.set_title('Problem Size Growth', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')
ax2.set_xticks(workers_weak)

# Add execution time on secondary axis
ax2_twin = ax2.twinx()
ax2_twin.plot(workers_weak, times_weak, 'ro-', linewidth=2, markersize=8, label='Time')
ax2_twin.set_ylabel('Execution Time (seconds)', fontsize=12, color='red')
ax2_twin.tick_params(axis='y', labelcolor='red')
ax2_twin.legend(loc='upper left', fontsize=11)

plt.tight_layout()
plt.savefig('weak_scaling_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print("Weak scaling plots saved to: weak_scaling_analysis.png")

## 6. Statistical Analysis

Proper benchmarking requires statistical rigor.

In [None]:
from scipy import stats

def analyze_benchmark_statistics(benchmark_result: BenchmarkResult):
    """
    Comprehensive statistical analysis of benchmark results.
    """
    times = np.array(benchmark_result.times)
    
    print(f"Benchmark: {benchmark_result.name}")
    print(f"Problem size: {benchmark_result.problem_size:,}")
    print(f"Number of runs: {len(times)}")
    print(f"\nDescriptive Statistics:")
    print(f"  Mean:     {np.mean(times):.6f} s")
    print(f"  Median:   {np.median(times):.6f} s")
    print(f"  Std Dev:  {np.std(times):.6f} s")
    print(f"  Min:      {np.min(times):.6f} s")
    print(f"  Max:      {np.max(times):.6f} s")
    print(f"  Range:    {np.max(times) - np.min(times):.6f} s")
    print(f"  CV:       {np.std(times)/np.mean(times)*100:.2f}%")  # Coefficient of variation
    
    # Confidence intervals
    ci_95 = stats.t.interval(0.95, len(times)-1, loc=np.mean(times), scale=stats.sem(times))
    ci_99 = stats.t.interval(0.99, len(times)-1, loc=np.mean(times), scale=stats.sem(times))
    
    print(f"\nConfidence Intervals:")
    print(f"  95% CI:   [{ci_95[0]:.6f}, {ci_95[1]:.6f}] s")
    print(f"  99% CI:   [{ci_99[0]:.6f}, {ci_99[1]:.6f}] s")
    
    # Test for normality
    if len(times) >= 3:
        shapiro_stat, shapiro_p = stats.shapiro(times)
        print(f"\nNormality Test (Shapiro-Wilk):")
        print(f"  Statistic: {shapiro_stat:.4f}")
        print(f"  P-value:   {shapiro_p:.4f}")
        print(f"  Normal:    {'Yes' if shapiro_p > 0.05 else 'No'} (α=0.05)")
    
    return {
        'mean': np.mean(times),
        'median': np.median(times),
        'std': np.std(times),
        'ci_95': ci_95,
        'ci_99': ci_99
    }


# Analyze one benchmark in detail
print("="*70)
print("DETAILED STATISTICAL ANALYSIS")
print("="*70)
print()

# Get a benchmark result
sample_result = suite.results[0]  # Serial baseline
stats_result = analyze_benchmark_statistics(sample_result)

In [None]:
# Visualize distribution of timing results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Get results for different configurations
results_to_plot = [
    suite.results[0],  # Serial
    suite.results[1] if len(suite.results) > 1 else suite.results[0],  # Parallel-1
    suite.results[2] if len(suite.results) > 2 else suite.results[0],  # Parallel-2
    suite.results[-1]  # Parallel-max
]

for idx, result in enumerate(results_to_plot):
    ax = axes[idx // 2, idx % 2]
    times = result.times
    
    # Histogram with KDE
    ax.hist(times, bins=10, alpha=0.7, color='skyblue', edgecolor='black', density=True)
    
    # Add KDE if we have enough points
    if len(times) > 3:
        from scipy.stats import gaussian_kde
        kde = gaussian_kde(times)
        x_range = np.linspace(min(times), max(times), 100)
        ax.plot(x_range, kde(x_range), 'r-', linewidth=2, label='KDE')
    
    # Add mean and median lines
    ax.axvline(np.mean(times), color='green', linestyle='--', linewidth=2, label=f'Mean: {np.mean(times):.4f}s')
    ax.axvline(np.median(times), color='orange', linestyle='--', linewidth=2, label=f'Median: {np.median(times):.4f}s')
    
    ax.set_xlabel('Execution Time (seconds)', fontsize=11)
    ax.set_ylabel('Density', fontsize=11)
    ax.set_title(f'{result.name} - Distribution of Times', fontsize=12, fontweight='bold')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('timing_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

print("Timing distribution plots saved to: timing_distributions.png")

## 7. Comprehensive Performance Comparison

In [None]:
def create_performance_table(suite: BenchmarkSuite, baseline_name: str = "Serial"):
    """
    Create comprehensive performance comparison table.
    """
    baseline = next(r for r in suite.results if r.name == baseline_name)
    baseline_time = baseline.mean_time
    
    data = []
    for result in suite.results:
        speedup = baseline_time / result.mean_time
        efficiency = speedup / result.num_workers * 100 if result.num_workers > 0 else 0
        
        data.append({
            'Implementation': result.name,
            'Workers': result.num_workers,
            'Mean Time (s)': f"{result.mean_time:.4f}",
            '± Std Dev': f"{result.std_time:.4f}",
            'Speedup': f"{speedup:.2f}x",
            'Efficiency': f"{efficiency:.1f}%",
            'Samples/sec': f"{result.problem_size/result.mean_time/1e6:.2f}M"
        })
    
    return pd.DataFrame(data)


print("="*70)
print("COMPREHENSIVE PERFORMANCE COMPARISON")
print("="*70)
print()

perf_table = create_performance_table(suite)
print(perf_table.to_string(index=False))

# Save to CSV
perf_table.to_csv('performance_comparison.csv', index=False)
print("\nTable saved to: performance_comparison.csv")

## 8. Amdahl's Law Analysis

In [None]:
def amdahls_law(p, n):
    """
    Calculate theoretical speedup using Amdahl's Law.
    
    Speedup = 1 / ((1 - p) + p/n)
    
    Args:
        p: Fraction of program that is parallelizable (0 to 1)
        n: Number of processors
    
    Returns:
        Theoretical speedup
    """
    return 1 / ((1 - p) + p / n)


# Plot Amdahl's Law for different parallelizable fractions
plt.figure(figsize=(12, 6))

processors = np.arange(1, 65)
parallel_fractions = [0.5, 0.75, 0.9, 0.95, 0.99, 1.0]

for p in parallel_fractions:
    speedups = [amdahls_law(p, n) for n in processors]
    label = f'{p*100:.0f}% parallelizable' if p < 1 else 'Perfect (100%)'
    plt.plot(processors, speedups, linewidth=2, label=label)

# Add ideal (linear) speedup
plt.plot(processors, processors, 'k--', linewidth=1, alpha=0.5, label='Ideal (linear)')

# Mark our actual results
if len(parallel_results) > 0:
    actual_workers = parallel_results['Workers'].values
    actual_speedups = serial_time / parallel_results['Mean Time (s)'].values
    plt.scatter(actual_workers, actual_speedups, s=150, c='red', marker='*', 
               zorder=10, edgecolors='black', linewidths=1.5,
               label='Our Results')

plt.xlabel('Number of Processors', fontsize=12)
plt.ylabel('Speedup', fontsize=12)
plt.title("Amdahl's Law: Maximum Speedup vs Parallelizable Fraction", 
         fontsize=14, fontweight='bold')
plt.legend(loc='upper left', fontsize=10)
plt.grid(True, alpha=0.3)
plt.xlim([1, 64])
plt.ylim([1, 64])

# Add annotation
plt.text(32, 5, 
         "Even 10% serial code\nlimits max speedup to 10x",
         fontsize=11, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))

plt.tight_layout()
plt.savefig('amdahls_law.png', dpi=150, bbox_inches='tight')
plt.show()

print("Amdahl's Law plot saved to: amdahls_law.png")
print("\nKey insight: Serial portion limits maximum achievable speedup!")

## 9. Bottleneck Identification

In [None]:
print("="*70)
print("BOTTLENECK ANALYSIS")
print("="*70)
print("""
Common Performance Bottlenecks:

1. SERIAL SECTIONS (Amdahl's Law)
   - Data initialization
   - Result aggregation
   - I/O operations
   → Minimize or parallelize

2. SYNCHRONIZATION OVERHEAD
   - Thread/process creation
   - Locks and barriers
   - Communication between workers
   → Use larger work chunks

3. LOAD IMBALANCE
   - Some workers finish early
   - Idle time while waiting
   → Dynamic work distribution

4. MEMORY BANDWIDTH
   - CPU-GPU transfers
   - Cache misses
   - Memory contention
   → Optimize access patterns

5. FALSE SHARING
   - Threads modifying adjacent memory
   - Cache line bouncing
   → Pad data structures
""")

# Analyze our results
print("\n" + "="*70)
print("YOUR RESULTS ANALYSIS")
print("="*70)
print()

if len(parallel_results) > 0:
    max_workers = parallel_results['Workers'].max()
    max_speedup = speedups.max()
    max_efficiency = efficiencies.max()
    
    print(f"Maximum speedup achieved: {max_speedup:.2f}x with {max_workers} workers")
    print(f"Peak efficiency: {max_efficiency:.1f}%")
    
    # Estimate parallel fraction from best speedup
    # Using Amdahl's law: S = 1 / ((1-p) + p/n)
    # Solving for p: p = (n*(S-1))/(S*(n-1))
    S = max_speedup
    n = max_workers
    if n > 1 and S > 1:
        p_estimated = (n * (S - 1)) / (S * (n - 1))
        serial_fraction = 1 - p_estimated
        
        print(f"\nEstimated parallelizable fraction: {p_estimated*100:.1f}%")
        print(f"Estimated serial fraction: {serial_fraction*100:.1f}%")
        print(f"\nTheoretical max speedup (infinite cores): {1/serial_fraction:.2f}x")
    
    # Scaling efficiency analysis
    if len(efficiencies) > 1:
        efficiency_drop = efficiencies[0] - efficiencies[-1]
        print(f"\nEfficiency drop from 1 to {max_workers} workers: {efficiency_drop:.1f}%")
        
        if efficiency_drop > 20:
            print("⚠ Significant efficiency loss - check for:")
            print("  - Synchronization overhead")
            print("  - Load imbalance")
            print("  - Memory contention")
        elif efficiency_drop > 10:
            print("⚠ Moderate efficiency loss - room for optimization")
        else:
            print("✓ Good scaling efficiency!")
else:
    print("Not enough data for analysis")

## 10. Summary and Best Practices

In [None]:
print("="*70)
print("PERFORMANCE BENCHMARKING BEST PRACTICES")
print("="*70)
print("""
1. EXPERIMENTAL DESIGN:
   ✓ Multiple runs (5-10) for statistical significance
   ✓ Warmup runs before timing (JIT compilation, caching)
   ✓ Controlled environment (close other programs)
   ✓ Test various problem sizes
   ✓ Test various worker counts

2. METRICS TO COLLECT:
   ✓ Execution time (mean, std, min, max)
   ✓ Speedup (T_serial / T_parallel)
   ✓ Efficiency (Speedup / num_workers)
   ✓ Throughput (work/time)
   ✓ Memory usage
   ✓ CPU/GPU utilization

3. STATISTICAL RIGOR:
   ✓ Report confidence intervals
   ✓ Test for outliers
   ✓ Check distribution normality
   ✓ Use appropriate statistical tests

4. VISUALIZATION:
   ✓ Speedup curves (actual vs ideal)
   ✓ Efficiency plots
   ✓ Strong/weak scaling graphs
   ✓ Error bars showing variability
   ✓ Comparison tables

5. ANALYSIS:
   ✓ Compare to theoretical limits (Amdahl's Law)
   ✓ Identify bottlenecks
   ✓ Calculate parallel fraction
   ✓ Determine optimal configuration

6. REPORTING:
   ✓ Document hardware/software versions
   ✓ Describe problem characteristics
   ✓ Explain methodology
   ✓ Provide reproducible code
   ✓ Save data and plots
""")

print("\n" + "="*70)
print("FILES GENERATED IN THIS SESSION")
print("="*70)
print("""
  strong_scaling_analysis.png   - Strong scaling plots
  weak_scaling_analysis.png     - Weak scaling plots
  timing_distributions.png      - Statistical distributions
  amdahls_law.png              - Theoretical vs actual speedup
  performance_comparison.csv    - Detailed results table
""")

## Summary

In this module, you learned:

1. **Benchmark Design**
   - Framework for systematic testing
   - Multiple runs for statistical validity
   - Warmup and timing methodology

2. **Scaling Analysis**
   - Strong scaling (fixed problem size)
   - Weak scaling (scaled problem size)
   - Speedup and efficiency metrics

3. **Statistical Methods**
   - Confidence intervals
   - Distribution analysis
   - Normality testing

4. **Visualization**
   - Professional performance plots
   - Comparison charts
   - Amdahl's Law analysis

5. **Bottleneck Identification**
   - Serial sections
   - Synchronization overhead
   - Load imbalance

**What's Next?**

- **Module 09**: Advanced optimization techniques and hybrid approaches