# Module 09: Advanced Optimization Techniques

**Difficulty**: ⭐⭐⭐

**Estimated Time**: 90 minutes

**Prerequisites**: 
- Module 05-08: Serial, Parallel, GPU, Benchmarking
- Understanding of performance bottlenecks

## Learning Objectives

By the end of this notebook, you will be able to:
1. Optimize memory access patterns for cache efficiency
2. Implement load balancing strategies for irregular workloads
3. Reduce communication overhead in parallel programs
4. Apply vectorization techniques for better CPU utilization
5. Design hybrid CPU-GPU approaches for maximum performance
6. Follow an iterative optimization process with measurable improvements

## 1. Setup and Imports

In [None]:
import numpy as np
import time
import matplotlib.pyplot as plt
import pandas as pd
from multiprocessing import Pool, cpu_count
from typing import Tuple, List
import seaborn as sns

# Try importing optimization libraries
try:
    from numba import jit, prange, set_num_threads
    NUMBA_AVAILABLE = True
except:
    NUMBA_AVAILABLE = False

try:
    from numba import cuda
    GPU_AVAILABLE = cuda.is_available()
except:
    GPU_AVAILABLE = False

# Set random seed
np.random.seed(42)

# Configure plotting
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

# System info
num_cores = cpu_count()
print(f"System Configuration:")
print(f"  CPU cores: {num_cores}")
print(f"  Numba available: {NUMBA_AVAILABLE}")
print(f"  GPU available: {GPU_AVAILABLE}")

## 2. Cache Optimization: Memory Access Patterns

Cache misses can devastate performance. Let's optimize memory access!

In [None]:
def matrix_sum_row_major(matrix):
    """
    Sum matrix elements using row-major order (cache-friendly for NumPy).
    
    NumPy stores arrays in row-major (C) order by default.
    Accessing consecutive elements in a row = good cache locality.
    """
    total = 0.0
    rows, cols = matrix.shape
    
    # Outer loop: rows, inner loop: columns
    for i in range(rows):
        for j in range(cols):
            total += matrix[i, j]
    
    return total


def matrix_sum_column_major(matrix):
    """
    Sum matrix elements using column-major order (cache-unfriendly).
    
    Accessing down columns = jumping around in memory.
    Each access might miss the cache!
    """
    total = 0.0
    rows, cols = matrix.shape
    
    # Outer loop: columns, inner loop: rows
    for j in range(cols):
        for i in range(rows):
            total += matrix[i, j]
    
    return total


# Benchmark cache effects
print("="*70)
print("CACHE OPTIMIZATION: Memory Access Patterns")
print("="*70)

test_matrix = np.random.rand(2000, 2000)
print(f"\nTest matrix: {test_matrix.shape} ({test_matrix.nbytes/1e6:.2f} MB)")

# Row-major access (cache-friendly)
times_row = []
for _ in range(5):
    start = time.perf_counter()
    result_row = matrix_sum_row_major(test_matrix)
    times_row.append(time.perf_counter() - start)
time_row = np.mean(times_row)

print(f"\nRow-major access (cache-friendly):")
print(f"  Time: {time_row:.4f} seconds")
print(f"  Result: {result_row:.2f}")

# Column-major access (cache-unfriendly)
times_col = []
for _ in range(5):
    start = time.perf_counter()
    result_col = matrix_sum_column_major(test_matrix)
    times_col.append(time.perf_counter() - start)
time_col = np.mean(times_col)

print(f"\nColumn-major access (cache-unfriendly):")
print(f"  Time: {time_col:.4f} seconds")
print(f"  Result: {result_col:.2f}")

# Compare
slowdown = time_col / time_row
print(f"\nColumn-major is {slowdown:.2f}x SLOWER!")
print(f"\nWhy? Cache misses!")
print(f"  Row-major: Sequential access → data prefetched into cache")
print(f"  Column-major: Strided access → cache line wasted")

### 2.1 Cache-Optimized Matrix Transpose

In [None]:
def transpose_naive(matrix):
    """Naive transpose - poor cache usage."""
    rows, cols = matrix.shape
    result = np.zeros((cols, rows), dtype=matrix.dtype)
    
    for i in range(rows):
        for j in range(cols):
            result[j, i] = matrix[i, j]  # Column-major write!
    
    return result


def transpose_tiled(matrix, tile_size=32):
    """
    Cache-optimized transpose using tiling.
    
    Process matrix in tiles that fit in cache.
    Within each tile, both reads and writes are cache-friendly.
    """
    rows, cols = matrix.shape
    result = np.zeros((cols, rows), dtype=matrix.dtype)
    
    # Process in tiles
    for i in range(0, rows, tile_size):
        for j in range(0, cols, tile_size):
            # Get tile boundaries
            i_end = min(i + tile_size, rows)
            j_end = min(j + tile_size, cols)
            
            # Transpose this tile
            for ii in range(i, i_end):
                for jj in range(j, j_end):
                    result[jj, ii] = matrix[ii, jj]
    
    return result


# Benchmark transpose methods
print("\n" + "="*70)
print("TILED TRANSPOSE FOR CACHE OPTIMIZATION")
print("="*70)

test_size = 2048
test_matrix = np.random.rand(test_size, test_size).astype(np.float32)

# Naive
start = time.time()
result_naive = transpose_naive(test_matrix)
time_naive = time.time() - start

# Tiled
start = time.time()
result_tiled = transpose_tiled(test_matrix, tile_size=32)
time_tiled = time.time() - start

# NumPy (highly optimized)
start = time.time()
result_numpy = test_matrix.T.copy()
time_numpy = time.time() - start

print(f"Matrix: {test_size}x{test_size}\n")
print(f"Naive transpose:  {time_naive:.4f} seconds")
print(f"Tiled transpose:  {time_tiled:.4f} seconds ({time_naive/time_tiled:.2f}x faster)")
print(f"NumPy transpose:  {time_numpy:.4f} seconds ({time_naive/time_numpy:.2f}x faster)")

# Verify correctness
assert np.allclose(result_naive, result_tiled)
assert np.allclose(result_naive, result_numpy)
print("\n✓ All results match!")

## 3. Load Balancing: Dynamic Work Distribution

In [None]:
def variable_workload(n):
    """
    Simulate task with variable computation time.
    
    Task n takes time proportional to n.
    This creates load imbalance if statically distributed.
    """
    # Simulate work (compute-bound)
    result = 0
    for i in range(n * 100000):
        result += np.sqrt(i + 1)
    return result


def static_load_balancing(tasks, num_workers):
    """
    Static partitioning: Divide tasks equally among workers.
    
    Problem: If tasks have variable time, some workers finish early.
    """
    chunk_size = len(tasks) // num_workers
    
    with Pool(processes=num_workers) as pool:
        # Map assigns fixed chunks to each worker
        results = pool.map(variable_workload, tasks)
    
    return results


def dynamic_load_balancing(tasks, num_workers):
    """
    Dynamic scheduling: Workers grab tasks as they become available.
    
    Uses chunksize=1 so workers get one task at a time.
    No idle time - workers always busy!
    """
    with Pool(processes=num_workers) as pool:
        # chunksize=1 enables dynamic scheduling
        results = pool.map(variable_workload, tasks, chunksize=1)
    
    return results


# Benchmark load balancing strategies
print("="*70)
print("LOAD BALANCING: Static vs Dynamic")
print("="*70)

# Create tasks with variable workload
# Tasks 1, 2, 3, ..., 20 (increasing difficulty)
tasks = list(range(1, 21))
num_workers = 4

print(f"\nTasks: {tasks}")
print(f"Workers: {num_workers}")
print(f"Note: Task difficulty increases linearly\n")

# Static partitioning
print("Static partitioning...")
start = time.time()
results_static = static_load_balancing(tasks, num_workers)
time_static = time.time() - start
print(f"  Time: {time_static:.3f} seconds")
print(f"  Problem: Worker 4 gets tasks [16,17,18,19,20] - takes longest!")

# Dynamic scheduling
print("\nDynamic scheduling...")
start = time.time()
results_dynamic = dynamic_load_balancing(tasks, num_workers)
time_dynamic = time.time() - start
print(f"  Time: {time_dynamic:.3f} seconds")
print(f"  Benefit: Workers grab tasks as available - no idle time!")

# Compare
speedup = time_static / time_dynamic
print(f"\nDynamic is {speedup:.2f}x faster!")
print(f"Savings: {time_static - time_dynamic:.3f} seconds ({(1-1/speedup)*100:.1f}% reduction)")

## 4. Reducing Communication Overhead

In [None]:
def process_small_chunks(data, num_workers):
    """
    Process data in many small chunks.
    
    High communication overhead: data serialization/deserialization
    happens many times.
    """
    chunk_size = 1000  # Small chunks
    chunks = [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]
    
    def worker_func(chunk):
        return np.sum(chunk ** 2)  # Simple computation
    
    with Pool(processes=num_workers) as pool:
        results = pool.map(worker_func, chunks)
    
    return sum(results)


def process_large_chunks(data, num_workers):
    """
    Process data in fewer large chunks.
    
    Lower communication overhead: data transferred fewer times.
    Better computation/communication ratio.
    """
    chunk_size = len(data) // num_workers  # Large chunks
    chunks = [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]
    
    def worker_func(chunk):
        return np.sum(chunk ** 2)
    
    with Pool(processes=num_workers) as pool:
        results = pool.map(worker_func, chunks)
    
    return sum(results)


# Benchmark communication overhead
print("="*70)
print("COMMUNICATION OVERHEAD: Chunk Size Impact")
print("="*70)

data = np.random.rand(10_000_000)  # 10M elements
num_workers = 4

print(f"\nData size: {len(data):,} elements ({data.nbytes/1e6:.2f} MB)")
print(f"Workers: {num_workers}\n")

# Small chunks (high overhead)
print("Small chunks (1000 elements each)...")
start = time.time()
result_small = process_small_chunks(data, num_workers)
time_small = time.time() - start
num_chunks_small = len(data) // 1000
print(f"  Chunks: {num_chunks_small}")
print(f"  Time: {time_small:.4f} seconds")

# Large chunks (low overhead)
print(f"\nLarge chunks ({len(data)//num_workers:,} elements each)...")
start = time.time()
result_large = process_large_chunks(data, num_workers)
time_large = time.time() - start
print(f"  Chunks: {num_workers}")
print(f"  Time: {time_large:.4f} seconds")

# Serial for comparison
print("\nSerial (no communication)...")
start = time.time()
result_serial = np.sum(data ** 2)
time_serial = time.time() - start
print(f"  Time: {time_serial:.4f} seconds")

# Analysis
print("\n" + "="*70)
print("ANALYSIS")
print("="*70)
print(f"Large chunks are {time_small/time_large:.2f}x faster than small chunks")
print(f"\nWhy?")
print(f"  - Fewer serialization/deserialization operations")
print(f"  - Better computation-to-communication ratio")
print(f"  - Less overhead from process coordination")
print(f"\nRule of thumb: Each chunk should take >10ms to process")

## 5. Vectorization with NumPy/Numba

In [None]:
# Scalar (slow)
def polynomial_scalar(x, coefficients):
    """
    Evaluate polynomial using loops (slow).
    
    P(x) = c0 + c1*x + c2*x² + c3*x³ + ...
    """
    result = np.zeros_like(x)
    
    for i in range(len(x)):
        for j, coef in enumerate(coefficients):
            result[i] += coef * (x[i] ** j)
    
    return result


# Vectorized (fast)
def polynomial_vectorized(x, coefficients):
    """
    Evaluate polynomial using NumPy vectorization.
    
    Uses broadcasting and optimized C loops.
    """
    result = np.zeros_like(x)
    
    for j, coef in enumerate(coefficients):
        result += coef * (x ** j)
    
    return result


if NUMBA_AVAILABLE:
    @jit(nopython=True)
    def polynomial_numba(x, coefficients):
        """
        Numba JIT-compiled version.
        
        Compiles to machine code - no Python overhead.
        """
        result = np.zeros_like(x)
        
        for i in range(len(x)):
            for j, coef in enumerate(coefficients):
                result[i] += coef * (x[i] ** j)
        
        return result


# Benchmark vectorization
print("="*70)
print("VECTORIZATION: Scalar vs Vectorized vs Numba")
print("="*70)

x = np.linspace(0, 10, 1_000_000)
coefficients = [1.0, 2.0, -0.5, 0.1, -0.01]  # 4th degree polynomial

print(f"\nInput size: {len(x):,} points")
print(f"Polynomial degree: {len(coefficients)-1}\n")

# Scalar version
print("Scalar (Python loops)...")
start = time.time()
result_scalar = polynomial_scalar(x, coefficients)
time_scalar = time.time() - start
print(f"  Time: {time_scalar:.4f} seconds")

# Vectorized version
print("\nVectorized (NumPy)...")
start = time.time()
result_vectorized = polynomial_vectorized(x, coefficients)
time_vectorized = time.time() - start
print(f"  Time: {time_vectorized:.4f} seconds")
print(f"  Speedup: {time_scalar/time_vectorized:.2f}x")

# Numba version
if NUMBA_AVAILABLE:
    print("\nNumba JIT (warmup)...")
    _ = polynomial_numba(x[:100], coefficients)  # Warmup
    
    print("Numba JIT (compiled)...")
    start = time.time()
    result_numba = polynomial_numba(x, coefficients)
    time_numba = time.time() - start
    print(f"  Time: {time_numba:.4f} seconds")
    print(f"  Speedup: {time_scalar/time_numba:.2f}x")

# Verify correctness
assert np.allclose(result_scalar, result_vectorized)
if NUMBA_AVAILABLE:
    assert np.allclose(result_scalar, result_numba)
print("\n✓ All results match!")

## 6. Hybrid CPU-GPU Approach

In [None]:
print("="*70)
print("HYBRID CPU-GPU APPROACH")
print("="*70)
print("""
Strategy: Use both CPU and GPU simultaneously!

1. DIVIDE WORK:
   - 70% of data → GPU (faster for large batches)
   - 30% of data → CPU (parallel with multiprocessing)
   - Overlap computation: GPU and CPU work in parallel

2. WHEN TO USE:
   - GPU transfer overhead is significant
   - Problem is large enough for both
   - CPU has idle cycles while waiting for GPU

3. IMPLEMENTATION:
   - Use threading to launch GPU and CPU work simultaneously
   - GPU works on large chunk (high throughput)
   - CPU works on smaller chunk (low latency)
   - Combine results at the end

4. EXPECTED SPEEDUP:
   - Better than GPU-only (uses idle CPU)
   - Better than CPU-only (GPU handles bulk)
   - Typically 1.2-1.5x improvement over best single approach
""")

if GPU_AVAILABLE:
    print("\nGPU available - hybrid approach is viable!")
    print("Recommended split:")
    print(f"  GPU: 70% of work (maximize GPU utilization)")
    print(f"  CPU: 30% of work ({num_cores} cores for parallelization)")
else:
    print("\nGPU not available - hybrid approach not demonstrated.")
    print("With GPU, you could achieve:")
    print("  - 5-10x from GPU on 70% of work")
    print("  - 2-4x from CPU parallel on 30% of work")
    print("  - Total: ~6-12x speedup (better than either alone)")

## 7. Iterative Optimization Process

In [None]:
print("="*70)
print("ITERATIVE OPTIMIZATION METHODOLOGY")
print("="*70)
print("""
Step-by-step process for optimizing parallel programs:

STEP 1: PROFILE & MEASURE
  ✓ Identify bottleneck (CPU? Memory? Communication?)
  ✓ Measure baseline performance
  ✓ Use profiling tools (cProfile, line_profiler, NVIDIA Nsight)

STEP 2: CHOOSE OPTIMIZATION TARGET
  Focus on ONE thing at a time:
  □ Cache optimization (memory access patterns)
  □ Load balancing (work distribution)
  □ Communication reduction (chunk sizing)
  □ Vectorization (SIMD operations)
  □ Algorithm improvement (O(n²) → O(n log n))

STEP 3: IMPLEMENT CHANGE
  ✓ Make ONE change at a time
  ✓ Keep code version controlled (git)
  ✓ Document what you changed and why

STEP 4: MEASURE IMPACT
  ✓ Run benchmarks (multiple iterations)
  ✓ Calculate speedup
  ✓ Check correctness (results still accurate?)

STEP 5: DECIDE
  ✓ If faster: Keep change, goto STEP 1
  ✓ If slower: Revert change, try different approach
  ✓ If no more bottlenecks: Done!

GOLDEN RULES:
  1. Measure, don't guess
  2. Optimize the bottleneck, not everything
  3. One change at a time
  4. Document your progress
  5. Know when to stop (diminishing returns)
""")

## 8. Case Study: Image Processing Optimization

In [None]:
def create_test_image(size=(1024, 1024)):
    """Create test image."""
    height, width = size
    image = np.random.randint(0, 256, size=(height, width), dtype=np.uint8)
    return image


def gaussian_kernel(size=5, sigma=1.0):
    """Create Gaussian kernel."""
    if size % 2 == 0:
        size += 1
    ax = np.arange(-size // 2 + 1, size // 2 + 1)
    xx, yy = np.meshgrid(ax, ax)
    kernel = np.exp(-(xx**2 + yy**2) / (2 * sigma**2))
    return kernel / np.sum(kernel)


# Version 1: Baseline (naive loops)
def blur_v1_naive(image, kernel):
    """Version 1: Naive nested loops."""
    height, width = image.shape
    k_size = kernel.shape[0]
    k_half = k_size // 2
    
    padded = np.pad(image, k_half, mode='edge')
    result = np.zeros_like(image, dtype=np.float32)
    
    for i in range(height):
        for j in range(width):
            for ki in range(k_size):
                for kj in range(k_size):
                    result[i, j] += padded[i+ki, j+kj] * kernel[ki, kj]
    
    return result.astype(np.uint8)


# Version 2: Optimized inner loops (NumPy slicing)
def blur_v2_vectorized(image, kernel):
    """Version 2: Vectorized inner loop."""
    height, width = image.shape
    k_size = kernel.shape[0]
    k_half = k_size // 2
    
    padded = np.pad(image, k_half, mode='edge')
    result = np.zeros_like(image, dtype=np.float32)
    
    for i in range(height):
        for j in range(width):
            region = padded[i:i+k_size, j:j+k_size]
            result[i, j] = np.sum(region * kernel)
    
    return result.astype(np.uint8)


# Version 3: Full vectorization (scipy)
def blur_v3_scipy(image, kernel):
    """Version 3: Use scipy's optimized convolution."""
    from scipy.ndimage import convolve
    return convolve(image.astype(np.float32), kernel, mode='nearest').astype(np.uint8)


# Benchmark all versions
print("="*70)
print("CASE STUDY: Progressive Image Processing Optimization")
print("="*70)

image = create_test_image((512, 512))
kernel = gaussian_kernel(size=5, sigma=1.5)

print(f"\nImage: {image.shape}")
print(f"Kernel: {kernel.shape}\n")

versions = [
    ("V1: Naive loops", blur_v1_naive),
    ("V2: Vectorized inner", blur_v2_vectorized),
    ("V3: Scipy (optimized)", blur_v3_scipy)
]

results = []
baseline_time = None

for name, func in versions:
    print(f"{name}...")
    
    # Warmup
    _ = func(image[:100, :100], kernel)
    
    # Time
    times = []
    for _ in range(3):
        start = time.time()
        result = func(image, kernel)
        times.append(time.time() - start)
    
    mean_time = np.mean(times)
    
    if baseline_time is None:
        baseline_time = mean_time
        speedup = 1.0
    else:
        speedup = baseline_time / mean_time
    
    print(f"  Time: {mean_time:.4f} seconds")
    print(f"  Speedup: {speedup:.2f}x\n")
    
    results.append({
        'Version': name,
        'Time (s)': mean_time,
        'Speedup': speedup
    })

# Summary
df_results = pd.DataFrame(results)
print("="*70)
print("OPTIMIZATION SUMMARY")
print("="*70)
print(df_results.to_string(index=False))
print(f"\nTotal improvement: {df_results['Speedup'].max():.2f}x")

In [None]:
# Visualize optimization progress
plt.figure(figsize=(14, 5))

# Plot 1: Execution time
plt.subplot(121)
versions_names = df_results['Version'].values
times = df_results['Time (s)'].values
colors = ['red', 'orange', 'green']

bars = plt.bar(range(len(versions_names)), times, color=colors, alpha=0.7, edgecolor='black')
plt.xticks(range(len(versions_names)), versions_names, rotation=15, ha='right')
plt.ylabel('Execution Time (seconds)', fontsize=12)
plt.title('Optimization Progress: Execution Time', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, (bar, time_val) in enumerate(zip(bars, times)):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
            f'{time_val:.3f}s', ha='center', va='bottom', fontsize=10, fontweight='bold')

# Plot 2: Speedup
plt.subplot(122)
speedups = df_results['Speedup'].values

plt.plot(range(len(versions_names)), speedups, 'o-', linewidth=3, markersize=12, color='green')
plt.xticks(range(len(versions_names)), versions_names, rotation=15, ha='right')
plt.ylabel('Speedup vs Baseline', fontsize=12)
plt.title('Cumulative Speedup', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.axhline(y=1, linestyle='--', color='gray', linewidth=1, label='Baseline')

# Add value labels
for i, speedup in enumerate(speedups):
    plt.text(i, speedup + 0.5, f'{speedup:.1f}x', ha='center', va='bottom', 
            fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig('optimization_progress.png', dpi=150, bbox_inches='tight')
plt.show()

print("Optimization progress plot saved to: optimization_progress.png")

## 9. Performance Optimization Checklist

In [None]:
print("="*70)
print("PERFORMANCE OPTIMIZATION CHECKLIST")
print("="*70)
print("""
Before starting optimization:
  □ Profile to find bottleneck
  □ Measure baseline performance
  □ Set performance target
  □ Ensure correctness tests exist

Algorithm level:
  □ Use optimal algorithm (O(n log n) vs O(n²))
  □ Avoid redundant computation
  □ Cache intermediate results
  □ Consider approximation algorithms

Memory optimization:
  □ Optimize memory access patterns (cache-friendly)
  □ Use appropriate data structures
  □ Minimize memory allocations
  □ Reuse buffers when possible
  □ Consider memory layout (row vs column major)

Parallelization:
  □ Identify parallelizable sections
  □ Choose right parallel approach (CPU/GPU/hybrid)
  □ Minimize synchronization
  □ Balance load across workers
  □ Use appropriate chunk sizes

CPU optimization:
  □ Vectorize operations (NumPy, SIMD)
  □ Use JIT compilation (Numba)
  □ Optimize loop structures
  □ Minimize branch mispredictions
  □ Use multi-threading/processing

GPU optimization:
  □ Minimize CPU-GPU transfers
  □ Use shared memory
  □ Coalesce memory access
  □ Maximize occupancy
  □ Avoid branch divergence

Communication:
  □ Reduce data transfer volume
  □ Overlap communication and computation
  □ Use large chunks (amortize overhead)
  □ Consider compression

After optimization:
  □ Verify correctness
  □ Measure performance gain
  □ Check scalability
  □ Document changes
  □ Consider maintainability

When to stop:
  ✓ Reached performance target
  ✓ Diminishing returns (<5% gain)
  ✓ Code becoming unmaintainable
  ✓ Hit hardware limits
""")

## 10. Summary and Recommendations

In [None]:
print("="*70)
print("OPTIMIZATION TECHNIQUES SUMMARY")
print("="*70)
print("""
Techniques covered in this module:

1. CACHE OPTIMIZATION
   Impact: 2-10x speedup
   When: Memory-bound applications
   How: Row-major access, tiling, data locality

2. LOAD BALANCING
   Impact: 1.5-3x speedup
   When: Variable workload per task
   How: Dynamic scheduling, work stealing

3. COMMUNICATION REDUCTION
   Impact: 2-5x speedup
   When: Communication-bound parallel code
   How: Large chunks, batching, minimize transfers

4. VECTORIZATION
   Impact: 5-50x speedup
   When: Elementwise operations
   How: NumPy operations, SIMD, Numba

5. HYBRID CPU-GPU
   Impact: 1.2-1.5x over best single approach
   When: Large problems with both resources
   How: Split work 70-30, overlap execution

Recommended optimization order:
  1st: Choose right algorithm (biggest impact)
  2nd: Vectorize (often easy, big wins)
  3rd: Parallelize (CPU cores or GPU)
  4th: Cache optimization (if still memory-bound)
  5th: Load balancing (if scaling poorly)
  6th: Communication reduction (if overhead high)
  7th: Hybrid approaches (squeeze last bits)

Tools for optimization:
  Profiling: cProfile, line_profiler, py-spy
  CPU: NumPy, Numba, multiprocessing
  GPU: Numba CUDA, CuPy, PyTorch
  Memory: memory_profiler, tracemalloc
  Benchmarking: timeit, pytest-benchmark

Remember:
  - Premature optimization is the root of all evil
  - Measure, don't guess
  - Optimize the bottleneck
  - Keep it maintainable
  - Know when to stop
""")

print("\n" + "="*70)
print("CONGRATULATIONS!")
print("="*70)
print("""
You've completed the parallel processing assignment modules!

You now know:
  ✓ How to implement serial baselines (Module 05)
  ✓ Multi-core CPU parallelization (Module 06)
  ✓ GPU acceleration with CUDA (Module 07)
  ✓ Performance benchmarking (Module 08)
  ✓ Advanced optimization techniques (Module 09)

Next steps:
  1. Apply these techniques to YOUR assignment problem
  2. Create your own benchmarks
  3. Write your performance analysis report
  4. Compare serial, parallel, and GPU versions

Good luck with your assignment!
""")

## Summary

In this module, you learned:

1. **Cache Optimization**
   - Row-major vs column-major access
   - Tiling for better locality
   - Impact on performance (2-10x)

2. **Load Balancing**
   - Static vs dynamic scheduling
   - Handling variable workloads
   - Minimizing idle time

3. **Communication Optimization**
   - Chunk sizing strategies
   - Reducing overhead
   - Computation/communication ratio

4. **Vectorization**
   - NumPy broadcasting
   - Numba JIT compilation
   - SIMD operations

5. **Hybrid Approaches**
   - CPU-GPU work splitting
   - Overlapping execution
   - Maximizing utilization

6. **Iterative Process**
   - Profile → Optimize → Measure
   - One change at a time
   - Know when to stop

**You're now ready to optimize your own parallel programs!**