# Module 06: Multi-Core CPU Parallelization with Python

**Difficulty**: ⭐⭐⭐

**Estimated Time**: 90 minutes

**Prerequisites**: 
- Module 05: Serial Implementation
- Understanding of processes vs threads
- Basic parallel computing concepts

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand OpenMP concepts and how they map to Python multiprocessing
2. Implement parallel versions of serial algorithms using multiprocessing
3. Use Numba's parallel features for CPU parallelization
4. Manage thread synchronization and avoid race conditions
5. Implement load balancing strategies for optimal performance
6. Measure and analyze speedup from parallelization

## 1. Setup and Imports

In [None]:
import numpy as np
import time
import multiprocessing as mp
from multiprocessing import Pool, cpu_count
from functools import partial
import matplotlib.pyplot as plt
import pandas as pd
from typing import Tuple, List

# Try to import numba (for JIT compilation and parallelization)
try:
    from numba import jit, prange, set_num_threads
    NUMBA_AVAILABLE = True
    print("Numba available - will use for JIT compilation")
except ImportError:
    NUMBA_AVAILABLE = False
    print("Numba not available - install with: pip install numba")
    print("Will use multiprocessing only")

# Set random seed
np.random.seed(42)

# Configure matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Get system information
num_cores = cpu_count()
print(f"\nSystem information:")
print(f"  CPU cores available: {num_cores}")
print(f"  NumPy version: {np.__version__}")

## 2. OpenMP Concepts → Python Equivalents

### OpenMP Directives and Python Mapping

| OpenMP (C/C++) | Python Equivalent | Purpose |
|----------------|-------------------|----------|
| `#pragma omp parallel` | `multiprocessing.Pool()` | Create parallel region |
| `#pragma omp for` | `pool.map()` | Parallel loop |
| `#pragma omp parallel for` | Numba `prange()` | Combined parallel+loop |
| `#pragma omp critical` | `multiprocessing.Lock()` | Critical section |
| `#pragma omp atomic` | `multiprocessing.Value()` | Atomic operation |
| `#pragma omp reduction` | Manual reduction in Pool | Reduction operation |
| `num_threads(N)` | `Pool(processes=N)` | Set thread count |

### Important Differences

- **Python multiprocessing uses processes, not threads** (due to GIL)
- **Higher overhead** for process creation vs threads
- **No shared memory by default** - data must be serialized
- **Numba can use real threads** - better for numerical code

## 3. Timing and Utilities

In [None]:
def time_function(func, *args, num_runs=3, **kwargs):
    """
    Time a function execution with multiple runs.
    """
    times = []
    result = None
    
    for i in range(num_runs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        times.append(end - start)
    
    mean_time = np.mean(times)
    std_time = np.std(times)
    
    return mean_time, std_time, result


def calculate_speedup(serial_time, parallel_time):
    """
    Calculate speedup and efficiency.
    
    Speedup = T_serial / T_parallel
    Efficiency = Speedup / num_cores
    """
    speedup = serial_time / parallel_time
    efficiency = speedup / num_cores
    return speedup, efficiency


# Create test data (reuse from Module 05)
def create_test_image(size=(512, 512)):
    """Create test image for filtering."""
    height, width = size
    image = np.zeros((height, width), dtype=np.uint8)
    
    for i in range(height):
        image[i, :] = int(255 * i / height)
    
    image[100:200, 100:300] = 200
    
    center_y, center_x = height // 2, width // 2
    radius = 80
    y, x = np.ogrid[:height, :width]
    mask = (x - center_x)**2 + (y - center_y)**2 <= radius**2
    image[mask] = 150
    
    noise = np.random.randint(-20, 20, size=(height, width))
    image = np.clip(image.astype(np.int16) + noise, 0, 255).astype(np.uint8)
    
    return image

print("Utilities loaded successfully!")

## 4. Image Processing with Multiprocessing

We'll parallelize image filtering by dividing the image into chunks and processing each chunk on a separate core.

### 4.1 Serial Implementation (Baseline)

In [None]:
def create_gaussian_kernel(size=5, sigma=1.0):
    """Create Gaussian kernel."""
    if size % 2 == 0:
        size += 1
    ax = np.arange(-size // 2 + 1, size // 2 + 1)
    xx, yy = np.meshgrid(ax, ax)
    kernel = np.exp(-(xx**2 + yy**2) / (2 * sigma**2))
    return kernel / np.sum(kernel)


def apply_filter_serial(image, kernel):
    """Serial convolution implementation."""
    height, width = image.shape
    k_height, k_width = kernel.shape
    
    pad_h = k_height // 2
    pad_w = k_width // 2
    
    padded = np.pad(image, ((pad_h, pad_h), (pad_w, pad_w)), mode='edge')
    output = np.zeros_like(image, dtype=np.float32)
    
    for i in range(height):
        for j in range(width):
            region = padded[i:i+k_height, j:j+k_width]
            output[i, j] = np.sum(region * kernel)
    
    return np.clip(output, 0, 255).astype(np.uint8)


# Create test image and kernel
test_image = create_test_image((512, 512))
gaussian_kernel = create_gaussian_kernel(size=5, sigma=1.5)

print(f"Test image shape: {test_image.shape}")
print(f"Kernel shape: {gaussian_kernel.shape}")

### 4.2 Parallel Implementation with Multiprocessing

**Strategy**: Divide image into horizontal strips, process each strip in parallel.

In [None]:
def process_image_chunk(args):
    """
    Process a chunk of the image.
    
    This function will be called by each worker process.
    
    Args:
        args: Tuple of (image_chunk, kernel, start_row)
    
    Returns:
        tuple: (filtered_chunk, start_row)
    """
    image_chunk, kernel, start_row = args
    
    height, width = image_chunk.shape
    k_height, k_width = kernel.shape
    
    pad_h = k_height // 2
    pad_w = k_width // 2
    
    # Pad the chunk
    padded = np.pad(image_chunk, ((pad_h, pad_h), (pad_w, pad_w)), mode='edge')
    output = np.zeros_like(image_chunk, dtype=np.float32)
    
    # Apply filter
    for i in range(height):
        for j in range(width):
            region = padded[i:i+k_height, j:j+k_width]
            output[i, j] = np.sum(region * kernel)
    
    output = np.clip(output, 0, 255).astype(np.uint8)
    
    return output, start_row


def apply_filter_parallel(image, kernel, num_processes=None):
    """
    Apply filter using multiprocessing.
    
    Args:
        image: Input image
        kernel: Convolution kernel
        num_processes: Number of processes (default: cpu_count)
    
    Returns:
        numpy.ndarray: Filtered image
    """
    if num_processes is None:
        num_processes = cpu_count()
    
    height, width = image.shape
    
    # Divide image into chunks (horizontal strips)
    chunk_size = height // num_processes
    
    # Prepare chunks with overlap for boundary handling
    k_height = kernel.shape[0]
    overlap = k_height // 2
    
    chunks = []
    for i in range(num_processes):
        start = i * chunk_size
        end = start + chunk_size if i < num_processes - 1 else height
        
        # Add overlap at boundaries
        chunk_start = max(0, start - overlap)
        chunk_end = min(height, end + overlap)
        
        chunk = image[chunk_start:chunk_end, :]
        chunks.append((chunk, kernel, start))
    
    # Process chunks in parallel
    with Pool(processes=num_processes) as pool:
        results = pool.map(process_image_chunk, chunks)
    
    # Combine results
    output = np.zeros_like(image, dtype=np.uint8)
    
    for result_chunk, start_row in results:
        chunk_height = result_chunk.shape[0]
        end_row = min(start_row + chunk_height, height)
        actual_height = end_row - start_row
        output[start_row:end_row, :] = result_chunk[:actual_height, :]
    
    return output


print("Parallel image processing functions defined!")

### 4.3 Benchmark and Compare

In [None]:
print("="*70)
print("IMAGE PROCESSING BENCHMARK: Serial vs Parallel")
print("="*70)

# Test serial
print("\nSerial implementation...")
t_serial, std_serial, result_serial = time_function(
    apply_filter_serial, test_image, gaussian_kernel, num_runs=3
)
print(f"  Time: {t_serial:.4f} ± {std_serial:.4f} seconds")

# Test parallel with different number of processes
parallel_results = []

for n_proc in [1, 2, 4, num_cores]:
    if n_proc > num_cores:
        continue
    
    print(f"\nParallel with {n_proc} processes...")
    t_parallel, std_parallel, result_parallel = time_function(
        apply_filter_parallel, test_image, gaussian_kernel, 
        num_processes=n_proc, num_runs=3
    )
    print(f"  Time: {t_parallel:.4f} ± {std_parallel:.4f} seconds")
    
    speedup, efficiency = calculate_speedup(t_serial, t_parallel)
    print(f"  Speedup: {speedup:.2f}x")
    print(f"  Efficiency: {efficiency*100:.1f}%")
    
    parallel_results.append({
        'Processes': n_proc,
        'Time (s)': t_parallel,
        'Speedup': speedup,
        'Efficiency (%)': efficiency * 100
    })

# Verify results are identical
are_equal = np.allclose(result_serial, result_parallel, rtol=1e-5)
print(f"\nResults match: {are_equal}")

# Display table
df_image = pd.DataFrame(parallel_results)
print("\n" + "="*70)
print("SUMMARY TABLE")
print("="*70)
print(df_image.to_string(index=False))

In [None]:
# Visualize speedup
plt.figure(figsize=(14, 5))

# Plot 1: Speedup curve
plt.subplot(131)
processes = df_image['Processes'].values
speedups = df_image['Speedup'].values
plt.plot(processes, speedups, 'o-', linewidth=2, markersize=8, label='Actual')
plt.plot(processes, processes, '--', linewidth=1, color='gray', label='Ideal (linear)')
plt.xlabel('Number of Processes')
plt.ylabel('Speedup')
plt.title('Speedup vs Number of Processes')
plt.grid(True, alpha=0.3)
plt.legend()

# Plot 2: Efficiency
plt.subplot(132)
efficiency = df_image['Efficiency (%)'].values
plt.plot(processes, efficiency, 'o-', linewidth=2, markersize=8, color='green')
plt.axhline(y=100, linestyle='--', color='gray', linewidth=1, label='100% (ideal)')
plt.xlabel('Number of Processes')
plt.ylabel('Efficiency (%)')
plt.title('Parallel Efficiency')
plt.grid(True, alpha=0.3)
plt.legend()

# Plot 3: Execution time
plt.subplot(133)
times = df_image['Time (s)'].values
plt.plot(processes, times, 'o-', linewidth=2, markersize=8, color='red')
plt.axhline(y=t_serial, linestyle='--', color='gray', linewidth=1, label='Serial time')
plt.xlabel('Number of Processes')
plt.ylabel('Execution Time (seconds)')
plt.title('Execution Time vs Processes')
plt.grid(True, alpha=0.3)
plt.legend()

plt.tight_layout()
plt.show()

## 5. Numba Parallel Implementation

Numba can use real threads (not processes) and has lower overhead. Let's compare!

In [None]:
if NUMBA_AVAILABLE:
    @jit(nopython=True, parallel=True)
    def apply_filter_numba(image, kernel):
        """
        Apply filter using Numba parallel JIT compilation.
        
        The @jit decorator compiles this to machine code.
        parallel=True enables automatic parallelization.
        prange() creates parallel loops.
        """
        height, width = image.shape
        k_height, k_width = kernel.shape
        
        pad_h = k_height // 2
        pad_w = k_width // 2
        
        # Manual padding (Numba doesn't support np.pad)
        padded = np.zeros((height + 2*pad_h, width + 2*pad_w), dtype=image.dtype)
        padded[pad_h:pad_h+height, pad_w:pad_w+width] = image
        
        # Edge replication for borders
        for i in range(pad_h):
            padded[i, pad_w:pad_w+width] = image[0, :]
            padded[height+pad_h+i, pad_w:pad_w+width] = image[height-1, :]
        for j in range(pad_w):
            padded[:, j] = padded[:, pad_w]
            padded[:, width+pad_w+j] = padded[:, width+pad_w-1]
        
        output = np.zeros((height, width), dtype=np.float32)
        
        # PARALLEL LOOP - prange() distributes iterations across threads
        for i in prange(height):
            for j in range(width):
                # Apply kernel
                value = 0.0
                for ki in range(k_height):
                    for kj in range(k_width):
                        value += padded[i+ki, j+kj] * kernel[ki, kj]
                output[i, j] = value
        
        # Clip and convert
        result = np.empty((height, width), dtype=np.uint8)
        for i in prange(height):
            for j in range(width):
                val = output[i, j]
                if val < 0:
                    result[i, j] = 0
                elif val > 255:
                    result[i, j] = 255
                else:
                    result[i, j] = np.uint8(val)
        
        return result
    
    print("Numba parallel functions compiled!")
    
    # Warm-up (first run compiles the function)
    print("\nWarming up Numba (first run compiles)...")
    _ = apply_filter_numba(test_image[:100, :100], gaussian_kernel)
    print("Warm-up complete!")
else:
    print("Numba not available - skipping Numba benchmarks")

In [None]:
if NUMBA_AVAILABLE:
    print("\n" + "="*70)
    print("NUMBA PARALLEL BENCHMARK")
    print("="*70)
    
    numba_results = []
    
    for n_threads in [1, 2, 4, num_cores]:
        if n_threads > num_cores:
            continue
        
        set_num_threads(n_threads)
        
        print(f"\nNumba with {n_threads} threads...")
        t_numba, std_numba, result_numba = time_function(
            apply_filter_numba, test_image, gaussian_kernel, num_runs=3
        )
        print(f"  Time: {t_numba:.4f} ± {std_numba:.4f} seconds")
        
        speedup, efficiency = calculate_speedup(t_serial, t_numba)
        print(f"  Speedup vs serial: {speedup:.2f}x")
        print(f"  Efficiency: {efficiency*100:.1f}%")
        
        numba_results.append({
            'Threads': n_threads,
            'Time (s)': t_numba,
            'Speedup': speedup,
            'Efficiency (%)': efficiency * 100
        })
    
    # Display comparison
    df_numba = pd.DataFrame(numba_results)
    print("\n" + "="*70)
    print("NUMBA SUMMARY")
    print("="*70)
    print(df_numba.to_string(index=False))
    
    # Compare multiprocessing vs Numba
    print("\n" + "="*70)
    print("COMPARISON: Multiprocessing vs Numba (max cores)")
    print("="*70)
    mp_time = df_image[df_image['Processes'] == num_cores]['Time (s)'].values[0]
    numba_time = df_numba[df_numba['Threads'] == num_cores]['Time (s)'].values[0]
    print(f"Multiprocessing ({num_cores} processes): {mp_time:.4f} s")
    print(f"Numba ({num_cores} threads):            {numba_time:.4f} s")
    print(f"Numba advantage:                         {mp_time/numba_time:.2f}x faster")

## 6. Matrix Multiplication Parallelization

In [None]:
def matrix_multiply_serial(A, B):
    """Serial matrix multiplication."""
    m, n = A.shape
    n2, p = B.shape
    assert n == n2
    
    C = np.zeros((m, p), dtype=np.float64)
    
    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]
    
    return C


def multiply_row_range(args):
    """
    Multiply a range of rows from matrix A with entire matrix B.
    
    This parallelizes the outer loop of matrix multiplication.
    """
    A_chunk, B, start_row = args
    
    m_chunk, n = A_chunk.shape
    n2, p = B.shape
    
    C_chunk = np.zeros((m_chunk, p), dtype=np.float64)
    
    for i in range(m_chunk):
        for j in range(p):
            for k in range(n):
                C_chunk[i, j] += A_chunk[i, k] * B[k, j]
    
    return C_chunk, start_row


def matrix_multiply_parallel(A, B, num_processes=None):
    """
    Parallel matrix multiplication using multiprocessing.
    
    Divides matrix A into row chunks and processes in parallel.
    """
    if num_processes is None:
        num_processes = cpu_count()
    
    m, n = A.shape
    n2, p = B.shape
    assert n == n2
    
    # Divide rows among processes
    chunk_size = m // num_processes
    
    chunks = []
    for i in range(num_processes):
        start = i * chunk_size
        end = start + chunk_size if i < num_processes - 1 else m
        chunks.append((A[start:end, :], B, start))
    
    # Process in parallel
    with Pool(processes=num_processes) as pool:
        results = pool.map(multiply_row_range, chunks)
    
    # Combine results
    C = np.zeros((m, p), dtype=np.float64)
    for C_chunk, start_row in results:
        chunk_rows = C_chunk.shape[0]
        C[start_row:start_row+chunk_rows, :] = C_chunk
    
    return C


# Create test matrices
test_matrix_a = np.random.rand(300, 300)
test_matrix_b = np.random.rand(300, 300)

print(f"Test matrices: {test_matrix_a.shape} × {test_matrix_b.shape}")

In [None]:
print("="*70)
print("MATRIX MULTIPLICATION BENCHMARK")
print("="*70)

# Serial baseline
print("\nSerial implementation...")
t_serial_mat, _, result_serial_mat = time_function(
    matrix_multiply_serial, test_matrix_a, test_matrix_b, num_runs=3
)
print(f"  Time: {t_serial_mat:.4f} seconds")

# Parallel versions
matrix_parallel_results = []

for n_proc in [1, 2, 4, num_cores]:
    if n_proc > num_cores:
        continue
    
    print(f"\nParallel with {n_proc} processes...")
    t_parallel_mat, _, result_parallel_mat = time_function(
        matrix_multiply_parallel, test_matrix_a, test_matrix_b,
        num_processes=n_proc, num_runs=3
    )
    print(f"  Time: {t_parallel_mat:.4f} seconds")
    
    speedup, efficiency = calculate_speedup(t_serial_mat, t_parallel_mat)
    print(f"  Speedup: {speedup:.2f}x")
    print(f"  Efficiency: {efficiency*100:.1f}%")
    
    matrix_parallel_results.append({
        'Processes': n_proc,
        'Time (s)': t_parallel_mat,
        'Speedup': speedup,
        'Efficiency (%)': efficiency * 100
    })

# Verify correctness
matches = np.allclose(result_serial_mat, result_parallel_mat, rtol=1e-10)
print(f"\nResults match: {matches}")

df_matrix_par = pd.DataFrame(matrix_parallel_results)
print("\n" + "="*70)
print("SUMMARY")
print("="*70)
print(df_matrix_par.to_string(index=False))

## 7. Monte Carlo Parallelization

In [None]:
def estimate_pi_serial(num_samples):
    """Serial Monte Carlo π estimation."""
    inside_circle = 0
    
    for _ in range(num_samples):
        x = np.random.random()
        y = np.random.random()
        
        if x*x + y*y <= 1.0:
            inside_circle += 1
    
    return 4.0 * inside_circle / num_samples


def estimate_pi_chunk(num_samples):
    """
    Estimate π for a chunk of samples.
    
    Each worker process runs this independently.
    """
    x = np.random.random(num_samples)
    y = np.random.random(num_samples)
    inside = np.sum(x*x + y*y <= 1.0)
    return inside


def estimate_pi_parallel(num_samples, num_processes=None):
    """
    Parallel Monte Carlo π estimation.
    
    Distributes samples across processes, then combines results.
    """
    if num_processes is None:
        num_processes = cpu_count()
    
    # Divide samples among processes
    samples_per_process = num_samples // num_processes
    
    # Process in parallel
    with Pool(processes=num_processes) as pool:
        results = pool.map(estimate_pi_chunk, 
                          [samples_per_process] * num_processes)
    
    # Combine results (reduction)
    total_inside = sum(results)
    pi_estimate = 4.0 * total_inside / num_samples
    
    return pi_estimate


print("Monte Carlo parallel functions ready!")

In [None]:
print("="*70)
print("MONTE CARLO π ESTIMATION BENCHMARK")
print("="*70)

num_samples = 10_000_000

# Serial
print(f"\nSerial ({num_samples:,} samples)...")
t_serial_pi, _, pi_serial = time_function(
    estimate_pi_serial, num_samples, num_runs=3
)
print(f"  Time: {t_serial_pi:.4f} seconds")
print(f"  π estimate: {pi_serial:.6f}")
print(f"  Error: {abs(pi_serial - np.pi):.6f}")

# Parallel
monte_carlo_results = []

for n_proc in [1, 2, 4, num_cores]:
    if n_proc > num_cores:
        continue
    
    print(f"\nParallel with {n_proc} processes...")
    t_parallel_pi, _, pi_parallel = time_function(
        estimate_pi_parallel, num_samples, num_processes=n_proc, num_runs=3
    )
    print(f"  Time: {t_parallel_pi:.4f} seconds")
    print(f"  π estimate: {pi_parallel:.6f}")
    
    speedup, efficiency = calculate_speedup(t_serial_pi, t_parallel_pi)
    print(f"  Speedup: {speedup:.2f}x")
    print(f"  Efficiency: {efficiency*100:.1f}%")
    
    monte_carlo_results.append({
        'Processes': n_proc,
        'Time (s)': t_parallel_pi,
        'Speedup': speedup,
        'Efficiency (%)': efficiency * 100,
        'π Estimate': pi_parallel
    })

df_monte = pd.DataFrame(monte_carlo_results)
print("\n" + "="*70)
print("SUMMARY")
print("="*70)
print(df_monte.to_string(index=False))

## 8. Thread Safety and Race Conditions

Let's demonstrate common pitfalls in parallel programming.

In [None]:
# Example: Race condition (INCORRECT CODE - for demonstration)
def increment_counter_unsafe(counter, num_increments):
    """
    UNSAFE: Multiple processes modifying shared counter.
    This will produce incorrect results!
    """
    for _ in range(num_increments):
        counter.value += 1


def increment_counter_safe(lock, counter, num_increments):
    """
    SAFE: Uses lock to prevent race conditions.
    """
    for _ in range(num_increments):
        with lock:
            counter.value += 1


print("Demonstrating race condition...\n")

# Create shared counter
counter_unsafe = mp.Value('i', 0)  # 'i' = integer
counter_safe = mp.Value('i', 0)
lock = mp.Lock()

num_processes = 4
increments_per_process = 10000
expected_value = num_processes * increments_per_process

# Test unsafe version
print("Unsafe version (race condition):")
processes = []
for _ in range(num_processes):
    p = mp.Process(target=increment_counter_unsafe, 
                   args=(counter_unsafe, increments_per_process))
    processes.append(p)
    p.start()

for p in processes:
    p.join()

print(f"  Expected: {expected_value}")
print(f"  Actual:   {counter_unsafe.value}")
print(f"  Lost updates: {expected_value - counter_unsafe.value}")

# Test safe version
print("\nSafe version (with lock):")
processes = []
for _ in range(num_processes):
    p = mp.Process(target=increment_counter_safe, 
                   args=(lock, counter_safe, increments_per_process))
    processes.append(p)
    p.start()

for p in processes:
    p.join()

print(f"  Expected: {expected_value}")
print(f"  Actual:   {counter_safe.value}")
print(f"  Correct:  {counter_safe.value == expected_value}")

print("\nKey lesson: Always protect shared state with locks!")

## 9. Load Balancing Strategies

Different work distribution strategies affect performance.

In [None]:
def variable_workload(n):
    """
    Simulate variable computation time.
    Some tasks take longer than others.
    """
    # Sleep time proportional to n
    time.sleep(0.001 * n)
    return n * n


# Create tasks with varying workloads
tasks = list(range(1, 21))  # 1, 2, 3, ..., 20

print("Comparing load balancing strategies...\n")

# Strategy 1: Static partitioning (chunks)
print("Strategy 1: Static chunks (default)")
start = time.time()
with Pool(processes=4) as pool:
    results = pool.map(variable_workload, tasks)
time_static = time.time() - start
print(f"  Time: {time_static:.3f} seconds")

# Strategy 2: Dynamic (chunksize=1)
print("\nStrategy 2: Dynamic (chunksize=1)")
start = time.time()
with Pool(processes=4) as pool:
    results = pool.map(variable_workload, tasks, chunksize=1)
time_dynamic = time.time() - start
print(f"  Time: {time_dynamic:.3f} seconds")

print(f"\nDynamic is {time_static/time_dynamic:.2f}x faster!")
print("\nWhy? Dynamic scheduling prevents idle workers.")
print("Static: Some workers get long tasks, others finish early and wait.")
print("Dynamic: Workers grab new tasks as soon as they finish.")

## 10. Summary and Best Practices

In [None]:
print("="*70)
print("PARALLEL PROGRAMMING BEST PRACTICES")
print("="*70)
print("""
1. CHOOSE THE RIGHT TOOL:
   - Multiprocessing: CPU-bound tasks, large data
   - Numba: Numerical code, lower overhead
   - Threading: I/O-bound tasks only (GIL limitation)

2. MINIMIZE COMMUNICATION:
   - Pass data to workers once at start
   - Avoid frequent synchronization
   - Larger chunks = less overhead

3. AVOID SHARED STATE:
   - Prefer independent tasks
   - If sharing needed, use locks/semaphores
   - Reduction at end instead of continuous updates

4. LOAD BALANCING:
   - Equal workload? Use static partitioning
   - Variable workload? Use dynamic scheduling
   - Monitor worker utilization

5. MEASURE EVERYTHING:
   - Always compare to serial baseline
   - Test with different core counts
   - Calculate speedup and efficiency
   - Profile to find bottlenecks

6. AMDAHL'S LAW:
   - Speedup limited by serial portion
   - 10% serial code → max 10x speedup
   - Focus on parallelizing the bottleneck

7. COMMON PITFALLS:
   - Race conditions (use locks)
   - False sharing (pad shared data)
   - Too many processes (overhead dominates)
   - Not enough work per task (overhead dominates)
""")

print("\n" + "="*70)
print("PERFORMANCE SUMMARY (from this module)")
print("="*70)
print(f"\nImage Processing ({test_image.shape[0]}x{test_image.shape[1]}):")
print(f"  Best speedup: {df_image['Speedup'].max():.2f}x with {df_image.loc[df_image['Speedup'].idxmax(), 'Processes']:.0f} processes")

print(f"\nMatrix Multiplication ({test_matrix_a.shape[0]}x{test_matrix_a.shape[1]}):")
print(f"  Best speedup: {df_matrix_par['Speedup'].max():.2f}x with {df_matrix_par.loc[df_matrix_par['Speedup'].idxmax(), 'Processes']:.0f} processes")

print(f"\nMonte Carlo ({num_samples:,} samples):")
print(f"  Best speedup: {df_monte['Speedup'].max():.2f}x with {df_monte.loc[df_monte['Speedup'].idxmax(), 'Processes']:.0f} processes")

print("\n" + "="*70)
print("NEXT STEPS")
print("="*70)
print("""
Module 07: GPU Acceleration with CUDA
  - Learn when GPU beats CPU
  - Implement CUDA kernels with Numba
  - Understand memory hierarchy
  - Achieve 10-100x speedups on suitable problems
""")

## 11. Exercises

### Exercise 1: Parallel Histogram Calculation

Implement parallel histogram computation for image analysis.

**Challenge**: How do you combine histograms from different chunks?

In [None]:
def compute_histogram_chunk(image_chunk):
    """
    Compute histogram for an image chunk.
    
    Returns:
        numpy.ndarray: Histogram (256 bins for 0-255)
    """
    # TODO: Implement histogram computation
    # Hint: Use np.bincount() or np.histogram()
    pass


def compute_histogram_parallel(image, num_processes=None):
    """
    Compute image histogram in parallel.
    
    Strategy:
    1. Divide image into chunks
    2. Compute histogram for each chunk
    3. Sum all histograms
    """
    # TODO: Implement parallel histogram
    pass

# Test your implementation
# hist = compute_histogram_parallel(test_image, num_processes=4)
# plt.bar(range(256), hist)
# plt.xlabel('Pixel Value')
# plt.ylabel('Frequency')
# plt.title('Image Histogram (Parallel)')
# plt.show()

### Exercise 2: Parallel Sorting (Merge Sort)

Implement parallel merge sort.

**Strategy**: Recursively divide array, sort sub-arrays in parallel, merge results.

In [None]:
def merge(left, right):
    """
    Merge two sorted arrays.
    """
    # TODO: Implement merge
    pass


def parallel_merge_sort(arr, num_processes=None):
    """
    Parallel merge sort.
    """
    # TODO: Implement parallel merge sort
    # Hint: Divide array, use Pool.map() to sort chunks, merge results
    pass

# Test
# test_arr = np.random.randint(0, 1000, size=10000)
# sorted_arr = parallel_merge_sort(test_arr, num_processes=4)
# print(f"Correctly sorted: {np.all(sorted_arr[:-1] <= sorted_arr[1:])}")

### Exercise 3: Scalability Analysis

Analyze strong vs weak scaling for one of the implemented algorithms.

**Strong scaling**: Fixed problem size, increase cores

**Weak scaling**: Problem size increases with cores (constant work per core)

In [None]:
# TODO: Implement scalability analysis
# 1. Choose an algorithm (e.g., Monte Carlo)
# 2. Strong scaling: Fix num_samples, vary num_processes
# 3. Weak scaling: num_samples proportional to num_processes
# 4. Plot both scaling curves

# Your code here...

## Summary

In this module, you learned:

1. **Parallel Programming Concepts**
   - OpenMP directives → Python equivalents
   - Processes vs threads in Python
   - Shared memory vs message passing

2. **Implementation Techniques**
   - Multiprocessing Pool for task parallelism
   - Numba parallel loops for data parallelism
   - Work distribution strategies

3. **Performance Optimization**
   - Measuring speedup and efficiency
   - Load balancing (static vs dynamic)
   - Minimizing communication overhead

4. **Common Pitfalls**
   - Race conditions and synchronization
   - Too much/too little granularity
   - Amdahl's law limitations

**What's Next?**

- **Module 07**: GPU acceleration with CUDA - 10-100x speedups
- **Module 08**: Comprehensive benchmarking and analysis
- **Module 09**: Advanced optimization techniques