# Methodology Design

**Objective**: Design your parallel computing solution in detail

**Weight in Proposal**: 30 marks (30% of Part A)

---

## What is the Methodology Section?

The methodology explains **HOW** you will solve the problem. It should be detailed enough that someone else could replicate your work.

In this section, you need to:

1. **System Architecture**: Overall design of your solution
2. **Algorithm Selection**: Which algorithms and why
3. **Parallel Design Pattern**: How you'll parallelize (SPMD, loop, task)
4. **Platform Justification**: Why OpenMP/CUDA/MPI
5. **Data Flow**: How data moves through the system
6. **Performance Prediction**: Expected speedup and bottlenecks

---

## Learning Objectives

By the end of this notebook, you will be able to:
1. Design a system architecture for parallel computing
2. Select appropriate algorithms and justify choices
3. Choose the right parallel design pattern
4. Create data flow diagrams
5. Write pseudocode for parallel algorithms
6. Predict performance and identify bottlenecks

---

## Expected Output

Your methodology should:
- Be **2-3 pages** in the proposal
- Include **diagrams** (architecture, data flow)
- Contain **pseudocode** for key algorithms
- **Justify all decisions** (not just describe)
- Show **technical depth** (not superficial)

---

## Step 1: System Architecture Design

### Components of a Parallel System

1. **Input Module**: Data loading and preprocessing
2. **Serial Baseline**: Non-parallel version for comparison
3. **Parallel Implementation(s)**: OpenMP, CUDA, MPI versions
4. **Performance Monitoring**: Timing, profiling, metrics
5. **Output Module**: Results visualization and validation

### Architecture Diagram

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch

# Create system architecture diagram
def draw_architecture_diagram():
    """
    Draw a system architecture diagram for parallel processing
    """
    fig, ax = plt.subplots(figsize=(14, 10))
    ax.set_xlim(0, 14)
    ax.set_ylim(0, 10)
    ax.axis('off')
    
    # Title
    ax.text(7, 9.5, 'System Architecture: Parallel Image Processing', 
            fontsize=16, fontweight='bold', ha='center')
    
    # Input Layer
    input_box = FancyBboxPatch((5.5, 8), 3, 0.8, 
                               boxstyle="round,pad=0.1", 
                               edgecolor='black', facecolor='lightblue', linewidth=2)
    ax.add_patch(input_box)
    ax.text(7, 8.4, 'Input Module\n(Load Image, Validate)', 
            ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Preprocessing
    prep_box = FancyBboxPatch((5.5, 6.5), 3, 0.8,
                              boxstyle="round,pad=0.1",
                              edgecolor='black', facecolor='lightyellow', linewidth=2)
    ax.add_patch(prep_box)
    ax.text(7, 6.9, 'Preprocessing\n(Convert, Normalize)', 
            ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Serial Baseline
    serial_box = FancyBboxPatch((1, 4.5), 2.5, 1.2,
                                boxstyle="round,pad=0.1",
                                edgecolor='black', facecolor='lightcoral', linewidth=2)
    ax.add_patch(serial_box)
    ax.text(2.25, 5.1, 'Serial Version\n(Baseline for\nComparison)', 
            ha='center', va='center', fontsize=9, fontweight='bold')
    
    # OpenMP Implementation
    openmp_box = FancyBboxPatch((4.5, 4.5), 2.5, 1.2,
                                boxstyle="round,pad=0.1",
                                edgecolor='black', facecolor='lightgreen', linewidth=2)
    ax.add_patch(openmp_box)
    ax.text(5.75, 5.1, 'OpenMP\n(CPU Multi-core)', 
            ha='center', va='center', fontsize=9, fontweight='bold')
    
    # CUDA Implementation
    cuda_box = FancyBboxPatch((8, 4.5), 2.5, 1.2,
                              boxstyle="round,pad=0.1",
                              edgecolor='black', facecolor='lightgreen', linewidth=2)
    ax.add_patch(cuda_box)
    ax.text(9.25, 5.1, 'CUDA/Numba\n(GPU)', 
            ha='center', va='center', fontsize=9, fontweight='bold')
    
    # MPI Implementation (optional)
    mpi_box = FancyBboxPatch((11, 4.5), 2.5, 1.2,
                             boxstyle="round,pad=0.1",
                             edgecolor='gray', facecolor='lightgray', linewidth=2, linestyle='--')
    ax.add_patch(mpi_box)
    ax.text(12.25, 5.1, 'MPI\n(Optional)', 
            ha='center', va='center', fontsize=9, fontweight='bold', color='gray')
    
    # Performance Monitor
    perf_box = FancyBboxPatch((5.5, 2.8), 3, 0.8,
                              boxstyle="round,pad=0.1",
                              edgecolor='black', facecolor='orange', linewidth=2)
    ax.add_patch(perf_box)
    ax.text(7, 3.2, 'Performance Monitor\n(Timing, Profiling)', 
            ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Validation
    valid_box = FancyBboxPatch((5.5, 1.5), 3, 0.8,
                               boxstyle="round,pad=0.1",
                               edgecolor='black', facecolor='lightblue', linewidth=2)
    ax.add_patch(valid_box)
    ax.text(7, 1.9, 'Validation\n(Compare Results)', 
            ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Output
    output_box = FancyBboxPatch((5.5, 0.2), 3, 0.8,
                                boxstyle="round,pad=0.1",
                                edgecolor='black', facecolor='lightblue', linewidth=2)
    ax.add_patch(output_box)
    ax.text(7, 0.6, 'Output Module\n(Visualize, Export)', 
            ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Arrows
    arrow_props = dict(arrowstyle='->', lw=2, color='black')
    
    # Input -> Preprocessing
    ax.annotate('', xy=(7, 6.5), xytext=(7, 8), arrowprops=arrow_props)
    
    # Preprocessing -> Implementations
    ax.annotate('', xy=(2.25, 5.7), xytext=(6.5, 6.5), arrowprops=arrow_props)
    ax.annotate('', xy=(5.75, 5.7), xytext=(7, 6.5), arrowprops=arrow_props)
    ax.annotate('', xy=(9.25, 5.7), xytext=(7.5, 6.5), arrowprops=arrow_props)
    
    # Implementations -> Performance
    ax.annotate('', xy=(6.5, 3.6), xytext=(2.25, 4.5), arrowprops=arrow_props)
    ax.annotate('', xy=(7, 3.6), xytext=(5.75, 4.5), arrowprops=arrow_props)
    ax.annotate('', xy=(7.5, 3.6), xytext=(9.25, 4.5), arrowprops=arrow_props)
    
    # Performance -> Validation -> Output
    ax.annotate('', xy=(7, 2.3), xytext=(7, 2.8), arrowprops=arrow_props)
    ax.annotate('', xy=(7, 1.0), xytext=(7, 1.5), arrowprops=arrow_props)
    
    plt.tight_layout()
    plt.show()

draw_architecture_diagram()

print("""
üìê System Architecture Diagram

Key Components:
1. Input Module: Loads and validates input data
2. Preprocessing: Prepares data for processing
3. Serial Version: Baseline for performance comparison
4. Parallel Implementations: OpenMP (CPU) and CUDA (GPU)
5. Performance Monitor: Measures execution time and resource usage
6. Validation: Ensures all implementations produce correct results
7. Output Module: Visualizes and exports results

üí° Include a similar diagram in your proposal!
""")

In [None]:
# YOUR TURN: Describe YOUR system architecture

architecture_description = """
System Architecture: [Your Project Title]

1. Input Module:
   - [What data will you load?]
   - [What validation checks?]
   - [What format conversions?]

2. Preprocessing:
   - [What preprocessing steps?]
   - [Why are they needed?]

3. Serial Baseline:
   - [Describe serial algorithm]
   - [Why this specific algorithm?]

4. Parallel Implementation(s):
   - [OpenMP version: How will you parallelize?]
   - [CUDA version: What will run on GPU?]
   - [Other platforms?]

5. Performance Monitoring:
   - [What metrics will you measure?]
   - [How will you profile?]

6. Validation:
   - [How will you verify correctness?]
   - [What error tolerance?]

7. Output:
   - [What results will you produce?]
   - [How will you visualize performance?]
"""

print(architecture_description)

# Save to file
# with open('architecture_description.txt', 'w') as f:
#     f.write(architecture_description)
# print("\nüíæ Saved to: architecture_description.txt")

---

## Step 2: Parallel Design Patterns

Choose the appropriate pattern(s) for your problem:

### 1. SPMD (Single Program, Multiple Data)
- **Description**: Same code runs on all processors, different data
- **Best for**: Data-parallel problems (image processing, matrix operations)
- **Example**: Each thread processes different rows of an image

### 2. Loop Parallelism
- **Description**: Parallelize independent loop iterations
- **Best for**: Simple for-loops with no dependencies
- **Example**: `#pragma omp parallel for` for independent iterations

### 3. Task Parallelism
- **Description**: Different tasks run concurrently
- **Best for**: Pipeline processing, different operations
- **Example**: Apply different filters simultaneously

### 4. Divide and Conquer
- **Description**: Split problem recursively, solve in parallel
- **Best for**: Merge sort, quicksort, tree algorithms
- **Example**: Parallel merge sort

### 5. Master-Worker
- **Description**: Master distributes tasks to workers
- **Best for**: Load balancing, dynamic work distribution
- **Example**: Monte Carlo simulations

In [None]:
import pandas as pd

# Compare parallel design patterns
patterns = pd.DataFrame([
    {
        'Pattern': 'SPMD',
        'Complexity': 'Low',
        'Scalability': 'High',
        'Load Balance': 'Static',
        'Best For': 'Data-parallel problems',
        'Example': 'Image filtering, matrix multiplication'
    },
    {
        'Pattern': 'Loop Parallelism',
        'Complexity': 'Very Low',
        'Scalability': 'High',
        'Load Balance': 'Static/Dynamic',
        'Best For': 'Independent iterations',
        'Example': 'Element-wise operations'
    },
    {
        'Pattern': 'Task Parallelism',
        'Complexity': 'Medium',
        'Scalability': 'Medium',
        'Load Balance': 'Can vary',
        'Best For': 'Different operations',
        'Example': 'Multiple filter pipeline'
    },
    {
        'Pattern': 'Divide & Conquer',
        'Complexity': 'High',
        'Scalability': 'High',
        'Load Balance': 'Static',
        'Best For': 'Recursive problems',
        'Example': 'Parallel merge sort'
    },
    {
        'Pattern': 'Master-Worker',
        'Complexity': 'Medium',
        'Scalability': 'High',
        'Load Balance': 'Dynamic',
        'Best For': 'Unbalanced workloads',
        'Example': 'Monte Carlo simulations'
    },
])

print("\nParallel Design Pattern Comparison")
print("=" * 100)
print(patterns.to_string(index=False))
print("\n" + "=" * 100)

print("""
üí° Choose based on:
   - Problem characteristics (data vs task parallel)
   - Load balance requirements
   - Implementation complexity
   - Expected scalability
""")

In [None]:
# YOUR TURN: Select and justify YOUR pattern(s)

pattern_selection = """
Parallel Design Pattern Selection

Primary Pattern: [Choose one from above]

Justification:
- [Why this pattern fits your problem]
- [What characteristics make it suitable]
- [What alternatives you considered and why you rejected them]

Implementation Details:
- [How will you implement this pattern?]
- [What are the key challenges?]
- [How will you handle load balancing?]

Secondary Pattern (if applicable): [Optional]
- [Why combine multiple patterns?]
- [How will they work together?]
"""

print(pattern_selection)

---

## Step 3: Platform Selection and Justification

Compare platforms and justify your choice(s)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Platform comparison
platform_comparison = pd.DataFrame([
    {
        'Platform': 'OpenMP',
        'Hardware': 'Multi-core CPU',
        'Parallelism': 'Thread-based',
        'Ease of Use': 'High (pragmas)',
        'Performance': 'Good (4-16x)',
        'Portability': 'Excellent',
        'Cost': 'Free',
        'Best For': 'Moderate parallelism, easy start'
    },
    {
        'Platform': 'CUDA',
        'Hardware': 'NVIDIA GPU',
        'Parallelism': 'Massive (1000s cores)',
        'Ease of Use': 'Medium (C++ extension)',
        'Performance': 'Excellent (50-100x)',
        'Portability': 'Limited (NVIDIA only)',
        'Cost': 'GPU required',
        'Best For': 'Data-parallel, high throughput'
    },
    {
        'Platform': 'Numba CUDA',
        'Hardware': 'NVIDIA GPU',
        'Parallelism': 'Massive (1000s cores)',
        'Ease of Use': 'High (Python decorators)',
        'Performance': 'Very Good (40-80x)',
        'Portability': 'Limited (NVIDIA only)',
        'Cost': 'GPU required',
        'Best For': 'GPU programming in Python'
    },
    {
        'Platform': 'MPI',
        'Hardware': 'Cluster/Network',
        'Parallelism': 'Process-based',
        'Ease of Use': 'Low (explicit communication)',
        'Performance': 'Excellent (scales to 1000s)',
        'Portability': 'Excellent',
        'Cost': 'Cluster access',
        'Best For': 'Large-scale, distributed problems'
    },
])

print("\nPlatform Comparison")
print("=" * 120)
print(platform_comparison.to_string(index=False))
print("\n" + "=" * 120)

# Visualize trade-offs
platforms_viz = ['OpenMP', 'CUDA', 'Numba', 'MPI']
ease_of_use = [9, 5, 8, 3]  # 0-10 scale
performance = [6, 10, 8, 9]
portability = [10, 4, 4, 10]

x = np.arange(len(platforms_viz))
width = 0.25

fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(x - width, ease_of_use, width, label='Ease of Use', color='skyblue')
ax.bar(x, performance, width, label='Performance', color='lightgreen')
ax.bar(x + width, portability, width, label='Portability', color='salmon')

ax.set_ylabel('Score (0-10)', fontsize=12)
ax.set_title('Platform Trade-offs Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(platforms_viz)
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("""
üí° Platform Selection Strategy:

Good: Choose ONE platform and implement well
Better: Implement TWO platforms (e.g., OpenMP + CUDA) and compare
Excellent: Implement MULTIPLE platforms with optimization and analysis
""")

In [None]:
# YOUR TURN: Justify YOUR platform choice

platform_justification = """
Platform Selection and Justification

Selected Platform(s): [e.g., OpenMP and CUDA]

Primary Platform: [OpenMP/CUDA/MPI/Numba]
Justification:
- [Why this platform?]
- [What advantages does it offer for YOUR problem?]
- [What are the hardware requirements?]
- [What is your expected speedup?]

Secondary Platform (if comparing): [Platform name]
Justification:
- [Why add a second platform?]
- [What insights will comparison provide?]
- [How will results differ from primary platform?]

Platforms NOT chosen:
- [List alternatives you considered]
- [Why did you reject each one?]

Implementation Strategy:
1. [Step-by-step plan]
2. [What will you implement first?]
3. [How will you ensure correctness?]
4. [How will you optimize performance?]
"""

print(platform_justification)

---

## Step 4: Algorithm Design and Pseudocode

Provide detailed pseudocode for your parallel algorithm

In [None]:
# Example: Pseudocode for parallel image filtering

openmp_pseudocode = """
ALGORITHM: Parallel Gaussian Blur (OpenMP)

INPUT: 
    image[height][width]       // Input image
    kernel[k_size][k_size]     // Gaussian kernel
    num_threads                // Number of OpenMP threads

OUTPUT:
    filtered[height][width]    // Filtered image

BEGIN
    // Serial preprocessing
    VALIDATE image dimensions
    NORMALIZE kernel weights
    ALLOCATE filtered array
    
    // Parallel filtering
    SET omp_num_threads = num_threads
    
    #pragma omp parallel for schedule(dynamic)
    FOR y = 0 TO height - 1 DO
        FOR x = 0 TO width - 1 DO
            sum = 0.0
            
            // Apply kernel
            FOR ky = -k_size/2 TO k_size/2 DO
                FOR kx = -k_size/2 TO k_size/2 DO
                    // Handle boundaries
                    py = CLAMP(y + ky, 0, height - 1)
                    px = CLAMP(x + kx, 0, width - 1)
                    
                    sum += image[py][px] * kernel[ky][kx]
                END FOR
            END FOR
            
            filtered[y][x] = sum
        END FOR
    END FOR
    
    RETURN filtered
END

TIME COMPLEXITY: O(height * width * k_size¬≤) / num_threads
SPACE COMPLEXITY: O(height * width)
PARALLELISM: Loop parallelism (SPMD pattern)
"""

print("\nExample OpenMP Pseudocode:")
print("=" * 80)
print(openmp_pseudocode)
print("=" * 80)

cuda_pseudocode = """
ALGORITHM: Parallel Gaussian Blur (CUDA)

INPUT: Same as OpenMP version
OUTPUT: Same as OpenMP version

BEGIN
    // Host (CPU) code
    VALIDATE image dimensions
    ALLOCATE device memory for image_d, filtered_d, kernel_d
    COPY image, kernel to device
    
    // Configure kernel launch
    block_size = (16, 16)           // 16x16 threads per block
    grid_size = (width/16, height/16)  // Blocks needed
    
    // Launch CUDA kernel
    LAUNCH gaussian_blur_kernel<<<grid_size, block_size>>>
           (image_d, filtered_d, kernel_d, width, height)
    
    // Copy result back
    COPY filtered_d to filtered
    FREE device memory
    
    RETURN filtered
END

__global__ KERNEL gaussian_blur_kernel:
    // Calculate thread's pixel coordinates
    x = blockIdx.x * blockDim.x + threadIdx.x
    y = blockIdx.y * blockDim.y + threadIdx.y
    
    IF x >= width OR y >= height THEN RETURN
    
    sum = 0.0
    
    // Apply kernel (same as OpenMP)
    FOR ky = -k_size/2 TO k_size/2 DO
        FOR kx = -k_size/2 TO k_size/2 DO
            py = CLAMP(y + ky, 0, height - 1)
            px = CLAMP(x + kx, 0, width - 1)
            sum += image_d[py][px] * kernel_d[ky][kx]
        END FOR
    END FOR
    
    filtered_d[y][x] = sum
END KERNEL

PARALLELISM: Massive data parallelism (one thread per pixel)
OPTIMIZATION: Shared memory for kernel (reduces global memory access)
"""

print("\nExample CUDA Pseudocode:")
print("=" * 80)
print(cuda_pseudocode)
print("=" * 80)

In [None]:
# YOUR TURN: Write pseudocode for YOUR algorithm

your_pseudocode = """
ALGORITHM: [Your Algorithm Name]

INPUT:
    [List all inputs]

OUTPUT:
    [List outputs]

BEGIN
    // Preprocessing
    [Your preprocessing steps]
    
    // Parallel section
    [Your parallel code]
    [Use pragma/kernel notation]
    
    // Postprocessing
    [Combine results if needed]
    
    RETURN [result]
END

TIME COMPLEXITY: [Big-O notation]
SPACE COMPLEXITY: [Big-O notation]
PARALLELISM: [Pattern and justification]
"""

print(your_pseudocode)

print("""
üí° Pseudocode Tips:
   - Use clear variable names
   - Show parallel pragmas/directives
   - Include complexity analysis
   - Explain synchronization points
   - Address edge cases
""")

---

## Step 5: Data Flow Diagram

Show how data moves through your system

In [None]:
# Create data flow diagram
def draw_data_flow():
    """
    Draw a data flow diagram for parallel processing
    """
    fig, ax = plt.subplots(figsize=(14, 8))
    ax.set_xlim(0, 14)
    ax.set_ylim(0, 8)
    ax.axis('off')
    
    # Title
    ax.text(7, 7.5, 'Data Flow: OpenMP Parallel Image Processing', 
            fontsize=14, fontweight='bold', ha='center')
    
    # Input data
    input_box = FancyBboxPatch((5.5, 6), 3, 0.6,
                               boxstyle="round,pad=0.05",
                               edgecolor='black', facecolor='lightblue', linewidth=2)
    ax.add_patch(input_box)
    ax.text(7, 6.3, 'Input Image\n(height √ó width)', ha='center', va='center', fontsize=9)
    
    # Data partitioning
    ax.text(7, 5.3, 'Partition Data (by rows)', ha='center', fontsize=10, fontweight='bold')
    
    # Worker threads
    num_workers = 4
    worker_colors = ['#ff9999', '#99ff99', '#9999ff', '#ffff99']
    
    for i in range(num_workers):
        x_pos = 2 + i * 3
        worker_box = FancyBboxPatch((x_pos, 3.5), 2, 1,
                                    boxstyle="round,pad=0.05",
                                    edgecolor='black', facecolor=worker_colors[i], linewidth=2)
        ax.add_patch(worker_box)
        ax.text(x_pos + 1, 4.3, f'Thread {i}', ha='center', fontweight='bold', fontsize=9)
        ax.text(x_pos + 1, 4.0, f'Rows {i*100}', ha='center', fontsize=8)
        ax.text(x_pos + 1, 3.7, f'to {(i+1)*100-1}', ha='center', fontsize=8)
        
        # Arrows from input to workers
        ax.annotate('', xy=(x_pos + 1, 4.5), xytext=(7, 6),
                   arrowprops=dict(arrowstyle='->', lw=1.5, color='black'))
    
    # Barrier/Synchronization
    ax.plot([1.5, 12.5], [2.8, 2.8], 'r-', linewidth=3)
    ax.text(7, 2.4, 'Synchronization Barrier', ha='center', fontsize=10, 
            fontweight='bold', color='red')
    
    # Merge results
    for i in range(num_workers):
        x_pos = 2 + i * 3
        ax.annotate('', xy=(7, 1.5), xytext=(x_pos + 1, 2.8),
                   arrowprops=dict(arrowstyle='->', lw=1.5, color='black'))
    
    # Output
    output_box = FancyBboxPatch((5.5, 0.5), 3, 0.8,
                                boxstyle="round,pad=0.05",
                                edgecolor='black', facecolor='lightgreen', linewidth=2)
    ax.add_patch(output_box)
    ax.text(7, 0.9, 'Filtered Image\n(Combined Result)', ha='center', va='center', fontsize=9)
    
    plt.tight_layout()
    plt.show()

draw_data_flow()

print("""
üìä Data Flow Diagram

Key Points:
1. Input image is partitioned by rows
2. Each thread processes independent rows
3. Synchronization barrier ensures all threads complete
4. Results are combined into final output

üí° Create a similar diagram for YOUR project!
""")

---

## Step 6: Performance Prediction

Estimate expected performance using theoretical models

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Amdahl's Law - Theoretical speedup prediction
def amdahl_speedup(P, N):
    """Calculate speedup using Amdahl's Law"""
    return 1 / ((1 - P) + (P / N))

# Gustafson's Law - For scaled speedup
def gustafson_speedup(P, N):
    """Calculate scaled speedup using Gustafson's Law"""
    return (1 - P) + P * N

# Example prediction for image processing
print("\nPerformance Prediction: Parallel Image Processing")
print("=" * 80)

# Assume 95% of code can be parallelized
P = 0.95
cores = [1, 2, 4, 8, 16]

print(f"\nParallelizable portion (P): {P:.0%}")
print(f"Serial portion: {(1-P):.0%}\n")

print("Cores\tAmdahl Speedup\tGustafson Speedup\tEfficiency")
print("-" * 60)

for N in cores:
    amdahl = amdahl_speedup(P, N)
    gustafson = gustafson_speedup(P, N)
    efficiency = (amdahl / N) * 100
    print(f"{N}\t{amdahl:.2f}x\t\t{gustafson:.2f}x\t\t{efficiency:.1f}%")

# Visualize speedup predictions
cores_range = np.arange(1, 17)
amdahl_curve = [amdahl_speedup(P, n) for n in cores_range]
gustafson_curve = [gustafson_speedup(P, n) for n in cores_range]
ideal_curve = cores_range

plt.figure(figsize=(12, 6))
plt.plot(cores_range, ideal_curve, 'k--', label='Ideal (Linear)', linewidth=2)
plt.plot(cores_range, amdahl_curve, 'r-o', label=f'Amdahl (P={P:.0%})', linewidth=2)
plt.plot(cores_range, gustafson_curve, 'b-s', label=f'Gustafson (P={P:.0%})', linewidth=2)

plt.xlabel('Number of Cores', fontsize=12)
plt.ylabel('Speedup', fontsize=12)
plt.title('Predicted Speedup: Amdahl vs Gustafson Laws', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n" + "=" * 80)
print("""
üí° Interpretation:
   - Amdahl's Law: Fixed problem size (pessimistic)
   - Gustafson's Law: Scaled problem size (optimistic)
   - Reality: Usually between these two curves
   - Efficiency drops as cores increase due to overhead
""")

In [None]:
# YOUR TURN: Predict YOUR performance

# Estimate your parallelizable portion
your_P = 0.90  # Replace with your estimate (0.0 to 1.0)
your_cores = 8  # Your CPU cores

print("\nYour Performance Prediction")
print("=" * 80)
print(f"\nParallelizable portion: {your_P:.0%}")
print(f"Number of cores: {your_cores}")

expected_speedup = amdahl_speedup(your_P, your_cores)
realistic_speedup = expected_speedup * 0.7  # Account for overhead

print(f"\nTheoretical maximum speedup: {expected_speedup:.2f}x")
print(f"Realistic expected speedup: {realistic_speedup:.2f}x")
print(f"Expected efficiency: {(expected_speedup / your_cores) * 100:.1f}%")

print("\n" + "=" * 80)

performance_prediction = f"""
Performance Prediction Summary:

Based on Amdahl's Law with P={your_P:.0%} parallelizable code:
- With {your_cores} cores: Expected {realistic_speedup:.1f}x speedup
- Serial execution time: [Estimate based on your testing]
- Parallel execution time: [Serial time / {realistic_speedup:.1f}]

Bottlenecks to consider:
- [List potential bottlenecks]
- [Memory bandwidth limitations?]
- [Synchronization overhead?]
- [Load imbalance?]

Optimization strategies:
- [How will you minimize overhead?]
- [Cache optimization techniques?]
- [Load balancing approach?]
"""

print(performance_prediction)

---

## Step 7: Testing and Validation Plan

How will you ensure correctness and measure performance?

In [None]:
# Testing and validation framework
testing_plan = pd.DataFrame([
    {
        'Test Type': 'Correctness',
        'Method': 'Compare parallel vs serial output',
        'Metric': 'Element-wise difference',
        'Pass Criteria': 'Max error < 1e-6',
        'Tools': 'Python assert, np.allclose()'
    },
    {
        'Test Type': 'Performance',
        'Method': 'Time execution with different core counts',
        'Metric': 'Execution time, speedup',
        'Pass Criteria': 'Speedup > 4x with 8 cores',
        'Tools': 'time.time(), timeit'
    },
    {
        'Test Type': 'Scalability',
        'Method': 'Test with varying problem sizes',
        'Metric': 'Speedup vs problem size',
        'Pass Criteria': 'Speedup maintained for large inputs',
        'Tools': 'Custom benchmarking script'
    },
    {
        'Test Type': 'Efficiency',
        'Method': 'Measure resource utilization',
        'Metric': 'CPU usage, memory bandwidth',
        'Pass Criteria': 'Efficiency > 70%',
        'Tools': 'cProfile, line_profiler'
    },
    {
        'Test Type': 'Edge Cases',
        'Method': 'Test boundary conditions',
        'Metric': 'Correct handling of edge cases',
        'Pass Criteria': 'No errors or crashes',
        'Tools': 'Unit tests, pytest'
    },
])

print("\nTesting and Validation Plan")
print("=" * 100)
print(testing_plan.to_string(index=False))
print("\n" + "=" * 100)

print("""
üí° Testing Best Practices:
   1. Always compare parallel output with serial baseline
   2. Test with multiple input sizes (small, medium, large)
   3. Measure performance multiple times (average 5-10 runs)
   4. Profile to identify bottlenecks
   5. Document all test results
""")

---

## Step 8: Methodology Writing Template

In [None]:
# Complete methodology section template
methodology_template = """
METHODOLOGY

3.1 Overview
[1 paragraph: Briefly describe your overall approach. What are you building?]

3.2 System Architecture
[1-2 paragraphs + diagram: Describe the components of your system. Include
an architecture diagram showing input, processing, and output modules.]

3.3 Algorithm Design
[2-3 paragraphs: Describe the algorithm(s) you will implement. Why did you
choose this specific algorithm? What are its characteristics?]

Serial Algorithm:
[Include pseudocode for serial version]

Parallel Algorithm:
[Include pseudocode for parallel version, showing parallel directives]

3.4 Parallel Design Pattern
[1-2 paragraphs: What parallel pattern are you using (SPMD, loop, task)?
Why is this pattern appropriate for your problem? How will you implement it?]

3.5 Platform Selection and Justification
[1-2 paragraphs: Which platform(s) (OpenMP, CUDA, MPI)? Why this choice?
What are the advantages and disadvantages? Include comparison table.]

3.6 Data Decomposition
[1 paragraph + diagram: How will you partition data across processors?
Include data flow diagram.]

3.7 Performance Prediction
[1-2 paragraphs: What speedup do you expect? Use Amdahl's Law to justify.
What are potential bottlenecks?]

3.8 Testing and Validation
[1 paragraph: How will you verify correctness? How will you measure performance?
What metrics will you use?]

3.9 Implementation Timeline
[1 paragraph or table: What will you implement first? What is your schedule?]

---
üí° Remember:
- Include diagrams and pseudocode
- Justify all decisions (don't just describe)
- Show technical depth
- Be specific (avoid vague statements)
- Cite relevant papers from literature review
"""

print(methodology_template)

# Save template
# with open('methodology_template.txt', 'w') as f:
#     f.write(methodology_template)
# print("\nüíæ Saved to: methodology_template.txt")

---

## Quality Checklist

In [None]:
# Methodology quality checklist
checklist = {
    'Content Completeness': [
        '[ ] System architecture described and diagrammed',
        '[ ] Algorithm design explained with pseudocode',
        '[ ] Parallel design pattern selected and justified',
        '[ ] Platform choice justified with comparison',
        '[ ] Data decomposition strategy explained',
        '[ ] Performance prediction included',
        '[ ] Testing plan described',
    ],
    'Technical Depth': [
        '[ ] Pseudocode shows parallel directives',
        '[ ] Complexity analysis included',
        '[ ] Bottlenecks identified',
        '[ ] Optimization strategies mentioned',
        '[ ] Synchronization points addressed',
        '[ ] Load balancing discussed',
    ],
    'Justification': [
        '[ ] All design decisions justified',
        '[ ] Platform choice compared with alternatives',
        '[ ] Algorithm choice explained',
        '[ ] Pattern selection reasoned',
        '[ ] Connected to literature review',
    ],
    'Presentation': [
        '[ ] Architecture diagram included',
        '[ ] Data flow diagram included',
        '[ ] Pseudocode properly formatted',
        '[ ] Clear section organization',
        '[ ] 2-3 pages in length',
        '[ ] Professional writing style',
    ]
}

print("\n" + "=" * 80)
print("METHODOLOGY QUALITY CHECKLIST")
print("=" * 80 + "\n")

for category, items in checklist.items():
    print(f"\nüìã {category}:")
    for item in items:
        print(f"   {item}")

print("\n" + "=" * 80)
print("\n‚úÖ Verify ALL items before finalizing proposal!")
print("\n" + "=" * 80)

---

## Common Mistakes to Avoid

### ‚ùå DON'T:

1. **Be too vague**: "We will use parallel processing to speed up the code"
   - ‚úÖ DO: "We will use OpenMP loop parallelism with dynamic scheduling to parallelize the outer loop, distributing rows across 8 CPU cores"

2. **Just describe, don't justify**: "We chose CUDA for our implementation"
   - ‚úÖ DO: "We chose CUDA because our problem exhibits massive data parallelism with independent pixel operations, making it ideal for GPU acceleration. Compared to OpenMP, CUDA offers 10x more parallel threads (5120 vs 8 cores)"

3. **Ignore performance prediction**: No mention of expected speedup
   - ‚úÖ DO: "Using Amdahl's Law with P=0.95, we predict 6.3x speedup on 8 cores, accounting for 30% overhead from synchronization"

4. **No diagrams**: Text-only methodology
   - ‚úÖ DO: Include architecture diagram, data flow diagram, and performance prediction graph

5. **Missing pseudocode**: Only high-level description
   - ‚úÖ DO: Provide detailed pseudocode showing parallel directives and data structures

6. **Unrealistic expectations**: "We expect 100x speedup with 8 cores"
   - ‚úÖ DO: Use Amdahl's Law to set realistic expectations based on parallelizable portion

7. **No testing plan**: Don't mention how you'll verify correctness
   - ‚úÖ DO: Describe validation against serial baseline and performance measurement methodology

---

## Summary

You've learned how to:
1. ‚úÖ Design a system architecture for parallel computing
2. ‚úÖ Select and justify parallel design patterns
3. ‚úÖ Choose appropriate platforms (OpenMP, CUDA, MPI)
4. ‚úÖ Write detailed pseudocode for parallel algorithms
5. ‚úÖ Create architecture and data flow diagrams
6. ‚úÖ Predict performance using theoretical models
7. ‚úÖ Plan testing and validation

---

## Next Steps

1. ‚úÖ Design your system architecture
2. ‚úÖ Write pseudocode for serial and parallel versions
3. ‚úÖ Create architecture and data flow diagrams
4. ‚úÖ Justify platform selection
5. ‚úÖ Predict performance with Amdahl's Law
6. ‚úÖ Write methodology section (2-3 pages)
7. ‚úÖ Verify quality checklist
8. ‚úÖ Move to `04_proposal_writing.ipynb`

---

**Good luck with your methodology design! üöÄ**