# Practical 3: Memory Management with OpenMP and CUDA

**Course**: BMCS3003 Distributed Systems and Parallel Computing

**Difficulty**: ⭐⭐ (Intermediate)

**Estimated Time**: 120 minutes

**Prerequisites**:
- Basic understanding of parallel programming concepts
- C/C++ programming experience
- Familiarity with threading from Practical 2

## Learning Objectives

By the end of this practical, you will be able to:

1. Configure and use OpenMP in Visual Studio for parallel programming
2. Understand and implement shared vs private data in parallel regions
3. Apply OpenMP directives for loop parallelization and reduction operations
4. Identify and prevent false sharing in multi-threaded programs
5. Implement parallel algorithms (numerical integration, matrix multiplication)
6. Use CUDA for GPU-accelerated computing (optional based on hardware)
7. Profile and analyze parallel program performance

## Table of Contents

1. [Introduction to OpenMP](#section1)
2. [OpenMP Configuration](#section2)
3. [Shared vs Private Data](#section3)
4. [Parallel Numerical Integration](#section4)
5. [False Sharing and Cache Line Padding](#section5)
6. [Matrix Multiplication with OpenMP](#section6)
7. [Introduction to CUDA](#section7)
8. [CUDA Vector Addition](#section8)
9. [Performance Profiling](#section9)
10. [Summary](#section10)

<a id='section1'></a>
## 1. Introduction to OpenMP

### What is OpenMP?

**OpenMP** (Open Multi-Processing) is an API that supports multi-platform shared-memory parallel programming in C, C++, and Fortran.

### Key Features

- **Compiler directives**: Simple pragmas to parallelize code
- **Fork-join model**: Master thread forks worker threads
- **Shared memory**: All threads access same memory space
- **Easy to learn**: Incremental parallelization

### OpenMP vs Manual Threading

| Aspect | Manual Threading | OpenMP |
|--------|------------------|--------|
| Code complexity | High | Low |
| Lines of code | Many | Few |
| Thread management | Manual | Automatic |
| Portability | Platform-specific | Cross-platform |
| Learning curve | Steep | Gentle |

### Fork-Join Execution Model

```
MASTER THREAD
    |
    |  #pragma omp parallel
    |
    +------- FORK -------+
    |                    |
 Thread 0    Thread 1   Thread 2   Thread 3
    |           |          |          |
    | Parallel  |          |          |
    | Region    |          |          |
    |           |          |          |
    +------- JOIN --------+
    |
 MASTER THREAD
```

<a id='section2'></a>
## 2. OpenMP Configuration in Visual Studio

### Step-by-Step Configuration

#### Step 1: Open Project Properties
- Right-click on your project in Solution Explorer
- Select **Properties**

#### Step 2: Navigate to OpenMP Settings
```
Configuration Properties
  └─ C/C++
      └─ Language
          └─ Open MP Support
```

#### Step 3: Enable OpenMP
- Set **Open MP Support** to **Yes (/openmp)**
- Click **Apply** and **OK**

#### Step 4: Verify Platform Configuration
- Ensure the **Platform** dropdown matches your build configuration
- **Win32** = **x86** (32-bit)
- **x64** = 64-bit

### Visual Studio Configuration Dialog

```
┌─────────────────────────────────────────────────────┐
│ Configuration: Active(Debug)    Platform: x64       │
├─────────────────────────────────────────────────────┤
│ ▶ Configuration Properties                          │
│   ▶ General                                         │
│   ▶ C/C++                                           │
│     ▼ Language                                      │
│         Open MP Support: Yes (/openmp)    ◄─────── │
│         C++ Language Standard: Default              │
│         Conformance mode: Yes (/permissive-)        │
└─────────────────────────────────────────────────────┘
```

### Linux/macOS Configuration

```bash
# GCC
g++ -fopenmp program.cpp -o program

# Clang
clang++ -fopenmp program.cpp -o program

# Run
./program
```

### Question 1: Hello World with OpenMP

**Objective**: Verify OpenMP is configured correctly by creating a simple parallel program.

#### Code (P3Q1.cpp)

```cpp
#include <omp.h>
#include <iostream>

int main() {
    // Set number of threads (optional)
    omp_set_num_threads(4);
    
    // Parallel region starts here
    #pragma omp parallel
    {
        // Get thread ID (0 to N-1)
        int thread_id = omp_get_thread_num();
        
        // Get total number of threads
        int num_threads = omp_get_num_threads();
        
        // Each thread prints its ID
        printf("Hello(%d) World(%d)\n", thread_id, thread_id);
    }
    // Implicit barrier - all threads join here
    
    std::cout << "All threads completed!" << std::endl;
    
    return 0;
}
```

#### Expected Output

```
Hello(1) World(1)
Hello(0) World(0)
Hello(2) World(2)
Hello(5) World(5)
Hello(3) World(3)
Hello(4) World(4)
Hello(7) World(7)
Hello(6) World(6)
All threads completed!
```

**Note**: The output order is non-deterministic because threads execute in parallel!

### Understanding the Code

```cpp
#pragma omp parallel
{
    // This code is executed by ALL threads
}
```

#### Key OpenMP Functions

| Function | Description | Return Type |
|----------|-------------|-------------|
| `omp_get_thread_num()` | Current thread ID (0 to N-1) | int |
| `omp_get_num_threads()` | Total number of threads | int |
| `omp_get_max_threads()` | Max threads available | int |
| `omp_set_num_threads(n)` | Set number of threads | void |
| `omp_get_wtime()` | Wall clock time in seconds | double |

### Common Issues and Solutions

#### Issue 1: OpenMP not recognized
```
Error: #include <omp.h> not found
```
**Solution**: Enable OpenMP in project properties

#### Issue 2: Single thread execution
```
Hello(0) World(0)
All threads completed!
```
**Solution**: 
- Check OpenMP is enabled in build configuration
- Verify `/openmp` flag is set
- Try explicitly setting thread count: `omp_set_num_threads(4)`

<a id='section3'></a>
## 3. Shared vs Private Data in OpenMP

### Data Scoping Rules

In OpenMP, variables can be:
1. **Shared**: All threads access the SAME memory location
2. **Private**: Each thread has its OWN copy

### Default Scoping Rules

```cpp
int global = 5;        // SHARED by default

#pragma omp parallel
{
    int local = 10;    // PRIVATE (declared inside parallel region)
    global++;          // All threads modify SAME variable
    local++;           // Each thread modifies its OWN copy
}
```

### Shared Data Example

```cpp
#include <omp.h>
#include <iostream>

int main() {
    int x = 5;  // Declared outside parallel region → SHARED
    
    #pragma omp parallel
    {
        x = x + 1;  // All threads modify the SAME x
        printf("Shared: x is %d\n", x);
    }
    
    printf("Final x: %d\n", x);
    return 0;
}
```

#### Output (8 threads)
```
Shared: x is 8
Shared: x is 6
Shared: x is 7
Shared: x is 9
Shared: x is 9
Shared: x is 10
Shared: x is 11
Shared: x is 12
Final x: 12
```

**Important**: Values are unpredictable due to **race condition**!

### Private Data Examples

#### Method 1: Declare inside parallel region

```cpp
int x = 5;  // Outer variable

#pragma omp parallel
{
    int x;   // NEW variable, shadows outer x
    x = 3;   // Each thread has its own x
    printf("Local: x is %d\n", x);
}

printf("After: x is still %d\n", x);  // Outer x unchanged
```

#### Output
```
Local: x is 3
Local: x is 3
Local: x is 3
Local: x is 3
...
After: x is still 5
```

#### Method 2: Use `private` clause

```cpp
int x = 5;

#pragma omp parallel private(x)
{
    // x is private, but UNINITIALIZED!
    x = omp_get_thread_num();
    printf("Private: x is %d\n", x);
}

printf("After: x is %d\n", x);  // Still 5
```

#### Output
```
Private: x is 0
Private: x is 1
Private: x is 2
Private: x is 3
...
After: x is 5
```

### Dangerous Private Variable Example

```cpp
int x = 5;

#pragma omp parallel private(x)
{
    x = x + 1;  // DANGEROUS! x is uninitialized
    printf("Private: x is %d\n", x);
}

printf("After: x is %d\n", x);  // Also dangerous
```

#### Output (undefined behavior)
```
Private: x is 6     ← Thread 0 read outer x before fork
Private: x is 13    ← Garbage value
Private: x is 9     ← Garbage value
Private: x is 7     ← Garbage value
...
After: x is 13      ← Value after parallel region is undefined
```

### Data Clause Modifiers

| Clause | Description | Initialization | After Region |
|--------|-------------|----------------|--------------|
| `shared(x)` | All threads share x | Same as before | Updated |
| `private(x)` | Each thread has copy | Undefined | Undefined |
| `firstprivate(x)` | Private, initialized | Copy of original | Undefined |
| `lastprivate(x)` | Private | Undefined | Last iteration value |

#### firstprivate Example

```cpp
int x = 5;

#pragma omp parallel firstprivate(x)
{
    x = x + omp_get_thread_num();  // Initialized to 5
    printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
}

// x is still 5 here
```

<a id='section4'></a>
## 4. Question 2: Parallel Numerical Integration (Pi Calculation)

### The Problem

Calculate π using numerical integration:

$$\pi = \int_0^1 \frac{4.0}{1+x^2} dx$$

### Approximation Method

Divide the integral into rectangles:

$$\sum_{i=0}^{N} \Delta x \cdot F(x_i) \approx \pi$$

Where:
- $\Delta x$ = width of each rectangle
- $F(x_i) = \frac{4.0}{1+x_i^2}$ = height at point $x_i$

### Visual Representation

```
F(x) = 4.0/(1+x²)
  |
4 ├─┐
  │ │█
3 │ │█
  │ │█
2 │ │█ █
  │ │███ █
1 │ │█████ █
  │ │████████
0 └─┴────────┴─→ x
  0          1
```

Each rectangle approximates a small part of the curve.

### Serial Implementation (P3Q2.cpp)

```cpp
#include <iostream>
#include "omp.h"

static long num_steps = 100000;
double step;

int main() {
    int i;
    double x, pi, sum = 0.0;
    
    // Calculate step size
    step = 1.0 / (double)num_steps;
    
    // Start timing
    double start_time = omp_get_wtime();
    
    // Calculate sum of all rectangles
    for (i = 0; i < num_steps; i++) {
        x = (i + 0.5) * step;              // Midpoint of rectangle
        sum = sum + 4.0 / (1.0 + x * x);   // Height of rectangle
    }
    
    // Multiply by width to get area
    pi = step * sum;
    
    double end_time = omp_get_wtime();
    
    printf("Pi = %.10f\n", pi);
    printf("Time = %.6f seconds\n", end_time - start_time);
    
    return 0;
}
```

#### Output (Serial)
```
Pi = 3.1415926536
Time = 0.002341 seconds
```

### Part A: Basic Parallel Version

**Challenge**: Parallelize using `#pragma omp parallel`

```cpp
#include <iostream>
#include <omp.h>

static long num_steps = 10000000;  // Increased for better measurement
double step;

int main() {
    int i;
    double x, pi;
    double sum[16];  // Array to store partial sums (max 16 threads)
    
    step = 1.0 / (double)num_steps;
    
    double start_time = omp_get_wtime();
    
    #pragma omp parallel
    {
        int id = omp_get_thread_num();
        int num_threads = omp_get_num_threads();
        
        // Each thread gets its own sum
        sum[id] = 0.0;
        
        // Divide work among threads
        for (i = id; i < num_steps; i += num_threads) {
            x = (i + 0.5) * step;
            sum[id] += 4.0 / (1.0 + x * x);
        }
    }
    // Implicit barrier here
    
    // Combine partial sums
    double total_sum = 0.0;
    for (i = 0; i < omp_get_max_threads(); i++) {
        total_sum += sum[i];
    }
    
    pi = step * total_sum;
    
    double end_time = omp_get_wtime();
    
    printf("Pi = %.10f\n", pi);
    printf("Time = %.6f seconds\n", end_time - start_time);
    printf("Threads = %d\n", omp_get_max_threads());
    
    return 0;
}
```

#### Work Distribution

```
Total iterations: 0 1 2 3 4 5 6 7 8 9 10 11 ...

Thread 0: 0   4   8   12  ...  (i += 4)
Thread 1:   1   5   9   13 ...
Thread 2:     2   6   10  14 ...
Thread 3:       3   7   11 15 ...
```

<a id='section5'></a>
## 5. Part B: False Sharing Problem

### What is False Sharing?

**False sharing** occurs when threads on different processors modify variables that reside on the same cache line.

### Cache Line Basics

- CPU caches work with **cache lines** (typically 64 bytes)
- When one core modifies data, the entire cache line is invalidated on other cores
- Even if threads access different variables!

### False Sharing Illustration

```
Array: sum[16]  (each double = 8 bytes)

Cache Line (64 bytes):
┌────────────────────────────────────────────────┐
│ sum[0] sum[1] sum[2] sum[3] sum[4] sum[5] ... │
└────────────────────────────────────────────────┘
     ↑      ↑      ↑      ↑
  Thread0 Thread1 Thread2 Thread3
```

**Problem**: 
- Thread 0 modifies `sum[0]`
- Entire cache line invalidated on all cores
- Thread 1, 2, 3 must reload cache line
- Massive performance loss!

### Solution: Cache Line Padding

Give each thread its own cache line:

```
Cache Line 0 (64 bytes):
┌────────────────────────────────────────────────┐
│ sum[0][0]    padding (56 bytes)                │  ← Thread 0
└────────────────────────────────────────────────┘

Cache Line 1 (64 bytes):
┌────────────────────────────────────────────────┐
│ sum[1][0]    padding (56 bytes)                │  ← Thread 1
└────────────────────────────────────────────────┘
```

### Implementation with Padding

```cpp
#include <iostream>
#include <omp.h>

static long num_steps = 10000000;
double step;

#define PAD 8  // 64 bytes / 8 bytes per double = 8 doubles

int main() {
    int i;
    double x, pi;
    double sum[16][PAD];  // 2D array with padding
    
    step = 1.0 / (double)num_steps;
    
    double start_time = omp_get_wtime();
    
    #pragma omp parallel
    {
        int id = omp_get_thread_num();
        int num_threads = omp_get_num_threads();
        
        // Initialize first element of row
        sum[id][0] = 0.0;
        
        for (i = id; i < num_steps; i += num_threads) {
            x = (i + 0.5) * step;
            sum[id][0] += 4.0 / (1.0 + x * x);  // Access [id][0] only
        }
    }
    
    // Combine partial sums
    double total_sum = 0.0;
    for (i = 0; i < omp_get_max_threads(); i++) {
        total_sum += sum[i][0];
    }
    
    pi = step * total_sum;
    
    double end_time = omp_get_wtime();
    
    printf("Pi = %.10f\n", pi);
    printf("Time with padding = %.6f seconds\n", end_time - start_time);
    printf("Threads = %d\n", omp_get_max_threads());
    
    return 0;
}
```

### Performance Comparison

**Configuration**: 8 threads, 10,000,000 steps

| Implementation | Time (ms) | Speedup | Notes |
|----------------|-----------|---------|-------|
| Serial | 45.2 | 1.0x | Baseline |
| Parallel (no pad) | 12.8 | 3.5x | False sharing! |
| Parallel (padded) | 6.1 | 7.4x | Good speedup |

**Improvement**: ~2x faster with padding!

### Part C: Using Reduction Clause

OpenMP provides a **reduction** clause that handles this automatically!

```cpp
#include <iostream>
#include <omp.h>

static long num_steps = 10000000;
double step;

int main() {
    int i;
    double x, pi, sum = 0.0;
    
    step = 1.0 / (double)num_steps;
    
    double start_time = omp_get_wtime();
    
    // Parallel loop with reduction
    #pragma omp parallel for reduction(+:sum)
    for (i = 0; i < num_steps; i++) {
        x = (i + 0.5) * step;
        sum += 4.0 / (1.0 + x * x);
    }
    
    pi = step * sum;
    
    double end_time = omp_get_wtime();
    
    printf("Pi = %.10f\n", pi);
    printf("Time with reduction = %.6f seconds\n", end_time - start_time);
    printf("Threads = %d\n", omp_get_max_threads());
    
    return 0;
}
```

### How Reduction Works

```cpp
#pragma omp parallel for reduction(+:sum)
```

1. Each thread gets a **private copy** of `sum` (initialized to 0)
2. Threads compute their partial sums independently
3. At the end, OpenMP **combines** all partial sums: `sum = sum0 + sum1 + sum2 + ...`

### Reduction Operations

| Operator | Description | Initial Value |
|----------|-------------|---------------|
| `+` | Addition | 0 |
| `*` | Multiplication | 1 |
| `-` | Subtraction | 0 |
| `&` | Bitwise AND | ~0 |
| `|` | Bitwise OR | 0 |
| `^` | Bitwise XOR | 0 |
| `&&` | Logical AND | 1 |
| `||` | Logical OR | 0 |

### Why Reduction is Best

1. **Simplest code**: Looks almost like serial version
2. **No false sharing**: OpenMP handles it internally
3. **Optimal performance**: Uses efficient reduction algorithms
4. **Less error-prone**: No manual thread management

### Discussion: Why Different Results?

Running with different thread counts may give slightly different π values:

```
Threads = 1: π = 3.141592653590
Threads = 2: π = 3.141592653589
Threads = 4: π = 3.141592653591
Threads = 8: π = 3.141592653590
```

**Reasons**:

1. **Floating-point rounding**: Addition order affects rounding errors
   - (a + b) + c ≠ a + (b + c) for floating-point

2. **Different summation order**: 
   - 1 thread: sum[0] + sum[1] + sum[2] + ...
   - 4 threads: (sum[0]+sum[1]) + (sum[2]+sum[3]) + ...

3. **Not an error**: All results are correct to many decimal places

4. **Solution**: Use higher precision (e.g., `long double`) or Kahan summation algorithm

<a id='section6'></a>
## 6. Question 3: Matrix Multiplication

### Problem Statement

Parallelize matrix multiplication: **C = A × B**

```
A[N×M] × B[M×P] = C[N×P]
```

### Algorithm

```cpp
for (i = 0; i < N; i++) {
    for (j = 0; j < P; j++) {
        C[i][j] = 0;
        for (k = 0; k < M; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}
```

### Serial Implementation (P3Q3.c)

```cpp
#include <iostream>
#include <omp.h>
#include <cstdlib>

#define N 1000  // Matrix size

double A[N][N], B[N][N], C[N][N];

void initialize() {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            A[i][j] = rand() % 10;
            B[i][j] = rand() % 10;
            C[i][j] = 0;
        }
    }
}

void multiply_serial() {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            for (int k = 0; k < N; k++) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

int main() {
    initialize();
    
    double start = omp_get_wtime();
    multiply_serial();
    double end = omp_get_wtime();
    
    printf("Serial time: %.6f seconds\n", end - start);
    
    return 0;
}
```

### Parallel Implementation

**Key insight**: Outer loop iterations are independent!

```cpp
void multiply_parallel() {
    int i, j, k;
    
    // Parallelize outer loop
    // i, j, k are private by default (loop variables)
    #pragma omp parallel for private(j, k)
    for (i = 0; i < N; i++) {
        for (j = 0; j < N; j++) {
            for (k = 0; k < N; k++) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}
```

### Why `private(j, k)`?

- **i**: Already private (loop variable of parallel for)
- **j, k**: Must be explicitly declared private
- **A, B, C**: Shared (declared outside)

### Work Distribution

```
Matrix C (N rows):

Row 0  ← Thread 0
Row 1  ← Thread 1
Row 2  ← Thread 2
Row 3  ← Thread 3
Row 4  ← Thread 0
Row 5  ← Thread 1
...
```

Each thread computes different rows of C.

### Complete Implementation with Timing

```cpp
#include <iostream>
#include <omp.h>
#include <cstdlib>
#include <cmath>

#define N 1000

double A[N][N], B[N][N], C_serial[N][N], C_parallel[N][N];

void initialize() {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            A[i][j] = rand() % 10;
            B[i][j] = rand() % 10;
            C_serial[i][j] = 0;
            C_parallel[i][j] = 0;
        }
    }
}

void multiply_serial() {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            for (int k = 0; k < N; k++) {
                C_serial[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

void multiply_parallel() {
    int i, j, k;
    #pragma omp parallel for private(j, k)
    for (i = 0; i < N; i++) {
        for (j = 0; j < N; j++) {
            for (k = 0; k < N; k++) {
                C_parallel[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

bool verify_results() {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            if (fabs(C_serial[i][j] - C_parallel[i][j]) > 1e-6) {
                return false;
            }
        }
    }
    return true;
}

int main() {
    initialize();
    
    // Serial execution
    double serial_start = omp_get_wtime();
    multiply_serial();
    double serial_end = omp_get_wtime();
    double serial_time = serial_end - serial_start;
    
    // Parallel execution
    double parallel_start = omp_get_wtime();
    multiply_parallel();
    double parallel_end = omp_get_wtime();
    double parallel_time = parallel_end - parallel_start;
    
    // Verify correctness
    if (verify_results()) {
        printf("✓ Results match!\n");
    } else {
        printf("✗ Results differ!\n");
    }
    
    // Performance report
    printf("\nPerformance Report:\n");
    printf("Matrix size: %d×%d\n", N, N);
    printf("Serial time: %.6f seconds\n", serial_time);
    printf("Parallel time: %.6f seconds\n", parallel_time);
    printf("Speedup: %.2fx\n", serial_time / parallel_time);
    printf("Threads: %d\n", omp_get_max_threads());
    
    return 0;
}
```

### Example Output

```
✓ Results match!

Performance Report:
Matrix size: 1000×1000
Serial time: 8.491261 seconds
Parallel time: 1.215151 seconds
Speedup: 6.99x
Threads: 8
```

### Performance Considerations

#### 1. Loop Order Matters

```cpp
// Better cache locality (accessing B column-wise)
for (i) for (j) for (k)
    C[i][j] += A[i][k] * B[k][j];

// Better for some systems (accessing A row-wise)
for (i) for (k) for (j)
    C[i][j] += A[i][k] * B[k][j];
```

#### 2. Scheduling Strategies

```cpp
// Static: Divide iterations evenly (default)
#pragma omp parallel for schedule(static)

// Dynamic: Assign iterations dynamically (better for unbalanced work)
#pragma omp parallel for schedule(dynamic, 10)

// Guided: Start with large chunks, decrease over time
#pragma omp parallel for schedule(guided)
```

#### 3. Why Not 8x Speedup?

Reasons for < ideal speedup:
- **Memory bandwidth**: All threads accessing memory
- **Cache misses**: Matrix B accessed in column-major order
- **Thread overhead**: Creating and synchronizing threads
- **Load imbalance**: Some threads finish earlier

<a id='section7'></a>
## 7. Introduction to CUDA (Optional)

### What is CUDA?

**CUDA** (Compute Unified Device Architecture) is NVIDIA's platform for GPU programming.

### CPU vs GPU Architecture

```
CPU (8 cores):              GPU (1000s of cores):
┌──────────────┐            ┌──────────────────────┐
│ ██ ██ ██ ██  │            │ ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ │
│ ██ ██ ██ ██  │            │ ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ │
│              │            │ ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ │
│ Large Cache  │            │ ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ │
└──────────────┘            └──────────────────────┘
 Few powerful cores         Many simple cores
 Good for complex logic     Good for data-parallel
```

### When to Use GPU?

✅ **Good for GPU**:
- Matrix operations
- Image processing
- Machine learning
- Scientific simulations
- Cryptography

❌ **Not good for GPU**:
- Branch-heavy code
- Sequential algorithms
- Small datasets
- Frequent host-device transfers

### CUDA Programming Model

```
HOST (CPU)              DEVICE (GPU)
    |                       |
    |   1. Allocate GPU     |
    |      memory           |
    |---------------------->|
    |                       |
    |   2. Copy data        |
    |      CPU → GPU        |
    |---------------------->|
    |                       |
    |   3. Launch kernel    |
    |    <<<blocks,threads>>>|
    |---------------------->|
    |                       |
    |                  [GPU computes]
    |                       |
    |   4. Copy results     |
    |      GPU → CPU        |
    |<----------------------|
    |                       |
    |   5. Free GPU memory  |
    |---------------------->|
```

### CUDA Key Concepts

#### 1. Kernel Functions

```cpp
__global__ void kernel_function(int* data) {
    // Runs on GPU
    int idx = threadIdx.x;  // Thread ID
    data[idx] = data[idx] * 2;
}
```

- `__global__`: Function runs on GPU, called from CPU
- `__device__`: Function runs on GPU, called from GPU
- `__host__`: Function runs on CPU (default)

#### 2. Thread Hierarchy

```
GRID
  ├─ BLOCK 0
  │   ├─ Thread 0
  │   ├─ Thread 1
  │   └─ Thread N
  ├─ BLOCK 1
  │   ├─ Thread 0
  │   └─ ...
  └─ BLOCK M
```

#### 3. Memory Types

| Memory | Location | Access | Speed |
|--------|----------|--------|-------|
| Global | GPU DRAM | All threads | Slow |
| Shared | On-chip | Block | Fast |
| Local | On-chip | Thread | Fast |
| Constant | GPU DRAM | Read-only | Cached |

#### 4. Key Functions

```cpp
// Allocate GPU memory
cudaMalloc((void**)&device_ptr, bytes);

// Copy CPU → GPU
cudaMemcpy(device_ptr, host_ptr, bytes, cudaMemcpyHostToDevice);

// Launch kernel
kernel<<<num_blocks, threads_per_block>>>(device_ptr);

// Copy GPU → CPU
cudaMemcpy(host_ptr, device_ptr, bytes, cudaMemcpyDeviceToHost);

// Free GPU memory
cudaFree(device_ptr);
```

<a id='section8'></a>
## 8. Question 4 & 5: CUDA Vector Addition and Matrix Multiplication

### Vector Addition Kernel

```cpp
#include <iostream>
#include <cuda_runtime.h>

// CUDA kernel for vector addition
__global__ void vectorAdd(float* A, float* B, float* C, int N) {
    // Calculate global thread ID
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Boundary check
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int N = 1000000;  // Vector size
    size_t bytes = N * sizeof(float);
    
    // 1. Allocate host memory
    float *h_A, *h_B, *h_C;
    h_A = (float*)malloc(bytes);
    h_B = (float*)malloc(bytes);
    h_C = (float*)malloc(bytes);
    
    // Initialize vectors
    for (int i = 0; i < N; i++) {
        h_A[i] = i;
        h_B[i] = i * 2;
    }
    
    // 2. Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, bytes);
    cudaMalloc((void**)&d_B, bytes);
    cudaMalloc((void**)&d_C, bytes);
    
    // 3. Copy data from host to device
    cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);
    
    // 4. Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
    
    // Wait for GPU to finish
    cudaDeviceSynchronize();
    
    // 5. Copy results back to host
    cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);
    
    // Verify results
    bool correct = true;
    for (int i = 0; i < N && correct; i++) {
        if (fabs(h_C[i] - (h_A[i] + h_B[i])) > 1e-5) {
            correct = false;
        }
    }
    
    printf("Results: %s\n", correct ? "CORRECT" : "INCORRECT");
    
    // 6. Free memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);
    
    return 0;
}
```

### Thread ID Calculation

```cpp
int idx = blockIdx.x * blockDim.x + threadIdx.x;
```

Example:
```
N = 1000, threadsPerBlock = 256

Block 0: threads 0-255    (idx = 0*256 + 0-255)
Block 1: threads 256-511  (idx = 1*256 + 0-255)
Block 2: threads 512-767  (idx = 2*256 + 0-255)
Block 3: threads 768-999  (idx = 3*256 + 0-231)
```

### Compilation Instructions

#### Windows (Visual Studio):
1. Create CUDA project: **File → New → Project → CUDA Runtime**
2. Save code as `.cu` file
3. Build: **Ctrl+B**
4. Run: **F5**

#### Linux:
```bash
nvcc -o vectorAdd vectorAdd.cu
./vectorAdd
```

### Performance Profiling

#### Using nvprof (Legacy)
```bash
nvprof ./vectorAdd
```

#### Using nsys (Modern)
```bash
# Generate profile
nsys profile -o report ./vectorAdd

# Analyze report
nsys analyze report.nsys-rep
```

**Note**: If you see compatibility warnings with nvprof, use nsys instead (NVIDIA NSight Systems).

### Expected Profile Output

```
Time(%)  Time      Calls  Avg       Min       Max       Name
50.06%   5.1200ms  1      5.1200ms  5.1200ms  5.1200ms  vectorAdd
38.32%   3.9200ms  3      1.3067ms  928.00us  1.5040ms  [CUDA memcpy HtoD]
11.61%   1.1800ms  1      1.1800ms  1.1800ms  1.1800ms  [CUDA memcpy DtoH]
```

**Analysis**:
- Kernel execution: 5.12ms
- Memory transfer: 5.10ms total
- Memory transfer is 50% of time! (Can be optimized)

<a id='section9'></a>
## 9. Performance Summary

### Speedup Comparison (Example System: 8-core CPU, NVIDIA GPU)

| Task | Serial | OpenMP (8 threads) | CUDA | Best |
|------|--------|-------------------|------|------|
| Vector Add (1M) | 5ms | 1.2ms (4.2x) | 0.8ms (6.3x) | CUDA |
| Pi Calculation | 45ms | 6ms (7.5x) | N/A | OpenMP |
| Matrix Mul (1000×1000) | 8491ms | 1215ms (7.0x) | 145ms (58.6x) | CUDA |

### Key Takeaways

1. **OpenMP**: Best for quick parallelization of CPU code
   - Easy to add to existing code
   - Good speedup for CPU-bound tasks
   - Limited by CPU core count

2. **CUDA**: Best for massive data-parallel workloads
   - Requires more code changes
   - Excellent speedup for suitable problems
   - Memory transfer can be bottleneck

3. **When to use what**:
   - **Small datasets**: Stay on CPU (OpenMP)
   - **Complex logic**: CPU (OpenMP)
   - **Simple, data-parallel, large datasets**: GPU (CUDA)

<a id='section10'></a>
## 10. Summary and Key Concepts

### OpenMP Directives Covered

| Directive | Purpose | Example |
|-----------|---------|--------|
| `#pragma omp parallel` | Create parallel region | Basic parallelism |
| `#pragma omp for` | Parallelize loop | Loop parallelization |
| `#pragma omp parallel for` | Combined directive | Most common usage |
| `reduction(+:var)` | Parallel reduction | Sum, product, etc. |
| `private(var)` | Thread-private variable | Avoid sharing |
| `shared(var)` | Shared variable | Explicit sharing |
| `firstprivate(var)` | Private with initialization | Copy initial value |

### Common Pitfalls

1. **Race conditions**: Multiple threads accessing shared variables
2. **False sharing**: Variables on same cache line
3. **Load imbalance**: Some threads finish early
4. **Overhead**: Too many threads or too small workload
5. **Memory bandwidth**: All threads competing for memory

### Best Practices

1. **Start simple**: Add `#pragma omp parallel for` to loops
2. **Profile first**: Identify bottlenecks before optimizing
3. **Use reduction**: For accumulation operations
4. **Minimize sharing**: Make variables private when possible
5. **Check correctness**: Verify parallel results match serial

### Next Steps

In Practical 4 and 5, you will learn:
- Critical sections and locks
- Barriers and synchronization
- Producer-consumer patterns
- Atomic operations
- Deadlock prevention

---

**End of Practical 3**

**Key Message**: Parallel programming requires careful consideration of data sharing and synchronization. OpenMP makes it easier, but understanding the underlying concepts is crucial for writing correct and efficient parallel code.