![NERSC Logo](NERSClogo087.png)
# NERSC Hands-On: Introduction to CUDA C/C++
## Part 3 Notbook 1: Understanding Kernels, Grids, and Blocks

Welcome to the NERSC introduction to CUDA C/C++! This 1-hour session is designed to run on Perlmutter via the NERSC JupyterHub.

**Goal:** Move from zero to writing, compiling, and running your first simple CUDA C++ programs. We will focus on the fundamental concepts of the CUDA programming model.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:

* **Verify** your GPU environment on Perlmutter.
* **Define** the core CUDA concepts: Kernel, Thread, Block, and Grid.
* **Write** a "Hello, World!" kernel in CUDA C++ using `%%writefile`.
* **Compile** CUDA C++ code using the NVIDIA compiler (`nvcc`).
* **Launch** a kernel and see parallel execution in action.
* **Implement** a simple, data-parallel kernel for vector addition.
* **Understand** the difference between Host (CPU) and Device (GPU) memory.

## ⚙️ 1. Setup: Verify Your Environment

Let's run two commands:
1.  `module load cudatoolkit`: Loads the NVIDIA compiler and libraries.
2.  `nvidia-smi`: The NVIDIA System Management Interface. This command shows us what GPU(s) are attached and their status.

In [None]:
!module load cudatoolkit
!nvidia-smi

> ⚪️ **REFLECTION:** You should see a table listing one or more NVIDIA GPUs (e.g., "NVIDIA A100-SXM4-80GB"). The output from `nvidia-smi` confirms the GPU is visible to your session. The `module load` command makes the `nvcc` compiler available, which we will use next.

## 2. Core Concept: The CUDA Execution Model

How does a GPU execute code? It follows a specific hierarchy.

1.  **Host (CPU) vs. Device (GPU):**
    * Your normal C++ `main()` function runs on the **Host** (the CPU).
    * Your parallel functions run on the **Device** (the GPU).

2.  **Kernel:**
    * A C/C++ function that runs on the GPU. You write it once, and it is executed by *many* threads in parallel.
    * You define a kernel using the `__global__` keyword.

3.  **Thread, Block, and Grid:**
    * **Thread:** The smallest unit of execution. It runs one copy of your kernel.
    * **Block:** A group of threads. Threads *within* a block can cooperate using fast **Shared Memory**.
    * **Grid:** A group of blocks. This is the full set of threads that you launch to run your kernel.



When you launch a kernel, you tell CUDA how many blocks to launch, and how many threads to put in each block:

```cpp
// C++ syntax for launching a kernel
int numBlocks = 64;
int threadsPerBlock = 256;
myKernel<<<numBlocks, threadsPerBlock>>>(...arguments...);
```

Inside the kernel, each thread has built-in variables to know *who* it is and *where* it is:
* `threadIdx.x`: Your thread ID *within* the block.
* `blockIdx.x`: Your block ID *within* the grid.
* `blockDim.x`: The total number of threads in your block.

## 3. "Hello, World!" in CUDA C++

Let's write our first kernel. We'll use the `%%writefile` "magic command" to create a new file named `hello.cu` right from this notebook cell. (`.cu` is the standard extension for CUDA C++ files).

In [None]:
%%writefile hello_cuda.cu

#include <stdio.h>

/*
 * A CUDA Kernel: A function that runs on the GPU.
 * Defined by the __global__ keyword.
 */
__global__ void hello_kernel() {
    /*
     * Calculate this thread's unique, global ID.
     * This is the most common pattern in CUDA.
     */
    int blockID = blockIdx.x;
    int threadID_in_block = threadIdx.x;
    int threads_per_block = blockDim.x;
    
    int globalThreadID = blockID * threads_per_block + threadID_in_block;
    
    printf("Hello from global thread %d! (Block %d, Thread %d)\n", 
           globalThreadID, blockID, threadID_in_block);
}

/*
 * The main function runs on the Host (CPU).
 */
int main() {
    printf("--- Starting kernel from CPU ---\n\n");

    // Define the Grid and Block dimensions
    int numBlocks = 2;         // The Grid will have 2 blocks
    int threadsPerBlock = 8;   // Each block will have 8 threads

    // Launch the kernel on the Device (GPU)
    // Syntax: kernel_name<<<Grid, Block>>>(...arguments...)
    hello_kernel<<<numBlocks, threadsPerBlock>>>();

    /*
     * Wait for the GPU to finish before the CPU continues.
     * This is CRUCIAL. Without it, the CPU's main() might exit 
     * before the GPU kernel even starts!
     */
    cudaDeviceSynchronize();

    printf("\n--- Kernel finished. Back to CPU. ---\n");
    return 0;
}

### Compile and Run

Now we compile the code with `nvcc` and run the resulting executable (`./hello_gpu`) right here.

In [None]:
# Compile the .cu file; -o specifies the output executable name
!nvcc -o hello_cuda hello_cuda.cu

# Run the executable
!./hello_cuda


> ⚪️ **REFLECTION: Why is the output jumbled?**
> You likely saw the "Hello..." messages printed in a random, jumbled order. **This is parallelism in action!** 
> 
> All 16 threads (2 blocks * 8 threads/block) ran at *roughly* the same time. They all tried to print to the screen (a single resource) at once, so the output order is not guaranteed. This is a fundamental concept of parallel programming.

## 4. A Real Task: Vector Addition

"Hello World" is fun, but not useful. A classic parallel task is vector addition: `C[i] = A[i] + B[i]`.

The logic is simple: **map one thread to each element `i` of the array.**

This requires a new concept: **Host (CPU) memory vs. Device (GPU) memory.**

1.  Your data (vectors A, B, C) starts on the Host.
2.  You must **allocate** memory on the Device (GPU) with `cudaMalloc()`.
3.  You must **copy** data from Host to Device with `cudaMemcpy()`.
4.  You **launch the kernel** to compute `C = A + B` on the Device.
5.  You **copy** the result (vector C) from Device back to Host.
6.  You **free** the Device memory with `cudaFree()`.



In [None]:
%%writefile vector_add.cu

#include <stdio.h>
#include <stdlib.h>

// Helper function to check for CUDA errors
void check(cudaError_t err, const char* msg) {
    if (err != cudaSuccess) {
        fprintf(stderr, "CUDA Error: %s: %s\n", msg, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

/* 
 * Kernel for element-wise vector addition
 * C[i] = A[i] + B[i]
 */
__global__ void vecAdd(float* A, float* B, float* C, int N) {
    // Get this thread's global index
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // IMPORTANT: Add a "bounds check"
    // We might launch more threads than N if N isn't a perfect multiple 
    // of our block size. This check prevents threads from writing 
    // out of bounds.
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

int main() {
    int N = 1024 * 1024; // Let's use a bigger vector: 1M elements
    size_t size = N * sizeof(float); // Total bytes

    // 1. Allocate memory on the Host (CPU)
    float* h_A = (float*)malloc(size);
    float* h_B = (float*)malloc(size);
    float* h_C = (float*)malloc(size);

    // Initialize Host data
    for (int i = 0; i < N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // 2. Allocate memory on the Device (GPU)
    float *d_A, *d_B, *d_C;
    check(cudaMalloc(&d_A, size), "cudaMalloc A");
    check(cudaMalloc(&d_B, size), "cudaMalloc B");
    check(cudaMalloc(&d_C, size), "cudaMalloc C");

    // 3. Copy data from Host to Device
    check(cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice), "Memcpy H2D A");
    check(cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice), "Memcpy H2D B");

    // 4. Set up Grid/Block dimensions and launch Kernel
    int threadsPerBlock = 256;
    // This calculation finds the number of blocks needed to cover all N elements
    // (N + threadsPerBlock - 1) / threadsPerBlock is a standard integer 
    // arithmetic trick for ceiling(N / threadsPerBlock)
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    printf("Launching kernel with %d blocks and %d threads per block...\n", 
           blocksPerGrid, threadsPerBlock);

    vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
    check(cudaGetLastError(), "Kernel launch");

    // Wait for kernel to finish
    check(cudaDeviceSynchronize(), "Device synchronize");

    // 5. Copy result from Device back to Host
    check(cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost), "Memcpy D2H C");

    // 6. Verify the result on the Host
    bool success = true;
    for (int i = 0; i < N; i++) {
        if (h_C[i] != 3.0f) {
            printf("Verification FAILED at index %d! Got %f, expected 3.0\n", i, h_C[i]);
            success = false;
            break;
        }
    }
    if (success) {
        printf("Verification PASSED!\n");
    }

    // 7. Free Device and Host memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

### Compile and Run Vector Add

In [None]:
!nvcc -o vector_add_gpu vector_add.cu
!./vector_add_gpu

> 🟡 **TASK: Experiment!**
> 1.  Go back to the `%%writefile vector_add.cu` cell.
> 2.  Change `int N = 1024 * 1024;` to `int N = 1024 * 1024 * 10;` (10 million elements).
> 3.  Rerun the `%%writefile` cell to save the file.
> 4.  Rerun the compile and run cell above.
> 
> Did it still work? Did it take noticeably longer? You just processed 10 million elements in parallel.

## 🐍 5. The Python Way: Numba and CuPy

Writing, compiling, and running C++ in a notebook is clunky. For prototyping and in Python-based workflows, we can access CUDA directly from Python.

* **CuPy:** Provides a `numpy`-like array interface (e.g., `cupy.array()`) that lives *on the GPU*. Operations like `A + B` are automatically run on the GPU.
* **Numba:** A just-in-time (JIT) compiler that can turn Python functions into CUDA kernels using a decorator (`@cuda.jit`).

Let's look at the *same* vector add kernel, but written in Python for Numba. Notice how similar it is!

In [None]:
import numpy as np
from numba import cuda

@cuda.jit
def vecAdd_numba_kernel(A, B, C, N):
    # Numba provides an easy way to get the global thread ID
    i = cuda.grid(1) 
    
    # The same bounds check is still needed!
    if i < N:
        C[i] = A[i] + B[i]

# --- Main program (now in Python) ---

N = 1024 * 1024 # 1M elements

# 1. Create Host arrays (using numpy)
h_A = np.ones(N, dtype=np.float32)
h_B = np.full(N, 2.0, dtype=np.float32)
h_C = np.zeros(N, dtype=np.float32)

# 2 & 3. Allocate AND Copy data from Host to Device
# Numba's cuda.to_device() handles both steps
d_A = cuda.to_device(h_A)
d_B = cuda.to_device(h_B)
d_C = cuda.to_device(h_C) # Allocate space for the result

# 4. Set up Grid/Block and launch kernel
threadsPerBlock = 256
blocksPerGrid = (N + threadsPerBlock - 1) // threadsPerBlock

print(f"Launching Numba kernel with {blocksPerGrid} blocks and {threadsPerBlock} threads...")

vecAdd_numba_kernel[blocksPerGrid, threadsPerBlock](d_A, d_B, d_C, N)

# 5. Copy result from Device back to Host
# d_C.copy_to_host() handles this
h_C_result = d_C.copy_to_host()

# 6. Verify
if np.all(h_C_result == 3.0):
    print("Numba verification PASSED!")
else:
    print("Numba verification FAILED!")

# 7. No manual free needed! Python's garbage collector handles it.

## 6. Recap & Next Steps

Congratulations! You have successfully written, compiled, and run code on a Perlmutter GPU using both raw CUDA C++ and Numba.

> ⚪️ **Key Takeaways**
> 
> * The **Host (CPU)** launches **Kernels** on the **Device (GPU)**.
> * A kernel is run by a **Grid** of **Blocks**, which are made of **Threads**.
> * The `blockIdx`, `threadIdx`, and `blockDim` variables are used to calculate a thread's unique global ID.
> * You must manage memory explicitly: `cudaMalloc`, `cudaMemcpy`, `cudaFree`.
> * Tools like Numba and CuPy provide a Python-friendly way to do the same thing.

### 📚 Resources

* **NERSC CUDA Docs:** [NERSC Perlmutter CUDA Documentation](https://docs.nersc.gov/development/programming-models/cuda/)
* **NVIDIA's Full Guide:** [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) (The official bible)
* **Numba CUDA:** [CUDA Target for Numba on Github](https://github.com/NVIDIA/numba-cuda)

--- 

### 🟢 Next Up

In **Part 2**, we will explore how to make our kernels *fast* by using **Shared Memory** to optimize a matrix multiplication kernel.