First step into CUDA (Compute Unified Device Architecture) programming: CUDA is NVIDIA's platform for harnessing the massive parallel processing power of GPUs. This introductory notebook will guide you through the essentials: verifying your CUDA environment, understanding basic concepts like kernels, threads, and blocks, and running your first piece of code directly on a GPU.

## Prerequisites
This notebook assumes a basic understanding of C programming. You should be comfortable with:

Variables, loops, and if/else logic.
Function definition and invocation.
Array allocation and usage.

To retrieve details about this GPU, run the nvidia-smi command in your terminal. It stands for NVIDIA System Management Interface.

In [None]:
!nvidia-smi

Mon Apr 14 12:49:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   57C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

##Sample code for CPU and GPU



```cpp
void CPUFunction()
{
  printf("This function is defined to run on the CPU.\n");
}

__global__ void GPUFunction()
{
  printf("This function is defined to run on the GPU.\n");
}

int main()
{
  CPUFunction();

  GPUFunction<<<1, 1>>>();
  cudaDeviceSynchronize();
}
```

`__global__ void GPUFunction()`
- The `__global__` keyword indicates that the following function will run on the GPU, and can be invoked **globally**, which in this context means either by the CPU, or, by the GPU.
  - Often, code executed on the CPU is referred to as **host** code, and code running on the GPU is referred to as **device** code.
  - Notice the return type `void`. It is required that functions defined with the `__global__` keyword return type `void`.

`GPUFunction<<<1, 1>>>();`
  - Typically, when calling a function to run on the GPU, it is called as **kernel**, which is **launched**.
  - When launching a kernel, an **execution configuration** must be provided, which is done by using the `<<< ... >>>` syntax just prior to passing the kernel any expected arguments.
  - At a high level, execution configuration allows programmers to specify the **thread hierarchy** for a kernel launch, which defines the number of thread groupings (called **blocks**), as well as how many **threads** to execute in each block.
  `cudaDeviceSynchronize();`
  - Unlike much C/C++ code, launching kernels is **asynchronous**: the CPU code will continue to execute *without waiting for the kernel launch to complete*.
  - A call to `cudaDeviceSynchronize`, a function provided by the CUDA runtime, will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU.

 - `nvcc` is the command line command for using the `nvcc` compiler.
  - `some-CUDA.cu` is passed as the file to compile.
  - The `o` flag is used to specify the output file for the compiled program.
  - The `arch` flag indicates for which **architecture** the files must be compiled.
  - Example: *nvcc -arch=sm_70 -o cuda_program cuda_program.cu*

## Creating and Launching Paraller Kernels

The execution configuration allows programmers to specify how many groups of threads - called **thread blocks**, or just **blocks** - and how many threads they would like each thread block to contain. The syntax for this is:
`<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK>>>`

**The kernel code is executed by every thread in every thread block configured when the kernel is launched**.

Thus, under the assumption that a kernel called `someKernel` has been defined, the following are true:
  - `someKernel<<<1, 1>>>()` is configured to run in a single thread block which has a single thread and will therefore run only once.
  - `someKernel<<<1, 10>>>()` is configured to run in a single thread block which has 10 threads and will therefore run 10 times.
  - `someKernel<<<10, 1>>>()` is configured to run in 10 thread blocks which each have a single thread and will therefore run 10 times.
  - `someKernel<<<10, 10>>>()` is configured to run in 10 thread blocks which each have 10 threads and will therefore run 100 times.

## CUDA Thread Heirarchy

Each thread is given an index within its thread block, starting at `0`. Additionally, each block is given an index, starting at `0`. Just as threads are grouped into thread blocks, blocks are grouped into a **grid**, which is the highest entity in the CUDA thread hierarchy. In summary, CUDA kernels are executed in a grid of 1 or more blocks, with each block containing the same number of 1 or more threads.

CUDA kernels have access to special variables identifying both the index of the thread (within the block) that is executing the kernel, and, the index of the block (within the grid) that the thread is within. These variables are `threadIdx.x` and `blockIdx.x` respectively.

## Optimizing For Loops using CUDA
For loops in CPU-only applications are ripe for acceleration: rather than run each iteration of the loop serially, each iteration of the loop can be run in parallel in its own thread. Consider the following for loop, and notice, though it is obvious, that it controls how many times the loop will execute, as well as defining what will happen for each iteration of the loop:

```cpp
int N = 2<<20;
for (int i = 0; i < N; ++i)
{
  printf("%d\n", i);
}
```
In order to parallelize this loop, 2 steps must be taken:

- A kernel must be written to do the work of a **single iteration of the loop**.
- Because the kernel will be agnostic of other running kernels, the execution configuration must be such that the kernel executes the correct number of times, for example, the number of times the loop would have iterated.


To find the architecture of gpu: *nvidia-smi --query-gpu=compute_cap --format=csv,noheader*

In [None]:
!nvidia-smi --query-gpu=compute_cap --format=csv,noheader

7.5


In [None]:
%%writefile sample_code1.cu
#include <stdio.h>

void CPUFunction(int N)
{
    for (int i=0; i<N; i++)
    {
        printf("CPU: %d\n", i);
    }
}

__global__
void GPUFunction(int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N)
    {
        printf("GPU: %d\n", idx);
    }
}

int main()
{
    int N = 20;

    CPUFunction(N);
    GPUFunction<<<N,1>>>(N);
    cudaDeviceSynchronize();
    return 0;
}


Overwriting sample_code1.cu


In [None]:
!nvcc sample_code1.cu -o sample_code1 -arch=sm_75

In [None]:
!./sample_code1

CPU: 0
CPU: 1
CPU: 2
CPU: 3
CPU: 4
CPU: 5
CPU: 6
CPU: 7
CPU: 8
CPU: 9
CPU: 10
CPU: 11
CPU: 12
CPU: 13
CPU: 14
CPU: 15
CPU: 16
CPU: 17
CPU: 18
CPU: 19
GPU: 2
GPU: 12
GPU: 7
GPU: 17
GPU: 0
GPU: 3
GPU: 10
GPU: 13
GPU: 4
GPU: 5
GPU: 14
GPU: 15
GPU: 8
GPU: 11
GPU: 18
GPU: 1
GPU: 9
GPU: 19
GPU: 6
GPU: 16


## Using blocks and threads for further parallelization

There is a limit to the number of threads that can exist in a thread block: 1024 to be precise. In order to increase the amount of parallelism in accelerated applications, we must be able to coordinate among multiple thread blocks.

CUDA Kernels have access to a special variable that gives the number of threads in a block: `blockDim.x`. Using this variable, in conjunction with `blockIdx.x` and `threadIdx.x`, increased parallelization can be accomplished by organizing parallel execution across multiple blocks of multiple threads with the idiomatic expression `threadIdx.x + blockIdx.x * blockDim.x`. Here is a detailed example.

The execution configuration `<<<10, 10>>>` would launch a grid with a total of 100 threads, contained in 10 blocks of 10 threads. We would therefore hope for each thread to have the ability to calculate some index unique to itself between `0` and `99`.

- If block `blockIdx.x` equals `0`, then `blockIdx.x * blockDim.x` is `0`. Adding to `0` the possible `threadIdx.x` values `0` through `9`, then we can generate the indices `0` through `9` within the 100 thread grid.
- If block `blockIdx.x` equals `1`, then `blockIdx.x * blockDim.x` is `10`. Adding to `10` the possible `threadIdx.x` values `0` through `9`, then we can generate the indices `10` through `19` within the 100 thread grid.
- If block `blockIdx.x` equals `5`, then `blockIdx.x * blockDim.x` is `50`. Adding to `50` the possible `threadIdx.x` values `0` through `9`, then we can generate the indices `50` through `59` within the 100 thread grid.
- If block `blockIdx.x` equals `9`, then `blockIdx.x * blockDim.x` is `90`. Adding to `90` the possible `threadIdx.x` values `0` through `9`, then we can generate the indices `90` through `99` within the 100 thread grid.

## Allocating Memory to be accessed on the GPU and the CPU

More recent versions of CUDA (version 6 and later) have made it easy to allocate memory that is available to both the CPU host and any number of GPU devices, and while there are many [intermediate and advanced techniques](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations) for memory management that will support the most optimal performance in accelerated applications, the most basic CUDA memory management technique we will now cover supports fantastic performance gains over CPU-only applications with almost no developer overhead.

To allocate and free memory, and obtain a pointer that can be referenced in both host and device code, replace calls to `malloc` and `free` with `cudaMallocManaged` and `cudaFree` as in the following example:

```cpp
// CPU-only

int N = 2<<20;
size_t size = N * sizeof(int);

int *a;
a = (int *)malloc(size);

// Use `a` in CPU-only program.

free(a);
```

```cpp
// Accelerated

int N = 2<<20;
size_t size = N * sizeof(int);

int *a;
// Note the address of `a` is passed as first argument.
cudaMallocManaged(&a, size);

// Use `a` on the CPU and/or on any GPU in the accelerated system.

cudaFree(a);
```

## Handling Block Configuration Mismatches to Number of Needed Threads

Sometimes you can't configure a GPU launch to create exactly the number of threads needed for a loop. For instance, due to hardware constraints, you might want blocks with a multiple of 32 threads for efficiency. If you choose 256 threads per block but need exactly 1000 threads, there’s no integer number of blocks that gives exactly 1000 threads.

To handle this, you can:

Launch more threads than needed.

Pass the actual work size (or total thread count needed,
𝑁
N) as a kernel argument.

Inside the kernel, calculate the global thread index (e.g., tid + bid * bdim) and perform work only if the index is less than
𝑁
N.

This method ensures you always have at least enough threads to cover
𝑁
N, with at most one extra block’s worth of threads.

```cpp
// Assume `N` is known
int N = 100000;

// Assume we have a desire to set `threads_per_block` exactly to `256`
size_t threads_per_block = 256;

// Ensure there are at least `N` threads in the grid, but only 1 block's worth extra
size_t number_of_blocks = (N + threads_per_block - 1) / threads_per_block;

some_kernel<<<number_of_blocks, threads_per_block>>>(N);
```

Because the execution configuration above results in more threads in the grid than `N`, care will need to be taken inside of the `some_kernel` definition so that `some_kernel` does not attempt to access out of range data elements, when being executed by one of the "extra" threads:

```cpp
__global__ some_kernel(int N)
{
  int idx = threadIdx.x + blockIdx.x * blockDim.x;

  if (idx < N) // Check to make sure `idx` maps to some value within `N`
  {
    // Only do work if it does
  }
}
```

## Gride Stride for larger datasets
Either by choice, often to create the most performant execution configuration, or out of necessity, the number of threads in a grid may be smaller than the size of a data set. Consider an array with 1000 elements, and a grid with 250 threads (using trivial sizes here for ease of explanation). Here, each thread in the grid will need to be used 4 times. One common method to do this is to use a **grid-stride loop** within the kernel.

In a grid-stride loop, each thread will calculate its unique index within the grid using `tid+bid*bdim`, perform its operation on the element at that index within the array, and then, add to its index the number of threads in the grid and repeat, until it is out of range of the array. For example, for a 500 element array and a 250 thread grid, the thread with index 20 in the grid would:

- Perform its operation on element 20 of the 500 element array
- Increment its index by 250, the size of the grid, resulting in 270
- Perform its operation on element 270 of the 500 element array
- Increment its index by 250, the size of the grid, resulting in 520
- Because 520 is now out of range for the array, the thread will stop its work

CUDA provides a special variable giving the number of blocks in a grid, `gridDim.x`. Calculating the total number of threads in a grid then is simply the number of blocks in a grid multiplied by the number of threads in each block, `gridDim.x * blockDim.x`. With this in mind, here is a verbose example of a grid-stride loop within a kernel:

```cpp
__global__ void kernel(int *a, int N)
{
  int indexWithinTheGrid = threadIdx.x + blockIdx.x * blockDim.x;
  int gridStride = gridDim.x * blockDim.x;

  for (int i = indexWithinTheGrid; i < N; i += gridStride)
  {
    // do work on a[i];
  }
}
```

--
## Error Handling

As in any application, error handling in accelerated CUDA code is essential. Many, if not most CUDA functions (see, for example, the [memory management functions](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY)) return a value of type `cudaError_t`, which can be used to check whether or not an error occurred while calling the function. Here is an example where error handling is performed for a call to `cudaMallocManaged`:

```cpp
cudaError_t err;
err = cudaMallocManaged(&a, N)                    // Assume the existence of `a` and `N`.

if (err != cudaSuccess)                           // `cudaSuccess` is provided by CUDA.
{
  printf("Error: %s\n", cudaGetErrorString(err)); // `cudaGetErrorString` is provided by CUDA.
}
```

Launching kernels, which are defined to return `void`, do not return a value of type `cudaError_t`. To check for errors occurring at the time of a kernel launch, for example if the launch configuration is erroneous, CUDA provides the `cudaGetLastError` function, which does return a value of type `cudaError_t`.

```cpp
/*
 * This launch should cause an error, but the kernel itself
 * cannot return it.
 */

someKernel<<<1, -1>>>();  // -1 is not a valid number of threads.

cudaError_t err;
err = cudaGetLastError(); // `cudaGetLastError` will return the error from above.
if (err != cudaSuccess)
{
  printf("Error: %s\n", cudaGetErrorString(err));
}
```

Finally, in order to catch errors that occur asynchronously, for example during the execution of an asynchronous kernel, it is essential to check the status returned by a subsequent synchronizing CUDA runtime API call, such as `cudaDeviceSynchronize`, which will return an error if one of the kernels launched previously should fail.

---
### CUDA Error Handling Function

It can be helpful to create a macro that wraps CUDA function calls for checking errors. Here is an example, feel free to use it in the remaining exercises:

```cpp
#include <stdio.h>
#include <assert.h>

inline cudaError_t checkCuda(cudaError_t result)
{
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
  return result;
}

int main()
{

/*
 * The macro can be wrapped around any function returning
 * a value of type `cudaError_t`.
 */

  checkCuda( cudaDeviceSynchronize() )
}
```

## Grids and Blocks of 2 and 3 Dimensions

Grids and blocks can be defined to have up to 3 dimensions. Defining them with multiple dimensions does not impact their performance in any way, but can be very helpful when dealing with data that has multiple dimensions, for example, 2d matrices. To define either grids or blocks with two or 3 dimensions, use CUDA's `dim3` type as such:

```cpp
dim3 threads_per_block(16, 16, 1);
dim3 number_of_blocks(16, 16, 1);
someKernel<<<number_of_blocks, threads_per_block>>>();
```

Given the example just above, the variables `gridDim.x`, `gridDim.y`, `blockDim.x`, and `blockDim.y` inside of `someKernel`, would all be equal to `16`.

In [None]:
%%writefile vector_add.cu

#include <stdio.h>

/*
 * Host function to initialize vector elements. This function
 * simply initializes each element to equal its index in the
 * vector.
 */

__global__
void initWith(float num, float *a, int N)
{
  int idx = threadIdx.x + blockIdx.x*blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = idx; i < N; i += stride)
  {
    a[i] = num;
  }
}

/*
 * Device kernel stores into `result` the sum of each
 * same-indexed value of `a` and `b`.
 */

__global__
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    result[i] = a[i] + b[i];
  }
}

/*
 * Host function to confirm values in `vector`. This function
 * assumes all values are the same `target` value.
 */

void checkElementsAre(float target, float *vector, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(vector[i] != target)
    {
      printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
      exit(1);
    }
  }
  printf("Success! All values calculated correctly.\n");
}

int main()
{
  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  /*
   * nsys should register performance changes when execution configuration
   * is updated.
   */

  threadsPerBlock = 32;
  numberOfBlocks = (threadsPerBlock+N-1)/threadsPerBlock;

  //printf("%d \n", numberOfBlocks);

  int deviceId;
  cudaGetDevice(&deviceId);

  cudaDeviceProp props;

  cudaGetDeviceProperties(&props, deviceId);

  int multiProcessorCount = props.multiProcessorCount;

  int temp = numberOfBlocks % multiProcessorCount;
  if(temp != 0) {
      numberOfBlocks = numberOfBlocks + multiProcessorCount - temp;
     }

  //printf("%d \n", numberOfBlocks);

  initWith<<<numberOfBlocks, threadsPerBlock>>>(3, a, N);
  initWith<<<numberOfBlocks, threadsPerBlock>>>(4, b, N);
  initWith<<<numberOfBlocks, threadsPerBlock>>>(0, c, N);

  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  cudaMemPrefetchAsync(a, size, deviceId);
  cudaMemPrefetchAsync(b, size, deviceId);
  cudaMemPrefetchAsync(c, size, deviceId);

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  cudaMemPrefetchAsync(c, size, cudaCpuDeviceId);

  checkElementsAre(7, c, N);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}


Writing vector_add.cu


### Unified Memory Migration

When UM is allocated, the memory is not resident yet on either the host or the device. When either the host or device attempts to access the memory, a [page fault](https://en.wikipedia.org/wiki/Page_fault) will occur, at which point the host or device will migrate the needed data in batches. Similarly, at any point when the CPU, or any GPU in the accelerated system, attempts to access memory not yet resident on it, page faults will occur and trigger its migration.

The ability to page fault and migrate memory on demand is tremendously helpful for ease of development in your accelerated applications. Additionally, when working with data that exhibits sparse access patterns, for example when it is impossible to know which data will be required to be worked on until the application actually runs, and for scenarios when data might be accessed by multiple GPU devices in an accelerated system with multiple GPUs, on-demand memory migration is remarkably beneficial.

There are times - for example when data needs are known prior to runtime, and large contiguous blocks of memory are required - when the overhead of page faulting and migrating data on demand incurs an overhead cost that would be better avoided.

`nsys profile` provides output describing UM behavior for the profiled application.

In the output of `nsys profile --stats=true` we should be looking for the following:

- Is there a _CUDA Memory Operation Statistics_ section in the output?
- If so, does it indicate host to device (HtoD) or device to host (DtoH) migrations?
- When there are migrations, what does the output say about how many _Operations_ there were? If you see many small memory migration operations, this is a sign that on-demand page faulting is occurring, with small memory migrations occurring each time there is a page fault in the requested location.

## Asynchronous Memory Prefetching

A powerful technique to reduce the overhead of page faulting and on-demand memory migrations, both in host-to-device and device-to-host memory transfers, is called **asynchronous memory prefetching**. Using this technique allows programmers to asynchronously migrate unified memory (UM) to any CPU or GPU device in the system, in the background, prior to its use by application code. By doing this, GPU kernels and CPU function performance can be increased on account of reduced page fault and on-demand data migration overhead.

Prefetching also tends to migrate data in larger chunks, and therefore fewer trips, than on-demand migration. This makes it an excellent fit when data access needs are known before runtime, and when data access patterns are not sparse.

CUDA Makes asynchronously prefetching managed memory to either a GPU device or the CPU easy with its `cudaMemPrefetchAsync` function. Here is an example of using it to both prefetch data to the currently active GPU device, and then, to the CPU:

```cpp
int deviceId;
cudaGetDevice(&deviceId);                                         // The ID of the currently active GPU device.

cudaMemPrefetchAsync(pointerToSomeUMData, size, deviceId);        // Prefetch to GPU device.
cudaMemPrefetchAsync(pointerToSomeUMData, size, cudaCpuDeviceId); // Prefetch to host. `cudaCpuDeviceId` is a
                                                                  // built-in CUDA variable.
```

## Concurrent CUDA Streams
CUDA streams are like independent lanes on a highway for GPU tasks. They allow you to run multiple tasks concurrently (or overlap tasks such as computation and memory transfers), which can improve overall performance by hiding latencies.
In Simple Terms:
#####Single Stream:
* If you use one stream (the default stream), tasks (like kernel executions and memory copies) run one after another.

#####Multiple Streams:
* By creating multiple streams, you can submit different tasks to different lanes. The GPU can then execute tasks in parallel if they don’t depend on one another.

#### A Simple Example:
Imagine you want to perform two independent computations concurrently. Here's a small CUDA C/C++ example:

```cpp
#include <cuda_runtime.h>
#include <stdio.h>

// A simple kernel that fills an array with its index values.
__global__ void fillKernel(int *data, int value) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data[idx] = value;
}

int main() {
    const int arraySize = 1024;
    const int arrayBytes = arraySize * sizeof(int);

    // Allocate device memory.
    int *d_data;
    cudaMalloc(&d_data, arrayBytes);

    // Create two CUDA streams.
    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);

    // Launch the kernel in stream1 to fill the array with 1's.
    fillKernel<<<4, 256, 0, stream1>>>(d_data, 1);
    // Launch the kernel in stream2 to fill the array with 2's.
    fillKernel<<<4, 256, 0, stream2>>>(d_data, 2);

    // Wait for both streams to finish.
    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);

    // Clean up.
    cudaStreamDestroy(stream1);
    cudaStreamDestroy(stream2);
    cudaFree(d_data);

    printf("Kernels executed in parallel using CUDA streams.\n");
    return 0;
}
```
###Explanation:
1. Memory Allocation:
Device memory is allocated for an array.

2. Stream Creation:
Two streams (stream1 and stream2) are created.

3. Kernel Launches:
The fillKernel is launched twice:

4. Once on stream1, filling the array with the value 1.

5. Once on stream2, filling the array with the value 2.

6. Because these streams are independent, the GPU may execute both kernels concurrently.

7. Synchronization:
The program waits until both streams are finished before cleaning up, ensuring all operations are complete.

By using streams, you can leverage the GPU’s ability to perform tasks concurrently, which can be beneficial for overlapping computation with data transfer, or for running multiple independent computations at once.

### Rules Governing the Behavior of CUDA Streams

There are a few rules, concerning the behavior of CUDA streams, that should be learned in order to utilize them effectively:

- Operations within a given stream occur in order.
- Operations in different non-default streams are not guaranteed to operate in any specific order relative to each other.
- The default stream is blocking and will both wait for all other streams to complete before running, and, will block other streams from running until it completes.

/bin/bash: line 1: nvidia-smi: command not found
