These exercises will have you use Unified Memory to utilize GPUs on non-trivial data structures.

##6.1 Porting Linked Lists to GPUs

For your first task, you are given a code that assembles a linked list on the CPU, and then attempts to print an element from the list. Your task is to modify the code using UM techniques, so that the linked list can be correctly traversed either from CPU code or from GPU code. Hint: there is only one line in the file that needs to be modified to do this exercise.

Compile it using the following:

The module load command selects a CUDA compiler for your use. The module load command only needs to be done once per session/login. nvcc is the CUDA compiler invocation command. The syntax is generally similar to gcc/g++.

Correct output should look like this:
```
key = 3
key = 3
```
If you need help, refer to linked_list_solution.cu

##linked_list.cu

```
#include <cstdio>
#include <cstdlib>
// error checking macro
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

struct list_elem {
  int key;
  list_elem *next;
};

template <typename T>
void alloc_bytes(T &ptr, size_t num_bytes){

  ptr = (T)malloc(num_bytes);
}

__host__ __device__
void print_element(list_elem *list, int ele_num){
  list_elem *elem = list;
  for (int i = 0; i < ele_num; i++)
    elem = elem->next;
  printf("key = %d\n", elem->key);
}

__global__ void gpu_print_element(list_elem *list, int ele_num){
  print_element(list, ele_num);
}

const int num_elem = 5;
const int ele = 3;
int main(){

  list_elem *list_base, *list;
  alloc_bytes(list_base, sizeof(list_elem));
  list = list_base;
  for (int i = 0; i < num_elem; i++){
    list->key = i;
    alloc_bytes(list->next, sizeof(list_elem));
    list = list->next;}
  print_element(list_base, ele);
  gpu_print_element<<<1,1>>>(list_base, ele);
  cudaDeviceSynchronize();
  cudaCheckErrors("cuda error!");
}
```

Unified Memory `(cudaMallocManaged)` allows the same memory pointer to be accessed on both host (CPU) and device (GPU) without explicit cudaMemcpy.

This function allocates Unified Memory using `cudaMallocManaged()`, meaning ptr can be accessed by both CPU and GPU without manual memory transfers.


We dynamically allocate memory for linked list nodes using `cudaMallocManaged()`, making it accessible to both CPU and GPU.
Each list_elem has a pointer (next) to the next node in Unified Memory.


In [None]:
%%writefile linked_list_solution.cu

# include <cstdio>
# include <cstdlib>
// error checking macro
# define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

struct list_elem {
  int key;
  list_elem *next;
};

template <typename T>
void alloc_bytes(T &ptr, size_t num_bytes){

  cudaMallocManaged(&ptr, num_bytes);
}

__host__ __device__
void print_element(list_elem *list, int ele_num){
  list_elem *elem = list;
  for (int i = 0; i < ele_num; i++)
    elem = elem->next;
  printf("key = %d\n", elem->key);
}

__global__ void gpu_print_element(list_elem *list, int ele_num){
  print_element(list, ele_num);
}

const int num_elem = 5;
const int ele = 3;
int main(){

  list_elem *list_base, *list;
  alloc_bytes(list_base, sizeof(list_elem));
  list = list_base;
  for (int i = 0; i < num_elem; i++){
    list->key = i;
    alloc_bytes(list->next, sizeof(list_elem));
    list = list->next;}
  print_element(list_base, ele);
  gpu_print_element<<<1,1>>>(list_base, ele);
  cudaDeviceSynchronize();
  cudaCheckErrors("cuda error!");
}

Writing linked_list_solution.cu


In [None]:
! nvcc -arch=sm_75 linked_list_solution.cu -o linked_list_solution
! ./linked_list_solution

key = 3
key = 3


## 6.2 Array Increment

In this exercise, you are given a code that increments a large array on the GPU.

a. First, compile and profile the code as-is:

```
nvcc -o array_inc array_inc.cu
nsys profile --stats=true ./array_inc
```

Make a note of the kernel execution duration.

b. Now, modify the code to use managed memory. Replace the malloc operations with cudaMallocManaged, and eliminate the cudaMemcpy operations. Do you need to replace the cudaMemcpy operation from device to host with a cudaDeviceSynchronize()? Why? Now, compile and profile the code again. Compare the kernel execution duration to the previous result. Note the profiler indication of CPU and GPU page faults.

c. Now, modify the code to insert prefetching of the array to the GPU immediately before the kernel call, and back to the CPU immediately after the kernel call. Compile and profile the code again. Compare the kernel execution time to the previous results. Are there still any page faults? Why?

d. Bonus: Modify the code to run the inc() kernel 10000 times in a row instead of just once. What can be said about the impact of memory operations on our runtime? What would this suggest for a real-world application?

If you need help, refer to the array_inc_solution.cu.

###array_inc.cu

```
#include <cstdio>
#include <cstdlib>
// error checking macro
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

template <typename T>
void alloc_bytes(T &ptr, size_t num_bytes){

  ptr = (T)malloc(num_bytes);
}

__global__ void inc(int *array, size_t n){
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  while (idx < n){
    array[idx]++;
    idx += blockDim.x*gridDim.x; // grid-stride loop
    }
}

const size_t  ds = 32ULL*1024ULL*1024ULL;

int main(){

  int *h_array, *d_array;
  alloc_bytes(h_array, ds*sizeof(h_array[0]));
  cudaMalloc(&d_array, ds*sizeof(d_array[0]));
  cudaCheckErrors("cudaMalloc Error");
  memset(h_array, 0, ds*sizeof(h_array[0]));
  cudaMemcpy(d_array, h_array, ds*sizeof(h_array[0]), cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMemcpy H->D Error");
  inc<<<256, 256>>>(d_array, ds);
  cudaCheckErrors("kernel launch error");
  cudaMemcpy(h_array, d_array, ds*sizeof(h_array[0]), cudaMemcpyDeviceToHost);
  cudaCheckErrors("kernel execution or cudaMemcpy D->H Error");
  for (int i = 0; i < ds; i++)
    if (h_array[i] != 1) {printf("mismatch at %d, was: %d, expected: %d\n", i, h_array[i], 1); return -1;}
  printf("success!\n");
  return 0;
}
```

In [1]:
%%writefile array_inc.cu
#include <cstdio>
#include <cstdlib>
// error checking macro
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

template <typename T>
void alloc_bytes(T &ptr, size_t num_bytes){

  cudaMallocManaged(&ptr, num_bytes);
}

__global__ void inc(int *array, size_t n){
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  while (idx < n){
    array[idx]++;
    idx += blockDim.x*gridDim.x; // grid-stride loop
    }
}

const size_t  ds = 32ULL*1024ULL*1024ULL;

int main(){

  int *h_array;
  alloc_bytes(h_array, ds*sizeof(h_array[0]));
  cudaCheckErrors("cudaMallocManaged Error");
  memset(h_array, 0, ds*sizeof(h_array[0]));
  cudaMemPrefetchAsync(h_array, ds*sizeof(h_array[0]), 0); // add in step 2c
  inc<<<256, 256>>>(h_array, ds);
  cudaCheckErrors("kernel launch error");
  cudaMemPrefetchAsync(h_array, ds*sizeof(h_array[0]), cudaCpuDeviceId); // add in step 2c
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel execution error");
  for (int i = 0; i < ds; i++)
    if (h_array[i] != 1) {printf("mismatch at %d, was: %d, expected: %d\n", i, h_array[i], 1); return -1;}
  printf("success!\n");
  return 0;
}

Writing array_inc.cu


In [2]:
! nvcc -arch=sm_75 array_inc.cu -o array_inc
! ./array_inc

success!


Why Use cudaMemPrefetchAsync()?\

Unified Memory (UM) can migrate pages between CPU and GPU.
If h_array is initially on CPU, accessing it from GPU triggers costly page migrations.
cudaMemPrefetchAsync(h_array, size, 0) moves h_array to GPU before kernel execution, reducing overhead.
After the kernel, cudaMemPrefetchAsync(h_array, size, cudaCpuDeviceId); brings data back to CPU for checking.

📌 When Should You Use cudaMemPrefetchAsync()?\
✅ If using Unified Memory (cudaMallocManaged) for better memory placement.\
✅ Before a kernel launch: Move data to GPU first for better performance.\
✅ After GPU computation: Move data back to CPU for processing.