# Programming Model Exercises

In this notebook, we will practice running a simple summation code in a number of different programming models. 
We will start by running it using a traditional CPU approach, before moving it onto the GPU and exploring 

Firstly, let's check we have suitable GPUs available with rocm-smi, move into the source directory and make sure the working environment is clean.

In [None]:
rocm-smi
cd $HOME/DiRAC-AMD-GPU/notebooks/01-Programming_Model/1b-Programming_overview/source
make clean
make clean && rm -rf kokkos/build && rm -rf raja/build

## The code

You may want to take some time to familiarise yourself with the source code we will be using in this notebook before proceeeding.
It is based on those used in the presentation, fleshed out into working code.

It begins by allocating memory for two `double` arrays.
The first array is initialised to contain `1` in every element, then each element of the second array is assigned twice the first array's value.
Finally, the sum of the second array is calculated.

Throughout this notebook we will look at how the different programming models can be applied to this code.
Build instructions for all of the examples are contained in the directory's `Makefile`.

Firstly, let's look at a traditional CPU approach.

## CPU code baseline

The standard CPU code is contained in the [`cpu_code.c`](./source/cpu_code.c) file. 
Its option in the [`Makefile`](./source/Makefile) is `cpu_code`, so let's compile and run it now:

In [None]:
make cpu_code
./cpu_code

Note that in this instance we are using the `amdclang` compiler, but this code will compile with any C (or C++) copmiler, and run on any CPU.

The output gives us the expected result for an input array of length 100000: 200000.

Let's now begin moving some of the code's execution to a GPU.

## Standard GPU code example

The code contained in [`gpu_code.hip`](./source/gpu_code.hip) demonstrates a standard approach to running the code on a GPU using explicit memory management, i.e. moving the involved memory between the CPU and GPU ourselves.

In this example, we begin by assigning the two arrays as before, but now we also have to assign the memory required on the GPU (which we normally refer to as the "device").
This is done using the `hipMalloc` command on lines 36 and 37.
The `hipMemcpy` command on lines 42 and 49 allows us to manually copy memory between the CPU (the host) and GPU (the device).
The assignment of the second array is carried out on the GPU, before transferring the results back to calculate the final summation on the CPU.

The code can be compiled using the `gpu_code` command.
Let's compile and run it now:

In [None]:
make ./gpu_code
./gpu_code

Note that we are now using the `hipcc` compiler in order to make use of the HIP commands in the code.

This and future compilations will be given the `--offload-arch=${AMDGPU_GFXMODEL}` argument to tell the compiler what architecture of GPU to expect at runtime.
For the MI200 series GPUs available on `COSMA` this string is `gfx90a`, but this can be changed to target different GPU architectures available elsewhere.
It is possible to supply a semi-colon separated list in order to compile for multiple architectures at once.

## Managed memory code

Our next example, [`gpu_code_managed.hip`](./source/gpu_code_managed.hip), demonstrates how this code can use managed memory, i.e. allowing the Operating System to move memory between the CPU and GPU for us.

Comparing this to the previous example, we can see that the explicit HIP memory management calls - the `hipMalloc` and `hipMemcpy` commands discussed before - have been removed.
These are not required when letting the OS handle the memory for us.

This code is compiled using the `gpu_code_managed` option in the [`Makefile`](./source/Makefile).
Let's compile the code now:

In [None]:
make gpu_code_managed

To tell the operating system to enable managed memory, we need to set the environment variable `HSA_XNACK` to 1.
This instructions tells the GPU to retry memory accesses that fail due to a page fault, and migrate the memory automatically in such cases.
This feature works for AMD GPUs of the MI200 and MI300 series, and can be disabled again by setting `HSA_XNACK=0`.

Let's enable this now, and run the code:

In [None]:
export HSA_XNACK=1
./gpu_code_managed

## OpenMP single address space

We will now look at two examples of this loop offloading using OpenMP.
The first, [`openmp_code.c`](./source/openmp_code.c), contains the offloaded loop within the `main` function, whilst the second, [`openmp_code1.c`](./source/openmp_code1.c), has these loops in a separate function where the compiler cannot tell the size of the array.
Both contain the `omp requires unified_shared_memory` pragma and, as such, we might naively assume that they will not be able to run on our MI200 series GPUs, which do not have a unified shared memory.

Let's nevertheless try compiling and running them:

In [None]:
make openmp_code openmp_code1
./openmp_code
./openmp_code1

Perhaps to our surprise, the code has compiled and run successfully.
Could this mean that the code is running on the CPU only, since we don't have access to unified shared memory GPU?

We can check this using the environment variable `LIBOMPTARGET_INFO`.
This variable will report various information depending on the level requested.
By setting `LIBOMPTARGET_INFO=1`, all data arguments passed to an OpenMP device kernel will be reported.
If we are only running on the CPU, therefore, there will be no messages printed, but if we are successfully offloading to the GPU we will see reports from both loop pragmas.

To turn off the messages, we can set `LIBOMPTARGET_INFO=0`.
The different types of runtime information we can get from `LIBOMPTARGET_INFO` are [documented here](https://openmp.llvm.org/design/Runtimes.html#libomptarget-info) .

Let's run it now and see:

In [None]:
export LIBOMPTARGET_INFO=1
./openmp_code
./openmp_code1

We can see from these results that the code is indeed being run on the GPU.
The OpenMP compiler is smart enough to recognise the architecture we are using and allow managed memory even in an environment without the asked for single unified memory (you will learn more about how this work in the OpenMP course).

Feel free to experiment with removing the various pragmas and running these again, to see how the reports change.

So far we've looked at native and pragma-based approaches to GPU offloading.
Now let's look at higher level performance portability frameworks.

## Higher level performance portability frameworks

Here we will see how we can use two examples of expanded high level frameworks to create single-source application code that can run on both CPUs and GPUs.
The above example code has been accelerated using the [Kokkos](https://github.com/kokkos/kokkos) and [RAJA](https://github.com/LLNL/RAJA) frameworks.

To simplify their running in these notebooks, we have pre-installed and loaded these frameworks.
Usually, they would be available on HPC systems through `modules` that can be loaded depending on user need.
Where unavailable, they can also be built specifically within one's own userspace.

### Kokkos

Kokkos is a programming model developed by Sandia National Labs that uses a HIP backend to allow us to exploit the unique abilities of AMD GPUs in C++.

Lets inspect the example [kokkos_code.cc](./source/kokkos/kokkos_code.cc). Note that in the code we have not had to declare the arrays in Kokkos Views.
Instead, we will allow the OS to manage the memory by setting the `HSA_XNACK` environment variable to 1.

We can compile (using CMake) and run the example code like this:

In [None]:
cd kokkos
cmake -B build -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc
cmake --build build
export HSA_XNACK=1
./build/kokkos_code

### RAJA

RAJA is a C++ framework developed by the Lawrence Livermore National Lab.
It is modular in structure, with separate compute and data management.
It supports a range of GPUs including AMD systems, with key kernel patterns that have been optimised by AMD themselves.

Lets inspect the example [raja_code.cc](./source/raja/raja_code.cc). Note that, similarly to the Kokkos example, we have only allocated the arrays on the host with malloc. We will therefore need to set the `HSA_XNACK` variable to 1.

We can compile (using CMake) and run the example code as follows:

In [None]:
cd ../raja
cmake -B build -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc
cmake --build build
export HSA_XNACK=1
./build/raja_code

Now that we've seen how various programming models can be implemented in a simple C code, we can start exploring more complicated constructs using different programming paradigms.

The next chapter will discuss OpenMP, and how we can use it to get the most of our AMD GPUs and APUs.