# OpenMP Single Line Compute Constructs

In this notebook, we will look at the use of single line directives as a means of moving the computation of a loop to the GPU.
For the examples in this notebook, we will use a simple saxpy code, which we will discuss below.

Let's begin, as ever, by checking that we have suitable GPUs available, moving into the source directory, and making sure that the working environment is clean.

In [None]:
rocm-smi
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2b-OpenMP_single_line_constructs/Fortran
make clean

## The saxpy code

Saxpy is an acronym that stands for **Single precision A X plus Y**.
It refers to the addition of two vectors, X and Y, where X is multiplied by a scalar A.
Practically, this is done by looping over the elements of X and Y and calculating each result individually.
Mathematically, it can be expressed as;
$$z_i = a \cdot x_i + y_i$$
for the $i^{th}$ element of the arrays.

These sorts of simple vector additions are used extensively in simulation and graphical processing, and can be written into code as a loop over the elements of the two arrays.
Since each calculation has no dependence on any other element in the array, they could, in principle, all be carried out independently.
This makes saxpy a prime candidate for parallelisation across CPUs and GPUs alike.
Indeed, it is often colloquially referred to as the 'Hello, World!' of GPU programming, as the first real code example you are likely to write and run in any new GPU framework.

Let's look now at a traditional CPU implementation of the saxpy code.

## CPU saxpy code

[`saxpy_cpu.F90`](./Fortran/saxpy_cpu.F90) contains a basic implementation of saxpy code for a CPU.
In the `main` block, we can see the instantiation and initialisation of the two arrays that we will be combining, `x` and `y`.
The `saxpy` function then carries out the vector addition itself within the loop:
```fortran
   do i=1,n
       y(i) = y(i) + a * x(i)
   end do
```
Note that in this example, and others that you might see, rather than returning a new vector $z$, we are doing the addition directly into `y`.

You should notice that we already have an OpenMP pragma in this code:
```fortran
!$ omp parallel for
```
This instructs the compiler to allow threaded parallelism of the loop on the CPU

For our example, we initialise the `y` array to `2`, the `x` array to `1` and the scalar `a` to `2`, so we expect every element of the output array `y` after the computation to contain the value `4`.

We can build this code using the `saxpy_cpu` option from the [`Makefile`](./Fortran/Makefile).
Let's build and run it now:

In [None]:
make saxpy_cpu
./saxpy_cpu

The code has run successfully with the expected output.

Now, let's start changing the pragma instructions to offload our loop onto the GPU.

## Offloading arrays to the GPU

We will now go about moving our calculation to the GPU, starting with a simplified version of the code.

### Automatically and dynamically assigned arrays

The code in [`saxpy_gpu_singleunit_autoalloc.F90`](./Fortran/saxpy_gpu_singleunit_autoalloc.F90) contains a slightly altered version of the saxpy code.
In it, `x` and `y` are assigned on the stack, so that the compiler is now fully aware of the array sizes that need to be moved in the following loop.
Note that we've had to reduce the number of entries in the arrays in comparison to the previous example, to make sure we don't run out of memory on the stack.

We can see that we've also now changed the OpenMP pragma instruction to:
```Fortran
!$omp target teams distribute parallel do
```

Let's remind ourselves of what the directives in this pragma are doing:
 - `target` tells the compiler to transfer control and data from the host (CPU) to the device (GPU),
 - `teams` creates a series of thread teams that can execute the code in parallel,
 - `distribute parallel do` instructs the compiler to run the subsequent `do` loop in parallel, distributing it between the previously created thread teams.

Put all together, this commonly-used construct asks that the subsequent `do` loop block be run in parallel on the GPU.
Let's try building and running this code now, using the `saxpy_gpu_singleunit_autoalloc` option of the [`Makefile`](./Fortran/Makefile).

Remember that we can set the `LIBOMPTARGET_INFO` environment variable to report at runtime the data arguments passed to an OpenMP device kernel.
In this way, we can ensure that we are correctly running on the GPU.

In [None]:
export LIBOMPTARGET_INFO=1
make saxpy_gpu_singleunit_autoalloc
./saxpy_gpu_singleunit_autoalloc

Success!
We've offloaded our saxpy loop to the GPU, and run the addition there in parallel.

But what if we want to dynamically assign the size of the input arrays, rather than hard-code their sizes at the beginning of the code?
In the example code [`saxpy_gpu_singleunit_dynamic.F90`](./Fortran/saxpy_gpu_singleunit_dynamic.F90), we've changed their assignment to `allocate` commands whilst keeping the rest of the code the same.

This code can be built using the `saxpy_gpu_singleunit_dynamic` command, so let's build and run it now.

In [None]:
make saxpy_gpu_singleunit_dynamic
./saxpy_gpu_singleunit_dynamic

If you've also been following the C examples, you might notice that the additional environment variable `HSA_XNACK` - which allows the operating system to automatically move dynamically assigned data to the device - is not needed in this example.
This is because of the way arrays are allocated in Fortran; they already contain enough information for the compiler to correctly offload them even without this feature enabled.

### Moving the allocation to the main

Now that we can offload our loop successfully, even for dynamically assigned arrays, let's return to the original example code.
Here, we want to allocate our arrays in the `main` block, and pass them as-is to the saxpy subroutine.
We will then keep the information calculated during the addition, rather than discarding it at the end of the subroutine's scope.
This will have many more real-world applications than the previous examples.
[`saxpy_gpu_paralleldo.F90`](./Fortran/saxpy_gpu_paralleldo.F90) shows how we can apply the new pragma in this way.

Let's compile and run this code, using the `saxpy_gpu_paralleldo` command:

In [None]:
make saxpy_gpu_paralleldo
./saxpy_gpu_paralleldo

This works as desired.

Now let's look at some more pragma clauses that might prove useful for the saxpy example.

## The **loop** clause

In more recent versions of OpenMP, the `loop` clause has been added as a simpler replacement to the `distribute parallel do` clause.
The clause that we would need to write for our offloaded loop would then become:
```Fortran
!$omp target teams loop
```

This new clause simplifies our usual pragmas, and allows the compiler more freedom in its implementation of parallelism for the applicable loop.
Indeed, ROCm will generate an optimised target region using this pragma over the traditional `distribute parallel do` clause for AMD GPUs.  Additionally, from OpenMP version 6.0 it has changed replace the `teams distribute parallel for` clause. At the time of writing, not many compilers support OpenMP version 6.0, however, you should be aware of this upcoming change and check the documentation for your compiler of choice.

This construct does currently have drawbacks, however.
Because it is relatively novel, not all compilers will support its applications, and the additional freedom that it affords the compilers may not yet be optimised for all available cards and scenarios.

There is an example of the `loop` clause for our saxpy code in [`saxpy_gpu_loop.F90`](./Fortran/saxpy_gpu_loop.F90).  Let's compile and run it now:

In [None]:
make saxpy_gpu_loop
./saxpy_gpu_loop

We see that the performance is largely consistent with the previous examples.

## The **collapse** clause

The final clause we will look at in this notebook is the `collapse(n)` clause.
This clause requests that `n` nested loops be reduced (or 'collapsed') and executed as a single loop.

We can see an example of this in the example code [`saxpy_gpu_collapse.F90`](./Fortran/saxpy_gpu_collapse.F90).
Here, we have transformed `x` and `y` into 2D arrays, and are performing the saxpy operation with nested loops over these dimensions.
The `collapse` clause is then added to the end of our normal pragma to request the reduction to one dimension.
This clause is very useful when working with Fortran's multi-dimensional arrays.

Let's build and run the code, using the `saxpy_gpu_collapse` command:

In [None]:
make saxpy_gpu_collapse
./saxpy_gpu_collapse

As we've seen in this notebook, it is very easy to add a single line pragma that offloads and runs a loop on a GPU.
Dynamic memory options available on the recent MI200 and MI300 series AMD GPUs make managing the memory involved in the process much simpler, so that we can focus on parallelising the appropriate loops rather than fiddling with what memory needs to go where.

In the next section, we will look at some more complex constructs.