# OpenMP Optimisation 

So far in these notebooks, we have learnt about many of the pragmas available in OpenMP to offload calculations, memory, and even functions to GPU devices.
Whilst judicious use of these pragmas should improve the performance of appropriate code, we have not yet spent any time considering performance and optimisations within OpenMP itself.
This notebook will explore a few areas in which the performance of OpenMP itself can be tweaked, including; memory allocation optimisations; memory pools; and kernel tuning clauses.

These examples are designed to work with unified shared memory or managed memory to keep them simpler.
For this reason, we will set the `HSA_XNACK` environment variable to 1 now.
Whilst here, let's do our standard environment check to make sure that we have an appropriate GPU to work with:

In [None]:
export HSA_XNACK=1
rocm-smi

## Optimising memory allocations

Although memory transfers are often the most costly part of offloading operations, there is also an overhead associated with memory allocations.
Let's consider our first example, [`alloc_problem.f90`](./Fortran/1_alloc_problem/alloc_problem.f90).

The main loop within this program calculates the value of $sin^2(i) + cos^2(i)$ for every member of a long array, and returns the sum of these divided by the number of elements - a value we expect to be 1.
For demonstrative purposes, we are running this loop 10 times.
Note that in this example, we carry out the memory allocations and deletion of the arrays within this loop.

Let's compile this example, run it, and check the output:

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2h-OpenMP_Optimisation/Fortran/1_alloc_problem
make clean
make
./alloc_problem

The code has run successfully, and the final result has been reported correctly.

Let's compare this with another version of the code, [`opt_allocation.f90`](./Fortran/2_opt_allocation/opt_allocation.f90).
This code is identical, except that the memory allocations and deletions are carried out outside the main loop.
Naively, we might not expect this to have a very large impact on the performance.

Let's compile and run it now:

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2h-OpenMP_Optimisation/Fortran/2_opt_allocation
make clean
make
./opt_allocation

On the MI200X that we are running on, this optimisation of memory allocation has given us a better than 6x speed-up.  
The speed-up is particularly dramatic on a discrete GPU such as this because, under the unified shared memory (USM) model, each new memory allocation on the host triggers page migrations when accessed from the device.
These migrations are costly and can significantly impact performance.
Without the USM model, we would still be required to add a `map` directive to the `target teams loop` clause, which would trigger explicit memory transfers for each iteration, that are also very costly.

Where possible, memory allocations should be carried out once, outside of loops that use the variables multiple times.
This allows the `target` directive to recognise the memory and avoids repeated page migrations by reusing already-migrated pages on the GPU.
When not using USM, we could still allocate the host memory outside the loop and use an unmanaged memory region with a clause like `target enter data map(alloc:)` to ensure the memory allocation remains accessible to the device.
This won't always be achievable - where arrays aren't necessarily of fixed size in every iteration, for example - but is good practice where it can be implemented.

## Using memory pools

Another approach to reducing the cost of memory allocations and deallocations is to use a memory pool.
A memory pool pre-allocates an area of memory on the device that can be easily and efficiently allocated to necessary variables later on in the code.
Many implementations of a memory pool are available, but for our example we will use the Umpire memory manager from Lawrence Livermore National Laboratory (LLNL), which works with C, C++ and Fortran.  

This library is unlikely to be installed on the system you are running on, and so installing it will be the first step to using these memory pools.  
The installation will take multiple steps and is likely similar to what you would encounter if you wished to install the library on another environment. 
1. ssh into the cosma login node if you are not already connected
2. cd to `$HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2h-OpenMP_Optimisation/Fortran/3_memorypool`
3. Clone the library with the command `git clone --recursive https://github.com/LLNL/Umpire.git Umpire_source`
4. Start a new terminal in the jupyter notebook.  This will create an interactive shell session on the AMD GPU node (note that this step is only needed as the software stack on the training node is different to the login node).
5. In your Jupyter terminal, make sure you are in `$HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2h-OpenMP_Optimisation/Fortran/3_memorypool` (you can check with `pwd`), and if not cd to it.
6. Run the script `./umpire_setup` (note that this script is designed to run on the COSMA training node.  To install on the login node, you will have to change the variables and paths in the script accordingly). 

After installation, we need to set the environment variable to the installtion directory with `export UMPIRE_PATH=<path_to_install_dir>`.  This has already been done in the code block below.

With the library installed, let's take a look at at our example code [`memorypool.f90`](./Fortran/3_memorypool/memorypool.f90).
This code carries out the same calculations as our previous examples, but now with the inclusion of a memory pool.
We create the shared pool before we enter the main calculation loop, then allocate and deallocate the arrays for our calculations in each iteration. 

Let's compile and run this code now, and compare its performance to our previous examples:

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2h-OpenMP_Optimisation/Fortran/3_memorypool
export UMPIRE_PATH="./Umpire_install"
make clean
make
./memorypool

The performance of the memory pool is comparable to the previous [`opt_allocation.f90`](./Fortran/2_opt_allocation/opt_allocation.f90) example.
Memory pools offer the advantage of allowing dynamically sized arrays each iteration, without the usual associated cost of the memory allocation.

Feel free to experiment with the different allocation methods, different size arrays and loops, and compare their performance.

## Kernel Tuning

Now that we've seen how memory allocation can impact performance, let's take a look at some options to optimise the OpenMP pragmas themselves.
First, we'll set `LIBOMPTARGET_KERNEL_TRACE=1` so we can better inspect what the OpenMP runtime is doing. 
The tracer allows us to see information about the number of teams and threads per team, the number of Scalar General Purpose Registers (SGPR) and Vector General Purpose Registers (VGPR).
These are all effective proxies of how well we are utilising the GPU device.
We'll also move into the `kernel_pragmas` directory containing our examples, and set up our build environment:

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2h-OpenMP_Optimisation/Fortran/kernel_pragmas
export LIBOMPTARGET_KERNEL_TRACE=1
rm -rf build
cmake -B build

The baseline code for our tuning examples will be [`kernel1.f90`](./Fortran/kernel_pragmas/kernel1.f90).
The code is a simple daxpy loop with the now familiar `target teams distribute parallel do` OpenMP clauses used for assignment and calculation:

In [None]:
grep "omp target teams distribute parallel" kernel1.f90

Lets build and run the code, taking note of the information printed out by the trace:

In [None]:
cmake --build build --target kernel1
./build/kernel1

Note that the `Timing in Seconds` line gives us the timing of the daxpy kernel itself - increasing the `NTIMERS` macro will run the kernel multiple times, and provide average figures of the timing.
Keep this baseline performance in mind as we investigate tuning parameters.

### The `num_threads` clause

With no explicit instructions, the compiler will decide how many threads to assign each team on the device.
From the trace output above, we can see that the default for the daxpy kernel on our MI200 is 256 such threads.
We can, however, control this ourselves by adding the `num_threads(XX)` clause to our target pragma, where `XX` is the number of desired threads per team.
Let's see how we've implemented this in our next example code, [`kernel2.f90`](./Fortran/kernel_pragmas/kernel2.f90):

In [None]:
grep "omp target teams distribute parallel" kernel2.f90

In this example, we've set the number of threads per team to be equal to 64.
Let's compile and run it now:

In [None]:
cmake --build build --target kernel2
./build/kernel2

We can see from the trace that we have reduced the number of threads per team, and have consequently increased the number of teams produced for the code.

### The `thread_limit` clause

We can limit the number of number of threads possible in a given pragma using the `thread_limit` clause.
The compiler will be free to select any number of threads per team up to the number given in the clause.
We have implemented this clause in the [`kernel3.f90`](./Fortran/kernel_pragmas/kernel3.f90) example - let's look at it now:

In [None]:
grep "omp target teams distribute parallel" kernel3.f90

We have applied a `thread_limit` of 64 to our pragmas here.
Let's compile and run it now:

In [None]:
cmake --build build --target kernel3
./build/kernel3

We are running over a short loop in this example, so the differences in performance will be small.
Furthermore, compiler optimisations mean that the performance and improvement we see with these tunings will vary greatly over time, and differ significantly from those shown in the slides.

### The `num_teams` clause

The final kernel tuning clause of note that we will discuss here is the `num_teams` clause.
As you might expect from the name, it instructs the compiler to create a specific number of teams for the `target` operation.
Although we don't have an example of its use here, we encourage you to test it out in the above examples to check how it impacts performance.

Indeed, the best way to get a better feel for these kernel tuning clauses is to try them out on more complex code.
Try expanding the loop or increasing the complexity of the calculations involved, and then test some different values for `num_threads`, `num_teams` and `thread_limit`.
The goal is to lower register usage in order to improve the occupancy.
You can refer to the slides to see the number of "waves" that can run on a compute unit with respect to the number of VGPRs.

This notebook taught us about some optimisations that can help with OpenMP performance.
Sometimes, even these optimisations aren't enough for the most intensive parts of our code - for those, we might want to use one of the other programming models such as HIP to improve our performance.
The next section will discuss the interoperability of OpenMP and HIP code.