# Reductions, atomics and mutexes

So far, we have concentrated on offloading independent, element-wise calculations over arrays to our devices.
Although this makes up a substantial portion of the work we expect to do using our acceleration devices, we also need to understand how to access single variables from each during a distributed calculation.

OpenMP provides functionality for this in the form of mutexes, reductions and atomics.
This notebook will review this functionality, and provide some examples of them in action.

Let's begin with our usual checks; making sure we have an appropriate GPU, loading into the working directory, and cleaning our environment:

In [None]:
rocm-smi
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2f-OpenMP_reductions_atomics_mutexes/Fortran
make clean && rm -rf build

## Mutexes

If you've done any work with CPU-based parallelism in the past, you may already be familiar with the concept of a `mutex`.
When multiple threads attempt to access and change a global variable with no protection, we enter what is known as a 'race condition'.
Each thread attempts to make its own changes to this variable, potentially over-writing the work of other threads and producing innaccurate results.

The traditional solution to this problem is known as a `mutex` (a portmanteau of mutual exclusion) - a lock that is placed around these potentially conflicting calculations to only allow one thread to access to the variable at a time.
Although effective, these types of hard software locks are typically very expensive in terms of performance - a problem that is only enhanced when running with greater parallelism on a GPU or APU.
More modern techniques techniques have been developed for the majority of coding challenges that would once have been addressed by mutexes, and so they are generally considered deprecated for device-based kernels.

OpenMP does provide a number of directives that fulfil the functionality of a mutex-style lock, but it is generally recommended to try and adapt code where possible to the more modern paradigms of reductions and atomics, which we will discuss below.

## Reductions

A common practice that would once have been carried out using mutexes is the transformation of an array into a single variable; the calculation of the sum of the elements of said array, for example.
If we consider running this in a single loop, it is not clear how running in parallel would speed up the calculation; you might (correctly!) assume that locking the software for each addition would slow the calculation down, and be less efficient than calculating it in series.
What we can do to improve the performance of this calculation is to split it into a number of parallel threads that each perform their own part in series, and then summing their results.

In OpenMP, these calculations are called a `reduction`, because they are used to reduce an array into a lower-dimension output.
In order to use them, we add in the `reduction(reduction-identifier: var-list)` clause to our parallel operation.
In this clause, the mandatory `reduction-identifier` identifier tells the compiler what operation will be used with the reduced variable, and takes the form of an `id-expression` or one of the operators: `+`, `-`, `*`, `&`, `|`, `^`, `&&` and `||`.
`var-list` then tells the compiler which variables will be part of the calculation.

When this clause is used, OpenMP creates a number of threads to divide the work between, each with a copy of the reduction inialisaed to a value determined by the reduction identifier type.
The operation is carried out, and once all parallel regions have finished the individual variables are combined into the final reduced output.

[`reduction_scalar.F`](./Fortran/reduction_scalar.F) shows an example of a reduction clause.
The code initialises two variables, then creates a parallel loop to increment these variables over a range.
For this example we use `reduction(+:ce1,ce2)`, indicating that we are reducing the variables `ce1` and `ce2` with an addition operation.

Let's compile and run it now, and check the output:

In [None]:
make reduction_scalar
./reduction_scalar

Reducing to a scalar value in this way is very useful and has many real-world applications.
There are also times, however, that we might want to take an array with many dimensions and, operating over a number of these dimensions, produce an output with a reduced dimensionality.
This is fully supported in the OpenMP standard, but at the present time is not incorporated into the rocm `amdflang` Fortran compiler.
For an example of how OpenMP can be used to carry out this kind of reduction, please refer to the C version of this notebook.

In [None]:
# Dose not work - seeking confirmation that this is not yet implemented in the rocm Fortran compilers.
# make reduction_array
# ./reduction_array

## Atomics

The modern equivalent of a mutex, the `omp atomic` directive ensures that the next memory operation immediately following it cannot be accessed by multiple threads at the same time.
For single memory operations, this lock is far less costly than the full software lock available in directives such as `omp critical`, whilst still avoiding problematic race conditions.

`omp critical` directives still have their uses; they are, for example, able to lock entire regions of code.
In general, however, these full software locks of data regions are too costly for efficient device computing.
It is often worth the time to modify the code to remove such locks when porting code to GPU.

_Note that at the time of writing, fast hardware atomics are __not__ enabled in the rocm `amdflang` Fortran compiler.
They are currently highly inefficient, and should be avoided where possible._

Let's take a look at [`atomic.F`](./Fortran/atomic.F) - in this example we perform an element-wise summation of an array using two methods; an atomic calculation within a parallel loop and a reduction.
Both are timed, to compare performance.
Note that we need to enable `HSA_XNACK` for this example, as we are requiring `unified_shared_memory` to let the OS handle our memory movements for us.

Let's compile and run it now:

In [None]:
export HSA_XNACK=1
make atomic
./atomic

_As noted, these performance numbers are currently unrepresentative, due to lacking fast hardware atomic support in the Fortran compiler.
It is therefore highly recommended to avoid `omp atomic` and use reductions wherever possible in your code._

In this notebook, we have learned about the concept of mutexes, and how modern algorithms in the form of reductions and atomic calculations have made them redundant for OpenMP code.
In the next notebook, we will be looking at function calls in target regions in OpenMP, and bringing together the things we have learned in the exercises.