# Managed Memory

The previous notebook taught us about explicitly managing memory across the host and device, and the directives that OpenMP provides for doing so.
More modern AMD Instinct series GPUs provide functionality that allow the operating system (OS) to automatically manage these data transfers itself.
In this notebook, we will look at some examples of managed memory code, how we can make it backwards compatible with non-unified memory devices, and the benefits that the unified memory can offer.

Before start, let's move in to the appropriate directory, and set up a clean working environment:

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2e-OpenMP_managed_memory/C
rm -rf build

Let's now set the necessary environment variables for the exercises.
As we've previously seen, `LIBOMPTARGET_INFO` provides information on memory transfers to the device.
`LIBOMPTARGET_KERNEL_TRACE=1` prints message whenever a kernel is launched on the device, along with the number of teams and threads, and its register usage.
And as introduced in the previous notebook, `OMP_TARGET_OFFLOAD=MANDATORY` will ensure that the code only runs if it is successfully offloading to its target device.
In this notebook, we will again be using the `HSA_XNACK` environment variable, to tell the compiler to let the OS handle memory management.

Let's enable these, and set up our build folder now:

In [None]:
export LIBOMPTARGET_KERNEL_TRACE=1
export LIBOMPTARGET_INFO=1
export OMP_TARGET_OFFLOAD=MANDATORY
export HSA_XNACK=1
cmake -B build

## The example daxpy code with managed memory - `mem7.cc`

The examples in this notebook use the same daxpy code as the previous, explicit memory management notebook.
They offload a simple daxpy loop to the target device, and report on the time spent in the calculation.
Let's look at the OpenMP pragmas that we use in the managed memory code, implemented in the first example [`mem7.cc`](./C/mem7.cc), and compare them with the explicit model:

In [None]:
grep "pragma omp" mem7.cc

We can see that there are two changes with respect to the baseline explicit managed memory case; the addition of a `requires unified_shared_memory` directive, and the removal of all `map` clauses.

As you might expect from reading it, the `#pragma omp requires unified_shared_memory` directive requires that the target architecture specified for offloading (i.e. the device specified using `--offload-arch`) supports unified shared memory.
It is what's known as a declarative directive, and will cause the compilation to fail should the requirement not be met.
The declaration of `unified_shared_memory` in this pragma makes the use of `map` in any subsequent `target device` directives optional, as it requires that the device is already explicitly able to see the memory address of any variables used within the target region.
And indeed, we see that in this code the `map` clauses have been removed as unnecessary.

Let's compile and run this code now:

In [None]:
cmake --build build --target mem7
./build/mem7

Note that in the output we see two reports from `LIBOMPTARGET_KERNEL_TRACE` when our target regions are launched, but no memory movement from `LIBOMPTARGET_INFO`.
This is because in the unified shared memory model, no data transfer is taking place.

## Maintaining backwards compatability for discrete GPUs - `mem8.cc`

This code will work on any system with a unified shared memory, but will not be portable to any system that does not.
If we want to keep the benefits of the unified model whilst also allowing it to run on other architectures, we will need to make some changes.
[`mem8.cc`](./C/mem8.cc) demostrates how we can implement this backwards-compatability into our code.
Let's take a look at how it compares with the baseline:

In [None]:
diff mem8.cc mem7.cc

The first change is adding in a pre-processor option around the `requires unified_shared_memory` directive.
If we are compiling for an architecture that has shared unified memory, the pragma gets called as normal, but for an unsupported architecture it will now be ignored.
This means that we'll now need to explicitly tell the compiler how to handle the data for such cases, and we can see this in the unstructured data region implemented for the variables `x`, `y` and `z` above.
Remember that the requirement of `unified_shared_memory` only makes such implementations optional in the code; we are free to add memory directives as we wish, but with the unified model enabled the compiler may chose to ignore them where they are unnecessary.

Let's compile and run this example now:

In [None]:
cmake --build build --target mem8
./build/mem8

We see the same output as the previous example - i.e. none of the `target data` directives have explicitly moved any data.
This is to be expected on a system that supports shared unified memory.

## Use of `std::vector` in offloaded variables - `mem9.cc`

In all of our previous examples, we have been using C-style arrays to offload data to the device.
The rigid size of an allocated array makes these variables well suited for the memory offloads associated with running on a target device, but can be detrimental in regular coding activities when the size of such an array is not known at compile-time, or changes during running.
For such cases, it is often advantageous to use standard library containers such as `std::vector`, allowing dynamic reallocation of size and easy appending and removing of elements thanks to the container's in-built memory management.

In the traditional model of explicit memory offloading to a discrete GPU, `std::vector`s present an impossible challenge; if the class reallocates memory to a shifted vector, the device memory map is no longer valid.
But with a unified shared memory, the device can work on `std::vector`s as well as it could on arrays.

[`mem9.cc`](./C/mem9.cc) demonstrates our example code with the arrays replaced with `std::vector`s.
When we compile this example, we will see a number of warnings informing us that the memory mapping of the `std::vector` type to the device will not necessarily be correct.
It is important to understand this warning, particularly when using discrete GPUs and allowing the OS to manage the memory movements for you, but in the unified memory space on an APU where no data is copied, the code will always work as expected.

Let's compile and run it now:

In [None]:
cmake --build build --target mem9
./build/mem9

## Use of `std::valarray` in offloaded variables - `mem10.cc`

In some legacy code, you may encounter the use of `std::valarray` as an alternative to the array, originally optimised for HPC use.
With managed memory enabled, these may be used in OpenMP applications in the same way as the `std::vector`.
[`mem10.cc`](./C/mem10.cc) shows our example code with `std::valarray`s replacing the array-typed variables.

We will see the same warnings at compile time for this example as we did with `std::vector`s, and as in that case, we can expect it to run correctly if our system has a unified shared memory.
Let's compile and run it now:

In [None]:
cmake --build build --target mem10
./build/mem10

In this and the previous notebook, we have discussed how to manage memory - both explicitly and automatically by the OS - for large arrays on which we want to act on each element separately.
This is the foundation of accelerated computation, but is hardly the end of the story.

The next section will discuss how OpenMP treats distributed calculations that need to access a single universal variables, on the topic of reductions, atomics and mutexes.