# OpenMP Explicit Memory Management

In this notebook, we will look at the explicit memory management directives available in OpenMP, and how to employ them to effectively optimise our code.
In particular, we will be looking at how the `data` and `map` clauses are used to move memory between the host and device, and how `update` can be used to enforce copying.

Let's begin in the usual manner; making sure we have a working GPU; loading in to the appropriate directory; and setting up a clean environment to work in.

In [None]:
rocm-smi
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2d-OpenMP_explicit_memory_directive/C
export LD_LIBRARY_PATH=/usr/lib64:/$HOME/opt/lib:${LD_LIBRARY_PATH}
make clean && rm -rf build

## Structured and unstructured regions

Let us first look at an important concept in OpenMP, the code `region`.
In OpenMP, a `region` is the section or block of code to which a pragma applies.

OpenMP gives regions defined by different directives their own names;
 - Target regions are regions of code to be executed on the GPU,
 - Parallel regions are regions to be executed in parallel,
 - Data regions are code regions within which data may exists on the host and device.
This notebook will primarily concern itself with data regions, though we will also want target regions to do real work on the GPU.

Traditionally, and indeed by default, a `region` is the block of code that immediately follows an OpenMP directive.
We've already seen this in the previous notebooks; the `for` loop immediately after a `#pragma omp target teams distribute parallel for` directive is the `region` in such a case.

However, we don't need to be limited to just one statement in a region - indeed, in many cases that would be detrimental to performance!
You might imagine a sequential series of operations being performed on the same set of vectors, for example - if each operation was treated as its own region, there would be a huge overhead incurred in unnecessary memory transfers between the host and device.
If there was a way to tell the compiler that all of these operations were to be carried out on the device without the need to move any memory, we would surely see a marked improvement in performance.
We can do this through the use of `structured regions`.

### Structured data regions

A structured region in OpenMP is a sequence of operations all intended to be carried out with a particular directive.
They are defined by placing curly brackets `{ }` around the desired region, immediately after the OpenMP pragma that will govern its behaviour.

We can see an example of this in [`target_data_structured.cpp`](./C/target_data_structured.cpp).
What the particular clauses of the `#pragma omp target data` directive are doing will be discussed later in this notebook - for now it is enough to understand that the directive is applying to the subsequent 5 function calls contained within the curly brackets.
It gives all of these calls access to the memory assigned on the device without further copies of the data from the host.
Note that we can (and indeed must) issue further target directives in the `zeros` and `saxpy` functions in order to actually carry out the calculations on the device.

Let's try compiling and running this now:

In [None]:
make target_data_structured
./target_data_structured

A structured region is a powerful tool to run a set of operations on the device without incurring the overheads of transferring data.
It does, however, have one major drawback: being defined by a set of curly brackets limits these structured regions to including only sequential commands.

We can imagine a scenario in which it would be beneficial to have an array allocated on the device and stay there whilst other, unrelated, operations are going on.
A C++ class might contain a member array, for example, that we want to manipulate using different method calls at different times.
If we then wanted to accelerate these calculations on the device with a structured region, it would likely end up with us copying the array backwards and forwards between host and device many times.
We might hope that there would be a better way to do such calculations, and good news - in more recent versions of the OpenMP standard, there is!

### Unstructured data regions

An unstructured data region is an area of code within which data is allocated onto a device.
Unlike a structured region, this area of code does not need to be sequential or contained within curly brackets.

We define these regions using the `enter` and `exit` clauses with the `#pragma omp target data` directive.
Typically, the `target enter data` directive is issued immediated after allocation - often in the constructor of an object with data targeted for the device.
The matching `target exit data` directive will then be just before the associated `free`, often in an object's destructor.

[`target_data_unstructured.cpp`](./C/target_data_unstructured.cpp) shows an example of an unstructured data region.
This example executes the same calculations as [`target_data_structured.cpp`](./C/target_data_structured.cpp), but using variables that are defined out of scope of the OpenMP data directives.
As these variables are not passed to the `compute` function as arguments, we need to use an unstructured data region approach to the memory management here.
We allocate the necessary variables in `main`, then begin the unstructured region with the `target enter data` directive.
The calculations within `compute` then use the variables that have been declared on the device, before leaving the unstructured region with `target exit data` and freeing the resources.

Let's compile and run it now:

In [None]:
make target_data_unstructured
./target_data_unstructured

## The `map` clause

Now that we understand the concept of a region in OpenMP, let's look in more detail at how we allocate and transfer data in them.

When we enter a data region (either structured or unstructured) using the `#pragma omp target data` directive, we must also supply a `map` clause that tells the compiler which data we are transferring to the device, and how it should be handled.
The `map` syntax follows the following convention:

```C++
map ([map-type:] var_list)
```
where `var_list` is a list of the variables to be mapped to the device.
The available `map-type`s are:
 - `to` - On entering the data region, transfer the contents of the `var_list` items from the host to the device,
 - `from` - On leaving the data region, transfer the contents of the `var_list` items from the device to the host,
 - `tofrom` - On entering the data region, transfer the contents of the `var_list` items from the host to the device, and on leaving the region transfer it back from the device to the host,
 - `alloc` - On entering the data region, initalise the `var_list` items with an undefined initial value,
 - `delete` - deletes the allocated `var_list` items from the device,
 - `release` - decrements the reference count to the `var_list` items by one, and deletes the allocation if this number reaches zero. This is useful if multiple data regions access the same memory.

If no `map-type` is supplied, `map` will default to `tofrom` behaviour.
Only one `map-type` can be provided per `map` clause, but multiple `map` clauses can be given to each `target data` directive when different data should be treated differently, as seen in the [`target_data_structured.cpp`](./C/target_data_structured.cpp) example.

It should be noted that the compiler has no way of immediately knowing the size of an array passed in the `var_list` to a `map` clause.
For this reason, we must supply the size of the array in the form `array[start:end]`, as demonstrated in the previous examples.

Note also that a `map` clause does not have to be associated with a `target data` directive - parallel and target regions can also include `map` clauses if the transfer of data is only required for those particular directives.

The `map` clause directs data transfers between the host and device at the beginning and end of a data region, but does not provide any further links between these data regions than that.
There are effectively now two copies of the data - one on the host, and one on the device - that can be operated on independently, and have no immediate bearing on each other.
This could potentially be problematic if we want to carry out a CPU-based calculation on a variable mapped to the device part way through a data region, and then use this value in a device-based operation.

We could overcome this problem by closing a data region whenever we need to carry out work on the host, and then declaring a new region with the changed variables, but the additional overheads associated with data transfer and memory allocation with each region declaration make this an inefficient approach.

Fortunately, there exists in OpenMP a feature that allows the transfer of data within a data region, in the form of the `update` directive.

## The update directive

The `#pragma omp target update` directive tells the compiler to transfer data between the host and device.
No memory allocation or freeing is allowed with this directive, it is strictly to update variables already declared on the device within a data region.
It can take the following additional clauses:
 - `device(int)` - an integer identifier of the acceleration device with which to update,
 - `to(var_list)` - a list of variables to update from host to device,
 - `from(var_list)` - a list of variables to update from device to host,
 - `if(expression)` - run the pragma only if the expression is true.
Note that the `to` and `from` clauses can take subarrays of the variables being transferred, so that we can update only the necessary parts of our offloaded variables and save transfer time for larger arrays.

[`target_data_update.cpp`](./C/target_data_update.cpp) shows an example of an `update` directive in use.
In `main`, we declare a structured data region that allocates `tmp` and maps `input` and `res` to the device.
We carry out `some_computation` on the target device, which uses values of `input` to fill `tmp`.
We then call `update_input_array_on_the_host`, which - as its name suggests - updates the values of `input` on the host.

At this point, `input` has changed on the host, but not on the device.
We want to call the `final_calcutation` function, and carry it out as a calculation on the target device, but this takes `input` as an argument.
If we were to immediately call it, the function would take the values of `input` that already exist on the device, which have not been updated since the host calculation.
To use these updated values of `input`, we insert a `#pragma omp target update to(input[:N])` directive before calling the `final_calculation`.

Let's see it in action:

In [None]:
make target_data_update
./target_data_update

## Memory pragma examples

Now that we understand OpenMP regions, and the `map` and `update` commands do, we can look at some examples of how to use them for explicit memory management in our own code.

Let's begin by setting up our environment.
We'll set `LIBOMPTARGET_INFO` as before to check information about our OpenMP offloading, but this time we'll also introduce a new environmental variable; `OMP_TARGET_OFFLOAD=MANDATORY`.
This will terminate the program if the code fails to offload execution to the device, rather than falling back on the host.
Once these are set, we'll build the available code for the examples:

In [None]:
export LIBOMPTARGET_INFO=0
export OMP_TARGET_OFFLOAD=MANDATORY
cmake -B build
cmake --build build

### The daxpy example code - `mem1.cc`

[`mem1.cc`](./C/mem1.cc) contains the base version of the code that we will be modifying in this notebook.
It is a simple daxpy code - that is, the same as the saxpy code we've been using up until now, but at double precision - that carries out one operation on the device and calculates the time taken to do so.
Let's look at the OpenMP directives we're using in it:

In [None]:
grep "pragma omp" mem1.cc

As you can see, we only have one OpenMP pragma in the base version of the code.
The usual `#pragma omp target teams distribute parallel for` directive has been given a `map` clause to direct the memory transfers between the host and device for the following calculation.
We are mapping our inputs `x` and `y` to the host at the beginning of the operation, and then reading off the result `z` upon completion.

Let's see it in operation:

In [None]:
./build/mem1

### Adding an unstructured data region - `mem2.cc`

Let's compare the base example with [`mem2.cc`](./C/mem2.cc):

In [None]:
diff --unified mem1.cc mem2.cc

We've added in a `target enter data` directive immediately after assignment, and a `target exit data` directive just before we delete our variables.
In this way, we've defined an unstructured data region that ensures that our variables `x`, `y` and `z` will all be on the device.

We've also added in the `always` keyword to the `map` clause in the `daxpy` function - this tells the compiler to ensure data is transfered between the host and device in the specified way at this point.
Without this additional keyword, updates to the variables declared in our data region might not be transfered across as expected.

Let's see how this version of the code performs:

In [None]:
./build/mem2

We can see that the performance of the code is largely comparable to the structured example.

### Replacing `map to/from` with `update` directives - `mem3.cc`

Let's compare [`mem3.cc`](./C/mem3.cc) with the baseline code:

In [None]:
diff --unified mem1.cc mem3.cc

In addition to the changes we made in [`mem2.cc`](./C/mem2.cc), we have now also removed the data transfers from the loop directive.
Instead, we've replaced them with `update` directives that more efficiently handle the data transfers.

Let's compare the performance now:

In [None]:
./build/mem3

Again, we see the performance is comparable to the original example.

### Replacing `delete` with `release` to use reference counting - `mem4.cc`

Let's compare the next example, [`mem4.cc`](./C/mem4.cc), with the first unstructured data region code, [`mem2.cc`](./C/mem2.cc):

In [None]:
diff --unified mem2.cc mem4.cc

The only difference here is that we have replaced the `delete` directive with a `release`, to allow the use of reference counting for our resources.
As we are only using the relevant variables once in our code, this will decrement the reference counter for them to zero, and act as a `delete` option.

Let's run this example:

In [None]:
./build/mem4

### Reducing the number of data transfers with the use of to/from - `mem5.cc`

Let's compare our final example, [`mem5.cc`](./C/mem5.cc), with our baseline unstructured data region code:

In [None]:
diff --unified mem2.cc mem5.cc

In this example, we are trying to minimise the number of data transfers in our code.
To this end, we have changed the initial data region to `map(to)` for `x` and `y`, to transfer their initial assigned values immediately at allocation.
The result, `z`, is only allocated here, as it does not need any initial values.

We have then removed the `always` keyword from the target loop, as we no longer want a data transfer to occur here.
Finally, we then carry out the transfer back from `z` only as we leave the data region.
In this way, we only have to transfer the values of each variable exactly once. 

Let's see this in action:

In [None]:
./build/mem5

We observe similar performance to the baseline case, perhaps with a slight improvement in performance.
With such small examples, performance improvements will be difficult to spot.
Feel free to experiment with the example code, changing the size of the array or complexity of calculations, and see how the performance changes with the use of different memory pragmas.

Now that we've learnt how to manage memory explicitly ourselves with OpenMP directives, we can look at how to let the operating system handle the memory for us.
In the next notebook, we will be exploring this by using Managed Memory.