# OpenMP and HIP Interoperability

In this notebook, we will see how we can write an application that makes use of both OpenMP pragmas and HIP kernels.
This type of strategic use of multiple acceleration paradigms can make our code flexible and performant in ways that would not be possible with just one.

Let's begin in the traditional way; by checking that we have an appropraite GPU on the system, loading into the relevant working directory and cleaning our environment.

In [None]:
rocm-smi
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2i-OpenMP_HIP_interportability/C
make clean

## Why use HIP and OpenMP?

As we've learned throughout these notebooks, OpenMP is a flexible and portable pragma-based API for shared-memory parallelism.
It allows relatively straightforward parallelism to be implemented across different architectures, but lacks the level of direct control that other paradigms offer.

HIP, on the other hand, offers a tighter control at the cost of OpenMP's ease and flexibility.
The details of HIP implementation will not be discussed in this notebook - rather, that will be saved for the next section of this course.
For now, it is enough to know that it is a low-level language that makes use of kernels to offer a more direct control of the offloaded compute.

Both of these approaches have their upsides and their drawbacks, so you might wonder if there was a way to incorporate both into code, to draw on their relative strengths.
And you would indeed be correct - there is!
Code making use of OpenMP pragmas can call HIP kernels, and HIP applications can call OpenMP kernels.

Using this knowledge we can, for example, write our simple loops using OpenMP pragmas, but write our complicated offloaded routines where we require fine-grained control to optimise their performance as HIP kernels.
By using both in the same application, we really can have the best of both worlds.

### The example code

An example of this can be seen in the [`daxpy.cc`](./C/daxpy.cc) example.
In this code, we define two arrays `x` and `y` and move them to the device in a structured data region.
Several operations are carried out on the host, and the copies of `x` and `y` are necessarily updated on the device.
Note that with the use of an APU, these memory pragmas would be unnecessary.

We then carry out a daxpy loop using these two arrays, but this time in a HIP kernel.
The HIP kernel itself is defined within the `daxpy_hip` function.
This code seems relatively straighforward, so let's take a look at how we should go about compiling it.

## Compiling code with HIP and OpenMP

Unfortunately, whilst OpenMP applications can call HIP kernels, and vice versa, there is currently no support for compiling HIP and OpenMP at the same time.
The HIP compiler `hipcc` cannot offload OpenMP GPU kernels, and the AMD compilers that support OpenMP pragmas cannot compile HIP kernels.
In order to use both, we will need to compiler the HIP kernels using `hipcc`, and the OpenMP code with a suitable C++ compiler, such as `amdclang`.

This can be done by separating the kernels into different files, and compiling them individually.
As long as the kernels are properly declared in each other's scopes at compile-time, they will be able to call each other during runtime without issue.

It is also possible, however, to have both in the same file and copmile in one command, via `make` and the prudent use of pre-processor flags.
This concept is demonstrated in [`daxpy.cc`](./C/daxpy.cc).

Note the two pre-processor commands that surround the majority of the code in this file, `#ifdef __DEVICE_CODE__` and `#ifdef __HOST_CODE__`.
As you might expect from their names, the HIP kernels destined to run on the device are contained within the `__DEVICE_CODE__` pre-processor conditional statement, and the OpenMP code that will run on the host can be found within `__HOST_CODE__`.

Now let's take a look at the [`Makefile`](./C/Makefile) and see how we can interact with these blocks of code:

In [None]:
cat Makefile

Our default build rule builds our file into two objects; the first uses the `hipcc` compiler and second the default `CXX`.
In the `HIPCC_FLAGS`, note the `-D__DEVICE_CODE` option.
This defines the `__DEVICE_CODE__` macro and makes it available to the compiler, allowing it to enter the section of code in [`daxpy.cc`](./C/daxpy.cc) demarcated by the `__DEVICE_CODE__` `#ifdef` and compile it with the HIP libraries.

We then compile the file again with the default `CXX` compiler, with the `-D__HOST_CODE__` flag.
This gates the HIP section of code and compiles the OpenMP commands as usual.  We are then free to link these objects into the `daxpy` executable.  

Let's try compiling and running it now:

In [None]:
make daxpy
./daxpy

If all has worked well, the code will state which machine it was compiled for - the host or the device - and verify that the results were reported correctly.
Congratulations!
You've now successfully run code with multiple forms of acceleration enabled.

This concludes the tutorial's optional lessons on OpenMP.

In the next set of notebooks, we will dive in to HIP proper; examining its functionality; discovering the strengths of its programming model; and learning how we can implement it into our code.