# Device functions and subroutines in C

Until now, we have focused on how to offload data or particular calculations to the GPU.
In most real software applications, however, we will want to pass data through a pipeline of multiple functions and routines before completion.
In this notebook, we will learn how to call functions to run on the device from within existing target regions.
In so doing, we will combine many of the things we have learned up to this point, and let you try porting some code to the GPU using OpenMP yourself.

Let's begin as usual with our environment check and moving into the appropriate directory:

In [None]:
rocm-smi
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/C

## The `declare target` clause

Calling a function that has been compiled for the architecture of the host from within a target region can have unexpected consequences.
It may not be able to run on the device architecture, and in some compilers may even cause a linking error.
To ensure that our functions can run on the device, we must indicate to the compiler that a device-compatible version should be generated.
We can do this by decorating a function with the `declare target` clause.

As with data regions, the `declare target` clause can be used in a number of different ways.
We can add `#pragma omp declare target` directly before a particular function to specify it as our target function directly, or place it around said function with additional `begin` and `end` clauses.
Finally, it is also possible to pre-declare a range of target functions using `declare target(func-list)`, with the names of the target functions in question as a comma separated `func-list`.

To get started, let's move into our first example directory, `1_device_routine/0_device_routine_portyourself`, and clean our environment therein.

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/C/1_device_routine/0_device_routine_portyourself
make clean
ls

There are 2 files in the source code; [`device_routine.c`](./C/1_device_routine/0_device_routine_portyourself/device_routine.c),  which contains the `main` function; and [`compute.c`](./C/1_device_routine/0_device_routine_portyourself/compute.c), where the `compute()` function is implemented.
At the moment, this code will run in serial on the CPU - it is your task to implement the appropriate OpenMP pragmas to get it running properly on the GPU.

Before porting the code, read the source code and try to understand it.
Then, compile and run this baseline CPU version, to know what to expect:

In [None]:
make
./device_routine

To port the code, we can either use unmanaged memory with an unstructured data region - as covered in [2d-OpenMP_explicit_memory_directive](../2d-OpenMP_explicit_memory_directive/Explicit_memory_management_in_C.ipynb) - or managed memory with unified shared memory - which was covered in [2e-OpenMP_managed_memory](../2e-OpenMP_managed_memory/Managed_memory_in_C.ipynb).
Note that if using unified shared memory, the function `compute` in [`compute.c`](./C/1_device_routine/0_device_routine_portyourself/compute.c) will also need to include the `requires unified_shared_memory` directive, and you must enable `HSA_XNACK`.

Make sure you identify all the appropriate parallel regions for offloading, and use the appropriate OpenMP clause for each.
One of the regions has a race condition, which you should address with what we learned in [2f-OpenMP_reductions_atomics_mutexes](./../2f-OpenMP_reductions_atomics_mutexes/Reductions_atomics_and_mutexes_in_C.ipynb).
Notice that `compute()` is called from within one of the regions, and remember what we learned earlier regarding the `declare target` clause.

You can use the code cell above to compile and run your code as you modify it.
After porting the code, you can compare your results with the solutions we have provided below.
The managed memory solution can be found in `1_device_routine_usm`, and the unmanaged memory solution in `2_device_routine_map`.

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/C/1_device_routine/1_device_routine_usm 
make clean
make
export HSA_XNACK=1
./device_routine

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/C/1_device_routine/2_device_routine_map
make clean
make
export HSA_XNACK=0
./device_routine

## Device global data

In many applications, it is common to have global data containing constants and variables we wish to share with many function throughout the code base.
The `declare target` clause also allows us to declare data on the target device, such that these global variables will remain on the device and be accessible to the device code there.
We will explore this use of the `declare target` clause in the next example, in the `2_device_routine_wglobaldata/0_device_routine_wglobaldata_portyourself` directory.
Let's move to that directory, make sure it is clean, and inspect the contents now:

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/C/2_device_routine_wglobaldata/0_device_routine_wglobaldata_portyourself
make clean
ls

In addition to the familiar [`compute.c`](./C/2_device_routine_wglobaldata/0_device_routine_wglobaldata_portyourself/compute.c) and [`device_routine.c`](./C/2_device_routine_wglobaldata/0_device_routine_wglobaldata_portyourself/device_routine.c), this directory contains an additional file, [`global_data.c`](./C/2_device_routine_wglobaldata/0_device_routine_wglobaldata_portyourself/global_data.c), with the definition of the `constants` array.
Lets compile and run the serial CPU version, so we know what to expect from our own ported code later:

In [None]:
make device_routine
./device_routine

Note that the result is different from the previous example.
Looking at the `compute()` function, we can see why - the sum includes an additional term the global `constants` array.
To port the code to the GPU, in addition to the steps taken for the previous example, you will also have to add the OpenMP `declare target` around both the constants array in [`global_data.c`](./C/2_device_routine_wglobaldata/0_device_routine_wglobaldata_portyourself/global_data.c), and the extern declaration in [`compute.c`](./C/2_device_routine_wglobaldata/0_device_routine_wglobaldata_portyourself/compute.c).

You can use the code cell above to compile and run your code as you modify it.
After porting the code, you can compare your results with the solutions in `1_device_routine_wglobaldata`.

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/C/2_device_routine_wglobaldata/1_device_routine_wglobaldata
make clean
make
./device_routine

## Device dynamic global data

The final example we will explore utilises dynamic global data, and can be found in the `3_device_routine_wdynglobaldata/0_device_routine_wdynglobaldata_portyourself` directory.

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/C/3_device_routine_wdynglobaldata/0_device_routine_wdynglobaldata_portyourself
make clean
ls

This example has the same file structure as the previous static global data example.
The computations carried out in [`compute.c`](./C/3_device_routine_wdynglobaldata/0_device_routine_wdynglobaldata_portyourself/compute.c) and [`device_routine.c`](./C/3_device_routine_wdynglobaldata/0_device_routine_wdynglobaldata_portyourself/device_routine.c) are identical, but there are some changes to the format of the global data.
If you inspect [`global_data.c`](./C/3_device_routine_wdynglobaldata/0_device_routine_wdynglobaldata_portyourself/global_data.c), you will notice that instead of a statically allocated array, `constants` is now dynamically allocated to the size `isize`, with elements filled at definition.
This definition is then carried out in [`device_routine.c`](./C/3_device_routine_wdynglobaldata/0_device_routine_wdynglobaldata_portyourself/device_routine.c).

Take some time to study and understand the code, before compiling and running it in its native CPU form below:

In [None]:
make
./device_routine

Let's port this code to GPU now.
As well as the previous steps, we now also need to move the dynamically allocated `constants` array to the device.
Remember that the copy should be done only **after** it is allocated and any changes have been applied to it on the CPU.

You can use the code cell above to compile and run your code as you modify it.
After porting the code, you can compare your results with the solutions in `1_device_routine_wdynglobaldata`.

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/C/3_device_routine_wdynglobaldata/1_device_routine_wdynglobaldata
make clean
make
./device_routine

Now that you have successfully ported the `device_routine` C code to run on the GPU, you can experiment with making the loops and `isize` larger, change the logic in `compute()` to be more complex, and try different pragmas to study their effect.
For C++-specific function offloading (i.e. class member functions, external and virtual member functions), see the associated [CXX.ipynb](./CXX.ipynb) notebook in this section.

Throughout this section of the course, we've learnt the fundamentals of OpenMP, and how they can help you run performant code on an AMD GPU or APU.
You should now be equipped to start porting your own code to such architectures.

There follow two optional sections on OpenMP optimisations, and interoperability with HIP.
Feel free to go through them if they sound interesting or useful to you or your work, or continue on to the next section of this course, looking at the HIP programming model, and how dedicated kernels can be used to improve the performance of your code.