# Device functions and subroutines in Fortran

Until now, we have focused on how to offload data or particular calculations to the GPU.
In most real software applications, however, we will want to pass data through a pipeline of multiple subroutines and submodules before completion.
In this notebook, we will learn how to call subroutines to run on the device from within existing target regions.
In so doing, we will combine many of the things we have learned up to this point, and let you try porting some code to the GPU using OpenMP yourself.

Let's begin as usual with our environment check and moving into the appropriate directory:

In [None]:
rocm-smi
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/Fortran

## The `declare target` clause

Calling a subroutine that has been compiled for the architecture of the host from within a target region can have unexpected consequences.
It may not be able to run on the device architecture, and in some compilers may even cause a linking error.
To ensure that our subroutines can run on the device, we must indicate to the compiler that a device-compatible version should be generated.
We can do this adding a `declare target` clause to the subroutines definition, immediately after the standard `implicit none` clause.
It is also possible to pre-declare a range of target subroutines using `declare target(func-list)` before the subroutine definitions, with the names of the target subroutines in question as a comma separated `func-list`.

To get started, let's move into our first example directory, `device_routine_with_interface/0_device_routine_portyourself`, and clean our environment therein.

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/Fortran/device_routine_with_interface/0_device_routine_portyourself
make clean
ls

There are 2 files in the source code; [`device_routine.f90`](./Fortran/device_routine_with_interface/0_device_routine_portyourself/device_routine.f90) where `program device_routine` is implemented, and [`compute.f90`](./Fortran/device_routine_with_interface/0_device_routine_portyourself/compute.f90) where the `compute()` subroutine is implemented.
At the moment, this code will run in serial on the CPU - it is your task to implement the appropriate OpenMP pragmas to get it running properly on the GPU.

Before porting the code, read the source code and try to understand it.
Then, compile and run this baseline CPU version, to know what to expect:

In [None]:
make
./device_routine

To port the code, we can either use managed memory with unified shared memory - which was covered in [2e-OpenMP_managed_memory](./../2e-OpenMP_managed_memory/Managed_memory_in_Fortran.ipynb) - explicit memory movements in the parallel regions with `map` clauses, or unmanaged memory with an unstructured data region - as covered in [2d-OpenMP_explicit_memory_directive](./../2d-OpenMP_explicit_memory_directive/Explicit_memory_management_in_Fortran.ipynb).
Note that if using unified shared memory, the function `compute` in [`compute.f90`](./Fortran/device_routine_with_interface/0_device_routine_portyourself/compute.f90) will also need to include the `requires unified_shared_memory` directive, and you must enable `HSA_XNACK`.

Make sure you identify all the appropriate parallel regions for offloading, and use the appropriate OpenMP clause for each.
One of the regions has a race condition, which you should address with what we learned in [2f-OpenMP_reductions_atomics_mutexes](./../2f-OpenMP_reductions_atomics_mutexes/Reductions_atomics_and_mutexes_in_Fortran.ipynb).
Notice that `compute()` is called from within one of the regions, and remember what we discuseed earlier regarding the `declare target` clause.
Note that for Fortran, the `!$omp declare target` directive goes inside the subroutine rather than outside, as is done for C functions.

You can use the code cell above to compile and run your code as you modify it.
After porting the code, you can compare your results with the possible solutions in the subdirectories of `device_routine_with_interface/`.
Several solutions are provided, including one 'wrong' solution; the example in `1_device_routine_wrong/` (demonstrated in the code cell below) shows the compilation errors that will be generated when the subroutine is not declared to run on the target.

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/Fortran/device_routine_with_interface/1_device_routine_wrong
make clean
make

The managed memory with unified shared memory solution is provided in `2_device_routine_usm/`, the unmanaged memory solution using explicit memory movements is in `3_device_routine_map/`, and the solution using unmanaged memory with an unstructured data region in `5_device_routine_enter_data/`.  

There is an additional solution covering an OpenMP clause we have not previously seen, `declare target device_type(nohost)`, in `4_device_routine_device_type/`.
This clause tells the compiler that this subroutine needs to be compiled for the device **only**, and not the host.

You now may wish to experiment with these different solutions by changing the problem size or the complexity of the calculation in `compute()`.

## Device routines within modules

The next part of this notebook will cover the use of Fortran **modules** - a common way to encapsulate code for reusability, organisation and maintainability.
The example code we will be using is in the `device_routine_with_module/0_device_routine_with_module_portyourself/` directory, so let's move to that and examine the contents:

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/Fortran/device_routine_with_module/0_device_routine_with_module_portyourself
make clean
ls

There are again two source code files; [device_routine.f90](Fortran/device_routine_with_module/0_device_routine_with_module_portyourself/device_routine.f90) containing the `device_routine` program implementation, and [`computemod.f90`](./Fortran/device_routine_with_module/0_device_routine_with_module_portyourself/computemod.f90), which contains a module that includes the `compute()` subroutine, replacing the `compute.f90` file from the previous exercise.

To port a subroutine within a module to the GPU, we need to add the `declare target` clause within the subroutine itself, just as we did with the previous example.
The pragma should be applied only to the subroutines that need porting, and should not be applied to the whole module.
Conversely, if we want to use unified shared memory, the enabling clause must be applied to the module as a whole, being placed after its declaration at the beginning of the file.

The only difference between this code and the previous example is that it encapsulates the subroutine into a module, so we know what to expect from it and can go straight into porting it.
You can use the following code block to compile and run the code.
If using unified shared memory, remember to add `export HSA_XNACK=1` before running the code.

In [None]:
make
./device_routine

Now that you have attempted to port your code, you can compare it with the model solutions.
Two versions are provided, one using unmanaged memory with an unstructured data region - see `1_device_routine_with_module/` - and one with unified shared memory - see `2_device_routine_with_module_usm/`.

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/Fortran/device_routine_with_module/1_device_routine_with_module
make clean
make
./device_routine

In [None]:
cd $HOME/DiRAC-AMD-GPU/notebooks/02-OpenMP/2g-OpenMP_device_subroutines/Fortran/device_routine_with_module/2_device_routine_with_module_usm
make clean
make
export HSA_XNACK=1
./device_routine

Once you are happy with the solution, feel free to take some time to experiment with the code - perhaps adjusting problem size, changing the complexity, or adding in additional device subroutines to the module - to get a handle on how offloading subroutines works.

Throughout this section of the course, we've learnt the fundamentals of OpenMP, and how they can help you run performant code on an AMD GPU or APU.
You should now be equipped to start porting your own code to such architectures.

There follows an optional section on OpenMP optimisations.
Feel free to follow it at your leisure.
The next main section of this course will cover HIP programming, and the use of dedicated kernels to improve performance on accelerated devices.
Unfortunately, HIP Fortran is still in development and not all features are fully supported, so the notebooks will concentrate instead on C/CXX.