# Directive Based Programming with OpenACC

Prepared and presented by Daniel Howard, March 31st 2022

In this notebook we present techniques and code examples for using OpenACC to develop for GPUs. We will cover:
1. Comparison of descriptive & perscriptive programming and their portability
    * OpenACC
    * OpenMP
    * ISO Standard Language Parallelism
2. Fork-Join Execution Model for Attached GPU Accelerators
3. OpenACC API Directives with Unified Memory
    * Compute construct directives
        * `!$acc kernels ...`
        * `!$acc parallel ...`
        * `!$acc serial ...`
    * `!$acc loop` and other specification clauses

# Workshop Etiquette
* Please mute yourself and turn off video during the session.
* Questions may be submitted in the chat and will be answered when appropriate. You may also raise your hand, unmute, and ask questions during Q&A at the end of the presentation.
* By participating, you are agreeing to [UCARâ€™s Code of Conduct](https://www.ucar.edu/who-we-are/ethics-integrity/codes-conduct/participants)
* Recordings & other material will be archived & shared publicly.
* Feel free to follow up with the GPU workshop team via Slack or submit support requests to [support.ucar.edu](support.ucar.edu)
    * Office Hours: Asynchronous support via [Slack](ncargpuusers.slack.com) or schedule a time with an organizer

## Notebook Setup
Set the `PROJECT` code to a currently active project, ie `UCIS0004` for the GPU workshop, and `QUEUE` to the appropriate routing queue depending on if during a live workshop session (`gpuworkshop`), during weekday 8am to 5:30pm MT (`gpudev`), or all other times (`casper`). Due to limited shared GPU resources, please use `GPU_TYPE=gp100` during the workshop. Otherwise, set `GPU_TYPE=v100` (required for `gpudev`) for independent work. See [Casper queue documentation](https://arc.ucar.edu/knowledge_base/72581396#StartingCasperjobswithPBS-Concurrentresourcelimits) for more info.  

In [None]:
export PROJECT=UCIS0004
export QUEUE=gpudev
export GPU_TYPE=v100

## Useful Definitions
* __Device__: Accelerator on which execution can be offloaded (ex : GPU).
* __Host__: Machine (ie CPU) hosting 1 or more accelerators and in charge of execution control.
* __Kernel__: Computational runtime derived from a section of parallelized code that is scheduled to run on an accelerator.
* __Execution thread__: Sequence of kernels to be executed on an accelerator.
* __Thread__: A single Processing Element (PE) or execution unit. On a NVIDIA GPU, run on a single CUDA core.
* __Streaming Multiprocessor (SM)__: Highest level processing unit in NVIDIA GPU that processes blocks/gangs of threads. Each SM provides a shared memory cache, similar to L1, accessible by the block/gang of threads running on that SM (V100 - 96kB, A100 - 160kB).
* __Grid__: _Collection of blocks/gangs_ of threads that are distributed for execution across SMs. Can be organized in Euclidean dimensions.
* __Gang (OpenACC) / Teams (OpenMP)__: Coarse-grain parallelism structure, assigned to a SM. Contains a _block_ of threads at size num_workers times vector_length and has a shared memory/L1 cache.
* __Worker (OpenACC)__: Fine-grain parallelism that executes vectors of threads. Equivalent to a _warp_.
* __Vector__: Group of threads executing the same _SIMT_ instruction and executed by a worker.
Note: Different vendors (ie NVIDIA vs AMD) use different terms that mean equivalent concepts

## Portability and Comparing OpenACC, OpenMP, and ISO Standard Language Parallelism
Recall from earlier sessions the difference between __prescriptive__ and __descriptive__ programming. In general, __descriptive__ paradigms, given the flexibility afforded to compilers, are able to achieve greater portability across different hardware types.

* OpenMP has longer history from 1997 and predominantly is __prescriptive__ while OpenACC is more __descriptive__, beginning in 2011
* ISO Standard Language Parallelism (stdPar) tends __descriptive__, but still early stage implementation across compilers
* More compilers support [OpenMP](https://www.openmp.org/resources/openmp-compilers-tools/#compilers) but fewer support [OpenACC](https://www.openacc.org/tools) (links list current compiler support)
* OpenACC is more mature for GPUs (esp. NVIDIA) while OpenMP only recently has been expanding GPU offload support
    * See Oak Ridge National Lab's "[Introduction to OpenMP GPU Offload](https://www.olcf.ornl.gov/calendar/introduction-to-openmp-gpu-offloading/)" from Dec 2021 if interested.
    * Note: __Legacy OpenMP will NOT run well off the shelf on GPUs__
* stdPar is in language standard and aims to replace need for directives, see [Burying the OpenACC vs OpenMP Hatchet](https://www.nextplatform.com/2019/01/16/burying-the-openmp-versus-openacc-hatchet/) by Michael Wolfe

Nonetheless, directive based & stdPar landscape constantly changes and different developers have their own opinion which is best. When deciding yourself, __most important is to consider any long term portability needs of your code__. Each cycle, a HPC system's hardware typically is set on order 3-5 years while a software project can more easily extend to 10+ years if designed with longevity in mind. See [Better Scientific Software](https://bssw.io) supported by the Department of Energy and National Labs alongside the [Exascale Computing Project](https://www.exascaleproject.org/).

## Philosophical Differences between OpenACC and OpenMP Programming Models

* __OpenACC__
    * __Compilers are allowed flexibility__ in how to parallelize
    * __Programmer augments information available to compiler__ and can optionally provide suggestions on how to map threads on accelerator
    * __More portable across target devices__ since compiler expects freedom in how to parallelize for each target type
    * Non-parallel code must be made parallel. __Programmer can safely suggest parallel regions__ since compiler checks if loops are actually parallel (unless `independent` clause is used)
* __OpenMP__
    * __Compilers must follow user-directed parallelization__ and the programmer must explicitly specify how the parallelism is achieved
        * Only recently allowed for compiler-generated automatic parallelization using `loop` clause
    * __Less portable__, different target devices (ie GPUs vs CPUs) require different directives
    * Non-parallel code can be optionally restructured. __Responsibility of programmer to ensure correct implementation__ of parallel regions
* __ISO Standard Language Parallelism__
    * Still __allows flexibility to the compiler__ depending on available target device
    * __Removes the need for directives__ or other extra instructions to compilers
    * __Standard within language__ and not in a separate organization
    * Not yet as robust as OpenACC/OpenMP, ie missing support for reductions. Over time, should gain breadth of scope of directives like OpenACC/OpenMP

## OpenACC and OpenMP Are Relatives
![Merge history of OpenMP and OpenACC, John Urbanic](img/mergeACC_MP.png)
From "[OpenMP and GPUs](https://www.psc.edu/wp-content/uploads/2021/06/OpenMP-and-GPUs.pdfhttps://www.psc.edu/wp-content/uploads/2021/06/OpenMP-and-GPUs.pdf)" Urbanic, 2021
* OpenMP - Began in 1997 (GPU offload in 2013-2015)
    * Latest 5.2 standard Nov 2021, broad community adoption, Intel strongly influences development
* OpenACC - Began 2010
    * Latest 3.1 standard Nov 2020, GPU-only community adoption, NVIDIA strongly influences development
* Compiler support for GPU target variable across vendors
* Uncertain if standards will merge and/or be replaced by ISO Language Standards
* However, source translation between all these approaches is relatively straightforward

## Translating between OpenACC and OpenMP

|OpenACC|OpenMP|Description|
|---|---|---|
|__Regional Directives__|   | Initializes parallel runtime regions |
| `!$acc parallel ...` | `!$omp target teams ...` | Establishes a parallel runtime region/compute kernel|
| `!$acc kernels ...` | `!$omp target teams loop` | Similar but gives optimization flexibility to compiler |
| `!$acc loop ...` | `!omp ...` | Defines a parallel loop within a compute kernel |
|__Parallelization Clauses__|   | Specifies types of parallelization in a region |
| `gang` | `distribute` or `distribute parallel for` | Specifies a gang work unit |
| `worker` | `parallel for` | Specifies a worker work unit within a gang |
| `vector` | `simd` or `parallel for num_threads(1) simd` | Specifies a SIMD work unit, best with coalesced memory|
| `num_gangs()` | `num_teams()` | Specifies number of gangs/teams |
| `num_workers()` | `num_threads()` | Specifies number of workers (threads in CPU context) |
| `vector_length()` | `simdlen()` | Specifies size of SIMD type operation |
|__Data Clauses__|   | Specifies data movement between CPU & GPU in parallel/data regions |
| `create()` | `alloc()` | Allocates memory on target device for data object |
| `copy()` | `map(tofrom:)` | Allocates memory if needed, copies data at region entry/exit |
| `copyin()` | `map(to:)` | Allocates memory if needed, copies data at region entry |
| `copyout()` | `map(from:)` | Allocates memory if needed and copies data object at region exit |
| `present()` | `assert(omp_target_is_present())` | Asserts that a data object is already present on GPU |

Details of these directives and control statements for directive based computing will be covered later. Point is that __most statements have clear translations__ and it wouldn't be a significant loss of effort to choose one programming paradigm but then later refactor to another. See [CCAMP paper](https://www.osti.gov/servlets/purl/1666015) (Lambert, et al) for language translation details.

## Fork-Join Execution Model
![Fork join model, NERSC](img/fork-join-model.png)
[Rita, et al, 2018](https://www.mdpi.com/2076-3417/8/3/399)

Both OpenMP and OpenACC employ similar execution models

1. First, the host CPU runs serial code until it encounters a parallel region
2. The host process then __forks off many threads__ to process the parallel code
    * OpenMP allows some degree of nested parallelism while OpenACC (and GPU programming) is less flexible and expects all threads to perform the same tasks in a kernel
3. Once all threads complete their work, the threads __join back together__ and continue

The difference for GPU offload is there can be additional steps at both the fork and join to transfer data from the host to the device or vice versa. OpenMP often refers to a "master" thread that leads execution on a host device but the concept of a "master" GPU thread is not practical.

## Directive Based Programming Example
![FORTRAN OpenACC Example. NVIDIA](img/Fortran_OpenACC.png)

Essentially, both __OpenACC and OpenMP take existing codes and decorate them with directives that define parallel regions__ alongside other details given by the programmer in associated clauses. Compilers can then choose to honor these directives and build an executable that forks and joins parallel threads across the program's execution.

In this session, we will focus on OpenACC given it's maturity and limited time to only focus on one programming model in a single session.

## MiniWeather for Simulating Weather-like Flows
For this session, we will use the [MiniWeather](https://github.com/mrnorman/miniWeather) mini-app to explore how you can implement OpenACC. This mini-app simulates weather-like flows, specifically to fascilitate training in parallelizing accelerated HPC architectures and has been developed by Matt Norman (ORNL), Jeff Larkin (Nvidia), and Isaac Lyngaas (ORNL). For example, MiniWeather can model injection jet streams into a stable atmosphere, like below.

![Injection Jet, MiniWeather](img/injection_pt_1000.png)

## Test that MiniWeather Builds Correctly
First, let's build MiniWeather using [cmake_casper_nvhpc.sh](fortran/build/cmake_casper_nvhpc.sh) and run some tests to make sure the model builds correctly. To note, we will focus on the `FORTRAN` and `OpenACC` versions of the model but MiniWeather is also a great tool for exploring the other programming models, including `MPI`, `OpenMP`, and `do concurrent`/`std::par`. See other language folders and associated programming model source files for examples. A script file for Casper has been adapted for the other build folders to fascilitate this exploration if you would like to do this on your own time. See MiniWeather's README "[Compiling and Running the Code](https://github.com/mrnorman/miniWeather#compiling-and-running-the-code)" section for more info about this.

Initially, we will run this test with the base `mpi` model using [miniWeather_mpi.F90](fortran/miniWeather_mpi.F90) and the already implemented `openacc` model using [miniWeather_mpi_openacc.F90](fortran/miniWeather_mpi_openacc.F90). The file in [fortran/build/cmake_casper_nvhpc.sh](fortran/build/cmake_casper_nvhpc.sh) has its final make line modified to only build these two implementations but you can simply change the final line back to the singular `make` command without targets to build all of miniWeather's executables or specify different targets as desired.

In [None]:
cd fortran/build
source cmake_casper_nvhpc.sh
cd ../..
# After running this, there will be the executables `mpi` and `openacc` in "fortran/build"

### Validate the Executable
Now we can run validation tests on the built programs. Below, we use the [check_output.sh](fortran/build/check_output.sh) script to do this. For miniWeather, you could also run `make` then `make test` to validate all the different executable types for the model. To note, we can enable the environment variable `NVCOMPILER_ACC_TIME=1` in order to provide some contextual performance runtime information for comparison later.

First, run the serial `mpi_test`.

In [None]:
cd fortran/build
qcmd -A $PROJECT -q $QUEUE -l select=1:ncpus=1:ngpus=1 -l gpu_type=$GPU_TYPE -l walltime=60 -v NVCOMPILER_ACC_TIME=0 -- \
$PWD/check_output.sh $PWD/mpi_test 1e-13 4.5e-5
cd ../..

Now run the parallel `openacc_test` offloaded on a GPU. __Pay attention to how much faster this one runs__. Feel free to set `NVCOMPILER_ACC_TIME=1` to see more info about GPU compute kernel performance. 

In [None]:
cd fortran/build
qcmd -A $PROJECT -q $QUEUE -l select=1:ncpus=1:ngpus=1 -l gpu_type=$GPU_TYPE -l walltime=60 -v NVCOMPILER_ACC_TIME=0 -- \
$PWD/check_output.sh $PWD/openacc_test 1e-13 4.5e-5
cd ../..

## OpenACC Directives - `kernels` and `parallel` with Unified Managed Memory
For reference, here is the [OpenACC 2.7 Quick Reference Guide](https://www.openacc.org/sites/default/files/inline-files/API%20Guide%202.7.pdf). You can also read through the official [OpenACC 3.1 Full Standard Specification](https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.1-final.pdf) or [OpenACC Programming and Best Practices Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC_Programming_Guide_0_0.pdf) when your time allows.

We first introduce the `kernels` and `parallel` directives in OpenACC. For an in depth comparison, read the blog post [OpenACC Kernels and Parallel ConstructsOpenACC Kernels and Parallel Constructs](https://www.pgroup.com/blogs/posts/kernels-vs-parallel.htmhttps://www.pgroup.com/blogs/posts/kernels-vs-parallel.htm) by Michael Wolfe. In order to simplify our initial work, we will use __managed memory__.
![Unified Memory, NVIDIA](img/UnifiedMemory.png)

With managed memory, the address space between the CPU and the GPU is abstracted to one unified construct. This approach is optional for `OpenACC` & `OpenMP` using `-gpu=managed` but is currently required for `stdPar`. With unified memory, the programmer no longer has to worry about data movement explicitly. Instead, the runtime will automatically move data as needed between the CPU and GPU whenever a page fault occurs, ie GPU tries to access memory that is not available/updated in its physical high bandwidth memory space.

In later sessions, we will spend some time with `!$acc data` regions since using managed memory can lead to sub-optimal performance. To note, due to non-optimized data movement patterns, some kernels have already been specified with OpenACC directives in the MPI section of `subroutine set_halo_values_x(state)`.

## Using the Descriptive `!$acc kernels` Compute Construct Directive
When you have a parallelizable for loop or tightly nested `for` loops, you can very easily __suggest__ to the compiler to run it on the GPU using the __descriptive `!$acc kernels` directive__. When targetting NVIDIA GPUs, the compiler then essentially builds a CUDA kernel for you and assigns parallel execution in a close to optimal arrangement across gangs, workers, and vectors. This is a __descriptive__ approach and the easiest way to port an application to a GPU using OpenACC. In this case, __the compiler does all the heavy lifting__ but regardless, only takes the directives as advice. If the compiler determines it can't parallelize your code, ie sees a potential data race, serial CPU code will be compiled instead. Thus, optimal performance is often difficult to achieve with the `!$acc kernels` directive alone as the compiler often won't parallelize code and instead would benefit from additional information provided about each compute region. Here are some important points:

* Use `!$acc kernels` and `!$acc end kernels` to encapsulate multiple collections of for and nested for loops that run in sequence
    * These regions must not have any data dependencies between loop iterations that, for example, would cause a data race condition.
* By default, there is an __implicit barrier__ at the end of a parallel execution region. Host thread execution will pause until the kernel completes
    * Use the `async()` and `wait()` clauses to permit asynchronous execution, discussed at future session
* You may specify `num_gangs()`, `num_workers()`, and `vector_length()` but only the compiler gets to decide which execution type is appied to each loop level(s) within a `!$acc kernels` region.
* You may not nest compute construct regions. Only one `kernels`, `parallel`, or `serial` context may be in scope at a time

An example of using the `!$acc kernels` in FORTRAN is below:
```fortran
!$acc kernels [optional clauses]
    do i = 1, n
        do j = 1, m
            ...
        enddo
    enddo
!$acc end kernels
```

## EXERCISE: Autoparallelization Using `!$acc kernels` Descriptive Directive
Add `!$acc kernels` regions the TODO sections of [miniWeather_mpi_exercise.F90](fortran/miniWeather_mpi_exercise.F90#L225) (use `CTRL+F` or `CMD+F` TODO). You may enter a `[x]` in the raw text of this cell to track when you're finished with each section. Some have already been completed for you.
- [ ] Line 225
- [ ] Line 274
- [ ] Line 306
- [ ] Line 334
- [x] Line 371
- [x] Line 454
- [x] Line 871

Once this is done, run the below commands to make a new executable from the exercise source file. To note, within the [fortran/CMakeLists.txt](fortran/CMakeLists.txt#L147) at Line 147 we added: 

* The `-gpu=managed` flag so that you do not have to worry about data movement yet during this exercise
* The `-Minfo=accel` flag so information about how the compiler is targetting the GPU is also printed during the make/compilation process.

Investigate the output from the `-Minfo=accel` compiler flag. __What types of parallelizations did the compiler find and perform? Which loops were not able to be parallelized by the compiler?__

In [None]:
make -C fortran/build openacc_ex openacc_test_ex

## EXERCISE: Run the Autoparalellized OpenACC MiniWeather Program
Once you are satisfied with your changes and compiled the new executable from previous exercise, run the below cell to test to make sure you have not introduced any bugs. If you get stuck, check [miniWeather_mpi_openacc.F90](fortran/miniWeather_mpi_openacc.F90).

To note, we add here the environment variable `NVCOMPILER_ACC_TIME=1`. Use this to compare to previous timing results of the already implemented openacc program for each kernel. __Do you notice any timing differences?__

In [None]:
cd fortran/build
qcmd -A $PROJECT -q $QUEUE -l select=1:ncpus=1:ngpus=1 -l gpu_type=$GPU_TYPE -l walltime=60 -v NVCOMPILER_ACC_TIME=1 -- \
$PWD/check_output.sh $PWD/openacc_test_ex 1e-13 4.5e-5
cd ../..

Once you are confident there are no bugs, you can run the next `qcmd` cell to check performance of the non-test program, `openacc_ex`. If you want to modify the resoltion, simulation time, or problem type (see MiniWeather's [Altering the Code's Configuration](https://github.com/mrnorman/miniWeather#altering-the-codes-configurations)), you may edit the parameters at line 53 of [fortran/miniWeather_mpi_exercise.F90](fortran/miniWeather_mpi_exercise.F90#L53). You will have to rebuild the executable using the below `make` command.

In [None]:
make -C fortran/build openacc_ex

In [None]:
cd fortran/build
qcmd -A $PROJECT -q $QUEUE -l select=1:ncpus=1:ngpus=1 -l gpu_type=$GPU_TYPE -l walltime=60 -- \
mpiexec -n 1 $PWD/openacc_ex
cd ../..

## GPU Execution Task Granularity
<img src="img/Map-of-Gang-Worker-Vector.png" alt="Gang Worker Vector hierarchy" style="width:600px;"/> <img src="img/volta-sm-architecture.png" alt="Volta Architecture" style="width:350px;"/>

From ["Porting LASG/ IAP Climate System Ocean Model to GPUs Using OpenACC"](http://dx.doi.org/10.1109/ACCESS.2019.2932443) Jiang, et al 2019

OpenACC, regardless if you specify it manually or not, schedules execution threads organized in gangs, workers, and vectors structures. Each generation of hardware has different limitations as part of the SMs that alter the limitations and number allowed of each of these execution structures. One important detail to remember, especially if you're interested in optimization techniques, is that __each gang has it's own shared memory/L1 cache in the SM__ it is assigned to. Each gang, or thread block from CUDA terms, does not migrate between SMs.

## Using the `!$acc parallel` and `!$acc loop` Compute Construct Directives
You can do a little extra work and instead use the more __prescriptive `!$acc parallel` directive__. This directive expects more direction from the programmer to specify how the parallel work takes place and tends to perform better than `!$acc kernels`, however automatic parallelization analysis from the compiler still takes place for each. One key difference however is that a `!$acc kernels` construct can encapsulate multiple distinct parallel regions, ie compute kernels, and then create multiple CUDA kernels while a `!$acc parallel` region can only specify one singular CUDA kernel compute region at a time, typically at the `gang` level without any specification. More important however are the added clauses you should specify alongside the compute regions. Here are some clause examples to consider:

* You may use `!$acc loop ...` within a `!$acc kernels/parallel ...` region on specific for loop(s)
* Use `!$acc parallel loop [gang/worker/vector]` to distribute work...
    * __Gangs__: across GPU's SMs. Each gang has a shared memory cache and is assigned for usually the outermost loop
    * __Workers__: within GPU's SMs. Each worker is a warp but this clause is often not used when only two levels of parallelism/for loops
    * __Vectors__: within the GPU's SMs. Should be of the order of SIMT or CPU like SIMD operations, ie length 128 or in multiples of warp size 32
* Use `!$acc parallel loop collapse(N)` to unroll tightly N nested loops into one large loop to equally distribute work across GPU
    * Best utilized when your innermost loop is not of optimal size for vector work
    * Often useful when there are more than three levels of parallelism that are beyond a GPU's typical three levels of parallelism
    * Provides more flexibility towards dynamic loop dimension sizes
* Use `!$acc parallel loop reduction(op:var)` to indicate that a reduction should be performed on a variable to avoid a race condition
    * op -> `+` `*` `max` `min` `iand` `ior` `ieor` `.and.` `.or.` `.eqv.` `.neqv.`
    * var -> A scalar varaiable
    * May also use `!$acc atomic update` construct, paricular to wrap around an array type object that must avoid updating the same memory locations.
* Use `$acc parallel private(var1,var2,...)` to specify that each variable listed should be private to whichever execution level scheduled.
    * If you use `$acc parallel loop gang private(var1,var2,...)`, each variable will be private to each gang.
    * If you use `$acc parallel loop vector private(var1,var2,...)`, each variable will be private to each thread.
    

An example of using the `!$acc parallel` in FORTRAN is below:
```fortran
!$acc parallel loop collapse(2) reduction(+:sum)
    do i = 1, n
        do j = 1, m
            ...
            sum = sum + tke
        enddo
    enddo
!$acc end kernels
```



## Using the `!$acc serial` Compute Construct Directive
Sometimes its useful to specify a serial compute region that will run on the GPU. This is most useful in cases where 
* The size of the loops is not ideal for the high level of parallelism achieveable by a GPU, ie O(10-100) loop size, but the cost to move data between GPU and host outweighs the performance gain to let the serial code optimized CPU host do the work.
* There is non-loop serial code that, though would run redundantly on GPU, would be better kept on GPU to avoid data transfers.

Essentially, `!$acc serial` always executes with a single gang of a single worker with a vector length of one. MiniWeather does not have any good examples of this use case, but you may try to play around with this concept, particularly on the serial code sections between for loops, to investigate how it changes performance.

## EXERCISE: Parallelization Using the `!$acc parallel` Prescriptive Directive
Modify the previously added `!$acc kernels` regions to provide more information to the compiler using the `!$acc parallel` directive. Again, look for the TODO sections of [fortran/miniWeather_mpi_exercise.F90](fortran/miniWeather_mpi_exercise.F90#L225) (use `CTRL+F` or `CMD+F` TODO). You may enter a `[x]` in the raw text of this cell to track when you're finished with each section.
- [ ] Line 225
- [ ] Line 274
- [ ] Line 306
- [ ] Line 334
- [ ] Line 371
- [ ] Line 454
- [ ] Line 871

Investigate the output from the `-Minfo=accel` compiler flag. __What types of parallelizations did the compiler find and perform? Any improvements from last time?__

In [None]:
make -C fortran/build openacc_ex openacc_test_ex

## EXERCISE: Run Your Improved OpenACC Program
Once you are satisfied with your changes and compiled the new executable from previous exercise, run the below cell to test to make sure you have not introduced any bugs. 

To note, we add here again the environment variable `NVCOMPILER_ACC_TIME=1`. Use this to compare to previous timing results of each kernel. Do you notice any differences?

In [None]:
cd fortran/build
qcmd -A $PROJECT -q $QUEUE -l select=1:ncpus=1:ngpus=1 -l gpu_type=$GPU_TYPE -l walltime=60 -v NVCOMPILER_ACC_TIME=1 -- \
$PWD/check_output.sh $PWD/openacc_test_ex 1e-13 4.5e-5
cd ../..

Once you are confident there are no bugs, you can run the next `qcmd` cell to check performance of the non-test program, `openacc_ex` (or simply use the `openacc` program for an already fully optimized version). If you want to modify the resolution, simulation time, or problem type (see [Altering the Code's Configuration](https://github.com/mrnorman/miniWeather#altering-the-codes-configurations)), you may edit the parameters at line 53 of [fortran/miniWeather_mpi_exercise.F90](fortran/miniWeather_mpi_exercise.F90#L53). You will have to rebuild the executable using the below `make` command.

Results of the program may be viewed by initiating a ssh X session with Casper on a terminal `ssh -Y [username]@casper.ucar.edu` then running `module load ncview` and `ncview output.nc` on the output file that should now be in your `$HOME` directory or the folder where you ran the MiniWeather executable from.

In [None]:
make -C fortran/build openacc_ex

In [None]:
cd fortran/build
qcmd -A $PROJECT -q $QUEUE -l select=1:ncpus=1:ngpus=1 -l gpu_type=$GPU_TYPE -l walltime=60 -- \
mpiexec -n 1 $PWD/openacc_ex
cd ../..

## Suggested Resources
* Matt Norman's [A Practical Introduction to GPU Refactoring in FORTRAN with Directives for Climate](https://github.com/mrnorman/miniWeather/wiki/A-Practical-Introduction-to-GPU-Refactoring-in-Fortran-with-Directives-for-Climate)
* May 2021, [OpenACC Programming and Best Practices Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC_Programming_Guide_0_0.pdf) and [Github](https://github.com/OpenACC/openacc-best-practices-guide)
* [OpenACC 2.7 Quick Reference Guide](https://www.openacc.org/sites/default/files/inline-files/API%20Guide%202.7.pdf)
* Official [OpenACC 3.1 Full Standard Specification](https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.1-final.pdf) - Not all updated features are implemented yet by compatible compilers
* If you want to dive deep into lower level control and optimization of GPU performance, check out Oak Ridge National Lab's [CUDA Training Series](https://olcf.ornl.gov/cuda-training-series/https://olcf.ornl.gov/cuda-training-series/).