![NCAR UCAR Logo](img/NCAR_CISL_NSF_banner.jpeg)
# Hands-On Session Using OpenACC in MPAS-A

By: Daniel Howard [dhoward@ucar.edu](mailto:dhoward@ucar.edu), Consulting Services Group, CISL & NCAR 

Date: April 28th 2022

In this notebook, we explore the GPU enabled [MPAS-A](http://mpas-dev.github.io/atmosphere/OpenACC/index.html) (Model Prediction Across Scales-Atmosphere) to apply techniques learned from MiniWeather and implementing OpenACC to develop for GPUs. 

* Review of exercises from prior OpenACC/MiniWeather sessions Part 1 and Part 2
* MPAS-Atmosphere model overview
* Managing GPU data in large software projects
* Assessing performance of extracted GPU kernels in MPAS-A

 Head to the [NCAR JupyterHub portal](https://jupyterhub.hpc.ucar.edu/stable) and __start a JupyterHub session on Casper login__ (or batch nodes using 1 CPU, no GPUs) and open the notebook in `05_DirectivesOpenACC/05p2_openACC_miniWeather_Tutorial.ipynb`. Be sure to clone (if needed) and update/pull the NCAR GPU_workshop directory.

```shell
# Use the JupyterHub GitHub GUI on the left panel or the below shell commands
git clone git@github.com:NCAR/GPU_workshop.git
git pull
```

# Workshop Etiquette
* Please mute yourself and turn off video during the session.
* Questions may be submitted in the chat and will be answered when appropriate. You may also raise your hand, unmute, and ask questions during Q&A at the end of the presentation.
* By participating, you are agreeing to [UCAR’s Code of Conduct](https://www.ucar.edu/who-we-are/ethics-integrity/codes-conduct/participants)
* Recordings & other material will be archived & shared publicly.
* Feel free to follow up with the GPU workshop team via Slack or submit support requests to [support.ucar.edu](https://support.ucar.edu)
    * Office Hours: Asynchronous support via [Slack](https://ncargpuusers.slack.com) or schedule a time with an organizer

## Notebook Setup
Set the `PROJECT` code to a currently active project, ie `UCIS0004` for the GPU workshop, and `QUEUE` to the appropriate routing queue depending on if during a live workshop session (`gpuworkshop`), during weekday 8am to 5:30pm MT (`gpudev`), or all other times (`casper`). Due to limited shared GPU resources, please use `GPU_TYPE=gp100` during the workshop. Otherwise, set `GPU_TYPE=v100` (required for `gpudev`) for independent work. See [Casper queue documentation](https://arc.ucar.edu/knowledge_base/72581396#StartingCasperjobswithPBS-Concurrentresourcelimits) for more info.  

In [None]:
export PROJECT=UCIS0004
export QUEUE=gpudev
export GPU_TYPE=v100

## Review of MiniWeather Performance Optimization
At the end of last session, MiniWeather was suggested to use `async` and predominantly `collapse` clauses to achieve optimal performance across the runtime. Using `NX=1024` and `NZ=512`, the most expensive kernel in terms of compute time was at [__Line 231__](../05_DirectivesOpenACC/fortran/miniWeather_mpi_openacc.F90#L231) in the `semi_discrete_step` subroutine, with `NVCOMPILER_ACC_TIME` statistics highlighted below:

```shell
/glade/u/home/dhoward/GPU_workshop/05_DirectivesOpenACC/fortran/miniWeather_mpi_exercise2.F90
semi_discrete_step  NVIDIA  devicenum=0
time(us): 62,147
257: compute region reached 924 times
257: kernel launched 924 times
grid: [16384]  block: [128]
device time(us): total=62,147 max=70 min=66 avg=67
elapsed time(us): total=76,527 max=87 min=80 avg=82
257: data region reached 1848 times
```

The __line number__ in the previous source file is listed at the far left when describing a compute or data region. The name of the subroutine is also given where the kernel is located. The arrangement of __gang/worker/vector__ units is provided by the __grid__, ie number and arrangement of __gangs__, and block, ie __vector length__ times the number of __workers__.

Running this version with the NVIDIA NSight Systems Profiler, we can get a visual representation of the model runtime.
![Profile of MiniWeather - Baseline](img/Profile_MiniWeather_Baseline.png)

This timeline shows the kernels running on the GPU runtime in the upper __blue__ compute kernels, __pink__ device to host transfers, and __aquamarine__ host to device transfers segments. The lower segments show the CPU runtime in __blue__ compute kernel launches, __red__ data directives/regions, and beige __wait/synchronize__ sections.

The bright blue highlights the most expensive GPU kernel in the `semi_discrete_step` subroutine with the associated launch call from the CPU highlighted earlier in the timeline.

![Profile of MiniWeather - Baseline timeline only](img/Profile_MiniWeather_Baseline_cropped.png)

Since we used `async`, the GPU kernels run right after one another without any kernel launch/exit costs.

If we did not use `async`, the profile would look like this and time would be lost as the CPU waits between every kernel launch and must incur kernel launch/exit costs between scheduling each kernel.

![Profile of MiniWeather - No async](img/Profile_MiniWeather_noasync.png)

## MiniWeather - Testing different kernel launch configurations and clauses
Below are some performance statistics on the previously proposed alternate kernel configuration experiments

| MiniWeather Kernel L231, `semi_discrete_step` | Avg. Device Time (s) |
|---|---|
| BaseLine (on V100) - auto `vector_length(128)`        | 67      |
| clause - `gang/worker/vector`      | 1017      |
| clause - `worker/vector` (Move NUM_VARS innermost, seq)     | 2844      |
| clause - `gang/vector` (Move NUM_VARS innermost, seq)     | 79      |
| clause - `tile(32,32,NUM_VARS)`      | 30      |
| clause - `tile(*,*,*)` -> 32,4,32   | 175      |
| clause - `collapse(3) vector_length(32)` | 113      |
| clause - `collapse(3) vector_length(256)` | 67      |
| clause - `collapse(3) vector_length(512)` | 67      |
| clause - `collapse(3) vector_length(1024)` | 75      |

1. __Why do you think the `tile()` clause specifying the outermost loop with the `NUM_VARS` variable was most performant?__
2. __Using `worker/vector`, the profiler shows `grid: [1]  block: [32x4]`. Why is this arrangement the least performant?__
3. __Did you find any better configurations for this or other kernels in MiniWeather? Explain why it performed better.__

## MPAS-A Overview
We will now look at a real world production model __MPAS-A__ and how this model leveraged OpenACC to refactor towards GPU devices.

<img src="img/MPAS-var-res_mesh.png" alt="Global Voronoi mesh" style="width:250px;"/>

<img src="img/MPAS-grid_diagram.png" alt="MPAS grid diagram" style="width:300px;"/>

So far, only the v6.x Atmosphere core has been ported to GPUs and is freely available to review via their [website](https://mpas-dev.github.io/atmosphere/OpenACC/index.html) and the stable [v6.x](https://github.com/MPAS-Dev/MPAS-Model/tree/atmosphere/v6.x-openacc) or v7.x [develop-openacc](https://github.com/MPAS-Dev/MPAS-Model/tree/atmosphere/develop-openacc) branches on GitHub. Some work has also been done on the MPAS-Ocean core given this [presentation](https://www.lanl.gov/org/padwp/adx/computational-physics/parallelcomputing/_assets/docs/2020-student-projects/Ashwath_PCSRI_Final_Presentation.pdf) by PhD student Ashwath Venkataraman.

## Managing GPU data in MPAS-A


## MPAS-A Kernel Extraction
We will focus on the `atm_compute_vert_imp_coefs_work` subroutine and kernels. This is the [link](https://github.com/MPAS-Dev/MPAS-Model/blob/498393d2c5cf36f73db8925d717ae449c3660d40/src/core_atmosphere/dynamics/mpas_atm_time_integration.F#L2109) to the code in the full model codebase and here is the [link](mpas_atm_compute_vert_imp_coefs_work.F90) to the extracted kernel.

The extracted kernel simply utilizes randomized input data as we will be focusing on optimizing the performance of the subroutine's kernels.

## MPAS-A Kernel Optimization
