<a href="https://colab.research.google.com/github/Mkps/nvidia_workshop/blob/main/Profiling_OpenACC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Jupyter notebook is a web based environment for interactive computing. It is capable of running code in a wide variety of programming languages. The notebook consists of cells which can be of type code or markdown (by default, new cells are created as code cells). You can execute the content of code cells by clicking the Run button. Cells which are actively running code will show a 'Play' icon and cells which have completed will show a green tick.

**Connect to GPU runtime**: If using Google Colab, please ensure you have this notebook enabled with GPU before running (Runtime->Change runtime type). If you are not using Google Colab you can ignore this instruction.

<img src="https://drive.google.com/uc?export&id=1rroMUCXRqssxsUNo29qbkeYjqy792BsE"/>

# Installing NVIDIA HPCSDK 22.7

In [4]:
# install HPC SDK 22.7
!echo 'deb [trusted=yes] https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /' | sudo tee /etc/apt/sources.list.d/nvhpc.list
!sudo apt-get update -y
!sudo apt-get install -y nvhpc-22-7-cuda-multi

# setting environment variable path
import os
os.environ["PATH"] = "/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/bin" + os.pathsep + os.getenv("PATH")
# ------

deb [trusted=yes] https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64  InRelease [2,126 B]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Ign:3 https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64  InRelease
Get:7 https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64  Packages [11.9 kB]
Hit:8 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:9 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1,081 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Hit:11 https://ppa.launchpadcontent.net/deadsnakes/p

Let's check the installed version of the compiler to ensure the installation was successfull.

In [8]:
!nvc++ --version


nvc++ 22.7-0 64-bit target on x86-64 Linux -tp haswell 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.


The expected output is:
```
nvc++ 22.7-0 64-bit target on x86-64 Linux -tp haswell
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
```

If your output is different, make sure you are connected to the runtime and re-do the installation.

In [9]:
# download the source code
import gdown
import os
url = 'https://drive.google.com/uc?id=1FDsXtSVmqk8TI4tZvAMbYiS8761_gWLM&export=download'
#url = 'https://drive.google.com/u/0/uc?export=download&confirm=1FDsXtSVmqk8TI4tZvAMbYiS8761_gWLM'
output = '/home/code'
gdown.download(url, output, quiet=False,proxy=None)
!unzip /home/code -d /home
!rm /home/code

Downloading...
From: https://drive.google.com/uc?id=1FDsXtSVmqk8TI4tZvAMbYiS8761_gWLM&export=download
To: /home/code
100%|██████████| 173k/173k [00:00<00:00, 65.4MB/s]


Archive:  /home/code
replace /home/source_code/lab1/Makefile? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

Let' check we downloaded the source files.

In [10]:
!ls /home/source_code

lab1  lab2  lab3  lab4	lab5  lab6


The expected output is:

```
lab1  lab2  lab3  lab4	lab5  lab6
```

If your output is different, make sure you are connected to the runtime and re-run the cell to download the source code.

Before we begin, let us execute the below cell to display information about the NVIDIA® CUDA® driver and the GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl+Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell.


In [11]:
!nvidia-smi

Tue Oct 10 08:33:36 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Mini Weather Application

In this lab, we will accelerate a fluid simulation in the context of atmosphere and weather simulation.
This mini weather code mimics the basic dynamics seen in atmospheric weather and climate.

The figure below demonstrates how a narrow jet of fast and slightly cold wind is injected into a balanced, neutral atmosphere at rest from the left domain near the model.

<!--<img src="images/Time.jpg" width="80%" height="80%">-->
<img src="https://drive.google.com/uc?export=download&id=1V1BcUCAKnRGAOGtNckJpmmnAhW35TKU4">


Simulation is a repetitive process from 0 to the desired simulated time, increasing by Δt on every iteration.
Each Δt step is practically the same operation. Each simulation is solving a differential equation that represents how the flow of the atmosphere (fluid) changes according to small perturbations. To simplify this solution the code uses dimensional splitting: Each dimension X and Z are treated independently.


<!--<img src="images/X_Y.jpg" width="80%" height="80%">-->
<img src="https://drive.google.com/uc?export=download&id=1pHafZwxlyylYKkos6APbs_UZeYNSwEK5">


The differential equation has a time derivative that needs integration, and a simple low-storage Runge-Kutta ordinary differential equations (ODE) solver is used to integrate the time derivative. In each time step, the order in which the dimensions are solved is reversed, giving second-order accuracy.


<!--<img src="images/Range-Kutta.jpg" width="70%" height="70%">-->
<img src="https://drive.google.com/uc?export=download&id=1Nex_qku2NFNvIFPM8sO0zYob666ZaDwg">

### The objective of this exercise is not to dwell on the math but to make use of OpenACC to parallelize and improve performance.

The general flow of the code is shown in the diagram below. For each time step, the differential equations are solved.


<!--<img src="images/Outer_Loop.jpg" width="70%" height="70%">-->
<img src="https://drive.google.com/uc?export=download&id=1Gy8LrKpRTCXbwkO9lAsJKADyAkGEYks4">


```cpp
while (etime < sim_time) {
    //If the time step leads to exceeding the simulation time, shorten it for the last step
    if (etime + dt > sim_time) { dt = sim_time - etime; }
    //Perform a single time step
    perform_timestep(state,state_tmp,flux,tend,dt);
    //Inform the user
    if (masterproc) { printf( "Elapsed Time: %lf / %lf\n", etime , sim_time ); }
    //Update the elapsed time and output counter
    etime = etime + dt;
    output_counter = output_counter + dt;
    //If it's time for output, reset the counter, and do output
    if (output_counter >= output_freq) {
      output_counter = output_counter - output_freq;
      output(state,etime);
    }
  }
  
```

At every time step, the direction is reversed to get the second-order derivative.


<!--<img src="images/Time_Step.jpg" width="70%" height="70%">-->
<img src="https://drive.google.com/uc?export=download&id=1o5psNO0wq-dgZ510dlV52NlyHjLAirAl">


```cpp
void perform_timestep( double *state , double *state_tmp , double *flux , double *tend , double dt ) {
  if (direction_switch) {
    //x-direction first
    semi_discrete_step( state , state     , state_tmp , dt / 3 , DIR_X , flux , tend );
    semi_discrete_step( state , state_tmp , state_tmp , dt / 2 , DIR_X , flux , tend );
    semi_discrete_step( state , state_tmp , state     , dt / 1 , DIR_X , flux , tend );
    //z-direction second
    semi_discrete_step( state , state     , state_tmp , dt / 3 , DIR_Z , flux , tend );
    semi_discrete_step( state , state_tmp , state_tmp , dt / 2 , DIR_Z , flux , tend );
    semi_discrete_step( state , state_tmp , state     , dt / 1 , DIR_Z , flux , tend );
  } else {
    //z-direction second
    semi_discrete_step( state , state     , state_tmp , dt / 3 , DIR_Z , flux , tend );
    semi_discrete_step( state , state_tmp , state_tmp , dt / 2 , DIR_Z , flux , tend );
    semi_discrete_step( state , state_tmp , state     , dt / 1 , DIR_Z , flux , tend );
    //x-direction first
    semi_discrete_step( state , state     , state_tmp , dt / 3 , DIR_X , flux , tend );
    semi_discrete_step( state , state_tmp , state_tmp , dt / 2 , DIR_X , flux , tend );
    semi_discrete_step( state , state_tmp , state     , dt / 1 , DIR_X , flux , tend );
  }
  if (direction_switch) { direction_switch = 0; } else { direction_switch = 1; }
}
```


<!--<img src="images/Semi_Discrete.jpg" width="70%" height="70%">-->
<img src="https://drive.google.com/uc?export=download&id=1ZB0yy3bMy0ncmj3mMExb4Pi-IdxQJnds">

# Lab 1

### Learning objectives

The **goal** of this lab is to:

- Learn how to compile your serial application with the NVIDIA HPC compiler
- Learn how to benchmark and profile the serial code using NVIDIA Nsight Systems
- Learn how to identify routines responsible for the bulk of the execution time via NVIDIA Tools Extension SDK (NVTX) markers shown on the Nsight System’s timeline
- Learn about scaling and Amdahl’s law

We do not intend to cover:

- The OpenACC programming model
- Advanced optimization techniques in detail

Understanding the structure of the code is very important to identify opportunities and parallelize the code.

**Understand and analyze** the code:

In [None]:
!cat /home/source_code/lab1/miniWeather_serial.cpp

In [None]:
!cat /home/source_code/lab1/Makefile

**Compile** the code with the NVIDIA HPC compiler by running `make`. You can get compiler feedback by adding the `-Minfo` flag. Some of the available options are:

- `accel` – Print compiler operations related to the accelerator
- `all` – Print all compiler output
- `intensity` – Print loop intensity information

Example usage: `-Minfo=accel`

In [14]:
# compile the C/C++ code
!cd /home/source_code/lab1 && make clean && make

rm -f *.o miniWeather *.nsys-rep *.sqlite *.ncu-rep
nvc++ -O3 -w -ldl -Minfo=accel -o miniWeather miniWeather_serial.cpp -I/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.0/include 


Now, we can **profile** the serial code via Nsight Systems command line (see below example command) and download the report.

In [17]:
!cd /home/source_code/lab1 && nsys profile -t nvtx --stats=true --force-overwrite true -o miniWeather_1 ./miniWeather 40 20 20

The arguments supplied are 40 20 20
nx_glob, nz_glob: 40 20
dx,dz: 500.000000 500.000000
dt: 1.666667
Elapsed Time: 0.000000 / 20.000000
Elapsed Time: 1.666667 / 20.000000
Elapsed Time: 3.333333 / 20.000000
Elapsed Time: 5.000000 / 20.000000
Elapsed Time: 6.666667 / 20.000000
Elapsed Time: 8.333333 / 20.000000
Elapsed Time: 10.000000 / 20.000000
Elapsed Time: 11.666667 / 20.000000
Elapsed Time: 13.333333 / 20.000000
Elapsed Time: 15.000000 / 20.000000
Elapsed Time: 16.666667 / 20.000000
Elapsed Time: 18.333333 / 20.000000
Generating '/tmp/nsys-report-7d48.qdstrm'
[3/3] Executing 'nvtxsum' stats report

NVTX Range Statistics:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)   Style        Range      
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -------  ----------------
     39.4        5,235,921          1  5,235,921.0  5,235,921.0  5,235,921  5,235,921          0.0  PushPop  Total

Download the report by running the below cell and open the report via the Nsight Systems user interface (UI).

In [16]:
from google.colab import files
files.download('/home/source_code/lab1/miniWeather_1.nsys-rep')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

To identify which step in the CPU algorithm is slowing the GPU down, we added an annotated timeline to mark the regions and different steps of the algorithm.From the "Timeline View", check the NVTX markers displayed as part of threads.

<!--<img src="images/e1-nvtx_gui.png">-->
<img src="https://drive.google.com/uc?export=download&id=13tzndtF9qgvrgyU17knzJeRQwFnN8A66">


You can also review NVTX statistics from the terminal console once the profiling session has ended and see most of the execution time is spent in `perform_timestep`.

<!--<img src="images/e1-nvtx_terminal.png">-->
<img src="https://drive.google.com/uc?export=download&id=1ZL1pjMdawT8uFWxTxJ36RCu2f3hjCTHs">


#### Scaling and Amdahl's law
<a name="amdahls"></a>
To plan an incremental parallelization strategy after identifying routines responsible for the bulk of the execution time, it is important to know how the application can scale. The amount of performance an application achieves by running on a GPU depends on the extent to which it can be parallelized. Code that cannot be sufficiently parallelized should run on the host, unless doing so would result in excessive transfers between the host and the device. It is very important to understand the relation between the problem size and computational performance as this can determine the amount of speedup and benefit you would get by parallelizing on the GPU.  

We can **Profile** the application again and run the executable with different values for `nx_glob`, `nz_glob` , and `sim_time` (40,20,10).

**Note:** You can provide input values for `nx_glob`, `nz_glob` , and `sim_time` where,

* `nx_glob` and `nz_glob` is the number of total cells in the x and z direction; and
* `sim_time` is the simulation time in seconds

The number of total cells in the x-direction must be twice as large as the total number of cells in the z-directions. The default values are 400, 200, and 200 seconds.

Now, we profile the code again and open the example expected output via the Nsight Systems UI.

From the "Timeline View", take a closer look at the "NVTX" markers from the function table on the left side of the top pane and compare it with the timeline from the previous report. You can see now that the most time-consuming part of the application is the initialization.

<!--<img src="images/e1-nvtx.png">-->
<img src="https://drive.google.com/uc?export=download&id=1ziv2brYzl0_uO1SlbWTNpoo_dnouDToN">


Due to the small problem size (`nx_glob`, `nz_glob` , and `sim_time` in this example), most of the computation is dominated by the initialization and there is not enough work/computation to make it suitable for GPU.

According to *Amdahl's law*, the speedup achieved by accelerating portions of an application is limited by the code sections that are not accelerated. Before parallelizing an application, it is important to know that the overall performance improvement gained by optimizing the portion of the code is limited by the fraction of time that the improved section is used. In other words, you may speedup a portion of the code by a factor of "N", but if only a small fraction of time is spent in this portion of the code, then the overall performance has not been improved substantially.

So, in this example, changing the problem size can hide the initialization part of the code and make it a better candidate for the GPU. Now that you have determined what the most important bottleneck is, modify the application to make this problem more appropriate for the GPU.

# Lab 2


### Learning objectives

The **goal** of this lab is to:
- Implement OpenACC parallelism using parallel directives to parallelize the serial application
- Learn how to compile your parallel application with the NVIDIA HPC compiler
- Benchmark and compare the parallel version of the application with the serial version
- Learn how to interpret NVIDIA HPC compiler feedback to ensure the applied optimization was successful

We do not intend to cover:

- The OpenACC programming model
- Advanced optimization techniques in detail

After inspecting the profiler report from the terminal, we noticed that most of the computation is done in the `perform_timestep`. So, we ported the code to the GPU using the OpenACC programming model and added OpenACC compute directives (`#pragma acc parallel`) around the expensive routines (loops) in the code.

**Understand and analyze** the code:


In [18]:
!cat /home/source_code/lab2/miniWeather_openacc.cpp

//////////////////////////////////////////////////////////////////////////////////////////
// miniWeather
// Author: Matt Norman <normanmr@ornl.gov>  , Oak Ridge National Laboratory
// This code simulates dry, stratified, compressible, non-hydrostatic fluid flows
// For documentation, please see the attached documentation in the "documentation" folder
//////////////////////////////////////////////////////////////////////////////////////////

/*
** Copyright (c) 2018, National Center for Computational Sciences, Oak Ridge National Laboratory.  All rights reserved.
**
** Portions Copyright (c) 2020, NVIDIA Corporation.  All rights reserved.
*/

#include <stdlib.h>
#include <math.h>
#include <stdio.h>
#include <nvtx3/nvToolsExt.h>

const double pi = 3.14159265358979323846264338327;   //Pi
const double grav = 9.8;                             //Gravitational acceleration (m / s^2)
const double cp = 1004.;                             //Specific heat of dry air at constant pressure
const doubl

In [19]:
!cat /home/source_code/lab2/Makefile

# Copyright (c) 2020 NVIDIA Corporation.  All rights reserved. 

CC := nvc++
CFLAGS := -O3 -w
ACCFLAGS := -ta=tesla:managed -Minfo=accel

FC := nvfortran
FFLAGS := -fast
LDFLAGS := -lnvhpcwrapnvtx
    
miniWeather_c: miniWeather_openacc.cpp
	${CC} ${CFLAGS} ${ACCFLAGS} -o miniWeather miniWeather_openacc.cpp 

miniWeather_f: miniWeather_openacc.f90
	$(FC) $(FFLAGS) $(ACCFLAGS) miniWeather_openacc.f90 -o miniWeather $(LDFLAGS)
    
clean:
	rm -f *.o miniWeather *.nsys-rep *.sqlite *.ncu-rep


Now, compile the code with `make`.  View the NVIDIA HPC compiler feedback (enabled by adding `-Minfo=accel` flag) and investigate the compiler feedback for the OpenACC code. The compiler feedback provides useful information about applied optimizations.

In [20]:
# compile the C/C++ code
!cd /home/source_code/lab2 && make clean && make

rm -f *.o miniWeather *.nsys-rep *.sqlite *.ncu-rep
nvc++ -O3 -w -ta=tesla:managed -Minfo=accel -o miniWeather miniWeather_openacc.cpp 
semi_discrete_step(double *, double *, double *, double, int, double *, double *):
    249, Generating NVIDIA GPU code
        249, #pragma acc loop gang /* blockIdx.x */
        251, #pragma acc loop seq
        253, #pragma acc loop seq
    249, Generating default present(tend[:],state_out[:],state_init[:])
    251, Complex loop carried dependence of tend-> prevents parallelization
         Loop carried dependence of state_out-> prevents parallelization
         Loop carried backward dependence of state_out-> prevents vectorization
         Complex loop carried dependence of state_out->,state_init-> prevents parallelization
    253, Complex loop carried dependence of tend-> prevents parallelization
         Loop carried dependence of state_out-> prevents parallelization
         Loop carried backward dependence of state_out-> prevents vectorization
 

Let's inspect part of the compiler feedback and see what it's telling us (the lines in the compiler feedback might be slightly different for you).

<!--<img src="images/cfeedback1_.png">-->
<img src="https://drive.google.com/uc?export=download&id=14QF8jMIWX0ziqBBmlCRQBI8Nl_dZKfXc">


- Using `-ta=tesla:managed`, instruct the compiler to build for an NVIDIA GPU using "CUDA Managed Memory"
- Using `-Minfo` command-line option, we will see all output from the compiler. In this example, we use `-Minfo=accel` to only see the output corresponding to the accelerator (in this case an NVIDIA GPU).
- Let's look at the line starting with `compute_tendencies_x`. It tells us which function the following information is in reference to.
- The line starting with 278, shows we created a parallel OpenACC loop. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang. `278, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */`
- The rest of the information concerns data movement. The compiler detected the possible need to move data and handled it for us. We will get into this later in this lab.

It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it.

Now, let's **profile** the application for smaller values of `nx_glob`,`nz_glob`, and `sim_time`: **40, 20, 100**.

In [21]:
!cd /home/source_code/lab2 && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o miniWeather_2 ./miniWeather 40 20 100

The arguments supplied are 40 20 100
nx_glob, nz_glob: 40 20
dx,dz: 500.000000 500.000000
dt: 1.666667
Elapsed Time: 0.000000 / 100.000000
Elapsed Time: 1.666667 / 100.000000
Elapsed Time: 3.333333 / 100.000000
Elapsed Time: 5.000000 / 100.000000
Elapsed Time: 6.666667 / 100.000000
Elapsed Time: 8.333333 / 100.000000
Elapsed Time: 10.000000 / 100.000000
Elapsed Time: 11.666667 / 100.000000
Elapsed Time: 13.333333 / 100.000000
Elapsed Time: 15.000000 / 100.000000
Elapsed Time: 16.666667 / 100.000000
Elapsed Time: 18.333333 / 100.000000
Elapsed Time: 20.000000 / 100.000000
Elapsed Time: 21.666667 / 100.000000
Elapsed Time: 23.333333 / 100.000000
Elapsed Time: 25.000000 / 100.000000
Elapsed Time: 26.666667 / 100.000000
Elapsed Time: 28.333333 / 100.000000
Elapsed Time: 30.000000 / 100.000000
Elapsed Time: 31.666667 / 100.000000
Elapsed Time: 33.333333 / 100.000000
Elapsed Time: 35.000000 / 100.000000
Elapsed Time: 36.666667 / 100.000000
Elapsed Time: 38.333333 / 100.000000
Elapsed Time: 4

You can see that the changes made actually slowed down the code and it runs slower compared to the non-accelerated CPU-only version. Let's review the profiler's report. Download the report by running the below cell and open the report via the NVIDIA Nsight™ Systems  user interface (UI).

In [22]:
from google.colab import files
files.download('/home/source_code/lab2/miniWeather_2.nsys-rep')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The timeline of the application is shown below.

<!--<img src="images/1_timeline_full.png" width=90%>-->
<img src="https://drive.google.com/uc?export=download&id=1KCh3gtkcEi6-a5b1EV1WsN5XbFD9Z673">


Hovering over the blue chart in the CUDA device row, we see that the CUDA kernel coverage on the GPU is about 80-90% throughout. This means that the GPU is idle for the remaining 10-20% of the time.

<!--<img src="images/1_gpu_row.png" width=90%>-->
<img src="https://drive.google.com/uc?export=download&id=1KoAYglxjyY1h1yUDXwtt6qgHe7Gc_zP3">


**Let's zoom into the timeline to see what's going on.** Press the "Ctrl" key while moving the mouse scroll wheel up or down to zoom into or out of the area around the mouse pointer. Another way to zoom in is to select the region you want to zoom into and press the "Shift"+"Z" keys.

<!--<img src="images/1_timeline.png" width=90%>-->
<img src="https://drive.google.com/uc?export=download&id=1U6G-7VqIIscQUevlP2MHtBjoEzfdeNaG">


Zoom into the OpenACC row. Nsight Systems is capable of capturing information about OpenACC execution in the profiled process. Under the CPU rows in the timeline tree, each thread that uses OpenACC will show OpenACC trace information. You can click on an OpenACC application programming interface (API) call to see the correlation with the underlying CUDA API calls (highlighted in teal). If the OpenACC API results in GPU work, that will also be highlighted:

<!--<img src="images/1_correlation.png" width=90%>-->
<img src="https://drive.google.com/uc?export=download&id=1D1ZOVvdO9IYlj0_h0B_g50hDCSBjTyKF">


If you hover over a particular OpenACC construct, it will bring up a tooltip with details about that construct:

<!--<img src="images/1_openacc_row.png" width=90%>-->
<img src="https://drive.google.com/uc?export=download&id=1CrEWDLCeUe-9N6fyIHrc5iFhrOA4Aq7q">


From the "Timeline View" on the top pane, double click on the "CUDA" from the function table on the left and expand it. Zoom in on the timeline and you can see a pattern similar to the screenshot below. Clearly, there is a repeating pattern where the GPU is idle for some time followed by a burst of kernel and memory operations. The blue boxes are the compute kernels and each of these groupings of kernels is surrounded by purple and teal boxes (annotated with red color) representing data movements. **Screenshots represent profiler report for the values of 400,200,200.**

<!--<img src="images/nsys_slow.png" width=90%>-->
<img src="https://drive.google.com/uc?export=download&id=1SyX_qSwyVx3sSjFxoQBPnFYErIBf_W3j">


Let's hover your mouse over kernels (blue boxes) one by one from each row and review the provided information.

<!--<img src="images/occu-1.png" width=90% >-->
<img src="https://drive.google.com/uc?export=download&id=18RfQN_00HCuiLyNSVMNrHH9d-iZ59KVT">


**Note**: In the next two exercises, we start optimizing the application by improving occupancy and reducing data movements.

# Lab 3

### Learning objectives

The **goal** of this lab is to:
- Learn about GPU occupancy, and  OpenACC vs CUDA execution model
- Learn how to find GPU occupancy from the NVIDIA Nsight™ Systems profiler
- Learn how to improve the occupancy and saturate compute resources
- Learn about collapse clauses for further optimization of the parallel nested loops and when to use them
- Apply collapse clause to eligible nested loops in the application and investigate the profiler report

We do not intend to cover:

- The OpenACC programming model
- Advanced optimization techniques in detail

Look at the profiler report from the previous exercise again. Take a close look at the kernel functions from the timeline. For example, we can see that the  `semi_discrete_steps_249_gpu` kernel has a theoretical  occupancy of 50% . It clearly shows that occupancy is a limiting factor. *Occupancy* is a measure of how well the GPU compute resources are being utilized (e.g. how much parallelism is running / how much parallelism the hardware could run).

<!--<img src="images/occu-2_.png" width="25%" height="25%">-->
<img src="https://drive.google.com/uc?export=download&id=1uJVLZHJ5HnMFP1iSHOFQqGy8Lh-_Z5YU">

NVIDIA GPUs are comprised of multiple streaming multiprocessors (SMs) that can manage up to 2048 concurrent threads (not actively running at the same time). Low occupancy shows that there are not enough active threads to fully utilize the computing resources. Higher occupancy implies that the scheduler has more active threads to choose from and hence achieves higher performance. So, what does this mean in the OpenACC execution model?

**Gang, Worker, and Vector**
CUDA and the OpenACC programming model use different terminologies for similar ideas. For example, in CUDA, parallel execution is organized into grids, blocks, and threads. On the other hand, the OpenACC execution model has three levels of gang, worker, and vector. OpenACC assumes the device has multiple processing elements (streaming multiprocessors on NVIDIA GPUs) running in parallel and the mapping of the OpenACC execution model on CUDA is as below:

- An OpenACC gang is a threadblock
- A worker is a warp
- An OpenACC vector is a CUDA thread

<!--<img src="images/diagram.png" width="50%" height="50%">-->
<img src="https://drive.google.com/uc?export=download&id=1Ii2W0upfrzDYM9YxENyU1RNLXowlhdsK">


In order to improve the occupancy, we have to increase the parallelism within the gang; we have to increase the number of threads that can be scheduled on the GPU to improve GPU thread occupancy.

**Optimizing loops and improving occupancy**
Let's have a look at the compiler feedback (*Line 249*) and the corresponding code snippet showing three tightly nested loops (the lines in the compiler feedback might be slightly different for you).

<!--<img src="images/cfeedback2_.png" width="80%" height="80%">-->
<img src="https://drive.google.com/uc?export=download&id=11CgkQrNwms2Y5cNe1lTPkeSWEX9Oe6Mc">


The iteration count for the outer loop is `NUM_VARS` which is 4. As you can see from the above screenshot, the block dimension is <1,1,1> which shows the small amount of parallelism within the gang.

```cpp
#pragma acc parallel loop private(indt, indf1, indf2)
  for (ll = 0; ll < NUM_VARS; ll++)
  {
    for (k = 0; k < nz; k++)
    {
      for (i = 0; i < nx; i++)
      {
        indt = ll * nz * nx + k * nx + i;
        indf1 = ll * (nz + 1) * (nx + 1) + k * (nx + 1) + i;
        indf2 = ll * (nz + 1) * (nx + 1) + k * (nx + 1) + i + 1;
        tend[indt] = -(flux[indf2] - flux[indf1]) / dx;
      }
    }
  }
```

In order to expose more parallelism and improve occupancy, we can use an additional clause called `collapse` in the `#pragma acc loop` to optimize loops. The loop directive gives the compiler additional information about the next loop in the source code through several clauses. Apply the `collapse(N)` clause to a loop directive to collapse the next `N` tightly-nested loops to be collapsed into a single, flattened loop. This is useful if you have many nested loops or when you have really short loops.

When the loop count in any of some tightly nested loops is relatively small compared to the available number of threads in the device, creating a single iteration space across all the nested loops, increases the iteration count thus allowing the compiler to extract more parallelism.

**Tips on where to use:**
- Collapse outer loops to enable the creation of more gangs.
- Collapse inner loops to enable longer vector lengths.
- Collapse all loops, when possible, to do both

We added the `collapse` clause to the code and made the necessary changes to the loop directives. **Understand and analyze** the code:

In [23]:
!cat /home/source_code/lab3/miniWeather_openacc.cpp

//////////////////////////////////////////////////////////////////////////////////////////
// miniWeather
// Author: Matt Norman <normanmr@ornl.gov>  , Oak Ridge National Laboratory
// This code simulates dry, stratified, compressible, non-hydrostatic fluid flows
// For documentation, please see the attached documentation in the "documentation" folder
//////////////////////////////////////////////////////////////////////////////////////////

/*
** Copyright (c) 2018, National Center for Computational Sciences, Oak Ridge National Laboratory.  All rights reserved.
**
** Portions Copyright (c) 2020, NVIDIA Corporation.  All rights reserved.
*/

#include <stdlib.h>
#include <math.h>
#include <stdio.h>
#include <nvtx3/nvToolsExt.h>

const double pi = 3.14159265358979323846264338327;   //Pi
const double grav = 9.8;                             //Gravitational acceleration (m / s^2)
const double cp = 1004.;                             //Specific heat of dry air at constant pressure
const doubl

Now, compile the code with `make`.

In [24]:
!cd /home/source_code/lab3 && make clean && make

rm -f *.o miniWeather *.nsys-rep *.sqlite *.ncu-rep
nvc++ -O3 -w -ta=tesla:managed -Minfo=accel -o miniWeather miniWeather_openacc.cpp 
semi_discrete_step(double *, double *, double *, double, int, double *, double *):
    249, Generating NVIDIA GPU code
        249, #pragma acc loop gang, vector(128) collapse(3) /* blockIdx.x threadIdx.x */
        251,   /* blockIdx.x threadIdx.x collapsed */
        253,   /* blockIdx.x threadIdx.x collapsed */
    249, Generating implicit copyin(tend[:]) [if not already present]
         Generating implicit copyout(state_out[:]) [if not already present]
         Generating implicit copyin(state_init[:]) [if not already present]
compute_tendencies_x(double *, double *, double *):
    278, Generating NVIDIA GPU code
        278, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
        280,   /* blockIdx.x threadIdx.x collapsed */
        283, #pragma acc loop seq
        285, #pragma acc loop seq
    278, Local memory used 

Let us start inspecting the compiler feedback and see if it applied the optimizations (the lines in the compiler feedback might be slightly different for you). Here is the screenshot of expected compiler feedback after adding the `collapse`clause to the code. You can see that nested loops on lines 249 and 278 have been successfully collapsed.

<!--<img src="images/cfeedback3_.png" width="80%" height="80%">-->
<img src="https://drive.google.com/uc?export=download&id=1m6bWXy5Cbz2i7mZCJviGbiTBSJ3DKmbf">


Now, **Profile** the code with Nsight Systems command line `nsys`.

In [33]:
!cd /home/source_code/lab3 && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o miniWeather_3 ./miniWeather

Using default values ...
nx_glob, nz_glob: 400 200
dx,dz: 50.000000 50.000000
dt: 0.166667
Elapsed Time: 0.000000 / 200.000000
Elapsed Time: 0.166667 / 200.000000
Elapsed Time: 0.333333 / 200.000000
Elapsed Time: 0.500000 / 200.000000
Elapsed Time: 0.666667 / 200.000000
Elapsed Time: 0.833333 / 200.000000
Elapsed Time: 1.000000 / 200.000000
Elapsed Time: 1.166667 / 200.000000
Elapsed Time: 1.333333 / 200.000000
Elapsed Time: 1.500000 / 200.000000
Elapsed Time: 1.666667 / 200.000000
Elapsed Time: 1.833333 / 200.000000
Elapsed Time: 2.000000 / 200.000000
Elapsed Time: 2.166667 / 200.000000
Elapsed Time: 2.333333 / 200.000000
Elapsed Time: 2.500000 / 200.000000
Elapsed Time: 2.666667 / 200.000000
Elapsed Time: 2.833333 / 200.000000
Elapsed Time: 3.000000 / 200.000000
Elapsed Time: 3.166667 / 200.000000
Elapsed Time: 3.333333 / 200.000000
Elapsed Time: 3.500000 / 200.000000
Elapsed Time: 3.666667 / 200.000000
Elapsed Time: 3.833333 / 200.000000
Elapsed Time: 4.000000 / 200.000000
Elapsed T

Download the report by running the below cell and open the report via the Nsight Systems UI.

In [None]:
from google.colab import files
files.download('/home/source_code/lab3/miniWeather_3.nsys-rep')

Now, have a close look at the kernel functions on the timeline and the occupancy.

<!--<img src="images/occu-3_.png" width="70%">-->
<img src="https://drive.google.com/uc?export=download&id=1GEpGEkVpCscdoRvEPAeAYd28ktNlMLvc">


As you can see from the above screenshot, the theoretical occupancy is now 75% and the block dimension is now `<128,1,1>` where *128* is the vector size per gang. **Screenshots represent profiler reports for the values of 400,200,200.**

```cpp
#pragma acc parallel loop collapse(3) private(inds, indt)
  for (ll = 0; ll < NUM_VARS; ll++)
  {
    for (k = 0; k < nz; k++)
    {
      for (i = 0; i < nx; i++)
      {
        inds = ll * (nz + 2 * hs) * (nx + 2 * hs) + (k + hs) * (nx + 2 * hs) + i + hs;
        indt = ll * nz * nx + k * nx + i;
        state_out[inds] = state_init[inds] + dt * tend[indt];
      }
    }
  }
}
```

The iteration count for the collapsed loop is `NUM_VARS * nz * nx` where (in the example screenshot),

- nz= 200,
- nx = 400, and
- NUM_VARS = 4

So, the iteration count for this particular loop inside the `semi_discrete_steps_249_gpu` function is 320K. This number divided by the vector length of *128* would give us the grid dimension of `<2500,1,1>`.

By creating a single iteration space across the nested loops and increasing the iteration count, we improved the occupancy and extracted more parallelism.

**Notes:**
- 100% occupancy is not required, nor does it guarantee the best performance.
- Less than 50% occupancy is often a red flag

How much this optimization will speed up the code will vary according to the application and the target accelerator, but it is not uncommon to see large speed-ups by using collapse on loop nests.

Moreover, you can see that the `semi_discrete_steps_249_gpu` function now takes 3.9% compared to 50.1% from the previous section. This means the bottleneck shifted and now we must focus on another kernel. In the next session, we focus on optimizing the data movement and reducing all non-necessary data migrations.

<!--<img src="images/compare_23.png" width="70%"> -->
<img src="https://drive.google.com/uc?export=download&id=1XprHAzRV-aNq_6OoJIdFh0JSf7ullPse">

# Lab 4

### Learning objectives

The **goal** of this lab is to:

- Learn how to identify redundant memory copies via NVIDIA Nsight™ Systems
- Learn how to improve efficiency by reducing extra data copies via OpenACC data directives
- Learn how to use NVIDIA HPC compiler feedback for guidance on where to insert OpenACC data directives
- Apply data directives to the parallel application, benchmark, and profile the application

We do not intend to cover:

- The OpenACC programming model
- Advanced optimization techniques in detail

Let's inspect the profiler report from the previous exercise. From the "Timeline View" on the top pane, double-click on "CUDA" from the function table on the left and expand it. Zoom in on the timeline and you can see a pattern similar to the screenshot below. The blue boxes are the compute kernels and each of these groupings of kernels is surrounded by purple and green boxes (annotated with a green rectangle) representing data movements. If you hover your mouse over each box, you can see more details.

What this graph is showing is that there is a lot of data movement between GPU and CPU.
    
<!--<img src="images/nsys_data_mv.png">-->
<img src="https://drive.google.com/uc?export=download&id=1keCaaM91iPQd7Xm_fNbyyrXFMh5u9jEt">


The compiler feedback we collected earlier tells us quite a bit about data movement too. If we look again at the compiler feedback from above, we see the following (the lines in the compiler feedback might be slightly different for you).

<!--<img src="images/cfeedback3-1_.png" width="80%" height="80%">-->
<img src="https://drive.google.com/uc?export=download&id=1XEeEf6LZo9L3VqU9gOb8o0JQAsaf8Swm">


The compiler feedback is telling us that the compiler has inserted data movement around our parallel region at line 278 which copies the `hy_dens_cell`, `hy_dens_theta_cell`, and `state` arrays in and out of GPU memory and also copies `flux` array out.

The compiler can only work with the information we provide. It knows we need the `hy_dens_cell`, `hy_dens_theta_cell`, `state`, and `flux` arrays on the GPU for the accelerated section within the  `compute_tendencies_x` function, but we did not tell the compiler anything about what happens to the data outside of those sections. Without this knowledge, the compiler has to copy the full arrays to the GPU and back to the CPU for each accelerated section. This is a good deal of unnecessary data transfers.

Ideally, we would want to move the data (example: `hy_dens_cell`, `hy_dens_theta_cell`, `state` arrays) to the GPU at the beginning, and only transfer it back to the CPU at the end (if needed). And as for the `flux` array in this example, we do not need to copy any data back and forth. So we only need to create space on the device (GPU) for this array.

We need to give the compiler information about how to reduce the extra and unnecessary data movement. By adding an OpenACC `data` directive to a structured code block, the compiler will know how to manage data according to the clauses. For information on the data directive clauses, please visit [OpenACC 3.0 Specification](https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC.3.0.pdf).

We added `data` directives to the code. **Understand and analyze** the code:


In [50]:
!cat /home/source_code/lab4/miniWeather_openacc.cpp

//////////////////////////////////////////////////////////////////////////////////////////
// miniWeather
// Author: Matt Norman <normanmr@ornl.gov>  , Oak Ridge National Laboratory
// This code simulates dry, stratified, compressible, non-hydrostatic fluid flows
// For documentation, please see the attached documentation in the "documentation" folder
//////////////////////////////////////////////////////////////////////////////////////////

/*
** Copyright (c) 2018, National Center for Computational Sciences, Oak Ridge National Laboratory.  All rights reserved.
**
** Portions Copyright (c) 2020, NVIDIA Corporation.  All rights reserved.
*/

#include <stdlib.h>
#include <math.h>
#include <stdio.h>
#include <nvtx3/nvToolsExt.h>

const double pi = 3.14159265358979323846264338327;   //Pi
const double grav = 9.8;                             //Gravitational acceleration (m / s^2)
const double cp = 1004.;                             //Specific heat of dry air at constant pressure
const doubl

Now, compile the code with `make`.

In [51]:
!cd /home/source_code/lab4 && make clean && make

rm -f *.o miniWeather *.nsys-rep *.sqlite *.ncu-rep
nvc++ -O3 -w -ta=tesla -Minfo=accel -o miniWeather miniWeather_openacc.cpp 
main:
    142, Generating copy(state[:((nz+4)*(nx+4))*4]) [if not already present]
         Generating create(flux[:((nz+1)*(nx+1))*4]) [if not already present]
         Generating copyin(hy_dens_int[:nz+1],hy_dens_cell[:nz+4],hy_dens_theta_cell[:nz+4],hy_dens_theta_int[:nz+1]) [if not already present]
         Generating create(tend[:(nz*nx)*4]) [if not already present]
         Generating copyin(state_tmp[:((nz+4)*(nx+4))*4],hy_pressure_int[:nz+1]) [if not already present]
    178, Generating update self(state[:((nz+4)*(nx+4))*4])
semi_discrete_step(double *, double *, double *, double, int, double *, double *):
    253, Generating NVIDIA GPU code
        253, #pragma acc loop gang, vector(128) collapse(3) /* blockIdx.x threadIdx.x */
        255,   /* blockIdx.x threadIdx.x collapsed */
        257,   /* blockIdx.x threadIdx.x collapsed */
    253, Generati

Start inspecting the compiler feedback and see if it applied the optimizations (the lines in the compiler feedback might be slightly different for you). Here is the screenshot of expected compiler feedback after adding the `data` directives. You can see that on line 281, the compiler is generating default present for `hy_dens_cell`, `hy_dens_theta_cell`, `state`, and `flux` arrays. In other words, it is assuming that data is present on the GPU and it only copies data to the GPU only if the data do not exist.

<!--<img src="images/cfeedback4_.png" width="90%" > -->
<img src="https://drive.google.com/uc?export=download&id=1YhyNwTaF5udxraSTNfhkqwBe_a7kDbQ3">

Now, **Profile** the code with Nsight Systems command line `nsys`.

In [37]:
!cd /home/source_code/lab4 && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o miniWeather_4 ./miniWeather

Using default values ...
nx_glob, nz_glob: 400 200
dx,dz: 50.000000 50.000000
dt: 0.166667
Elapsed Time: 0.000000 / 200.000000
Elapsed Time: 0.166667 / 200.000000
Elapsed Time: 0.333333 / 200.000000
Elapsed Time: 0.500000 / 200.000000
Elapsed Time: 0.666667 / 200.000000
Elapsed Time: 0.833333 / 200.000000
Elapsed Time: 1.000000 / 200.000000
Elapsed Time: 1.166667 / 200.000000
Elapsed Time: 1.333333 / 200.000000
Elapsed Time: 1.500000 / 200.000000
Elapsed Time: 1.666667 / 200.000000
Elapsed Time: 1.833333 / 200.000000
Elapsed Time: 2.000000 / 200.000000
Elapsed Time: 2.166667 / 200.000000
Elapsed Time: 2.333333 / 200.000000
Elapsed Time: 2.500000 / 200.000000
Elapsed Time: 2.666667 / 200.000000
Elapsed Time: 2.833333 / 200.000000
Elapsed Time: 3.000000 / 200.000000
Elapsed Time: 3.166667 / 200.000000
Elapsed Time: 3.333333 / 200.000000
Elapsed Time: 3.500000 / 200.000000
Elapsed Time: 3.666667 / 200.000000
Elapsed Time: 3.833333 / 200.000000
Elapsed Time: 4.000000 / 200.000000
Elapsed T

Download the report by running the below cell and open the report via the Nsight Systems user interface (UI).

In [40]:
from google.colab import files
files.download('/home/source_code/lab4/miniWeather_4.nsys-rep')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Now, have look at the expected output example below:

<!--<img src="images/nsys_fast_mv_.png">-->
<img src="https://drive.google.com/uc?export=download&id=1fBqK87wjOjoaiUYPQ15B3b__Yurn2SJP">


Look at the data movements annotated with green and purple color and compare them with the previous versions. We have accelerated the application and reduced the execution time by eliminating the unnecessary data transfers between CPU and GPU.

Let's look at the NVTX ranges to see how much speedup we achieved after multiple optimizations.

|   | Serial | Parallel (lab2) | Parallel (lab3) | Parallel (lab4) |
| --- | ----------- |----------- |----------- |----------- |
| Total | 27.66 s |157.47 s | 7.14 s |1.27 s |
| While | 27.66 s  |157.09 s | 6.77 s |926.69 ms |
| perform_timestep| 22.72 s |131.09 ms |  4.89 ms |0.78399 ms |

**Note**: The next exercise gives an overview of the introduction to NVIDIA Nsight Compute tool and is optional.

# Lab 5

###  Learning objectives

The **goal** of this lab is to:

- Learn how to inspect the application's kernels with NVIDIA Nsight™ Compute
- Learn how to execute rules inside the Nsight Computer profiler and find bottlenecks
- Learn how to add baselines and compare results/reports

We do not intend to cover:

- The OpenACC programming model
- Advanced optimization techniques in detail

As mentioned earlier on, Nsight Compute and Nsight Systems each serve a different purpose in profiling with different functionalities. In previous exercises, we inspected the timelines, measured activity durations, and tracked CPU events via the Nsight Systems profiler. The purpose of this exercise is to get familiar with the Nsight Compute tool. This tool provides access to kernel-level analysis using GPU performance metrics.

We first profile the GPU application and identify certain areas in the code, that don't behave as expected. Then we isolate those kernels and profile them via Nsight Compute.

**Understand and analyze** the code present at:

In [41]:
!cat /home/source_code/lab5/miniWeather_openacc.cpp

//////////////////////////////////////////////////////////////////////////////////////////
// miniWeather
// Author: Matt Norman <normanmr@ornl.gov>  , Oak Ridge National Laboratory
// This code simulates dry, stratified, compressible, non-hydrostatic fluid flows
// For documentation, please see the attached documentation in the "documentation" folder
//////////////////////////////////////////////////////////////////////////////////////////

/*
** Copyright (c) 2018, National Center for Computational Sciences, Oak Ridge National Laboratory.  All rights reserved.
**
** Portions Copyright (c) 2020, NVIDIA Corporation.  All rights reserved.
*/

#include <stdlib.h>
#include <math.h>
#include <stdio.h>
#include <nvtx3/nvToolsExt.h>

const double pi = 3.14159265358979323846264338327;   //Pi
const double grav = 9.8;                             //Gravitational acceleration (m / s^2)
const double cp = 1004.;                             //Specific heat of dry air at constant pressure
const doubl

Now, compile the code with `make`.

In [42]:
!cd /home/source_code/lab5 && make clean && make

rm -f *.o miniWeather *.nsys-rep *.sqlite *.ncu-rep
nvc++ -O3 -w -ta=tesla:managed -Minfo=accel -o miniWeather miniWeather_openacc.cpp 
main:
    142, Generating copy(state[:((nz+4)*(nx+4))*4]) [if not already present]
         Generating create(flux[((nz+1)*(nx+1))*4]) [if not already present]
         Generating copyin(hy_dens_int[nz+1],hy_dens_cell[nz+4],hy_dens_theta_cell[nz+4],hy_dens_theta_int[nz+1]) [if not already present]
         Generating create(tend[(nz*nx)*4]) [if not already present]
         Generating copyin(state_tmp[((nz+4)*(nx+4))*4],hy_pressure_int[nz+1]) [if not already present]
    178, Generating update self(state[((nz+4)*(nx+4))*4])
semi_discrete_step(double *, double *, double *, double, int, double *, double *):
    253, Generating NVIDIA GPU code
        253, #pragma acc loop gang, vector(128) collapse(3) /* blockIdx.x threadIdx.x */
        255,   /* blockIdx.x threadIdx.x collapsed */
        257,   /* blockIdx.x threadIdx.x collapsed */
    253, Generatin

Now, **Profile** the code with Nsight Systems command line interface (CLI):

In [43]:
!cd /home/source_code/lab5 && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o miniWeather_5 ./miniWeather

Using default values ...
nx_glob, nz_glob: 40 20
dx,dz: 500.000000 500.000000
dt: 1.666667
Elapsed Time: 0.000000 / 10.000000
Elapsed Time: 1.666667 / 10.000000
Elapsed Time: 3.333333 / 10.000000
Elapsed Time: 5.000000 / 10.000000
Elapsed Time: 6.666667 / 10.000000
Elapsed Time: 8.333333 / 10.000000
Generating '/tmp/nsys-report-0ddc.qdstrm'
[3/8] Executing 'nvtxsum' stats report

NVTX Range Statistics:

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)   Style        Range      
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  -------  ----------------
     95.3      572,038,951          1  572,038,951.0  572,038,951.0  572,038,951  572,038,951          0.0  PushPop  Total           
      2.4       14,241,411          1   14,241,411.0   14,241,411.0   14,241,411   14,241,411          0.0  PushPop  while           
      2.4       14,116,867          6    2,352,811.2    2

Download the report by running the below cell and open the report via the Nsight Systems user interface (UI).

In [44]:
from google.colab import files
files.download('/home/source_code/lab5/miniWeather_5.nsys-rep')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

As shown in the example output, the initialization looks very expensive and the kernels are very small meaning that the GPU compute part of the problem is very small. Check how much time (what percentage) is spent in each kernel relative to the time it takes to run the code.

<!--<img src="images/5_init.png" width="80%">-->
<img src="https://drive.google.com/uc?export=download&id=1ZDokTNM39JfiOkhejoweiBXIgoZATcIP">


From the "Timeline View", inspect the less efficient kernel. Next, inspect the most expensive kernel and see what the Nsight Compute recommends.

<!--<img src="images/cexer5.png" width="80%">-->
<img src="https://drive.google.com/uc?export=download&id=1f08tI6XXmiJ8VovINaJveH4a56djmuOt">


Now, **Profile** the application via Nsight Compute CLI (`ncu`):

In [45]:
!cd /home/source_code/lab5 && ncu --set full -k regex:compute_tendencies_x --launch-skip 10 --launch-count 1 -f -o miniWeather1 ./miniWeather

Using default values ...
==PROF== Connected to process 20619 (/home/source_code/lab5/miniWeather)
nx_glob, nz_glob: 40 20
dx,dz: 500.000000 500.000000
dt: 1.666667
Elapsed Time: 0.000000 / 10.000000
==PROF== Profiling "_Z28compute_tendencies_x_282_gpuPdS_S_": 0%....50%....100% - 32 passes
Elapsed Time: 1.666667 / 10.000000
Elapsed Time: 3.333333 / 10.000000
Elapsed Time: 5.000000 / 10.000000
Elapsed Time: 6.666667 / 10.000000
Elapsed Time: 8.333333 / 10.000000
==PROF== Disconnected from process 20619
==PROF== Report: /home/source_code/lab5/miniWeather1.ncu-rep


Download the report by running the below cell and open the report via the Nsight Compute UI.

In [46]:
from google.colab import files
files.download('/home/source_code/lab5/miniWeather1.ncu-rep')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This tool has many sections that focus on different areas of the GPU and presents them all on one page.

**Note:** If you do not specify a specific kernel name when profiling, all kernels will be profiled and  will slow down the profiling time.

<!--<img src="images/ccompute.png">-->
<img src="https://drive.google.com/uc?export=download&id=1g9guvRrjpFvz1swTT50PQl6bXpTw-FGl">


The "GPU Speed Of Light Throughput" section shows less than 1% Compute (SM) Throughput. As you can see from the example output below, the Nsight Compute profiler suggests looking at the "Launch Statistics" section because the kernel grid is too small to fill the available resources on the GPU.

We previously discussed Amdahl's law in the first exercise. It is very important to understand the relation between the problem size and computational performance as this can determine the amount of speedup and benefit you get by parallelizing on GPU. Due to the small problem size (`nx_glob`, `nz_glob` , and `sim_time` in this example), most of the computation is dominated by the initialization and there is not enough work/computation to make it suitable for GPU. Run the application with different values `nx_glob`, `nz_glob` , and `sim_time` and profile the same kernel (`nx_glob` = 400 , `nz_glob`= 200 , and `sim_time`= 100).

In [47]:
!cd /home/source_code/lab5 && make clean && make

rm -f *.o miniWeather *.nsys-rep *.sqlite *.ncu-rep
nvc++ -O3 -w -ta=tesla:managed -Minfo=accel -o miniWeather miniWeather_openacc.cpp 
main:
    142, Generating copy(state[:((nz+4)*(nx+4))*4]) [if not already present]
         Generating create(flux[((nz+1)*(nx+1))*4]) [if not already present]
         Generating copyin(hy_dens_int[nz+1],hy_dens_cell[nz+4],hy_dens_theta_cell[nz+4],hy_dens_theta_int[nz+1]) [if not already present]
         Generating create(tend[(nz*nx)*4]) [if not already present]
         Generating copyin(state_tmp[((nz+4)*(nx+4))*4],hy_pressure_int[nz+1]) [if not already present]
    178, Generating update self(state[((nz+4)*(nx+4))*4])
semi_discrete_step(double *, double *, double *, double, int, double *, double *):
    253, Generating NVIDIA GPU code
        253, #pragma acc loop gang, vector(128) collapse(3) /* blockIdx.x threadIdx.x */
        255,   /* blockIdx.x threadIdx.x collapsed */
        257,   /* blockIdx.x threadIdx.x collapsed */
    253, Generatin

In [48]:
!cd /home/source_code/lab5 && ncu --set full -k regex:compute_tendencies_x --launch-skip 100 --launch-count 1 -f -o miniWeather2 ./miniWeather 400 200 100

The arguments supplied are 400 200 100
==PROF== Connected to process 21446 (/home/source_code/lab5/miniWeather)
nx_glob, nz_glob: 400 200
dx,dz: 50.000000 50.000000
dt: 0.166667
Elapsed Time: 0.000000 / 100.000000
Elapsed Time: 0.166667 / 100.000000
Elapsed Time: 0.333333 / 100.000000
Elapsed Time: 0.500000 / 100.000000
Elapsed Time: 0.666667 / 100.000000
Elapsed Time: 0.833333 / 100.000000
Elapsed Time: 1.000000 / 100.000000
Elapsed Time: 1.166667 / 100.000000
Elapsed Time: 1.333333 / 100.000000
Elapsed Time: 1.500000 / 100.000000
Elapsed Time: 1.666667 / 100.000000
Elapsed Time: 1.833333 / 100.000000
Elapsed Time: 2.000000 / 100.000000
Elapsed Time: 2.166667 / 100.000000
Elapsed Time: 2.333333 / 100.000000
Elapsed Time: 2.500000 / 100.000000
==PROF== Profiling "_Z28compute_tendencies_x_282_gpuPdS_S_": 0%....50%....100% - 32 passes
Elapsed Time: 2.666667 / 100.000000
Elapsed Time: 2.833333 / 100.000000
Elapsed Time: 3.000000 / 100.000000
Elapsed Time: 3.166667 / 100.000000
Elapsed Tim

Download the report by running the below cell and open the report via the Nsight Compute user interface (UI).

In [49]:
from google.colab import files
files.download('/home/source_code/lab5/miniWeather2.ncu-rep')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Diff the reports**

Open both reports via the Nsight Compute UI. From the top of the first report, click on *Add Baseline*, then do the same for the second report shown in the second tab. Have a look at the expected output:

<!--<img src="images/c2compute.png">-->
<img src="https://drive.google.com/uc?export=download&id=1LrJAwgOrhs9fayAYuWBGcaStIpI7dx6o">


You can now compare the two reports and see how changes you made to the cell size affected specific metrics by looking at each section. By increasing the cell size, we increased the "Compute Throughput" and "Memory Throughput" by 35% and 26% respectively.


Next, look at the Roofline chart which shows a high-level overview of the utilization for compute and memory resources of the GPU. We can see that increasing cells size resulted in the performance getting closer to the Roof lines.

<!--<img src="images/roofline.png">-->
<img src="https://drive.google.com/uc?export=download&id=1Htd3oGohIqXI0cDyNlI5VGCFMAA3i-Fk">


However, the kernel is too still small to utilize the GPU and "Compute" and "Memory" are still less than 50% utilized. The "GPU Speed of Light Throughput" section gives a high-level overview of the throughput for compute and memory resources of the GPU for each unit. Based on this information we can find the performance limiters and categorize them into four possible combinations:

- Compute Bound: SM>50% & Mem<50%
- Bandwidth Bound: SM<50% & Mem>50%
- Latency Bound: SM<50% & Mem<50%
- Compute and Bandwidth Bound : SM>50% & Mem>50%


According to the Roofline, this kernel is fp64 bound, and we should consider using 32-bit precision floating point operations to improve its performance.

The detailed "Memory Workload Analysis" section shows all the data traffic between various stages of the GPU and what your kernel is actually transferring. This section suggests that we need to look at the memory access pattern in the code as the load/store pattern is not optimal. The solution is to minimize how many cache lines need to be accessed per memory  request.


This is out of scope for this tutorial but  you can have a look at the algorithm and see if you can change anything to do more work per memory access.

# Links and Resources

[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/)

[NVIDIA Nsight Compute](https://docs.nvidia.com/nsight-compute/index.html)


**NOTE**: To be able to see the Nsight Systems and Compute profiler outputs, please download the latest versions from the below pages:

- https://developer.nvidia.com/nsight-systems
- https://developer.nvidia.com/nsight-compute


Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

---

## Licensing

Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.