![NCAR UCAR Logo](../../NCAR_CISL_NSF_banner.jpeg)
# Hands-On Session with Nsight Systems and Compute

By: Brett Neuman [bneuman@ucar.edu](mailto:bneuman@ucar.edu), Consulting Services Group, CISL & NCAR

Date: June 16th 2022


In this notebook we explore profiling of the mini-app [MiniWeather](https://github.com/mrnorman/miniWeather) to present profiling techniques and code examples. We will cover:

1. Overview of Profiling and Performance Sampling Tools
   * Typical development workflows with profiling tools
2. NSight Systems for Overview Analysis of GPU Program Runtimes
   * How to generate nsys reports and command line parameters
   * Analysis of nsys reports and investigating the program timeline
   * Generating NSight Compute profiling commands from nsys reports


Head to the [NCAR JupyterHub portal](https://jupyterhub.hpc.ucar.edu/stable) and __start a JupyterHub session on Casper login__ (or batch nodes using 1 CPU, no GPUs) and open the notebook in `10_HandsOnNsight/nsys/10_HandsOnNsight_nsys.ipynb`. Be sure to clone (if needed) and update/pull the NCAR GPU_workshop directory.

```shell
# Use the JupyterHub GitHub GUI on the left panel or the below shell commands
git clone git@github.com:NCAR/GPU_workshop.git
git pull
```

# Workshop Etiquette
* Please mute yourself and turn off video during the session.
* Questions may be submitted in the chat and will be answered when appropriate. You may also raise your hand, unmute, and ask questions during Q&A at the end of the presentation.
* By participating, you are agreeing to [UCAR’s Code of Conduct](https://www.ucar.edu/who-we-are/ethics-integrity/codes-conduct/participants)
* Recordings & other material will be archived & shared publicly.
* Feel free to follow up with the GPU workshop team via Slack or submit support requests to [support.ucar.edu](https://support.ucar.edu)
    * Office Hours: Asynchronous support via [Slack](https://ncargpuusers.slack.com) or schedule a time with an organizer

## Notebook Setup
Set the `PROJECT` code to a currently active project, ie `UCIS0004` for the GPU workshop, and `QUEUE` to the appropriate routing queue depending on if during a live workshop session (`gpuworkshop`), during weekday 8am to 5:30pm MT (`gpudev`), or all other times (`casper`). Due to limited shared GPU resources, please use `GPU_TYPE=gp100` during the workshop. Otherwise, set `GPU_TYPE=v100` (required for `gpudev`) for independent work. See [Casper queue documentation](https://arc.ucar.edu/knowledge_base/72581396#StartingCasperjobswithPBS-Concurrentresourcelimits) for more info.  

In [None]:
export PROJECT=UCIS0004
export QUEUE=gpudev
export GPU_TYPE=gp100

module load nvhpc/22.2 &> /dev/null
export PNETCDF_INC=/glade/u/apps/dav/opt/pnetcdf/1.12.2/openmpi/4.1.1/nvhpc/22.2/include
export PNETCDF_LIB=/glade/u/apps/dav/opt/pnetcdf/1.12.2/openmpi/4.1.1/nvhpc/22.2/lib

## Profilers - Why Bother?

So you have some code.  Maybe you own it, maybe you’re inheriting it, maybe you’re trying to improve it, maybe you’re just trying to keep it operational. 

If you’re looking to **understand, improve performance, or make informed decisions** on your code in a **timely** fashion, profiling is a **good place to start**.

**The profiler does not make decisions for you.** 
Profilers provide information that could lead to more efficient use of resources for your code!  Be mindful that profiling can add significant runtime overhead to your application.



## How to get there
1. Profile your code!

2. Make sure you have your baseline performance
   * Performance is relative here
   * Your baseline should be a realistic run of the application (real data, reasonable runtime)


3. Attempt to find potential performance gains using profiling tools, your experience, and working around your constraints
    
   * Common project constraints include:
        
        
     * cluster configurations
     * hardware architectures (CPU/GPU/NIC types)
     * memory
     * flow control (simple instructions vs branching instructions)
     * programming language
     * development time
     
   * Tools can give you insight on what sections of code are using up significant runtime 
     * A function with the highest runtime often has highest potential to be optimized .. **but not always**

## Profiling data collection methods

1. **Sampling**
   * Collect data at a regular interval, or sampling frequency, to understand how much time is spent in a function or application
2. **Concurrency**
   * Identifying shared resource bottlenecks, communication overhead, and thread or kernel inefficiencies via call stack traces
3. **Memory**
   * Gathers information on data movement, allocation, and resource availability

## The focus of our session
In this session we will focus on profiling code on clusters with NVidia GPUs in the role of a researcher.  Our interest is in performant threads, kernels, GPU utilization, and memory efficiency.

# NSight Systems and Compute

The Nsight Systems and Compute are used to profile, debug, and optimize applications that utilize Nvidia GPUs.  You can follow along if you have Nsight Systems installed on your local machine.

Download: https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2022-2

Casper runs Nsys version `2021.2.4.12`

![NSight Workflow Diagram](img/Nsight-Diagram.png)

## NSight Systems `nsys`

Workload level analysis:
* Visualize algorithms, instruction flow, data flow, and scaling out to multiple nodes
* Identify areas to optimize within the code
* Maximize computational and memory utilization on the GPU


### The NSight Systems Profiling Model

The Nsight profiling model is based on the **Client Server** model.  The **Client** is your the machine you will use to view reports generated by your code profiling.  The **Server** is the node you run GPU code on and generate the profiling report from.  NVidia refers to this as the **Two Phase** approach to profiling.  A good workflow for profiling your code using the Client Server model would look like:

![NSight Workflow Diagram](img/Nsight-Workflow.png)

## GPU kernel generation

Previously, we ran ACC directives on our miniWeather application.  Compilers handle the conversion into GPU code behind the scenes but it is important to note that ACC directives are converted into a NVidia CUDA kernel.  These kernels can be analyzed for performance using Nsight Systems and Compute.


# miniWeather App OpenACC Profiling Example

## **Baseline**: Profile Generation and Analysis

We're going to profile the miniweather application using the most basic version of `!$acc loop parallel` without any additional flags to help the compiler generate efficient parallel loops.  This might be a first step to converting a CPU based function into an OpenACC.  Remember, your baseline should be a stable working version of your code with a realistic dataset and runtime.  Here we're looking at one example of this implementation on the `semi_discrete_step` subroutine.

![NSight Workflow Diagram](img/miniweather_basicacc.png)

## Setting up a baseline

The Nsight Systems profile launch within this script:

``` shell
nsys profile -o miniweather_baseline fortran/build/openacc -t openacc,mpi 
```

### Notable flags for `nsys profile`:

* -t (--trace) parameters: cublas, cuda, cudnn, nvtx, opengl, openacc, openmp, osrt, mpi, vulkan, none
    * `-t openmp,openacc`
* -b (--backtrace) parameters: fp, lbr, dwarf, none
    * `-b fp`
* --cuda-memory-usage parameters: true, false
    * `--cuda-memory-usage=true`
* --mpi-impl parameters: openmpi, mpich
    * `--mpi-impl=openmpi`
* -o
    * `-o myreport`
    * Names the generated profiling report
* --stats
    * `--stats=true`
    * Generate data file to analyze within the CLI
    * Takes time to generate
* -h: help with explanations for all `nsys` commands plus sub commands
    * Run below cells to see help text

Some of these options can add significant profiler overhead to your application.

Additional options for CLI profiling can be found on the NVidia NSight CLI documentation:

https://docs.nvidia.com/nsight-systems/2020.3/profiling/index.html#cli-installing

In [None]:
nsys -h

In [None]:
nsys profile -h

## Launching the Profiler on Casper

In [None]:
# Comment to prevent repeat runs while testing
# qsub pbs/pbs_miniweather_baseline.sh

You will see a `.qdrep` file after this job has finished.

## Quick Analysis via CLI

In [1]:
nsys stats reports/miniweather_baseline.qdrep | grep -v "SKIPPED"

Using reports/miniweather_baseline.sqlite for SQL queries.
Running [/glade/u/apps/dav/opt/cuda/11.4.0/nsight-systems-2021.2.4/target-linux-x64/reports/cudaapisum.py reports/miniweather_baseline.sqlite]... 

 Time(%)  Total Time (ns)  Num Calls    Average      Minimum     Maximum     StdDev            Name        
 -------  ---------------  ---------  ------------  ----------  ----------  ---------  --------------------
    98.9  311,045,408,877    829,502     374,978.5         480   4,110,831  672,951.6  cuStreamSynchronize 
     0.6    1,952,049,331    414,751       4,706.6       2,995   1,276,614    3,407.5  cuLaunchKernel      
     0.1      360,215,231     92,188       3,907.4       2,308     392,225    2,940.6  cuMemcpyHtoDAsync_v2
     0.1      353,236,646     92,374       3,824.0       2,242   1,474,974    6,696.5  cuMemcpyDtoHAsync_v2
     0.1      328,508,188     46,186       7,112.7       1,109   1,278,036   59,376.3  cuCtxSynchronize    
     0.1      321,497,397    139,066 


This output will look familiar if you have used `nvprof` to profile codes previously.


## Timeline Analysis via Nsight Systems GUI

### Transfer or View the Report

Reports for analysis are located in the `reports` folder.  For our baseline we will use the generated report:

`miniweather_baseline.qdrep`

1. Transfer the `.qdrep` file to your local machine and load in into your local installation of the NSight Systems application
    * Download the file by right clicking and selecting `Download` on the JupyterHub browser on the left.  


2. Launch a X or VNC session on a GP100 GPU node on Casper.  Launch `nsight-nsys`. 
   * KB Article to set up VNC: https://kb.ucar.edu/display/RC/Using+remote+desktops+on+Casper+with+VNC
   * X session works but can be slow



### Nsight Systems GUI

Open the file in the NSight Systems application.  Below is the default view upon opening the application.

![NSight Default View](img/nsight_defaultview.png)

### Projects

![NSight Project View](img/nsight_projectview.png)

### Navigation

![NSight Navigation](img/nsight_navigation.png)

### Event Descriptions

![NSight Event Description](img/nsight_eventdescription.png)

## Baseline Timeline View

`miniweather_baseline.qdrep`

![Miniweather Baseline Timeline View](img/miniweather_baseline_timeline.png)

## Patterns, Gaps, Walltime and Kernels

We can find instruction patterns of interest, sections where the GPU is idle, and also view details on which kernel is running at a given time using the Timeline view.  Below is an example of a repeated pattern found in the baseline report.  It will be useful to note that the time to complete this repeated pattern is about 20ms.

Note that we zoomed into the timeline significantly.

![Miniweather Instruction Pattern](img/miniweather_pattern.png)

### Stats View

Quickly find CUDA API and GPU Kernel instruction runtimes. This is a good place to get ideas on how to make improvements.

![NSight Systems Stats View](img/miniweather_statsview_baseline.png)

## **Asynchronous** Loops Profile

I'm using the information that shows about 50% of our runtime in `cuStreamSynchronize` to make changes to the existing `!$acc loop parallel` sections.

![Miniweather CUDA Summary](img/miniweather_cudasummary_baseline.png)

Modify the ACC loops to perform asynchronously.  OpenACC will no longer wait for the flagged loop to finish before launching another and should pipeline the loop iterations.  We need to include `!$acc wait` flags for sections to allow individual loop sections to finish before operating on a different loop.

![Miniweather Async Loop Code](img/miniweather_asyncloop.png)

Recompile and profile the code again to see the changes you've made.  Launch the script with the new `nsys profile` command on Casper.

``` shell
nsys profile -o miniweather_async fortran/build/openacc -t openacc,mpi
```


### Asynchronous Analysis

`miniweather_async.qdrep`

Not a significant change.  The command `CuStreamSynchronize` changed to `CuCtxSynchronize` but still takes almost 50% of the runtime.

![Miniweather Async Stats](img/miniweather_async_stats.png)

We can see that the memory operations are launching from within the same stream now, suggesting that there is pipelining.

![Miniweather Async Pipeline](img/miniweather_baseline_memops.png)

We're still spending a lot of time in `CuStreamSynchronize`.  Can we try to improve our parallezation of loops?

![Miniweather Async Timeline](img/miniweather_async_timeline.png)

## __Collapsed__ Loops Profile

Modify the ACC loops to perform asynchronously and also collapse loops based on how deep the loop structure is.

![Miniweather Collapsed Loop Code](img/miniweather_async_collapse.png)

Recompile and profile the code again to see the changes you've made.  Launch the script with the new `nsys profile` command on Casper.

``` shell
nsys profile -o miniweather_async_collapsed fortran/build/openacc -t openacc,mpi
```


### __Collapsed__ Loops Analysis

`miniweather_async_collapsed.qdrep`

![Miniweather Collapsed Loop Analysis](img/miniweather_collapsed_walltime.png)


Here is the `CuCtxSynchronize` wait time for the Async profile.  15 seconds spent waiting to launch a new round of instructions.

![Miniweather Async Sync Wait Time](img/miniweather_async_synctime.png)

The same `CuCtxSynchronize` with the Collapsed loops profile.  Down to 1ms.

![Miniweather Collapsed Sync Wait Time](img/miniweather_collapsed_synctime.png)

You can also spot additional calls to kernels in between synchronization, so we've improved parallelism.

![Miniweather Collapsed Sync Wait Time](img/miniweather_collapsed_instructions.png)

### Output to file and I/O operations

After zooming into the timeline for the `miniweather_async_collapsed.qdrep` file you will notice that there is an operation that occurs between kernel operations frequently. 

![miniweather_collapsed_bubbles](img/miniweather_collapsed_bubbles.png)

Hovering over the operation gives us the call stack where we can identify the IO operation.  Here we see it coming from the `_output` subroutine.  Recording the results of your simulation is important but let's see what sort of performance we can get by eliminating the call to `output`.

Compare the full timeline view of the `miniweather_async_collapsed.qdrep` and the `miniweather_nooutput.qdrep`.  You'll notice the bubbles are gone and the walltime is 32s compared to 41s (1.28x).  Reducing idle time on the GPU and also reducing memory transfers between host and device give us a good performance gain.

![miniweather_collapsed_bubbles](img/miniweather_nobubbles.png)

### Expert View

Good spot to go for general recommendations based on common GPU problems and can provide hints on where to start optimizing.

![NSight Workflow Diagram](img/Expert_Systemview.png)

## Other profiling tools

There is a lot of profiling work being done in the deep learning and scientific computing spheres. There are other tools available to analyzing training time, visualization insight, and other DL/ML focused profilers:
1) DLProf: https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/
2) Tensorboard: https://www.tensorflow.org/tensorboard/get_started
3) NVidia Tools Extension (NVTX)
   * NVIDIA Tools Extension (NVTX) is an API that allows for additional control for profiling your applications.  NVTX can be particularly useful when you have a specific section of your code that you need to gather performance information on.  It can also be a useful intermediate step between the higher level Nsight Systems view and the kernel optimization of Nsight Compute.
   * NVTX header file used and code marked to profile specific sections of your larger codebase
   * Jiri Kraus (our next workshop presenter) has a very good walkthrough of using NVTX for C/C++: https://developer.nvidia.com/blog/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/

FORTRAN Example:
```fortran
program main
  use nvtx

  call nvtxStartRange("First label")
  do n=1,100
    ! Create custom label for each marker
    write(itcount,'(i4)') n
    ! Range with custom  color
    call nvtxStartRange("Label "//itcount,n)
    ...
    call nvtxEndRange
  end do
  call nvtxEndRange
end program main
```

# Launching Nsight Compute with Nsight Systems

Information from hovering over a kernel launch instruction:

![NCU Timeline View Launch](img/ncu_kernel_info.png)

You can also right click on the kernel and see a textual timeline of all instances of that kernel in your application:

![NCU Timeline View Launch](img/ncu_kernel_eventsview.png)

From here you can right click on the kernel launch instruction in the timeline and analyze it in Nsight Compute.  Select `Analyze the Selected Kernel with NVIDIA Nsight Compute`:

![NCU Timeline View Launch](img/ncu_kernel_launch.png)

Here is the window to launch Nsight Compute:

![NCU Window](img/ncu_launch.png)

# Resources

NVidia Nsight Systems User Guide:
https://docs.nvidia.com/nsight-systems/UserGuide/index.html

Climate related optimizations for GPUs
https://github.com/mrnorman/miniWeather/wiki/A-Practical-Introduction-to-GPU-Refactoring-in-Fortran-with-Directives-for-Climate

Overview of common profiling methods
https://www.atatus.com/blog/what-is-code-profiling-a-detailed-explanation/#Types-of-Code-Profiling

NVTX Walkthrough:
https://developer.nvidia.com/blog/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/

OpenACC Best Practices for GPU Refactoring:
https://www.openacc.org/sites/default/files/inline-files/OpenACC_Programming_Guide_0_0.pdf

## Move On to Nsight Compute Profiler Tool

[Nsight Compute Profiler](../ncu/10_HandsOnNsight_ncu.ipynb)