# Optimizing Online 5g Machine-Learning with Nsight Compute

## 04.1 Nsight Compute CUDA Kernel Profiler

We identified the CUDA kernel `kernel_apsm_detect` as our optimization target. To understand its performance in detail, we can profile it with Nsight Compute. Let's start with a short introduction about the tool:

Nsight Compute is an `interactive CUDA kernel profiler` with
* Targeted metric sections for various performance aspects
* Customizable data collection and presentation (tables, charts, ...)
* UI and Command Line
* Python-based rules for guided analysis (or post-processing)
* Support for remote profiling across machines and platforms

<img src="images/ncu_intro_01.png" width="900">

Detailed `memory workload analysis` chart and tables help to understand bottlenecks between different hardware units, and how efficiently they are utilized. The tool supports comparing data in most charts and tables against one or `multiple baselines`, to see the impact on any optimizations to your code. Comparisons are supported across kernels, reports and GPU architectures.
 
<img src="images/ncu_intro_02.png" width="900">

The `Source` page provides correlation between high-level CUDA-C/C++ source, PTX and SASS (assembly). Several metrics are available per instruction for a detailed `line-by-line analysis` of the source code. The metric heatmap helps to quickly find the hot spot for a particular metric.

<img src="images/ncu_intro_03.png" width="900">

## 04.2 Profiling kernel_apsm_detect Interactively

Before we start looking into the kernel's performance, have a look at [apsm_versions.h](apsm/cpp/lib/apsm/apsm_versions.h): You will find that there are multiple implementations of this kernel available here, which are selected by setting the `APSM_DETECT_VERSION` define. We start already with the `Cooperative Groups` (CG) implementation `APSM_DETECT_CG`, but  there is also an `ORIGINAL` version implemented without CG, which you could compare later if you are interested.

 We will be using the Nsight Compute UI in a remote desktop environment. Execute the following cell to generate the URL for the remote desktop, which you should copy and paste into a new browser tab. The noVNC password is `nvidia`. Then continue to follow the presenter, or the instructions below.

In [None]:
%%js
var url = window.location.hostname + '/nsight/';
element.append(url)

### Steps without instructor in `...`

Switch to the Ubuntu instance (password `nvidia`), open the `Find Application` tool with the search/looking glass icon, and search for `compute` to select Nsight Compute. Within the tool, open the prepared project by activating `Load Project` and selecting the `/root/Desktop/reports/ncu/apsm.ncu-proj` project file. This opens the connection dialog, which is now pre-filled with the application details.

<img src="images/ncu_connect.png" width="700">

After launching, Nsight Compute connects to the target application and suspends it in the first CUDA API call, visible in the API Stream tool window. Since we want to profile the `kernel_apsm_detect` kernel, enter that name in the `Next Trigger` edit, and select the green `Run to Next Kernel` button. This lets the application continue until before that kernel is launched.

Before starting to profile, enable the `full` section set in the `Sections/Rules Info` tool window, in order to have Nsight Compute collect the full set of curated metrics. As we only have a single kernel to profile, we are not too concerned about the overhead when replaying the kernel multiple times. Afterwards, click `Profile Kernel` and wait for the report to be created. Finally, we can `Terminate` the target application.

<img src="images/ncu_full_set.png" width="700">

Now we can start analyzing the created profiler report. On the `Details` page, inspect the sections from top to bottom and pay attention to the `Recommendations` generated by the tool. 

The first section shows that the kernel has very little utilization of the SM compute units (`SOL SM`), and also quite low throughput of the memory units (`SOL Memory`). The tool suggests that the kernel is latency bound, and we should continue with the `Scheduler` and `Warp State Statistics` sections. However, feel free to inspect the information shown in the `Compute and Memory Workload Analysis` sections, too.

<img src="images/ncu_report01_01.png">

In the `Scheduler Statistics` section, we can see that the theoretical number of warps per scheduler (4) is only half of what the hardware is capable of (8). Consequently, the `Active Warps` are below the GPU maximum, too. While this is not problematic by itself, we can see that there is well below one warp per cycle `issued` by the scheduler, resulting in multiple-cycle delays between work being scheduled.

<img src="images/ncu_report01_02.png">

The tool suggests reducing warp stalls (which we could investigate in the `Warp State Statistics` section), or to increase the number of active (and thereby also eligible/issued) warps. While we could start either way, the fact that our `theoretical warps` are only half of the GPU hardware maximum suggests that we aren't even `occupying` the full available hardware. It can be a good strategy to first have the kernel all the available SM compute units, and then optimize the per-unit usage.

To continue down this path, we can use the `Occupancy` section for further input.

<img src="images/ncu_report01_03.png">

We can see that we have 50% theoretical and ~49% achieved occupancy. The close relationship between warps per scheduler and active warps per SM is also manifested by the fact that the same relation between theoretical and achieved can be seen for the `Active Warps per SM` in this section. As such, changing our kernel to have 100% theoretical occupancy will likely also result in similarly improved theoretical and achieved warps per scheduler.

After the analysis, we can move on to optimize and re-evaluate the kernel in [step 05](05_spb.ipynb)