# Chapter 9: Developer Tools

## CUDA Python Performance

In order to achieve optimal performance in CUDA, you must consider several factors:
- Localizing memory access in order to minimize memory latency.
- Maximizing the number of active threads per multiprocessor to ensure high utilization of your hardware.
- Minimization of conditional branching.

In order to overcome the bottleneck between CPU and GPU across the PCIe bus, we want to:
- Minimize the volume of data transferred.  Transferring data in large batches can minimize the number of data transfer operations.
- Organize data in a way that complements the hardware architecture.
- Utilize asynchronous transfer features that will allow computation and data transfer to occur simultaneously.  Overlapping data transfers with computation can hide latencies caused by data transfers. 

## Common Pitfalls
The most common mistake is running a CPU-only code on a GPU node. Only codes that have been explicitly written to run on a GPU can take advantage of a GPU. Ensure your codes are using the correct GPU accelerated libraries, drivers, and hardware.

**Zero GPU Utilization**
Check to make sure your software is GPU enabled.  Only codes that have been explicitly written to use GPUs can take advantage of them.
Make sure your software environment is properly configured. In some cases certain libraries must be available for your code to run on GPUs. Check your dependencies, version of CUDA Toolkit, and your software environment requirements.
 
**Low GPU Utilization** (e.g. less than ~15%)
Using more GPUs than necessary.  You can find the optimal number of GPUs and CPU-cores by performing a scaling analysis.
Check your process’s throughput.  If you are writing output to slow memory, making unnecessary copies, or switching between your CPU and GPU, you may see low utilization.

**Memory Errors**
Access Violation Errors.  Reading or writing to memory locations that are not allowed or permitted can result in unpredictable behavior and system crashes.
Memory Leaks.  When memory is allocated but not correctly deallocated, the application will consume GPU memory resources, but not utilize them.  The allocated memory will not be available for further computation.


## Debugging and Profiling CUDA Python
In order to take advantage of the optimizations available through CUDA, debugging and analyzing memory issues is essential to creating accelerated Python applications.  

### External Tools
**Nsight**

NVIDIA Nsight™ Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs, from large servers to our smallest systems-on-a-chip (SoCs).

This suite of tools offers an array of interactive as well as command-line tools.  They can detect and provide insight into kernel execution and memory issues.  Ultimately, we need memory usage to align with the GPU hardware preferences.  When these two components are out of alignment, applications may use non-performant access or execution patterns.

![NVCube](images/chapter-09/nvidia-developer-tools-1070x400.svg)

# Getting Started with NSight Systems

## Pre-requisites

How to get setup
- Installing tools
- Installing nvtx
- Installing cuda-python for profiler start/stop apis

## Profiling with Nsight Systems

[Nsight Systems](https://developer.nvidia.com/nsight-systems) is a platform profiling tool designed to give users a high-level, time-correlated view of the performance activity of their entire platform. This includes CPU, GPU, Memory, Networking, OS and application-level metrics. It helps identify the largest opportunities to optimize, and tune to scale efficiently across all available resources. This tutorial will only scratch the surface of what Nsight Systems is capable of. For full details see the [documentation](https://docs.nvidia.com/nsight-systems/).


## Setting up a profile with the Nsight SystemsGUI

The first thing to do in the GUI is select the target machine for profiling. This can be the local machine or a remote server. This example uses the local target. To profile a Python workload with Nsight Systems, set the “Command line with arguments:” field to point to the Python interpreter and the Python file to run including and arguments. Make sure the Python executable is in an environment with all the necessary dependencies for the application, for example a Conda shell. For example: “C:\Users\myusername\AppData\Local\miniconda3\python.exe C:\Users\myusername\cupyTests\cupyProfilingStep1.py"

Also fill in the “Working directory” as appropriate. 

**Recommended settings/flags**

A good initial set of flags for profiling Python include:
- Collect CPU context switch trace
- Collect CUDA trace
- Collect GPU metrics
- Python profiling options:
- Collect Python backtrace samples

You can learn more about all the options [here](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#profiling-from-the-gui)

# CuPy Profiling Example

In this example, we create two CuPy arrays. Then sort one of them and take the dot product.

In [None]:
import sys
import cupy as cp


def create_array(x, y) :
    return cp.random.random((x, y),dtype=cp.float32)

def sort_array(a) :
    return cp.sort(a)

def run_program() :
    print("init step...")
    arr1 = create_array(10_000, 10_000)
    arr2 = create_array(10_000, 10_000)

    print("sort step...")
    arr1 = sort_array(arr1)

    print("dot step...")
    arr3 = cp.dot(arr1, arr2)
    
    print("done")
    return

if __name__ == '__main__':

    run_program()


**Step 1 - Profiling a CuPy workload**

First, run an initial profile of this CuPy sample using the setup and flags described above. Once the profile completes, zoom in to the active portion of the Python thread. The stuff after is python gunk (more on that later). If you hover over a sample in the Python Backtrace row, you will see the call stack that was currently executing.

![cupy1](images/chapter-09/cupy-profiling-1.png)

CuPy will call CUDA kernels under the hood as it executes. Nsight Systems will automatically detect these. Expand the **CUDA HW** row to see where the kernels are scheduled.

![cupy2](images/chapter-09/cupy-profiling-2.png)

Look at the **GPU Metrics > GPU Active** and **SM Instructions** rows to verify that the GPU is being used. You can hover over a spot in this row to see the % Utilization

![cupy3](images/chapter-09/cupy-profiling-3.png)

**Step 2 - Adding nvtx**

Nsight Systems can automatically detect CUDA kernels as well as APIs from many other frameworks or libraries. Additionally, the nvtx annotation module gives users the ability to markup their own applications to see personalized trace events and ranges on the timeline. The file <> adds nvtx to the CuPy application, with colored ranges defined around various phases of the workload. Run a profile of this new version to see nvtx on the timeline.

The **NVTX** row for the CPU thread of the Python process shows when the CPU is inside one of these ranges. The **NVTX** row under the CUDA HW section shows when these ranges are active on the GPU. Notice that they are not exactly lined up because of GPU execution scheduling. You can also see how the CUDA kernels map to these various nvtx ranges that represent the phases of our workload.

In this particular example, we can see in the **GPU Metrics > SM Instructions > Tensor Active** row that the Tensor cores on the GPU are not active while the kernels are running. Tensor cores can add a lot of performance to computation-intensive kernels. The next step will be to get them active. 

![cupy4](images/chapter-09/cupy-profiling-4.png)

**Step 3 - Enabling Tensor cores** 

The [CuPy documentation](https://docs.cupy.dev/en/stable/reference/environment.html#envvar-CUPY_TF32) describes how to enable Tensor cores with an environment variable. <file> adds the following line:
- os.environ["CUPY_TF32"] = "1"

Run another Nsight Systems profile to see the activity of the Tensor cores with this version.

![cupy5](images/chapter-09/cupy-profiling-5.png)

**Notice** that the tensor cores are now being used during the dot product and the runtime of the dot range on the GPU is shorter 312ms ->116ms.


**Step 4 - Using an annotation file** 
Nsight Systems can also automatically trace specific functions from Python modules, in this case CuPy, with an annotation file. This example points to the file “cupy_annotations.json” which contains:
```
[
    {
        "_comment": "CuPy Annotations",
        
        "module": "cupy",
   "color": "black",
        "functions": ["random.random","dot","sort"]
    }

]
```
This json object indicates that the functions “random.random”, “dot”, and, “sort” from the module “cupy” should be traced and displayed as a black range on the timeline. Add this file to the “Python Functions trace” field in the configuration as shown below.

![cupy6](images/chapter-09/cupy-profiling-6.png)

Run another profile to see the automatic tracing. Note that the second random call is much shorter than the first (jit cache?).

![cupy7](images/chapter-09/cupy-profiling-7.png)


# Numba Profiling Example

**Nsight Systems** shows platform-wide profile information and some GPU-specific data, like GPU metrics, but it does not dive deep into the GPU kernels themselves. That’s where Nsight Compute comes in. Nsight Compute does detailed performance analysis of kernels as they run on the GPU. Historically, these have been written in native languages like C, but new technologies like Numba are enabling Python developers to write kernels as well. This section will describe how to profile Numba kernels with Nsight Compute. For an overview of Nsight Compute, check out <>.


**Setting up a profile with the Nsight Compute GUI**

To profile a Numba application with Nsight Compute, open the “Connect” dialog from the GUI. Select the python interpreter binary as the “Application Executable”. Ensure this interpreter runs in the environment with all the necessary dependencies for the application, for example the Conda shell supporting Numba. Then fill in the “Working Directory” field and put your Python file and any additional command line arguments in the “Command Line Arguments” field. This tells Nsight Compute how to launch your workload for profiling.

![numba1](images/chapter-09/numba-debug-1.png)

**Recommended settings/flags**

Nsight Compute has a lot of options to configure your profile. This guide isn’t designed to cover all of them, but there is a lot of additional information in the <documentation> and <online collateral>. A good starting point for Numba profiling is to choose the “Profile” activity. In the Filter > Kernel Base Name dropdown select “Demangled”. In the Other > Enable CPU Call Stack select Yes and Other > CPU Call Stack Types select All or Python.

The “Metrics” tab is where you will choose what performance metrics to collect. The metrics are grouped into sets, and the detailed set is a good starting point. You can learn more about the metrics in the <kernel profiling guide>. After updating these settings, click “Launch” to start the automatic profiling process. Nsight Compute will profile each kernel it encounters via a multi-pass replay mechanism and will report the results once complete.


### Sample Nsight Compute Profile Walkthrough

In this simple example, there is a Numba kernel doing vector addition. It takes in three vectors, adds two together, and returns the sum in the third vector. Notice that the @cuda.jit “decorator?” has the parameter “(lineinfo=True)”. This is important for resolving kernel performance data to lines of source code. With the setup described above, launch a profile to see the performance of the kernel. When the profile completes, the Summary page shows an overview of the kernels profiled. In this example, it’s only one. Expanding the “Demangled Name” column shows that this is the “vecadd” kernel that we wrote with Numba. The Summary has some basic information including the kernel duration and compute and memory throughput. It also lists top performance rules that were triggered and estimated speedups for correcting them. 

![numba2](images/chapter-09/numba-debug-2.png)

Double clicking on the kernel will open the Details page with much more information.

The “GPU Speed of Light Throughput” section at the top shows that this kernel has much higher Memory usage than Compute. The Memory Workload Analysis section shows significant traffic to device memory. 

![numba3](images/chapter-09/numba-debug-3.png)

The Compute Workload Analysis section shows the majority of the compute is using the FP64 pipeline. 

![numba4](images/chapter-09/numba-debug-4.png)

The Source Counters section at the bottom shows the source locations with the most stalls and clicking on one opens the Source page. 

![numba5](images/chapter-09/numba-debug-5.png)

Since this was a very simple kernel, most of the stalls are on the addition statement, but with more complex kernels, this level of detail is invaluable. Additionally, the Context page will show the CPU call stack that led to this kernel being executed. 

![numba6](images/chapter-09/numba-debug-6.png)


For this example, we did not specify the data type in Numpy which defaulted to FP64. This caused an increase in memory traffic that was unintended. After manually switching to the FP32 datatype and rerunning a profile, we can see that the runtime of the kernel decreased significantly as did the memory traffic. Setting the initial result to the Baseline <link on how to do that> and opening up the new result will automatically compare the two.

![Img7](images/chapter-09/numba-debug-7.png)


Nsight Compute has an abundance of performance data and built-in expertise. Each section on the Details page has detailed information for a particular category of metrics including Guided Analysis rules and descriptions. The best way to learn about all these features is to try it out on your workload and use the documentation and collateral to assist.

Sample code below:

In [None]:
import numpy as np
from numba import cuda


@cuda.jit(lineinfo=True)
def vecadd(a, b, c):
    tid = cuda.grid(1)
    size = len(c)
    if tid < size:
        c[tid] = a[tid] + b[tid]

In [None]:
def run_program() :


    np.random.seed(1)


    N = 500000


    a = cuda.to_device(np.random.random(N))
    b = cuda.to_device(np.random.random(N))
    #a = cuda.to_device(np.float32(np.random.random(N)))
    #b = cuda.to_device(np.float32(np.random.random(N)))
    c = cuda.device_array_like(a)


    vecadd.forall(len(a))(a, b, c)
    print(c.copy_to_host())