# Chapter 10: Developer Tools

## CUDA Python Performance

In order to achieve optimal performance in CUDA, you must consider several factors:
- Localizing memory access in order to minimize memory latency.
- Maximizing the number of active threads per multiprocessor to ensure high utilization of your hardware.
- Minimization of conditional branching.

In order to overcome the bottleneck between CPU and GPU across the PCIe bus, we want to:
- Minimize the volume of data transferred.  Transferring data in large batches can minimize the number of data transfer operations.
- Organize data in a way that complements the hardware architecture.
- Utilize asynchronous transfer features that will allow computation and data transfer to occur simultaneously.  Overlapping data transfers with computation can hide latencies caused by data transfers.

[Nsight Systems](https://developer.nvidia.com/nsight-systems) and [Nsight Compute](https://developer.nvidia.com/nsight-compute) are the tools used to detect the bottlenecks and performance flaws in Cuda code.

## CUDA Python Correctness

CUDA code can sometimes introduce various errors that are not detected by the compiler, such as 
- Memory access violations
- Memory leaks
- Data race conditions
- Incorrect API usage

These errors can lead to incorrect program behavior, crashes, or performance degradation. [Compute Sanitizer](https://developer.nvidia.com/compute-sanitizer) is a suite of runtime error detection tools provided by NVIDIA to help developers identify and debug such issues in CUDA applications.

## Common Pitfalls
The most common mistake is running a CPU-only code on a GPU node. Only codes that have been explicitly written to run on a GPU can take advantage of a GPU. Ensure your codes are using the correct GPU accelerated libraries, drivers, and hardware.

**Zero GPU Utilization**
Check to make sure your software is GPU enabled.  Only codes that have been explicitly written to use GPUs can take advantage of them.
Make sure your software environment is properly configured. In some cases certain libraries must be available for your code to run on GPUs. Check your dependencies, version of CUDA Toolkit, and your software environment requirements.
 
**Low GPU Utilization** (e.g. less than ~15%)
Using more GPUs than necessary.  You can find the optimal number of GPUs and CPU-cores by performing a scaling analysis.
Check your process’s throughput.  If you are writing output to slow memory, making unnecessary copies, or switching between your CPU and GPU, you may see low utilization.

**Memory Errors**
Access Violation Errors.  Reading or writing to memory locations that are not allowed or permitted can result in unpredictable behavior and system crashes.
Memory Leaks.  When memory is allocated but not correctly deallocated, the application will consume GPU memory resources, but not utilize them.  The allocated memory will not be available for further computation.


# Getting Started with Developer Tools for CUDA Python

## Pre-requisites

This steps in this document assume the user has an environment capable of running CuPy and Numba code on a GPU. See those resepective projects to set them up.

- [Nsight Systems](https://developer.nvidia.com/nsight-systems) (also available in the [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit))
- [Nsight Compute](https://developer.nvidia.com/nsight-compute) (also available in the [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit))
- [Compute Sanitizer](https://developer.nvidia.com/compute-sanitizer) (also available in the [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit))
- [nvtx Python bindings](https://pypi.org/project/nvtx/)  




## Profiling with Nsight Systems

[Nsight Systems](https://developer.nvidia.com/nsight-systems) is a platform profiling tool designed to give users a high-level, time-correlated view of the performance activity of their entire platform. This includes CPU, GPU, Memory, Networking, OS and application-level metrics. It helps identify the largest opportunities to optimize, and tune to scale efficiently across all available resources. This tutorial will only scratch the surface of what Nsight Systems is capable of. For full details see the [documentation](https://docs.nvidia.com/nsight-systems/).

## Setting up a profile with the Nsight Systems GUI

After opening the Nsight Systems GUI, select the target machine for profiling. This can be the local machine or a remote server. This example uses the local target. To profile a Python workload with Nsight Systems, set the “Command line with arguments:” field to point to the Python interpreter and the Python file to run including any arguments. Make sure the Python executable is in an environment with all the necessary dependencies for the application. For example: “C:\Users\myusername\AppData\Local\miniconda3\python.exe C:\Users\myusername\cupyTests\cupyProfilingStep1.py \<args if needed\>"

Also fill in the “Working directory” where the Python executable should run. 

**Recommended settings/flags**

A good initial set of flags for profiling Python include:
- Collect CPU context switch trace
- Collect CUDA trace
- Collect GPU metrics
- Python profiling options:
  - Collect Python backtrace samples

You can learn more about all the options [here](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#profiling-from-the-gui)

# CuPy Profiling Example

In this example, we create two CuPy arrays. Then sort one of them and take the dot product.

In [None]:
import sys
import cupy as cp


def create_array(x, y) :
    return cp.random.random((x, y),dtype=cp.float32)

def sort_array(a) :
    return cp.sort(a)

def run_program() :
    print("init step...")
    arr1 = create_array(10_000, 10_000)
    arr2 = create_array(10_000, 10_000)

    print("sort step...")
    arr1 = sort_array(arr1)

    print("dot step...")
    arr3 = cp.dot(arr1, arr2)
    
    print("done")
    return

if __name__ == '__main__':

    run_program()


**Step 1 - Profiling a CuPy workload**

First, run an initial profile of this CuPy sample using the setup and flags described above. If launching a profile for the GUI is not an option, a profile can also be launched from the command line. An example CLI command to run this analysis is below. Some flags may vary depending on your specific setup.

*nsys profile --gpu-metrics-device=all --python-sampling=true --python-sampling-frequency=1000 --trace=cuda --cpuctxsw=process-tree python "/home/myusername/cupytest1.py"*


Once the profile completes, find the Python process thread under the **Processes** row on the timeline. Zoom in to the active portion of the Python thread by left-clicking and dragging across the area of interest to select it. Then right-click to "Zoom into selection". If you hover over a sample in the **Python Backtrace** row, a popup will appear with the call stack that was currently executing when the sample was taken.

![cupy1](images/chapter-10/cupy-profiling-1.png)

CuPy will call CUDA kernels under the hood as it executes. Nsight Systems will automatically detect these. Expand the **CUDA HW** row to see where the kernels are scheduled.

![cupy2](images/chapter-10/cupy-profiling-2.png)

Look at the **GPU Metrics > GPU Active** and **SM Instructions** rows to verify that the GPU is being used. You can hover over a spot in this row to see the % Utilization.

![cupy3](images/chapter-10/cupy-profiling-3.png)

**Step 2 - Adding nvtx**

Nsight Systems can automatically detect CUDA kernels as well as APIs from many other frameworks or libraries. Additionally, the [nvtx](https://github.com/NVIDIA/NVTX) annotation module gives users the ability to markup their own applications to see personalized trace events and ranges on the timeline. The [nvtx Python module](https://pypi.org/project/nvtx/) is available through pip and can be installed with the command:

*pip install nvtx*

The code below adds nvtx to the CuPy application, with colored ranges defined around various phases of the workload. Run a profile of this new version to see nvtx on the timeline. If using the CLI, update the flag to "--trace=nvtx,cuda"



In [2]:
import sys
import cupy as cp
import nvtx

def create_array(x, y) :
    return cp.random.random((x, y),dtype=cp.float32)

def sort_array(a) :
    return cp.sort(a)

def run_program() :
    print("init step...")
    nvtx.push_range("init_step", color='green')
    arr1 = create_array(10_000, 10_000)
    arr2 = create_array(10_000, 10_000)
    nvtx.pop_range()

    print("sort step...")
    nvtx.push_range("sort_step", color='yellow')
    arr1 = sort_array(arr1)
    nvtx.pop_range()

    nvtx.push_range("dot_step", color='magenta')
    print("dot step...")
    arr3 = cp.dot(arr1, arr2)
    nvtx.pop_range()
    
    print("done")
    return

if __name__ == '__main__':
    
    nvtx.push_range("run_program", color='white')
    run_program()
    nvtx.pop_range()

init step...
sort step...
dot step...
done


The **NVTX** row for the CPU thread of the Python process shows when the CPU is inside one of these ranges. The **NVTX** row under the CUDA HW section shows when these ranges are active on the GPU. Notice that they are not exactly lined up because of GPU execution scheduling. You can also see how the CUDA kernels map to these various nvtx ranges that represent the phases of our workload.

In this particular example, we can see in the **GPU Metrics > SM Instructions > Tensor Active** row that the Tensor cores on the GPU are not active while the kernels are running. Tensor cores can add a lot of performance to computation-intensive kernels. The next step will be to get them active. 

![cupy4](images/chapter-10/cupy-profiling-4.png)

**Step 3 - Enabling Tensor cores** 

The [CuPy documentation](https://docs.cupy.dev/en/stable/reference/environment.html#envvar-CUPY_TF32) describes how to enable Tensor cores with an environment variable. <file> adds the following line:
- os.environ["CUPY_TF32"] = "1"

Run another Nsight Systems profile to see the activity of the Tensor cores with this version.




In [None]:
import sys
import cupy as cp
import nvtx
import os



def create_array(x, y) :
    return cp.random.random((x, y),dtype=cp.float32)

def sort_array(a) :
    return cp.sort(a)

def run_program() :
    print("init step...")
    nvtx.push_range("init_step", color='green')
    arr1 = create_array(10_000, 10_000)
    arr2 = create_array(10_000, 10_000)
    nvtx.pop_range()

    print("sort step...")
    nvtx.push_range("sort_step", color='yellow')
    arr1 = sort_array(arr1)
    nvtx.pop_range()

    nvtx.push_range("dot_step", color='magenta')
    print("dot step...")
    arr3 = cp.dot(arr1, arr2)
    nvtx.pop_range()
    
    print("done")
    return

if __name__ == '__main__':
    os.environ["CUPY_TF32"] = "1"
    nvtx.push_range("run_program", color='white')
    run_program()
    nvtx.pop_range()

![cupy5](images/chapter-10/cupy-profiling-5.png)

**Notice** that the tensor cores are now being used during the dot product and the runtime of the dot range on the GPU is shorter 312ms ->116ms.

**Step 4 - Using an Annotation File** 
Nsight Systems can also automatically trace specific functions from Python modules, in this case CuPy, with an annotation file. This example points to the file “cupy_annotations.json” which contains:
```
[
    {
        "_comment": "CuPy Annotations",
        
        "module": "cupy",
   "color": "black",
        "functions": ["random.random","dot","sort"]
    }

]
```
This json object indicates that the functions “random.random”, “dot”, and, “sort” from the module “cupy” should be traced and displayed as a black range on the timeline. Add this file to the “Python Functions trace” field in the configuration as shown below. 

![cupy6](images/chapter-10/cupy-profiling-6.png)

To do this from the CLI, add a flag like " --python-functions-trace="/home/myusername/cupy_annotations.json" "
Run another profile to see the automatic tracing.

![cupy7](images/chapter-10/cupy-profiling-7.png)


# Numba Profiling Example

While Nsight Systems shows platform-wide profile information and some GPU-specific data, like GPU metrics, it does not dive deep into the GPU kernels themselves. That’s where [Nsight Compute](https://developer.nvidia.com/nsight-compute) comes in. Nsight Compute does detailed performance analysis of kernels as they run on the GPU. Historically, these have been written in native languages like C, but new technologies like Numba are enabling Python developers to write kernels as well. This section will describe how to profile Numba kernels with Nsight Compute. For details on Nsight Compute, check out the [Nsight Compute Documentation](https://docs.nvidia.com/nsight-compute/).


**Setting up a profile with the Nsight Compute GUI**

To profile a Numba application with Nsight Compute, open the “Connect” dialog from the GUI. Select the python interpreter binary as the “Application Executable”. Ensure this interpreter runs in the environment with all the necessary dependencies for the application, for example the Conda shell supporting Numba. Then fill in the “Working Directory” field and put your Python file and any additional command line arguments in the “Command Line Arguments” field. This tells Nsight Compute how to launch your workload for profiling.

![numba1](images/chapter-10/numba-profiling-1.png)

**Recommended settings/flags**

Nsight Compute has a lot of options to configure your profile. This guide isn’t designed to cover all of them, but there is a lot of additional information in the [documentation](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options). A good starting point for Numba profiling is to choose the **Profile** activity. In the **Filter > Kernel Base Name dropdown select “Demangled”**. In the **Other > Enable CPU Call Stack** select Yes and **Other > CPU Call Stack Types** select All or Python.

The **Metrics** tab is where you will choose what performance metrics to collect. The metrics are grouped into sets, and the detailed set is a good starting point. You can learn more about the metrics in the [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html). After updating these settings, click **Launch** to start the automatic profiling process. Nsight Compute will profile each kernel it encounters via a multi-pass replay mechanism and will report the results once complete. If profiling from the GUI is not an option, you can configure a profile from the GUI and copy the approprite command from the "Command Line:" in the **Common** tab. An example command for this profile might be:

*ncu --config-file off --export "\home\myusername\r%i" --force-overwrite --launch-count 3 --set detailed --call-stack --call-stack-type native --call-stack-type python --nvtx --import-source yes \home\myusername\numbaTest1.py*


### Sample Nsight Compute Profile Walkthrough

In this simple example, there is a Numba kernel doing vector addition. It takes in three vectors, adds two together, and returns the sum in the third vector. Notice that the "@cuda.jit" decorator has the parameter “(lineinfo=True)”. This is important for resolving kernel performance data to lines of source code. With the setup described above, launch a profile to see the performance of the kernel. 


In [None]:
import numpy as np
from numba import cuda
from numba import config as numba_config
numba_config.CUDA_ENABLE_PYNVJITLINK = True


@cuda.jit(lineinfo=True)
def vecadd(a, b, c):
    tid = cuda.grid(1)
    size = len(c)
    if tid < size:
        c[tid] = a[tid] + b[tid]

def run_program() :


    np.random.seed(1)


    N = 500000


    a = cuda.to_device(np.random.random(N))
    b = cuda.to_device(np.random.random(N))
    #a = cuda.to_device(np.float32(np.random.random(N)))
    #b = cuda.to_device(np.float32(np.random.random(N)))
    c = cuda.device_array_like(a)


    vecadd.forall(len(a))(a, b, c)
    print(c.copy_to_host())

When the profile completes, the **Summary** page shows an overview of the kernels profiled. In this example, it’s only one. Expanding the “Demangled Name” column shows that this is the “vecadd” kernel that we wrote with Numba. The Summary has some basic information including the kernel duration and compute and memory throughput. It also lists top performance rules that were triggered and estimated speedups for correcting them. 

![numba2](images/chapter-10/numba-profiling-2.png)

Double clicking on the kernel will open the **Details** page with much more information.

The “GPU Speed of Light Throughput” section at the top shows that this kernel has much higher Memory usage than Compute. The Memory Workload Analysis section shows significant traffic to device memory. 

![numba3](images/chapter-10/numba-profiling-3.png)

The Compute Workload Analysis section shows the majority of the compute is using the FP64 pipeline. 

![numba4](images/chapter-10/numba-profiling-4.png)

The Source Counters section at the bottom shows the source locations with the most stalls and clicking on one opens the **Source** page. 

![numba5](images/chapter-10/numba-profiling-5.png)

Since this was a very simple kernel, most of the stalls are on the addition statement, but with more complex kernels, this level of detail is invaluable. Additionally, the **Context** page will show the CPU call stack that led to this kernel being executed. 

![numba6](images/chapter-10/numba-profiling-6.png)

For this example, we did not specify the data type in Numpy which defaulted to FP64. This caused an increase in memory traffic that was unintended. To manually switch to using the FP32 datatype switch these lines:
    
    a = cuda.to_device(np.random.random(N))
    b = cuda.to_device(np.random.random(N))
    
to this:

    a = cuda.to_device(np.float32(np.random.random(N)))
    b = cuda.to_device(np.float32(np.random.random(N)))

After switching to the FP32 datatype and rerunning a profile, we can see that the runtime of the kernel decreased significantly as did the memory traffic. Setting the initial result to the [Baseline](https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#id7) and opening up the new result will automatically compare the two. Notice that the FP64 usage has disapperared and the kernel has sped up from 59us to 33us. 

![Img7](images/chapter-10/numba-profiling-7.png)


Nsight Compute has an abundance of performance data and built-in expertise. Each section on the Details page has detailed information for a particular category of metrics including Guided Analysis rules and descriptions. The best way to learn about all these features is to try it out on your workload and use the documentation and collateral to assist.


## Checking CUDA Python correctness with Compute Sanitizer

Compute Sanitizer is a suite of command line tools used for detection of code errors. The available tools are:
- **Memcheck** (default) Detects memory access errors, such as out-of-bounds accesses and misaligned memory accesses.
- **Racecheck** Identifies potential data races in shared memory, which can cause nondeterministic behavior.
- **Initcheck** Finds uninitialized memory accesses that might lead to undefined behavior.
- **Synccheck** Detects invalid synchronization patterns that could lead to deadlocks or race conditions.

To choose which tool should be used, run Compute Sanitizer with option "--tool" like below:

> compute-sanitizer --tool <memcheck|racechck|synchcheck|initcheck> python <python_app.py>

[Here](developer.nvidia.com/compute-sanitizer) You can find more information on how to use the tool. Basically it's a good idea to run it first without any parameters, which will trigger **Memcheck**. Memcheck tool and provide the list of detected memory access errors, along with a Python backtrace like depicted in examples below.

### Compute Sanitizer Numba example

In [None]:
# File: main.py

import numpy as np
from numba import cuda
from numba import config as numba_config
numba_config.CUDA_ENABLE_PYNVJITLINK = True

@cuda.jit('void(int32[:], int32[:])', lineinfo=True)
def invalid_read_kernel(x, out):
    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    bw = cuda.blockDim.x
    pos = tx + ty * bw

    if pos < x.size:
        out[pos] = x[pos + 2]  # out of bounds access

def launchKernel():
    invalid_read_kernel[blockspergrid, threadsperblock](d_x, d_out)


# Initialize data
n = 100
x = np.arange(n).astype(np.int32)
out = np.empty_like(x)

# Transfer data to device
d_x = cuda.to_device(x)
d_out = cuda.to_device(out)

# Set up enough threads for the job
threadsperblock = 32
blockspergrid = (n + (threadsperblock - 1)) // threadsperblock

# Run kernel
launchKernel()

# Synchronize device
cuda.synchronize()

# Copy result back to host
out = d_out.copy_to_host()
print(out)


The Numba code above contains out-of-bounds reads from the array x. During kernel run, the invalid_read_kernel may attempt to access memory outside of x due to incorrect indexing. Running compute sanitizer with:

> compute-sanitizer python main.py

will result with below output:

![sanitizer1](images/chapter-10/numba-sanitizer-1.png)

You can see that Compute Sanitizer correctly identified the failing kernel runs, providing detailed information on what went wrong and printing the host Python backtrace and the device backtrace.

**Note** 'lineinfo=True' option is needed in @cuda.jit decorator to enable line number in device location line and device backtrace.

### Compute Sanitizer Numba and ctypes example

Compute Sanitizer works correctly with Numba code that calls functions from a compiled CUDA library using ctypes. It accurately concatenates the host backtrace from both its Python and CUDA components, as demonstrated in the example below.

In [None]:
// File: cuda_code.cu

#include <stdio.h>
#if defined(_WIN32) || defined(WIN32)
#define EXPORT_FN __declspec(dllexport)
#else
#define EXPORT_FN
#endif

extern "C"
__global__ void invalid_read_kernel(int *x, int *out, int n) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;

    if (pos < n) {
        out[pos] = x[pos+2]; // out of bounds access
    }
}
extern "C" 
void launch_kernel(int *x, int *out, int n, int threadsperblock) {
    printf("Launching CUDA kernel...\n");

    int blockspergrid = (n + (threadsperblock - 1)) / threadsperblock;
    invalid_read_kernel<<<blockspergrid, threadsperblock>>>(x, out, n);
}

extern "C" 
EXPORT_FN void do_stuff(int *x, int *out, int n, int threadsperblock) {
    printf("Doing stuff...\n");
    launch_kernel(x, out, n, threadsperblock);
}


In [None]:
# File: main.py

import os
import numpy as np
import ctypes
from numba import cuda
from numba import config as numba_config
numba_config.CUDA_ENABLE_PYNVJITLINK = True

def run_lib_func():
    # Load the shared library
    if os.name == 'nt':  # Windows
        print("Running on Windows")
        lib = ctypes.CDLL('./cuda_code.dll')

    elif os.name == 'posix':  # Linux or Unix-like
        print("Running on Linux or Unix")
        lib = ctypes.CDLL('./libcuda_code.so')

    else:
        print("Unknown operating system")
        exit()

    # Initialize data
    n = 100
    x = np.arange(n).astype(np.int32)
    out = np.empty_like(x)
    # Allocate memory on the device
    x_gpu = cuda.to_device(x)
    out_gpu = cuda.to_device(out)
    # Set up enough threads for the job
    threadsperblock = 32
    # Get device pointers
    x_gpu_ptr = ctypes.c_void_p(int(x_gpu.device_ctypes_pointer.value))
    out_gpu_ptr = ctypes.c_void_p(int(out_gpu.device_ctypes_pointer.value))

    # Run kernel
    lib.do_stuff(x_gpu_ptr, out_gpu_ptr, ctypes.c_int(n), ctypes.c_int(threadsperblock))
    # Synchronize device
    cuda.synchronize()
    # Copy result back to host
    out = out_gpu.copy_to_host()
    print(out)

run_lib_func()


Running Compute Sanitizer with:

> compute-sanitizer python main.py

will result with below output:

![sanitizer2](images/chapter-10/numba-sanitizer-2.png)