# Thicket Nsight Compute Reader: Thicket Tutorial

Nsight Compute (NCU) is a performance profiler for NVIDIA GPUs. NCU report files do not have a calltree, but with the NVTX Caliper service we can forward Caliper annotations to NCU. By profiling the same executable with a calltree profiler like Caliper, we can map the NCU data to the calltree profile and create a Thicket object. 

**NOTE: An interactive version of this notebook is available in the Binder environment.**

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/llnl/thicket-tutorial/develop)

***

## 1. Import Necessary Packages

The Thicket NCU reader requires an existing install of Nsight Compute, and the `extras/python` directory in the Nsight Compute installation directory must be added to the `PYTHONPATH`. We use `sys.path.append` to add the path to the `PYTHONPATH` in this notebook. If you are not on a Livermore Computing system, you must change this path to match your install of Nsight Compute.

In [1]:
import sys

sys.path.append("/usr/tce/packages/nsight-compute/nsight-compute-2023.2.2/extras/python")

from IPython.display import display
from IPython.display import HTML

import thicket as tt

display(HTML("<style>.container { width:80% !important; }</style>"))



## 2. The Dataset

The dataset we are using comes from a profile of the RAJA Performance Suite on Lassen. We profile the `block_128` tuning of the `Base_CUDA`, `Lambda_CUDA`, and `RAJA_CUDA` variants, while varying the problem size for 1 million and 2 million. The calltree profiles come from the CUDA Activity Profile Caliper configuration. By changing the `variant` argument in the following cell, we can look at NCU data for different variants.

The following are reproduceable steps to generate this dataset:

```
# Example of building
$ . RAJAPerf/scripts/lc-builds/blueos_nvhpc_nvcc_clang_caliper.sh 
$ make -j

# Load CUDA version equal to the CUDA version used to build RAJAPerf
$ module load nvhpc/24.1-cuda-11.2.0

# Turn off NVIDIA Data Center GPU Manager (DCGM) on Lassen so we can run NCU (get an error if it's on)
$ dcgmi profile --pause
```

```
# Example run to Generate the CUDA Activity Profile
$ CALI_CONFIG=cuda-activity-profile,output.format=cali lrun -n 1 --smpiargs="-disable_gpu_hooks" bin/raja-perf.exe --variants [Base_CUDA OR Lambda_CUDA OR RAJA_CUDA] --tunings block_128 --size [1048576 OR 2097152] --repfact 0.01

# Example run to Generate the NCU Report
$ CALI_SERVICES_ENABLE=nvtx lrun -n 1 --smpiargs="-disable_gpu_hooks" ncu \
--nvtx --set default \
--export report \
--metrics sm__throughput.avg.pct_of_peak_sustained_elapsed \
--replay-mode application \
bin/raja-perf.exe --variants [Base_CUDA OR Lambda_CUDA OR RAJA_CUDA] --tunings block_128 --size [1048576 OR 2097152] --repfact 0.01
```

In [2]:
# Map all files
ncu_dir = "../data/ncu/"
ncu_report_mapping = {}
variant = "base_cuda" # OR "lambda_cuda" OR "raja_cuda"
problem_sizes = ["1M", "2M"]
for problem_size in problem_sizes:
    full_path = f"{ncu_dir}{variant}/{problem_size}/"
    ncu_report_mapping[full_path+"report.ncu-rep"] = full_path+"cuda_profile.cali"

## 3. Read Calltree Profiles into Thicket

The only performance metrics contained in the CUDA Activity Profile will be the CPU time `time` and the GPU time `time (gpu)`.

In [3]:
tk_cap = tt.Thicket.from_caliperreader(list(ncu_report_mapping.values()))
tk_cap.dataframe.head(20)

(1/2) Reading Files: 100%|██████████| 2/2 [00:00<00:00, 14.48it/s]
(2/2) Creating Thicket: 100%|██████████| 1/1 [00:00<00:00,  5.76it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,nid,time,time (gpu),name
node,profile,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"{'name': 'RAJAPerf', 'type': 'function'}",3785253476,23.0,0.000606,,RAJAPerf
"{'name': 'RAJAPerf', 'type': 'function'}",4063456299,23.0,0.00059,,RAJAPerf
"{'name': 'Algorithm', 'type': 'function'}",3785253476,164.0,2.3e-05,,Algorithm
"{'name': 'Algorithm', 'type': 'function'}",4063456299,164.0,2.3e-05,,Algorithm
"{'name': 'Algorithm_MEMCPY', 'type': 'function'}",3785253476,168.0,1.7e-05,,Algorithm_MEMCPY
"{'name': 'Algorithm_MEMCPY', 'type': 'function'}",4063456299,168.0,1.7e-05,,Algorithm_MEMCPY
"{'name': 'cudaDeviceSynchronize', 'type': 'function'}",3785253476,170.0,5.9e-05,,cudaDeviceSynchronize
"{'name': 'cudaDeviceSynchronize', 'type': 'function'}",4063456299,170.0,3.8e-05,,cudaDeviceSynchronize
"{'name': 'cudaLaunchKernel', 'type': 'function'}",3785253476,169.0,3.2e-05,,cudaLaunchKernel
"{'name': 'cudaLaunchKernel', 'type': 'function'}",4063456299,169.0,3.2e-05,,cudaLaunchKernel


## 4. Add NCU Data

The Thicket `add_ncu` function takes one required argument and one optional arguement. The required argument `ncu_report_mapping` is the mapping from the NCU report file to the corresponding calltree profile run. The optional argument `chosen_metrics` allows for a subselection of the NCU performance metrics to add, since there can be hundreds of NCU performance metrics. By default we add all metrics.

In [4]:
# Add NCU to thicket
ncu_metrics = [
    "gpu__time_duration.sum",
    "sm__throughput.avg.pct_of_peak_sustained_elapsed",
    "smsp__maximum_warps_avg_per_active_cycle",
]
# Add in metrics
tk_cap.add_ncu(
    ncu_report_mapping=ncu_report_mapping, 
    chosen_metrics=ncu_metrics,
)
tk_cap.dataframe.head(20)

Processing action 600/601: 100%|██████████| 601/601 [00:24<00:00, 24.57it/s] 
Processing action 600/601: 100%|██████████| 601/601 [00:01<00:00, 389.47it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,nid,time,time (gpu),name,gpu__time_duration.sum,sm__throughput.avg.pct_of_peak_sustained_elapsed,smsp__maximum_warps_avg_per_active_cycle
node,profile,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"{'name': 'RAJAPerf', 'type': 'function'}",3785253476,23.0,0.000606,,RAJAPerf,,,
"{'name': 'RAJAPerf', 'type': 'function'}",4063456299,23.0,0.00059,,RAJAPerf,,,
"{'name': 'Algorithm', 'type': 'function'}",3785253476,164.0,2.3e-05,,Algorithm,,,
"{'name': 'Algorithm', 'type': 'function'}",4063456299,164.0,2.3e-05,,Algorithm,,,
"{'name': 'Algorithm_MEMCPY', 'type': 'function'}",3785253476,168.0,1.7e-05,,Algorithm_MEMCPY,,,
"{'name': 'Algorithm_MEMCPY', 'type': 'function'}",4063456299,168.0,1.7e-05,,Algorithm_MEMCPY,,,
"{'name': 'cudaDeviceSynchronize', 'type': 'function'}",3785253476,170.0,5.9e-05,,cudaDeviceSynchronize,,,
"{'name': 'cudaDeviceSynchronize', 'type': 'function'}",4063456299,170.0,3.8e-05,,cudaDeviceSynchronize,,,
"{'name': 'cudaLaunchKernel', 'type': 'function'}",3785253476,169.0,3.2e-05,,cudaLaunchKernel,,,
"{'name': 'cudaLaunchKernel', 'type': 'function'}",4063456299,169.0,3.2e-05,,cudaLaunchKernel,,,


## 5. Add Problem Size to the Index

We can add the problem size to the performance data index for clarity about which profile we are looking at.

In [5]:
tk_cap.metadata_column_to_perfdata("ProblemSizeRunParam")
tk_cap.dataframe = tk_cap.dataframe.reset_index().set_index(["node", "ProblemSizeRunParam"])
tk_cap.dataframe.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,profile,nid,time,time (gpu),name,gpu__time_duration.sum,sm__throughput.avg.pct_of_peak_sustained_elapsed,smsp__maximum_warps_avg_per_active_cycle
node,ProblemSizeRunParam,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"{'name': 'RAJAPerf', 'type': 'function'}",2097152,3785253476,23.0,0.000606,,RAJAPerf,,,
"{'name': 'RAJAPerf', 'type': 'function'}",1048576,4063456299,23.0,0.00059,,RAJAPerf,,,
"{'name': 'Algorithm', 'type': 'function'}",2097152,3785253476,164.0,2.3e-05,,Algorithm,,,
"{'name': 'Algorithm', 'type': 'function'}",1048576,4063456299,164.0,2.3e-05,,Algorithm,,,
"{'name': 'Algorithm_MEMCPY', 'type': 'function'}",2097152,3785253476,168.0,1.7e-05,,Algorithm_MEMCPY,,,
"{'name': 'Algorithm_MEMCPY', 'type': 'function'}",1048576,4063456299,168.0,1.7e-05,,Algorithm_MEMCPY,,,
"{'name': 'cudaDeviceSynchronize', 'type': 'function'}",2097152,3785253476,170.0,5.9e-05,,cudaDeviceSynchronize,,,
"{'name': 'cudaDeviceSynchronize', 'type': 'function'}",1048576,4063456299,170.0,3.8e-05,,cudaDeviceSynchronize,,,
"{'name': 'cudaLaunchKernel', 'type': 'function'}",2097152,3785253476,169.0,3.2e-05,,cudaLaunchKernel,,,
"{'name': 'cudaLaunchKernel', 'type': 'function'}",1048576,4063456299,169.0,3.2e-05,,cudaLaunchKernel,,,


## 6. Visualize the NCU Performance Data on the Calltree

In [6]:
print(tk_cap.tree(
    metric_column="sm__throughput.avg.pct_of_peak_sustained_elapsed",
    expand_name=True,
    ))

  _____ _     _      _        _   
 |_   _| |__ (_) ___| | _____| |_ 
   | | | '_ \| |/ __| |/ / _ \ __|
   | | | | | | | (__|   <  __/ |_ 
   |_| |_| |_|_|\___|_|\_\___|\__|  v2024.1.0

[34mnan[0m RAJAPerf[0m
├─ [34mnan[0m Algorithm[0m
│  ├─ [34mnan[0m Algorithm_MEMCPY[0m
│  │  ├─ [34mnan[0m cudaDeviceSynchronize[0m
│  │  └─ [34mnan[0m cudaLaunchKernel[0m
│  │     └─ [38;5;34m7.244[0m void RAJA::policy::cuda::impl::forall_cuda_kernel<RAJA::policy::cuda::cuda_exec_explicit<RAJA::iteration_mapping::Direct, RAJA::cuda::IndexGlobal<(RAJA::named_dim)0, 128, 0>, RAJA::cuda::MaxOccupancyConcretizer, 1ul, true>, 1ul, RAJA::Iterators::numeric_iterator<long, long, long*>, void rajaperf::algorithm::MEMCPY::runCudaVariantBlock<128ul>(rajaperf::VariantID)::{lambda(long)#2}, long, RAJA::iteration_mapping::Direct, RAJA::cuda::IndexGlobal<(RAJA::named_dim)0, 128, 0>, 128ul>(void rajaperf::algorithm::MEMCPY::runCudaVariantBlock<128ul>(rajaperf::VariantID)::{lambda(long)#2}, RAJA::Iter