LIKWID and Nvidia GPUs

Introduction

LIKWID 5.0 was extended to support Nvidia GPUs in LIKWID. The tasks was challenging because LIKWID focused on CPUs and there was no "other device" logic. In order to simplify the transition from CPUs to GPUs for the users, the LIKWID API for GPUs is basically a copy of the LIKWID API for CPUs with a few differences. For the command line applications, new CLI options are introduced.

Permissions

The Nvidia libraries provide an option to allow profiling as a user. If you try out likwid-perfctr and it tells error 35 (CUPTI_ERROR_INSUFFICIENT_PRIVILEGES) in the output, see this page and follow the instructions here.

CLI tools

likwid-topology

As soon as the CUDA and CUPTI libraries are reachable (path to libs in LD_LIBRARY_PATH), likwid-topology prints a separate section GPU Topology listing all GPUs.

********************************************************************************
GPU Topology
********************************************************************************
GPU count:		1
--------------------------------------------------------------------------------
ID:			0
Name:			Tesla V100-SXM2-16GB
Compute capability:	7.0
L2 size:		6.00 MB
Memory:			16.00 GB
SIMD width:		32
Clock rate:		1530000 kHz
Memory clock rate:	877000 kHz
Attached to NUMA node:	-1
--------------------------------------------------------------------------------

If you worked with likwid-topology, you might have used -c, --caches to display detailed cache information. For GPUs, the CLI option -G, --gpus causes likwid-topology to print out detailed information about the GPU:

********************************************************************************
GPU Topology
********************************************************************************
GPU count:		1
--------------------------------------------------------------------------------
ID:			0
Name:			Tesla V100-SXM2-16GB
Compute capability:	7.0
L2 size:		6.00 MB
Memory:			16.00 GB
SIMD width:		32
Clock rate:		1530000 kHz
Memory clock rate:	877000 kHz
Attached to NUMA node:	-1
Number of SPs:		80
Max. threads per SP:	2048
Max. threads per block:	1024
Max. thread dimensions:	1024/1024/64
Max. regs per block:	0
Shared mem per block:	49152
Memory bus width:	4096
Texture alignment:	512
Surface alignment:	512
ECC:			on
GPU integrated:		no
Max. grid sizes:	2147483647/65535/65535
PCI bus:		0x4
PCI domain:		0x4
PCI device:		0x0
--------------------------------------------------------------------------------

likwid-perfctr

The second CLI tool which was adapted to support Nvidia GPUs is likwid-perfctr. We thought about different ways to specify the GPUs and related events and ended up with distinct CLI options:

$ likwid-perfctr -h
[...]
-G, --gpus <list>	 List of GPUs to monitor
-W, --gpugroup <string>	 Performance group or custom event set string for GPU monitoring
[...]

Notice: At the moment LIKWID for Nvidia GPUs is only supported for instrumented code, so you always need -m!

$ make -C test triadCU
$ likwid-perfctr -G 0 -W FLOPS_DP -m test/triadCU 
--------------------------------------------------------------------------------
CPU name:	POWER9, altivec supported
CPU type:	POWER9 architecture
CPU clock:	3.80 GHz
--------------------------------------------------------------------------------
137.38ms 31.26GB/s 
--------------------------------------------------------------------------------
Region triad, Group 1: FLOPS_DP
+-------------------+----------+
|    Region Info    |   GPU 0  |
+-------------------+----------+
| RDTSC Runtime [s] | 0.105340 |
|     call count    |       10 |
+-------------------+----------+

+----------------------------------------------------+---------+-----------+
|                        Event                       | Counter |   GPU 0   |
+----------------------------------------------------+---------+-----------+
| SMSP_SASS_THREAD_INST_EXECUTED_OP_DADD_PRED_ON_SUM |   GPU0  |         0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_DMUL_PRED_ON_SUM |   GPU1  |         0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_DFMA_PRED_ON_SUM |   GPU2  | 867648400 |
+----------------------------------------------------+---------+-----------+

+--------------+------------+
|    Metric    |    GPU 0   |
+--------------+------------+
| DP [MFLOP/s] | 16473.3097 |
+--------------+------------+

The NvMarkerAPI is also quite similar to the MarkerAPI:

LIKWID_NVMARKER_INIT: Initialize the LIKWID library for Nvidia GPUs
LIKWID_NVMARKER_CLOSE: Close the LIKWID library and write result file
LIKWID_NVMARKER_REGISTER(name): Register a region with the name name
LIKWID_NVMARKER_START(name): Start a region with the name name
LIKWID_NVMARKER_STOP(name): Stop a region with the name name
LIKWID_NVMARKER_RESET(name): Reset counter data for the region with the name name
LIKWID_NVMARKER_SWITCH: Switch to next group if any (round-robin fashion)
LIKWID_NVMARKER_GET(name, ngpu, nevents, eventlist, time, count): Get the current results of a region.

In order to activate the macros, you have to compile with -DLIKWID_NVMON and proper include and library paths. Finally you have to link with the LIKWID library.

Example code how to use the NvMarkerAPI:

#include <likwid-marker.h>

int main(int argc, char **argv) {
  LIKWID_NVMARKER_INIT;
  LIKWID_NVMARKER_REGISTER("triad");
  for (int i = 0; i < iters; i++) {
    LIKWID_NVMARKER_START("triad");
    sch_triad_kernel<double>
        <<<max_blocks, block_size>>>(dA, dB, dC, dD, buffer_size);
    LIKWID_NVMARKER_STOP("triad");
  }
  LIKWID_NVMARKER_CLOSE;
}

Compilation step: nvcc -O3 -I. -Xcompiler -mno-float128 -I$LIKWID_HOME/include -DLIKWID_NVMON triad.cu -o triadCU -lm -L$LIKWID_HOME/lib -llikwid

One difference is that the NvMarkerAPI is not bound to the threads executing the calls. So LIKWID_NVMARKER_START starts the counters on all GPUs supplied to likwid-perfctr. If you compare with the CPU-related MarkerAPI, there each thread has to execute the macros itself.

LIKWID library

The command line applications are basically just Lua scripts interfacing with the LIKWID library. For users of the CPU-related LIKWID API (PerfMon) the transition to the GPU-related LIKWID API (NvMon) is not difficult as the calls are mostly the same.

Example calls:

PerfMon	NvMon	Description
`perfmon_init(ncpus, cpulist)`	`nvmon_init(ngpus, gpulist)`	Initialize the measurement system on given CPUs/GPUs
`perfmon_getMetric(group, metric, cpu-idx)`	`nvmon_getMetric(group, metric, gpu-idx)`	Get the result of a derived metric for CPUs/GPUs
`perfmon_startCounters()`	`nvmon_startCounters()`	Start previously set up event set on all CPUs/GPUs

There are a few exceptions like nvmon_getEventsOfGpu(gpuId, NvmonEventList_t* list) because each GPU might provide a different set of events/metrics. The Perfmon equivalent would be perfmon_init_maps(); perfmon_check_counter_map(cpuId) and a directly readable list of events.

Performance groups

When LIKWID is compiled with NVIDIA_INTERFACE=true, the output of likwid-perfctr -a contains multiple section, the first for CPU related groups and the second with the Nvidia GPU related performance groups. The groups for the different backends are different.

compute capability < 7.0: $LIKWID_SRC/groups/nvidia_gpu_cc_lt_7
compute capability >= 7.0: $LIKWID_SRC/groups/nvidia_gpu_cc_ge_7

If you want to create own performance groups, put them in $HOME/.likwid/groups/nvidia_gpu_cc_lt_7 and $HOME/.likwid/groups/nvidia_gpu_cc_ge_7 respectively.

Internals

Internally, LIKWID uses different backends to access the performance events on Nvidia GPUs. For GPUs with compute capability < 7.0, LIKWID uses the CUPTI Event API (Metrics API upcoming) while for GPUs with compute capability >= 7.0, it uses the CUpti Profiling API in combination with the PerfWorks API. Which backend is used for a GPU is determined once in the initialization phase. The PerfWorks API is not published yet (no documentation) but there is some example code which was the basis for the LIKWID backend.

Known problems

General:

No CUPTI Metrics API for GPUs with compute capability < 7.0.

CUPTI Profiling backend:

Only works if GPU 0 is part of the GPU list. Only measuring on GPU 1 does not work.
Some metrics require multi-pass kernel execution. It's not detectable in the setup phase whether an event requires multiple passes, so the error is thrown at the first counter starting.
On some occasions, the CUPTI Profiling API returns error 999 which is "Unknown CUPTI error"