GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUs

GPU-FPX is a tool based on NVBit that detects and analyzes floating-point exceptions on NVIDIA GPUs through binary instrumentation. Its purpose is to detect and report the occurances of floating-point exceptions during numerical computations, offering efficient location detections and exception flow. GPU-FPX achieves exceptional performance, being 16× faster than comparable prior tools, such as BinFPE.

To reproduce the experiments in the paper, see the Benchmarks section.

Build

GPU-FPX is under the license agreement of NVBit, so we construct the code by providing .patch files.

There are two components in GPU-FPX:

A detector to detect the floating point exceptions and report their locations;
An analyzer which can display how an exception flows within one instruction. This may help debug and fix the exceptions in the program being analyzed.

Ensure you have the right platform You should change the Arch in config.mk at ./nvbit_release/tools/GPU-FPX/utility if you are not on the Ampere Machine. This parameter will be fixed in the future.

To build both components, just run the following commands:

git clone https://github.com/LLNL/GPU-FPX
cd GPU-FPX
make

You can also run

make detector 
make analyzer

to build them separately.

This will generate two shared objects

./nvbit_release/tools/GPU-FPX/analyzer/analyzer.so

and

./nvbit_release/tools/GPU-FPX/detector/detector.so

which can be loaded when executing your programs.

Usage

To use our tools, you need to load the shared objects with LD_PRELOAD while running your programs. For example, if you want to detect exceptions for your GPU programs, just run

LD_PRELOAD=/your/path/to/GPU-FPX/nvbit_release/tools/GPU-FPX/detector/detector.so ./your/program

Getting start example

We provide a simple example to illustrate how to use GPU-FPX to detect and analyze the exceptions. All the example codes can be find in example.

Create a simple GPU program

Here we create a GPU program to compute the dot product, you can name it as dot-prod.cu

#include <stdio.h>
#include <stdlib.h>


__global__ void dot_prod(float *x, float *y, int size)
{
  float d;
  for (int i=0; i < size; ++i)
  {
    float tmp;
    tmp = x[i]*y[i];
    tmp = (tmp-tmp) / (tmp - tmp); // division by zero, produce NaN
    d += tmp; // d=NaN
  }

  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid == 0) {
    printf("dot: %f\n", d);
  }
}
int main(int argc, char **argv)
{
  int n = 3;
  int nbytes = n*sizeof(float);
  float *d_a = 0;
  cudaMalloc(&d_a, nbytes);

  float *data = (float *)malloc(nbytes);
  for (int i=0; i < n; ++i)
  {
    data[i] = (float)(i+1);
  }

  cudaMemcpy((void *)d_a, (void *)data, nbytes, cudaMemcpyHostToDevice);

  printf("Calling kernel\n");
  dot_prod<<<1,1>>>(d_a, d_a, nbytes);
  cudaDeviceSynchronize();
  printf("done\n");

  return 0;
}

Observe that there is a division by zero operation on line 13 resulting in a NaN in the final result.

Compiling and running it

nvcc --generate-line-info dot-prod.cu -o dot-prod
./dot-prod

It will output

./dot-prod
Calling kernel
dot: nan
done

Using the`detector`

LD_PRELOAD=/your/path/to/GPU-FPX/nvbit_release/tools/GPU-FPX/detector/detector.so ./dot-prod

It will generate exceptional report, we paste some segments here:

#GPU-FPX LOC-EXCEP INFO: in kernel [dot_prod], DIV0 found @ /home/xinyi/gpufpx-docker/example/dot-prod.cu:13 [FP32]
#GPU-FPX LOC-EXCEP INFO: in kernel [dot_prod], NaN found @ /home/xinyi/gpufpx-docker/example/dot-prod.cu:13 [FP32]
dot: nan
#GPU-FPX LOC-EXCEP INFO: in kernel [dot_prod], NaN found @ /home/xinyi/gpufpx-docker/example/dot-prod.cu:21 [FP32]
#GPU-FPX LOC-EXCEP INFO: in kernel [dot_prod], NaN found @ /home/xinyi/gpufpx-docker/example/dot-prod.cu:14 [FP32]

We can see it successfully detects the division by zero operation on line 13.

Using `analyzer`

LD_PRELOAD=/your/path/to/GPU-FPX/nvbit_release/tools/GPU-FPX/analyzer/analyzer.so ./dot-prod

We paste some analyzer segments here:

#GPU-FPX-ANA APPEAR : INF appear at the destination  @ /home/xinyi/gpufpx-docker/example/dot-prod.cu:13 Instruction: MUFU.RCP R0, R10 ; We have 2 registers in total. Register 0 is INF. Register 1 is VAL.
#GPU-FPX-ANA APPEAR : NaN appear at the destination  @ /home/xinyi/gpufpx-docker/example/dot-prod.cu:13 Instruction: FFMA R9, -R10, R0, 1 ; We have 3 registers in total. Register 0 is NaN. Register 1 is VAL. Register 2 is INF.

If you have some knowledge about the low-level CUDA assembly -- SASS, you may find the MUFU.RCP instruction is one of the key instruction for division operation. It computes the reciprocal of register R10 and stores the result in R0.

Here, GPU-FPX-ANA APPEAR means there are no exceptional values (NaN, INF) are present in the source register R10, however, exceptional values occur in the destination R0 implying the apperance of an exception.

Benchmark

We have benchmarked 151 GPU programs in our paper. To test and reproduce them, we refered to the README.md in the benchmarks folder.

Contact

For questions, contact Ganesh Gopalakrishnan ganesh@cs.utah.edu and Xinyi Li xin_yi.li@utah.edu.

To cite GPU-FPX please use

@inproceedings{10.1145/3588195.3592991,
author = {Li, Xinyi and Laguna, Ignacio and Fang, Bo and Swirydowicz, Katarzyna and Li, Ang and Gopalakrishnan, Ganesh},
title = {Design and Evaluation of GPU-FPX: A Low-Overhead Tool for Floating-Point Exception Detection in NVIDIA GPUs},
year = {2023},
isbn = {9798400701559},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3588195.3592991},
doi = {10.1145/3588195.3592991},
booktitle = {Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing},
pages = {59–71},
numpages = {13},
keywords = {high-performance computing, binary instrumentation, numerical programs, floating-point exceptions, GPUs, machine learning},
location = {Orlando, FL, USA},
series = {HPDC '23}
}

License

GPU-FPX is distributed under the terms of the MIT license.

See LICENSE-MIT, and NOTICE for details.

LLNL-CODE- 851480

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-FPX

GPU-FPX

benchmarks

benchmarks

example

example

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

NOTICE

NOTICE

README.md

README.md

Repository files navigation

GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUs

Build

Usage

Getting start example

Create a simple GPU program

Compiling and running it

Using the`detector`

Using `analyzer`

Benchmark

Contact

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
GPU-FPX		GPU-FPX
benchmarks		benchmarks
example		example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md

License

LLNL/GPU-FPX

Folders and files

Latest commit

History

Repository files navigation

GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUs

Build

Usage

Getting start example

Create a simple GPU program

Compiling and running it

Using thedetector

Using analyzer

Contact

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Using the`detector`

Using `analyzer`