Triad GPU Bandwidth

Test achievable triad kernel bandwidth using all available CUDA allocation/transfer methodologies:

pageable allocation and explicit transfer: cudaMalloc / cudaMemcpy
pinned allocatiion and explicit transfer: cudaHostAlloc / cudaMemcpy
pinned allocation with mapping: cudaHostAlloc(..., cudaHostAllocMapped)
unified memory: cudaMallocManaged
unified memory with AccessedBy hint: cudaMallocManaged + cudaMemAdvise(..., cudaMemAdviseAccessedBy)
unified memory with prefetching: cudaMallocManaged + cudaMemPrefetchAsync
system allocator: new

The basic operation is a host-to-device transfer, followed by the triad kernel, followed by a device-to-host transfer.

The output is a CSV file with columns for the transfer time (zero for implicit transfers), the kernel time, and the total time.

total time: the time from the start of the first transfer to the end of the last transfer.
transfer time: the combined time for the host-to-device and the device-to-host transfers.
kernel time: start of the kernel execution to end of the kernel execution

transfer + kernel may not equal total, though they should be close.

Examples

Run benchmarks on GPU 0 from n = 1e5; n <= 2.5e8; n *= 1.3. Repeat each benchmark 5 times. Pin access and allocations to NUMA node 0. Show output on terminal and also pipe to triad.csv.

numactl -p 0 ./triad | tee triad.csv

Run only the pinned benchmark

./triad --pinned

Run only n = 1e9

./triad -n 1e9

Run 3 iterations of each benchmark

./triad -i 3

Show all options

./triad -h

Automatic Testing for Functional System Allocator

Forks a child process to test whether CUDA can use the system allocator. If CUDA cannot, this causes a sticky error that permanently damages the CUDA context, so we use a child process to fully isolate the context so it can be completely destroyed. The child process needs to be create before the parent does any CUDA activity at all.

The test is done in test_system_allocator.cu

Getting stable benchmark results

It is important to do 3 things:

Call cudaDeviceReset() before each benchmark. This ensures that any CUDA state is wiped between runs.
Call cudaFree(0) after cudaDeviceReset(). This initializes the GPU, ensuring that we don't actidentally time any lazy initialization.
Pin to a single NUMA region or CPU. This ensures that data copies always take a consistent route from CPU to GPU.

Building with a gcc that has std::regex

mkdir build && cd build
cmake ..

Building on Power9 with gcc 4.8.5

GCC 4.8.5 doesn't have a working std::regex (used for cxxopts), so install a supported version of clang. GCC 4.8.5 cannot build libcxx, so we use a clang without libcxx to build a clang with libcxx. Depending on your installed CUDA, you'll need a different version of clang.

CUDA	Clang	Installer
9.2	5.0.0	https://gist.github.com/cwpearson/c5521dfc50175b1d977643b2fc5a2bb1
10.1	5.0.0	https://gist.github.com/cwpearson/c13ac7c25bde8c8644300e211faf4e78

Add the clang to your path, and have CMake use clang in the build.

mkdir build && cd build
cmake .. -DCMAKE_TOOLCHAIN_FILE=`readlink -f ../toolchains/clang.toolchain`

The CUDA documentation claims that clang 8.0.0 is supported for CUDA 10.1, but if you actually try it says it requires clang>=3.2 and clang<8. Clang 7.1.0 fails on CUDA 10.0.0 with some errors about __fp16.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cmake		cmake
scripts		scripts
toolchains		toolchains
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
cache.cu		cache.cu
cache.hpp		cache.hpp
check_cuda.cuh		check_cuda.cuh
test_system_allocator.cu		test_system_allocator.cu
test_system_allocator.hpp		test_system_allocator.hpp
triad.cu		triad.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmake

cmake

scripts

scripts

toolchains

toolchains

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

README.md

README.md

cache.cu

cache.cu

cache.hpp

cache.hpp

check_cuda.cuh

check_cuda.cuh

test_system_allocator.cu

test_system_allocator.cu

test_system_allocator.hpp

test_system_allocator.hpp

triad.cu

triad.cu

Repository files navigation

Triad GPU Bandwidth

Examples

Automatic Testing for Functional System Allocator

Getting stable benchmark results

Building with a gcc that has std::regex

Building on Power9 with gcc 4.8.5

About

Releases

Packages

Languages

cwpearson/triad-gpu-bandwidth

Folders and files

Latest commit

History

Repository files navigation

Triad GPU Bandwidth

Examples

Automatic Testing for Functional System Allocator

Getting stable benchmark results

Building with a gcc that has std::regex

Building on Power9 with gcc 4.8.5

About

Topics

Resources

Stars

Watchers

Forks

Languages