Skip to content

Latest commit

 

History

History
265 lines (209 loc) · 18.1 KB

README.md

File metadata and controls

265 lines (209 loc) · 18.1 KB

Comb v0.3.1

Comb is a communication performance benchmarking tool. It is used to determine performance tradeoffs in implementing communication patterns on high performance computing (HPC) platforms. At its core comb runs combinations of communication patterns with execution patterns, and memory spaces in order to find efficient combinations. The current set of capabilities Comb provides includes:

  • Configurable structured mesh halo exchange communication.
  • A variety of communication patterns based on grouping messages.
  • A variety of execution patterns including serial, openmp threading, cuda, cuda fused kernels.
  • A variety of memory spaces including default system allocated memory, pinned host memory, cuda device memory, and cuda managed memory with different cuda memory advice.

It is important to note that Comb is very much a work-in-progress. Additional features will appear in future releases.

Quick Start

The Comb code lives in a GitHub repository. To clone the repo, use the command:

git clone --recursive https://github.com/llnl/comb.git

On an lc system you can build Comb using the provided cmake scripts and host-configs.

./scripts/lc-builds/blueos_nvcc_gcc.sh 10.1.243 sm_70 8.3.1
cd build_lc_blueos-nvcc10.1.243-sm_70-gcc8.3.1
make

You can also create your own script and host-config provided you have a C++ compiler that supports the C++11 standard, an MPI library with compiler wrapper, and optionally an install of cuda 9.0 or later.

./scripts/my-builds/compiler_version.sh
cd build_my_compiler_version
make

To run basic tests make a directory and make symlinks to the comb executable and scripts. The scripts expect a symlink to comb to exist in the run directory. The run_tests.bash script runs the basic_tests.bash script in 2^3 processes.

ln -s /path/to/comb/build_my_compiler_version/bin/comb .
ln -s /path/to/comb/scripts/* .
./run_tests.bash 2 basic_tests.bash

User Documentation

Minimal documentation is available.

Comb runs every combination of execution pattern, and memory space enabled. Each rank prints its results to stdout. The sep_out.bash script may be used to simplify data collection by piping the output of each rank into a different file. The combine_output.lua lua script may be used to simplify data aggregation from multiple files.

Comb uses a variety of manual packing/unpacking execution techniques such as sequential, openmp, and cuda. Comb also uses MPI_Pack/MPI_Unpack with MPI derived datatypes for packing/unpacking. (Note: tests using cuda managed memory and MPI datatypes are disabled as they sometimes produce incorrect results)

Comb creates a different MPI communicator for each test. This communicator is assigned a generic name unless MPI datatypes are used for packing and unpacking. When MPI datatypes are used the name of the memory allocator is appended to the communicator name.

Configure Options

The cmake configuration options change which execution patterns and memory spaces are enabled.

  • ENABLE_MPI Allow use of mpi and enable test combinations using mpi
  • ENABLE_OPENMP Allow use of openmp and enable test combinations using openmp
  • ENABLE_CUDA Allow use of cuda and enable test combinations using cuda
  • ENABLE_RAJA Allow use of RAJA performance portability library
  • ENABLE_CALIPER Allow use of the Caliper performance profiling library
  • ENABLE_ADIAK Allow use of the Adiak library for recording program metadata

Runtime Options

The runtime options change the properties of the grid and its decomposition, as well as the communication pattern used.

  • #_#_# Grid size in each dimension (Required)
  • -divide #_#_# Number of subgrids in each dimension (Required)
  • -periodic #_#_# Periodicity in each dimension
  • -ghost #_#_# The halo width or number of ghost zones in each dimension
  • -vars # The number of grid variables
  • -comm option Communication options
    • cutoff # Number of elements cutoff between large and small message packing kernels
    • enable|disable option Enable or disable specific message passing execution policies
      • all all message passing execution patterns
      • mock mock message passing execution pattern (do not communicate)
      • mpi mpi message passing execution pattern
      • gdsync libgdsync message passing execution pattern (experimental)
      • gpump libgpump message passing execution pattern
      • mp libmp message passing execution pattern (experimental)
      • umr umr message passing execution pattern (experimental)
    • post_recv option Communication post receive (MPI_Irecv) options
      • wait_any Post recvs one-by-one
      • wait_some Post recvs in groups
      • wait_all Post all recvs
      • test_any Post recvs one-by-one
      • test_some Post recvs in groups
      • test_all Post all recvs
    • post_send option Communication post send (MPI_Isend) options
      • wait_any pack and send messages one-by-one
      • wait_some pack messages then send them in groups
      • wait_all pack all messages then send them all
      • test_any pack messages asynchronously and send when ready
      • test_some pack multiple messages asynchronously and send when ready
      • test_all pack all messages asynchronously and send when ready
    • wait_recv option Communication wait to recv and unpack (MPI_Wait, MPI_Test) options
      • wait_any recv and unpack messages one-by-one (MPI_Waitany)
      • wait_some recv messages then unpack them in groups (MPI_Waitsome)
      • wait_all recv all messages then unpack them all (MPI_Waitall)
      • test_any recv and unpack messages one-by-one (MPI_Testany)
      • test_some recv messages then unpack them in groups (MPI_Testsome)
      • test_all recv all messages then unpack them all (MPI_Testall)
    • wait_send option Communication wait on sends (MPI_Wait, MPI_Test) options
      • wait_any Wait for each send to complete one-by-one (MPI_Waitany)
      • wait_some Wait for all sends to complete in groups (MPI_Waitsome)
      • wait_all Wait for all sends to complete (MPI_Waitall)
      • test_any Wait for each send to complete one-by-one by polling (MPI_Testany)
      • test_some Wait for all sends to complete in groups by polling (MPI_Testsome)
      • test_all Wait for all sends to complete by polling (MPI_Testall)
    • allow|disallow option Allow or disallow specific communications options
      • per_message_pack_fusing Combine packing/unpacking kernels for boundaries communicated in the same message
      • message_group_pack_fusing Fuse packing/unpacking kernels across messages (and variables) in the same message group
  • -cycles # Number of times the communication pattern is tested
  • -omp_threads # Number of openmp threads requested
  • -exec option Execution options
    • enable|disable option Enable or disable specific execution patterns
      • all all execution patterns
      • seq sequential CPU execution pattern
      • omp openmp threaded CPU execution pattern
      • cuda cuda GPU execution pattern
      • cuda_graph cuda GPU batched via cuda graph API execution pattern
      • hip hip GPU execution pattern
      • raja_seq RAJA sequential CPU execution pattern
      • raja_omp RAJA openmp threaded CPU execution pattern
      • raja_cuda RAJA cuda GPU execution pattern
      • raja_hip RAJA hip GPU execution pattern
      • mpi_type MPI datatypes MPI implementation execution pattern
  • -memory option Memory space options
    • UseType enable|disable Optional UseType modifier for enable|disable, default is all. UseType specifies what uses to enable|disable, for example "-memory buffer disable cuda_pinned" disables cuda_pinned buffer allocations.
      • all all use types
      • mesh mesh use type
      • buffer buffer use type
    • enable|disable option Enable or disable specific memory spaces for UseType allocations
      • all all memory spaces
      • host host CPU memory space
      • cuda_hostpinned cuda pinned memory space (pooled)
      • cuda_device cuda device memory space (pooled)
      • cuda_managed cuda managed memory space (pooled)
      • cuda_managed_host_preferred cuda managed with host preferred advice memory space (pooled)
      • cuda_managed_host_preferred_device_accessed cuda managed with host preferred and device accessed advice memory space (pooled)
      • cuda_managed_device_preferred cuda managed with device preferred advice memory space (pooled)
      • cuda_managed_device_preferred_host_accessed cuda managed with device preferred and host accessed advice memory space (pooled)
      • hip_hostpinned hip pinned memory space (pooled)
      • hip_hostpinned_coarse hip coarse grained (non-coherent) pinned memory space (pooled)
      • hip_device hip device memory space (pooled)
      • hip_device_fine hip fine grained device memory space (pooled)
      • hip_managed hip managed memory space (pooled)
  • -cuda_aware_mpi Assert that you are using a cuda aware mpi implementation and enable tests that pass cuda device or managed memory to MPI
  • -hip_aware_mpi Assert that you are using a hip aware mpi implementation and enable tests that pass hip device or managed memory to MPI
  • -cuda_host_accessible_from_device Assert that your system supports pageable host memory access from the device and enable tests that access pageable host memory on the device
  • -use_device_preferred_for_cuda_util_aloc Use device preferred host accessed memory for cuda utility allocations instead of host pinned memory, mainly affects fused kernels
  • -use_device_for_hip_util_aloc Use device memory for hip utility allocations instead of host pinned memory, mainly affects fused kernels
  • -print_packing_sizes Print message and packing sizes to proc files
  • -print_message_sizes Print message sizes to proc files
  • -caliper_config Caliper performance profiling config (e.g., "runtime-report")

Example Script

The run_tests.bash is an example script that allocates resources and uses a script such as focused_tests.bash to run the code in a variety of configurations. The run_tests.bash script takes two arguments, the number of processes per side used to split the grid into an N x N x N decomposition, and the tests script.

mkdir 1_1_1
cd 1_1_1
ln -s path/to/comb/build/bin/comb .
ln -s path/to/comb/scripts/* .
./run_tests.bash 1 focused_tests.bash

The scale_tests.bash script used with run_tests.bash which shows the options available and how the code may be run with multiple sets of arguments with mpi. The focused_tests.bash script used with run_tests.bash which shows the options available and how the code may be run with one set of arguments with mpi.

Output

Comb outputs Comb_(number)_summary and Comb_(number)_proc(number) files. The summary file contains aggregated results from the proc files which contain per process results. The files contain the argument and code setup information and the results of multiple tests. The results for each test follow a line started with "Starting test" and the name of the test.

The first set of tests are memory copy tests with names of the following form.

Starting test memcpy (execution policy) dst (destination memory space) src (source memory space)"
copy_sync-(number of variables)-(elements per variable)-(bytes per element): num (number of repeats) avg (time) s min (time) s max (time) s

Example:

Starting test memcpy seq dst Host src Host
copy_sync-3-1061208-8: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s

This is a test in which memory is copied via sequential cpu execution to one host memory buffer from another host memory buffer. The test involves one measurement.

copy_sync-3-1061208-8 Copying 3 buffers of 1061208 elements of size 8.

The second set of tests are the message passing tests with names of the following form.

Comm (message passing execution policy) Mesh (physics execution policy) (mesh memory space) Buffers (large message execution policy) (large message memory space) (small message execution policy) (small message memory space)
(test phase): num (number of repeats) avg (time) s min (time) s max (time) s
...

Example

Comm mpi Mesh seq Host Buffers seq Host seq Host
pre-comm:  num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-recv: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-send: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
wait-recv: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
wait-send: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-comm: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
start-up:   num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
test-comm:  num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
bench-comm: num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s

This is a test in which a mesh is updated with physics running via sequential cpu execution using memory allocated in host memory. The buffers used for large messages are packed/unpacked via sequential cpu execution and allocated in host memory and the buffers used with MPI for small messages are packed/unpacked via sequential cpu execution and allocated in host memory. This test involves multiple measurements, the first six time individual parts of the physics cycle and communication.

  • pre-comm "Physics" before point-to-point communication, in this case setting memory to initial values.
  • post-recv Allocating MPI receive buffers and calling MPI_Irecv.
  • post-send Allocating MPI send buffers, packing buffers, and calling MPI_Isend.
  • wait-recv Waiting to receive MPI messages, unpacking MPI buffers, and freeing MPI receive buffers
  • wait-send Waiting for MPI send messages to complete and freeing MPI send buffers.
  • post-comm "Physics" after point-to-point communication, in this case resetting memory to initial values. The final three measure problem setup, correctness testing, and total benchmark time.
  • start-up Setting up mesh and point-to-point communication.
  • test-comm Testing correctness of point-to-point communication.
  • bench-comm Running benchmark, starts after an initial MPI_Barrier and ends after a final MPI_Barrier.
Execution Policies
  • seq Sequential CPU execution
  • omp Parallel CPU execution via OpenMP
  • cuda Parallel GPU execution via cuda
  • cudaGraph Parallel GPU execution via cuda graphs
  • hip Parallel GPU execution via hip
  • raja_seq RAJA Sequential CPU execution
  • raja_omp RAJA Parallel CPU execution via OpenMP
  • raja_cuda RAJA Parallel GPU execution via cuda
  • raja_hip RAJA Parallel GPU execution via hip
  • mpi_type Packing or unpacking execution done via mpi datatypes used with MPI_Pack/MPI_Unpack

Note: The cudaGraph exec policy updates the graph each cycle. There is currently no option to use the same graph for every cycle.

Memory Spaces
  • Host CPU memory (malloc)
  • HostPinned Cuda/Hip Pinned CPU memory pool (cudaHostAlloc/hipMallocHost)
  • Device Cuda/Hip GPU memory pool (cudaMalloc/hipMalloc)
  • Managed Cuda/Hip Managed GPU memory pool (cudaMallocManaged/hipMallocManaged)
  • ManagedHostPreferred Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation cudaCpuDeviceId)
  • ManagedHostPreferredDeviceAccessed Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation cudaCpuDeviceId + cudaMemAdviseSetAccessedBy 0)
  • ManagedDevicePreferred Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation 0)
  • ManagedDevicePreferredHostAccessed Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation 0 + cudaMemAdviseSetAccessedBy cudaCpuDeviceId)

Note: Some memory spaces are pooled. This is done to amortize the cost of allocation. After the first allocation the cost of allocating memory should be trivial for pooled memory spaces. The first allocation is done in a warmup step and is not be included in any timers.

Related Software

The RAJA Performance Suite contains a collection of loop kernels implemented in multiple RAJA and non-RAJA variants. We use it to monitor and assess RAJA performance on different platforms using a variety of compilers.

The RAJA Proxies repository contains RAJA versions of several important HPC proxy applications.

Contributions

The Comb team follows the GitFlow development model. Folks wishing to contribute to Comb, should include their work in a feature branch created from the Comb develop branch. Then, create a pull request with the develop branch as the destination. That branch contains the latest work in Comb. Periodically, we will merge the develop branch into the master branch and tag a new release.

Authors

Thanks to all of Comb's contributors.

Comb was created by Jason Burmark (burmark1@llnl.gov).

Release

Comb is released under an MIT license. For more details, please see the LICENSE, RELEASE, and NOTICE files.

LLNL-CODE-758885