Comb is a communication performance benchmarking tool.
Branch: develop
Clone or download

README.md

Comb v0.1.1

Comb is a communication performance benchmarking tool. It is used to determine performance tradeoffs in implementing communication patterns on high performance computing (HPC) platforms. At its core comb runs combinations of communication patterns with execution patterns, and memory spaces in order to find efficient combinations. The current set of capabilities Comb provides includes:

  • Configurable structured mesh halo exchange communication.
  • A variety of communication patterns based on grouping messages.
  • A variety of execution patterns including serial, openmp threading, cuda, cuda batched kernels, and cuda persistent kernels.
  • A variety of memory spaces including default system allocated memory, pinned host memory, cuda device memory, and cuda managed memory with different cuda memory advice.

It is important to note that Comb is very much a work-in-progress. Additional features will appear in future releases.

Quick Start

The Comb code lives in a GitHub repository. To clone the repo, use the command:

git clone --recursive https://github.com/llnl/comb.git

On an lc system you can build Comb using the provided cmake scripts and host-configs.

./scripts/lc-builds/blueos/nvcc_9_2_gcc_4_9_3.sh
cd build_lc_blueos_nvcc_9_2_gcc_4_9_3
make

You can also create your own script and host-config provided you have a C++ compiler that supports the C++11 standard, an MPI library with compiler wrapper, and optionally an install of cuda 9.0 or later.

./scripts/my-builds/compiler_version.sh
cd build_my_compiler_version
make

User Documentation

Minimal documentation is available.

Comb runs every combination of execution pattern, and memory space enabled. Each rank prints its results to stdout. The sep_out.bash script may be used to simplify data collection by piping the output of each rank into a different file. The combine_output.lua lua script may be used to simplify data aggregation from multiple files.

Comb uses a variety of manual packing/unpacking execution techniques such as sequential, openmp, and cuda. Comb also uses MPI_Pack/MPI_Unpack with MPI derived datatypes for packing/unpacking. (Note: tests using cuda managed memory and MPI datatypes are disabled as they sometimes produce incorrect results)

Comb creates a different MPI communicator for each test. This communicator is assigned a generic name unless MPI datatypes are used for packing and unpacking. When MPI datatypes are used the name of the memory allocator is appended to the communicator name.

Configure Options

The cmake configuration options change which execution patterns and memory spaces are enabled.

  • ENABLE_OPENMP Allow use of openmp and enable test combinations using openmp
  • ENABLE_CUDA Allow use of cuda and enable test combinations using cuda

Runtime Options

The runtime options change the properties of the grid and its decomposition, as well as the communication pattern used.

  • #_#_# Grid size in each dimension (Required)
  • -divide #_#_# Number of subgrids in each dimension (Required)
  • -periodic #_#_# Periodicity in each dimension
  • -ghost #_#_# The halo width or number of ghost zones in each dimension
  • -vars # The number of grid variables
  • -comm option Communication options
    • mock Do mock communication (do not make MPI calls)
    • cutoff # Number of elements cutoff between large and small message packing kernels
    • post_recv option Communication post receive (MPI_Irecv) options
      • wait_any Post recvs one-by-one
      • wait_some Post recvs one-by-one
      • wait_all Post recvs one-by-one
      • test_any Post recvs one-by-one
      • test_some Post recvs one-by-one
      • test_all Post recvs one-by-one
    • post_send option Communication post send (MPI_Isend) options
      • wait_any pack and send messages one-by-one
      • wait_some pack messages then send them in groups
      • wait_all pack all messages then send them all
      • test_any pack messages asynchronously and send when ready
      • test_some pack multiple messages asynchronously and send when ready
      • test_all pack all messages asynchronously and send when ready
    • wait_recv option Communication wait to recv and unpack (MPI_Wait, MPI_Test) options
      • wait_any recv and unpack messages one-by-one (MPI_Waitany)
      • wait_some recv messages then unpack them in groups (MPI_Waitsome)
      • wait_all recv all messages then unpack them all (MPI_Waitall)
      • test_any recv and unpack messages one-by-one (MPI_Testany)
      • test_some recv messages then unpack them in groups (MPI_Testsome)
      • test_all recv all messages then unpack them all (MPI_Testall)
    • wait_send option Communication wait on sends (MPI_Wait, MPI_Test) options
      • wait_any Wait for each send to complete one-by-one (MPI_Waitany)
      • wait_some Wait for all sends to complete in groups (MPI_Waitsome)
      • wait_all Wait for all sends to complete (MPI_Waitall)
      • test_any Wait for each send to complete one-by-one by polling (MPI_Testany)
      • test_some Wait for all sends to complete in groups by polling (MPI_Testsome)
      • test_all Wait for all sends to complete by polling (MPI_Testall)
  • -cycles # Number of times the communication pattern is tested
  • -omp_threads # Number of openmp threads requested

Example Script

The run_scale_tests.bash is an example script that allocates resources and runs the code in a variety of configurations via the scale_tests.bash script. The run_scale_tests.bash script takes a single argument, the number of processes per side used to split the grid into an N x N x N decomposition.

mkdir 1_1_1
cd 1_1_1
ln -s path/to/comb/build/bin/comb .
ln -s path/to/comb/scripts/* .
./run_scale_tests.bash 1

The scale_tests.bash script used run_scale_tests.bash which shows the options available and how the code may be run with mpi.

Related Software

The RAJA Performance Suite contains a collection of loop kernels implemented in multiple RAJA and non-RAJA variants. We use it to monitor and assess RAJA performance on different platforms using a variety of compilers.

The RAJA Proxies repository contains RAJA versions of several important HPC proxy applications.

Contributions

The Comb team follows the GitFlow development model. Folks wishing to contribute to Comb, should include their work in a feature branch created from the Comb develop branch. Then, create a pull request with the develop branch as the destination. That branch contains the latest work in Comb. Periodically, we will merge the develop branch into the master branch and tag a new release.

Authors

Thanks to all of Comb's contributors.

Comb was created by Jason Burmark (burmark1@llnl.gov).

Release

Comb is released under an MIT license. For more details, please see the LICENSE, RELEASE, and NOTICE files.

LLNL-CODE-758885