Release v0.102 · LLNL/lbann

============================== Release Notes: v0.102 ==============================

Support for new training algorithms:

LTFB is now a first-class training algorithm.
LTFB now allows multiple metrics. The local algorithm is favored by
each trainer and a partner model must win every metric to be declared
the tournament winner.
The batched iterative optimizer (sgd_training_algorithm) was
refactored for consistency.
Improved documentation of training algorithm infrastructure.

Support for new network structures:

ATOM WAE model - character-based Wasserstein Autoencoder
Community GAN model for graph data sets

Support for new layers:

"DFTAbs" layer that computes the absolute value of the channel-wise
DFT of the input data
Adding support for 3D Matrix Multiplication
Added scatter and gather neural network layers
CPU-based GRU layers using oneDNN
Added batch-wise reduce-sum
ArcFace loss

Python front-end:

Added 3D U-Net Model
Added Cosmoflow Model
Ported CANDLE Pilot1 models
Support nvprof
Added channelwise fully connected layer
Added support for non square kernels, padding, stride, and
dilation for the convolution module
Support for OpenMPI launcher

Performance optimizations:

Use cuDNN 8 RNN API and CUDA Graphs in GRU layer
Cache CUDA Graphs for each active mini-batch size
Tuned performance of slice, concatenate, and tessellate layers on
ARM processors
Parallelize computation of Gaussian random numbers
Optimizing tessellate, concatenate, and slice layers on CPU

Experiments & Applications:

Added experiment scripts for ATOM cWAE Gordon Bell simulations
LBANN-ATOM model inference and analysis

Internal features:

Wrapper classes for CUDA Graphs API
Elementary examples of using complex numbers
cuDNN handles are now wrapped in RAII management classes
Improved HWLOC compatility for v1.11 and v2.x
Added an enum type of visitor hooks that will eventually be used to
allow callbacks or other visitors to operate at user defined hook
points
Changed checkpoint logic to checkpoint at the start of epochs
and changed the naming scheme to use the callback phase (visitor
hook) in the name rather than the current execution context.
Added in-memory binary model exchange for LTFB.
Added support for ROCm and MIOpen
Added support for oneDNN
Updated the bamboo test environment to use local executable rather
than hard coded executables
Overhauled and refactored serialization throughout code to use
Cereal serialization library
Significant cleanup and refactoring of code base to improve compile
times. Moving to ensure that code adheres to standard split of
header between declaration and implementation functions (for
templated code). Specifically focused on serialization functions
and comm class. Reduced dependencies through over reaching header
inclusions.
The relationship of execution_contexts and training_algorithms was
clarified. There is still work to do here.
Added DistConv tests both convolution and pooling layers
Support padding in distributed embedding layer
Added dump model graph callback
Added perturb learning rate callback
Added batched inference algorithm
Switched ATOM tests to use CPU embedding and tessellate layers to
minimize noise

I/O & data readers:

Experimental data reader that generates graph random walks with
HavoqGT
Added explict tournament execution mode
Added support to split training data reader into validation and
tournament readers
node2vec data reader

Build system:

Hydrogen v1.5.0+
Aluminum v0.5.0+
DiHydrogen v0.2.0 is required
C++14 or newer standard with CUDA (CMake: "-DCMAKE_CUDA_STANDARD=14")
OpenCV is now an optional dependency via CMake "LBANN_WITH_VISION"
CNPY is now an optional dependency via CMake "LBANN_WITH_CNPY"
Adds support in the build_lbann.sh script for concretizing extra
packages with the primary LBANN installation
New features in the build script to setup / configure the build
environment, but stop and allow the user to manually add extra
packages
Add a set of user-focused build scripts that use the main
build_lbann.sh script to setup good defaults on known systems
Added application specific build scripts for users such as ATOM
Added support for pulling from Spack mirrors and setting them up
Split embedded Python support from Python Front End
Switched Spack-based build script to use Spack's clingo concretizer

Bug fixes:

Fixed a bug where LBANN didn't set the Hydrogen RNG seed
Fixed both CosmoFlow and UNet models PFE as well as addressed
issues in the data reader and data coordinator.
Fixed the HDF5 data reader to properly specify the supported I/O
types
Fixed calculation of the linearized response size
Fixed the data coordinator's interface to input_layer
Fixed error with deterministic execution of dropout layers

Retired features:

Removed deprecated JAG leader mode which was made obsolete when the
data reader moved into the data coordinator
Removed the deprecated partitioned data reader modes that were used
to partition and overlap data sets for multiple models
Removed deprecated ActivationDescriptor class

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.102