Skip to content
SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
C Perl Python Shell CMake C++
Branch: master
Clone or download
CamStan and adammoody Fix compile error when using BBAPI
This updates some out-of-date scr function names that get called
when using BBAPI, as well as updating the the call to BBAPI's
GetLastErrorDetails function.

Also adds the hints to where the the BBAPI Libraries might be found
to avoid needing to provide the -DWITH_BBAPI_PREFIX option (as was
done in AXL).
Latest commit 6841492 Oct 9, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
cmake Fix compile error when using BBAPI Oct 10, 2019
doc-dev/rst whitespace matters? Aug 19, 2019
doc update to docs for new SCR Route file behavior Oct 15, 2018
examples
man Removing autotools !! (#64) Jul 27, 2017
scripts restart: drop newline from resource manager list of down nodes Apr 13, 2019
src Fix compile error when using BBAPI Oct 10, 2019
testing update TESTING.csh for CORAL Mar 12, 2019
.gitignore add LSF and some m4 files to .gitignore Jan 19, 2017
CMakeLists.txt Bamboo Testing (#85) Sep 12, 2017
ChangeLog Added some functions to send/recv strings via MPI. No longer need to … Mar 1, 2012
LICENSE.TXT Update license URL Feb 11, 2017
META Added support for C++ and Fortran codes. Added scr.index to prefix di… Jun 26, 2010
README.md Add developer documentation Apr 3, 2019
build_scr_mysql Importing source, moving development to sourceforge. Dec 22, 2009
buildme_deps add scripts to fetch and build dtcmp Nov 26, 2014
scr.conf.template
scr.mysql update mysql to use dataset instead of checkpoint Jun 1, 2015
scr.user.conf.template conf files are templates, should not be tracked Sep 23, 2016

README.md

Scalable Checkpoint / Restart (SCR) Library

The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing and restarting large-scale jobs. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on critical shared resources such as the parallel file system.

Detailed usage is provided at SCR.ReadTheDocs.io.

User Docs Status

Quickstart

SCR uses the CMake build system and we recommend out-of-source builds.

git clone git@github.com:llnl/scr.git
mkdir build
mkdir install

cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ../scr
make
make install
make test

Some useful CMake command line options:

  • -DCMAKE_INSTALL_PREFIX=[path]: Place to install the SCR library
  • -DCMAKE_BUILD_TYPE=[Debug/Release]: Build with debugging or optimizations
  • -DBUILD_PDSH=[OFF/ON]: CMake can automatically download and build the PDSH dependency
  • -DWITH_PDSH_PREFIX=[path to PDSH]: Path to an existing PDSH installation (should not be used with BUILD_PDSH)
  • -DWITH_DTCMP_PREFIX=[path to DTCMP]
  • -DWITH_YOGRT_PREFIX=[path to YOGRT]
  • -DSCR_ASYNC_API=[CRAY_DW/INTEL_CPPR/IBM_BBAPI/NONE]
  • -DSCR_RESOURCE_MANAGER=[SLURM/APRUN/PMIX/LSF/NONE]

Dependencies

  • C (with support for C++ and Fortran)
  • MPI
  • CMake, Version 2.8+
  • PDSH
  • DTCMP (optional)
  • libYOGRT (optional)
  • MySQL (optional)

Configuration Files

SCR searches the following locations in the following order for a parameter value, taking the first value it finds.

  1. Environment variables,
  2. User configuration file,
  3. System configuration file,
  4. Compile-time constants.

To find a user configuration file, SCR looks for a file named .scrconf in the prefix directory (note the leading dot). Alternatively, one may specify the name and location of the user configuration file by setting the SCR_CONF_FILE environment variable at run time. This repository includes some example configuration files (scr.conf.template, scr.user.conf.template, and examples/test.conf).

Authors

Numerous people have contributed to the SCR project.

To reference SCR in a publication, please cite the following paper:

Additional information and research publications can be found here:

http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi

Developers

Developer documentation is provided at SCR-dev.ReadTheDocs.io.

Developer Docs Status

You can’t perform that action at this time.