User space POSIX-like file system in main memory
C M4 Makefile Other
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
maint
src
tests
.gitignore
.travis.yml
COPYRIGHT
COPYRIGHT.ANL
INSTALL
Makefile.in
README.md
buildme_bgq
buildme_opt
buildme_zlib
configure.ac
cruise-config.in
cruise-defs.h
cruise-runtime-config.h.in
ipc_cleanup
prepare

README.md


CRUISE: Checkpoint-Restart In User-SpacE

Build Status

With the massive scale of high-performance computing systems, long-running scientific parallel applications periodically save the state of their execution to files called checkpoints to recover from system failures. Checkpoints are stored on external parallel file systems, but limited bandwidth makes this a time-consuming operation. Multilevel checkpointing systems, like the Scalable Checkpoint/Restart (SCR) library, alleviate this bottleneck by caching checkpoints in storage located close to the compute nodes. However, most large scale systems do not provide file storage on compute nodes, preventing the use of SCR.

CRUISE is a novel user-space file system that stores data in main memory and transparently spills over to other storage, like local flash memory or the parallel file system, as needed. This technique extends the reach of libraries like SCR to systems where they otherwise could not be used. CRUISE also exposes file contents for Remote Direct Memory Access, allowing external tools to copy checkpoints to the parallel file system in the background with reduced CPU interruption.

More information about the project, and relevant publications, can be found HERE.