A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Faiter is a fault-tolerant framework for asynchronous iterative computations in cloud environments.

1. Introduction

Faiter is a prototype system with simple yet efficient fault-tolerance components for distributed asynchronous iterative computation engines. Many well-known algorithms are iterative in nature, like PageRank, Penalized Hitting Probability (PHP), and Katz Metric. Asynchronous computation model can significantly improve the performance of these algorithms in distributed environments. However, it is challenging to recover from failures in such a model, since a typical checkpointing based appoach requires many expensive synchronization barriers that largely offset the gains of asynchronous computations.

The built-in fault-tolerant component in Faiter utilizes data on surviving machines to recover data on failed machines, rather than checkpoints. Additionally, a novel asynchronous checkpointing method is introduced to further boost the recovery efficiency at the price of nearly zero overhead. Faiter provides simple APIs to facilitate tolerating failures for asynchronous computations. Also, Faiter performs load balancing on recovery by re-assigning lost data onto multiple machines.

The Faiter project started at UMASS Amherst in 2015. Faiter is a C++ framework implemented on top of Maiter. For more details, please refer to our paper A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments (Zhigang Wang and Lixin Gao et al.).

2. Quick Start

Before running Faiter, you need to download faiter.tar.gz and deploy it. You can click here for help.

PageRank and PHP, are provided as two example algorithms. Before running them, you need to split input data into multiple partitions and assign partitions onto different machines. Click here to know how to prepare input data.

Taking PageRank as example, you can specify the following parameters in "pr.sh" to run it.

Specifically, CKINTERVAL indicates the interval between two consecutive asynchronous checkpoints (milliseconds). CKINTERVAL=-1 means disabling checkpointing. FAULTTIME specifies when a failure happens and FAILEDWORKERNUM tells the system how many machines will be marked as failed workers. Similarly, -1 means no failure happens. When CASCADING=1, cascading failures will be simulated. When running PHP in php.sh, another parameter SOURCE is required to set the source vertex id.

3. Programming Guide

Users can also implement their own algorithms.

3.1 Compiling Requirements

  • CMake
  • OpenMPI
  • Python
  • gcc/g++
  • Protocol Buffers

3.2 Programming on Faiter

Users can implement their own algorithms by learning built-in examples. Specifically, you first need to create your own xx.cc file in the src/example directory. After finishing your coding work, add the name of your xx.cc file into the CMakeLists.txt file in the same directory.

3.3 Building and Running

Type build to run the build.sh shell script in the top level directory of Faiter. Run your algorithm as PageRank but replace the ALGORITHM parameter with your algorithm name.

4. Contact

If you encounter any problem with Faiter, please feel free to contact wangzhiganglab@gmail.com.