Shedulers.jl provides elastic and fault tolerant parallel map (epmap
) and parallel map reduce
methods (epmapreduce!
). The primary feature that distinguishes Schedulers parallel map method
from Julia's Distributed.pmap
is elasticity where the cluster is permitted to dynamically grow/shrink.
The parallel map reduce method also aims for features of fault tolerance, dynamic load balancing and
elasticity.
This package can be used in conjunction with AzSessions.jl, AzStorage.jl and AzManagers.jl to have reasonable parallel map and parallel map reduce methods that work with Azure cloud scale-sets. The fault handling and elasticity allows us to, in particular, work with Azure cloud scale-sets that are using Azure Spot instances.
The current implementation of the parallel map reduce method uses shared storage (e.g. NFS, Azure Blob storage...) to check-point local reductions. This check-pointing is what allows for fault tolerance, and to some extent, elasticity. However, it is important to note that it comes at the cost of IO operations. It would be useful to investigate alternatives ideas that avoid IO such as resilient distributed data from Spark.
The map and the reduce in the epmapreduce!
method are asynchronous. The map is prioritized, but when the
number of tasks that need to be mapped over is less than the number of machines, the reduction over available
check-points can begin. In certain cases this can help to hide the cost of the reduction by overlapping with
the map. In addition, it can make better use of the cluster such that machines that would otherwise be idle
can be used for the reduction.