Java Disco Worker
NOTE: This project is still very beta and in active development. I'm operating under the "release early, release often" principle.
MapReduce basically has three steps. The first step, called the "client" by Disco, defines the MapReduce "job" (input space and references to the map and reduce functions) and submits the "job" to Disco. The second step is the actual "map" function which runs for each input. The third step is the reduce which takes the results of all the map functions and combines them if needed.
The disco command line tool must be installed on the box used to submit the job.
Define the map and reduce phases by implementing the
DiscoMapFunctioninterface which has a
mapfunction. This function will do the work to be done in parallel on the cluster. Do the same for
mainfunction which creates the
DiscoJob. This is the "client".
DiscoJobdescribes the input space and takes three arguments:
inputs- Each element of this list will have a map instance run.
args- Fixed arguments to pass to each map run.
DiscoJobinstance with the class containing the map function to run for each input. Do the same with
Submit the job with a call to
DISCO_MASTERenvironment variable to point to the master node. For example, something like
DiscoReduceFunctionimplementation into the same jar with your client
mainfunction. Run the jar with
java -cp <your.jar> yourMainClass. The VM args and Classpath will be passed on to the worker as well.
- Reduce phase doesn't correctly handle the
- Outputs aren't correctly reported as relative to the jobhome.
- Java JobPack
implementation is incomplete. Currently the CLI
disco jobcommand is used to submit jobs.
- Doesn't support including and excluding based on inputs which have already been downloaded
- Doesn't handle fail, retry, or pause messages.
- Doesn't support failed inputs.
- Doesn't implement the local filesystem optimization instead of HTTP access when input files are already locally available.
Repositories & Issues
Please submit bugs to the GitHub Issue Tracker