GitHub Sale: sign up for any paid plan this week and pay nothing until January 1, 2009!  [ hide ]

public
Description: Python API that allows you to easily write and run MapReduce programs.

Running programs

Local run:
python program.py -input input.txt -output output.txt 
python -m dumbo cat output.txt | more
Distributed run on Hadoop:
python program.py -hadoop <path to local hadoop> \
-input <DFS input path> -output <DFS output path> [<options>]
python -m dumbo cat <DFS output path> -hadoop <path to local hadoop> | more
Options (see also the Hadoop streaming page and wiki):
  • -input <additional DFS input path>
  • -python <python command to use on nodes> (“python” by default)
  • -name <job name> (“program.py” by default)
  • -nummaptasks <number>
  • -numreducetasks <number> (no sorting or reducing will take place if this is 0)
  • -priority <priority value>
  • -libjar <path to jar> (this jar gets put in the class path)
  • -libegg <path to egg> (this egg gets put in the Python path)
  • -file <local file> (this file will be put in the dir where the python program gets executed)
  • -cacheFile hdfs://<host>:<fs_port>/<path to file>#<link name> (a link ”<link name>” to the given file will be in the dir)
  • -cacheArchive hdfs://<host>:<fs_port>/<path to jar>#<link name> (link points to dir that contains files from given jar)
  • -inputformat <name of an InputFormat class> (“TextInputFormat” by default)
  • -cmdenv <env var name>=<value>
  • -jobconf <property name>=<value>
  • -fake yes (fake run, only prints the underlying shell commands but does not actually execute them)
Last edited by klbostee, 6 days ago
Versions: