Skip to content

Latest commit

 

History

History
31 lines (23 loc) · 1.11 KB

README.md

File metadata and controls

31 lines (23 loc) · 1.11 KB

An adaptation of codazzo's MultiRowInputFormat at http://github.com/codazzo/MultiRow.

This input format splits a range of integers into any number of input splits for use in a Hadoop job. This is useful when you need to, for example, crawl an id space. If you want to act in parallel on input values from 19 to 500 million with 917 mappers, you would configure as follows:

  1. In your main/run method of your Hadoop job driver class, add

Job job = new Job(new Configuration());
...

job.setInputFormatClass(IntegerListInputFormat.class);

IntegerListInputFormat.setListInterval(19, 500000000);
IntegerListInputFormat.setNumSplits(917);
  1. Then, make your mapper take a LongWritable as the key and a NullWritable as the value:

public static class MyMapper extends Mapper<LongWritable, NullWritable, ..., ...> {
    protected void map(LongWritable key, NullWritable value, Context context) throws IOException, InterruptedException {
        // Do something with the id.
        long id = key.get();
        ...
    }
}

The LongWritable key is the input value. That's it!