Skip to content

An input format for divvying up a range of input values to Hadoop mappers. Set the min, max, and number of splits, and each mapper will get an approximately equal number of input values.

kevinweil/IntegerListInputFormat

Repository files navigation

An adaptation of codazzo's MultiRowInputFormat at http://github.com/codazzo/MultiRow.

This input format splits a range of integers into any number of input splits for use in a Hadoop job. This is useful when you need to, for example, crawl an id space. If you want to act in parallel on input values from 19 to 500 million with 917 mappers, you would configure as follows:

  1. In your main/run method of your Hadoop job driver class, add

Job job = new Job(new Configuration());
...

job.setInputFormatClass(IntegerListInputFormat.class);

IntegerListInputFormat.setListInterval(19, 500000000);
IntegerListInputFormat.setNumSplits(917);
  1. Then, make your mapper take a LongWritable as the key and a NullWritable as the value:

public static class MyMapper extends Mapper<LongWritable, NullWritable, ..., ...> {
    protected void map(LongWritable key, NullWritable value, Context context) throws IOException, InterruptedException {
        // Do something with the id.
        long id = key.get();
        ...
    }
}

The LongWritable key is the input value. That's it!

About

An input format for divvying up a range of input values to Hadoop mappers. Set the min, max, and number of splits, and each mapper will get an approximately equal number of input values.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages