Redis bulk-loader for Apache Pig
Java
Switch branches/tags
Nothing to show
Pull request Compare This branch is 1 commit ahead of mattb:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src/com/hackdiary/pig
.gitignore
LICENSE
README.md
build.xml
ivy.xml

README.md

RedisStorer

A UDF StoreFunc for Apache Pig designed to bulk-load data into Redis. Inspired by wonderdog, the Infochimps bulk-loader for elasticsearch.

Compiling and running

Compile:

Dependencies are automatically retrieved using Ivy.

$ ant hadoop

Use:

$ pig
grunt> REGISTER dist/pig-redis.jar;
grunt> a = LOAD 'somefile.tsv' USING PigStorage('\t');
grunt> STORE a INTO 'dummy-filename-is-ignored' USING com.hackdiary.pig.RedisStorer('kv', 'localhost');

Bulkloading strategy

RedisStorer runs in four modes: kv, set, zset, hash and list (specified as the first argument to RedisStorer). If no mode is specified, kv is the default.

In kv mode, it takes the first field of the stored tuple as the key, and the second field as the value, and issues SET key value. Any further fields are ignored.

In set mode, it takes the first field of the stored tuple as the key, and issues SADD key value once for each subsequent field value in the tuple.

In set mode, it takes the first field of the stored tuple as the key, and issues ZADD key value once for each subsequent two field values in the tuple. The first field value will be interpreted as a Double typed score, the second will be interpreted as the value itself.

In hash mode, it takes the first field of the stored tuple as the key, and issues HSET key fieldname value once for each subsequent field value, using the same key for each, and taking the fieldname from the tuple's schema fieldnames. This means that it will fail unless the stored tuple has a schema with named fields.

In list mode, it takes the first field of the stored tuple as the key, and issues LPUSH key value once for each subsequent field value in the tuple.