Every repository with this icon (
Every repository with this icon (
tree c2326e3682f98049d77c151b35aec22101b095c8
parent 60bc2d2b5033e835d8d3f60073a052ccff9f8d26 parent 93ecff9ec2a2601a6fa14e4b8ed69992ccdd6ba0
| name | age | message | |
|---|---|---|---|
| |
.gitignore | Mon Jun 22 03:31:11 -0700 2009 | |
| |
LICENSE.txt | Tue Feb 17 11:48:51 -0800 2009 | |
| |
README-tutorial.textile | Sun Mar 01 22:14:24 -0800 2009 | |
| |
README.textile | Tue Feb 24 22:16:16 -0800 2009 | |
| |
Rakefile | Wed Apr 08 18:52:55 -0700 2009 | |
| |
TODO.textile | Wed Apr 08 19:03:36 -0700 2009 | |
| |
VERSION.yml | Wed Apr 08 18:52:55 -0700 2009 | |
| |
bin/ | Mon Jun 22 07:47:48 -0700 2009 | |
| |
config/ | Sun Mar 15 09:43:55 -0700 2009 | |
| |
doc/ | Tue Jun 23 17:16:04 -0700 2009 | |
| |
examples/ | Tue Jun 23 17:14:36 -0700 2009 | |
| |
lib/ | Tue Jun 23 12:55:12 -0700 2009 | |
| |
spec/ | Fri Feb 13 23:11:35 -0800 2009 | |
| |
wukong.gemspec | Wed Apr 08 18:52:55 -0700 2009 |
Wukong
Wukong makes Hadoop so easy a chimpanzee can use
it.
Treat your dataset like a
- stream of lines when it’s efficient to process by lines
- stream of field arrays when it’s efficient to deal directly with fields
- stream of lightweight objects when it’s efficient to deal with objects
Wukong is friends with Hadoop the elephant,
Pig the query language, and the cat on your
command line.
How to write a Wukong script
Here’s a script to count words in a text stream:
require ‘wukong’ module WordCount class Mapper < Wukong::Streamer::LineStreamer- Emit each word in the line.
def process line
words = line.strip.split(/\W+/).reject(&:blank?)
words.each{|word| yield [word, 1] }
end
end
The first class, the Mapper, eats lines and craps [word, count] records: word
is the /key/, its count is the /value/.
In the reducer, the values for each key are stacked up into a list; then the
record(s) yielded by #finalize are emitted. There are many other ways to write
the reducer (most of them are better) — see the examples
Structured data stream
You can also use structs to treat your dataset as a stream of objects:
<pre>
require ‘wukong’
require ‘my_blog’ #defines the blog models
- structs for our input objects
Tweet = Struct.new( :id, :created_at, :twitter_user_id,
:in_reply_to_user_id, :in_reply_to_status_id, :text )
TwitterUser = Struct.new( :id, :username, :fullname,
:homepage, :location, :description )
module TwitBlog
class Mapper < Wukong::Streamer::RecordStreamer
- Watch for tweets by me
MY_USER_ID = 24601
#
- If this is a tweet is by me, convert it to a Post.
#
- If it is a tweet not by me, convert it to a Comment that
- will be paired with the correct Post.
#
- If it is a TwitterUser, convert it to a User record and
- a user_location record
#
def process record
case record
when TwitterUser
user = MyBlog::User.new.merge(record) # grab the fields in common
user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
yield user
yield user_loc
when Tweet
if record.twitter_user_id == MY_USER_ID
post = MyBlog::Post.new.merge record
post.link = “http://twitter.com/statuses/show/#{record.id}”
post.body = record.text
post.title = record.text[0..65] + “…”
yield post
else
comment = MyBlog::Comment.new.merge record
comment.body = record.text
comment.post_id = record.in_reply_to_status_id
yield comment
end
end
end
end
end
Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
More info
There are many useful examples (including an actually-useful version of the
WordCount script) in examples/ directory.
Setup
1. Allow Wukong to discover where his elephant friend lives: either
- set a $HADOOP_HOME environment variable,
- or create a file ‘config/wukong-site.yaml’ with a line that points to the
top-level directory of your hadoop install:
2. Add wukong’s bin/ directory to your $PATH, so that you may use its
filesystem shortcuts.
How to run a Wukong script
To run your script using local files and no connection to a hadoop cluster,
your/script.rb —run=local path/to/input_files path/to/output_dirTo run the command across a Hadoop cluster,
your/script.rb —run=hadoop path/to/input_files path/to/output_dirYou can set the default in the config/wukong-site.yaml file, and then just use
--run instead of --run=something —it will just use the default run mode.
If you’re running --run=hadoop, all file paths are HDFS paths. If you’re
running --run=local, all file paths are local paths. (your/script path, of
course, lives on the local filesystem).
You can supply arbitrary command line arguments (they wind up as key-value pairs
in the options path your mapper and reducer receive), and you can use the hadoop
syntax to specify more than one input file:
Note that all --options must precede (in any order) all non-options.
How to test your scripts
To run mapper on its own:
cat ./local/test/input.tsv | ./examples/word_count.rb —map | moreor if your test data lies on the HDFS,
hdp-cat test/input.tsv | ./examples/word_count.rb —map | moreNext graduate to running --run=local mode so you can inspect the reducer.
What’s up with Wukong::AndPig?
Wukong::AndPig is a small library to more easily generate code for the
Pig data analysis language. See its
README for more.
Why is it called Wukong?
Hadoop, as you may know, is named after a stuffed
elephant. Since Wukong was started by the
infochimps team, we needed a simian analog. A Monkey
King who journeyed to the land of the Elephant seems to fit the bill:
Sun Wukong (孙悟空), known in the West as the Monkey King, is the main
character in the classical Chinese epic novel Journey to the West. In the novel,
he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from
India.
Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
(8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling
108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations,
which allows him to transform into various animals and objects; he is, however,
shown with slight problems transforming into other people, since he is unable to
complete the transformation of his tail. He is a skilled fighter, capable of
holding his own against the best generals of heaven. Each of his hairs possesses
magical properties, and is capable of transforming into a clone of the Monkey
King himself, or various weapons, animals, and other objects. He also knows
various spells in order to command wind, part water, conjure protective circles
against demons, freeze humans, demons, and gods alike. — Sun Wukong’s
Wikipedia entry
The Jaime Hewlett / Damon Albarn
short that the BBC made for
their 2008 Olympics coverage gives the general idea.
What tools does Wukong work with?
Wukong is friends with Hadoop the elephant,
Pig the query language, and the cat on your
command line. We’re looking forward to being friends with
martinis and express
trains down the road.







