Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

..
Octocat-spinner-32 data
Octocat-spinner-32 docs
Octocat-spinner-32 src
Octocat-spinner-32 LICENSE.txt
Octocat-spinner-32 README.md
Octocat-spinner-32 build.gradle
Octocat-spinner-32 emr.sh
README.md

Cascading for the Impatient, Part 2

The goal is to expand on the simplest Cascading 2.1 app possible, while following best practices, to implement a Word Count example.

We'll keep building on this example until we have a MapReduce implementation of TF-IDF.

More detailed background information and step-by-step documentation is provided at https://github.com/ConcurrentCore/impatient/wiki

Build Instructions

To generate an IntelliJ project use:

gradle ideaModule

To build the sample app from the command line use:

gradle clean jar

Before running this sample app, be sure to set your HADOOP_HOME environment variable. Then clear the output directory, then to run on a desktop/laptop with Apache Hadoop in standalone mode:

rm -rf output
hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc

To view the results:

more output/wc/part-00000

An example of log captured from a successful build+run is at https://gist.github.com/3020297

Amazon AWS Elastic MapReduce

To run this Cascading app on the Amazon AWS cloud, You'll need to have an AWS account, with credentials setup -- probably in your ~/.aws_cred/ directory.

Then install these two excellent AWS tools:

Next, edit the emr.sh shell script to update the BUCKET variable for one of your S3 buckets. Now upload your JAR and run it on Elastic MapReduce using the emr.sh shell script.

Apache Pig Comparison

To run the Pig version of the script, make sure PIG_HOME is set and run :

rm -rf output
mkdir -p dot
pig -p docPath=./data/rain.txt -p wcPath=./output/wc ./src/scripts/wc.pig

More Info

For more discussion, see the cascading-user email forum.

Stay tuned for the next installments of our Cascading for the Impatient series.

Something went wrong with that request. Please try again.