The goal is to expand on the simplest Cascading 2.1 app possible, while following best practices, to implement a Word Count example.
We'll keep building on this example until we have a MapReduce implementation of TF-IDF.
More detailed background information and step-by-step documentation is provided at https://github.com/ConcurrentCore/impatient/wiki
To generate an IntelliJ project use:
To build the sample app from the command line use:
gradle clean jar
Before running this sample app, be sure to set your
HADOOP_HOME environment variable.
Then clear the
output directory, then to run on a desktop/laptop with Apache Hadoop in standalone mode:
rm -rf output hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc
To view the results:
An example of log captured from a successful build+run is at https://gist.github.com/3020297
To run this Cascading app on the Amazon AWS cloud, You'll need to have an AWS account, with credentials setup --
probably in your
Then install these two excellent AWS tools:
Next, edit the
emr.sh shell script to update the
BUCKET variable for one of your S3 buckets.
Now upload your JAR and run it on Elastic MapReduce
emr.sh shell script.
To run the Pig version of the script, make sure
PIG_HOME is set and run :
rm -rf output mkdir -p dot pig -p docPath=./data/rain.txt -p wcPath=./output/wc ./src/scripts/wc.pig
For more discussion, see the cascading-user email forum.
Stay tuned for the next installments of our Cascading for the Impatient series.