The goal is to expand on our Word Count example in Cascading, this time adding a "stop words" list of tokens to nix from the stream. We'll use a join in Cascading to perform that at scale.
We'll keep building on this example until we have a MapReduce implementation of TF-IDF.
More detailed background information and step-by-step documentation is provided at https://github.com/ConcurrentCore/impatient/wiki
To generate an IntelliJ project use:
To build the sample app from the command line use:
gradle clean jar
Before running this sample app, be sure to set your
HADOOP_HOME environment variable. Then clear the
output directory, then to run on a desktop/laptop with Apache Hadoop in standalone mode:
rm -rf output hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc data/en.stop
To view the results:
To run the Pig version of the script:
rm -rf output mkdir -p dot pig -p docPath=./data/rain.txt -p wcPath=./output/wc -p stopPath=./data/en.stop ./src/scripts/wc.pig
To run the Hive version of the script:
rm -rf derby.log metastore_db/ hive -hiveconf hive.metastore.warehouse.dir=/tmp < src/scripts/wc.q
An example of log captured from a successful build+run is at https://gist.github.com/3043745
For more discussion, see the cascading-user email forum.
Stay tuned for the next installments of our Cascading for the Impatient series.