Cascading 2.7 SDK
This project contains the build scripts for creating the Cascading SDK release.
The SDK distribution, available online, includes Cascading 2.6 and related projects in a single archive.
Every project or extension included in this SDK has been tested independently before being downloaded and included in the package created by these build scripts.
This SDK includes
The following paragraphs describe those subdirectories.
binary subdirectory contains the latest version of all cascading jars,
their dependencies and the javadoc.
source subdirectory includes the latest release of Cascading and many sample applications and any related API
docs. It also contains a template project for creating a Cascading application.
docs subdirectory includes the Cascading User Guide, the Lingual User Guide, the source and docs of the tutorial
series "Cascading for the Impatient", the Lingual HBase
tutorial the Cascading AWS
tutorial, the Cascading Teradata
tutorial, the Scalding Data
Processing tutorial, the ETL with Cascading tutorial a Pattern
tutorial and Integrating Cascading with Map Reduce APIs.
driven directory contains an installer for the Driven plugin for Cascading. Type
install-driven-plugin to install the latest version.
preview subdirectory includes projects from the Cascading eco-system, that are not yet released. Projects in the
preview can change between versions of the SDK or move out of the this directory or dissapear. Please do not rely on
them for production code.
Pattern is a Cascading library and framework for machine learning model scoring at scale on Apache Hadoop.
The SDK includes the source code of scalding as well as a ready to use tutorial project to get you started with scalding.
The scalding code is in
thirdparty/source/scalding-src and the tutorial in
Note: In order to follow the tutorial you have to have
SBT installed. For more
Cascalog is fully-featured data processing and querying library for Clojure or Java.
The SDK includes the source code of cascalog and a ready to go project for the cascalog tutorial. You find the source
thirdparty/source/cascalog-src and the tutorial in
Note: In order to follow the cascalog tutorial you have to have
leiningen 2 installed.
For more information see
tools subdirectory includes prebuilt command line tools. To add them to your PATH, call:
> export CASCADING_SDK_HOME=<sdk install path> > source $CASCADING_SDK_HOME/etc/setenv.sh
The tools you will find are
Multitool provides a simple command line interface for building data processing jobs. Think of this as grep, sed, and awk for Hadoop, which also supports joins between multiple data-sets.
Lingual is true SQL for Cascading and Apache Hadoop. Lingual includes JDBC Drivers, SQL command shell, and a catalog manager for creating schemas and tables.
Load provides a simple command line interface for building high load cluster jobs, based on Cascading.
Downloading the latest SDK release
To download a current copy of the Cascading SDK release, execute the following command:
> wget -i http://files.cascading.org/sdk/2.7/latest.txt
Cascading 2.7 supports 3 platforms:
hadoop2-mr1. The easiest way to make all tools in the SDK
use the same platform is by creating a properties file in your home directory.
> mkdir $HOME/.cascading && echo "cascading.platform.name=<insert-platform-here>" >> $HOME/.cascading/default.properties
All tools support overwriting this platform selection via command line switches if necessary.
Installing on AWS Elastic MapReduce
To pre-install the SDK on a new instance of AWS EMR, use the following bootstrap action:
This will download the latest SDK, unarchive it into the default user home directory, and add any tools to the PATH.
This bootstrap action has the following arguments:
# Usage: # --user-home - an alternative user to install into, default /home/hadoop # --tmpdir - an alternative temporary directory, default TMPDIR or /tmp if not set # --no-screen - do not install screen, screen is installed by default on the master as a convenience # --latest - url to text file referencing the latest version # --no-bash - do not update .bashrc # --driven-api-key - api key to use with driven # --driven-host - url of the driven instance
When using a bootstrap action with the EMR ruby client, remember to use commas instead of spaces between arguments:
--bootstrap-action s3://files.cascading.org/sdk/2.7/install-cascading-sdk.sh --args "--tmpdir,/tmp"
If you pass an API key for Driven to the bootstrap action the driven plugin will
automatically be installed on EMR. If you don't pass it, you can install the driven plugin later by running
The boostrap action will detect if your cluster is based on Hadoop 1.x or 2.x and will set the Cascading platform accordingly.
This SDK was created to make it easier for third-parties to redistribute Cascading and any related documentation and sub-projects.
To get the latest redistributable package follow the download instructions above.
Because the SDK includes many pre-built binaries and tools it is important that each third-party shipping this SDK along with their own Hadoop distribution "certify" their distribution as being both binary and functionally compatible.
This SDK includes the Cascading Compatibility project under
tools/compatibility. This project is also
available on GitHub at https://github.com/Cascading/cascading.compatibility.
Please follow the instructions on the README.md file to run the binary compatibility test.
To test Cascading on a running cluster, this SDK includes the Cascading Load project under
This project is also available on GitHub at https://github.com/Cascading/cascading.load.
Please run the Load application on a given cluster with the following switches:
> load --certify-tests -I input -O output -W working
This will execute a set of preconfigured work loads to verify any standard Cascading application can run at scale.
Building the SDK
To build a local version of the SDK package, execute the following command:
> gradle packageDist
Note downloading the pre-built release is the preferred method.