A half-day workshop on Scalding, the Scala API for Cascading
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Scalding Workshop README

Copyright (C) 2010-2014 Think Big Analytics, Inc. All Rights Reserved.

StrangeLoop 2012
Dean Wampler, Think Big Analytics
Hire Us!

About this Workshop

This workshop is a half-day tutorial on Scalding and its place in the Hadoop ecosystem. Scalding is a Scala API developed at Twitter for distributed data programming that uses the Cascading Java API, which in turn sits on top of Hadoop's Java API. However, Scalding, through Cascading, also offers a local mode that makes it easy to run jobs without using the Hadoop libraries, for simpler testing and learning. We'll use this feature for most of this workshop.

Getting Started

To keep the setup process as simple as possible, the workshop git repo contains a pre-built jar that bundles Scalding v0.7.3 for Scala v2.9.2 and other required jars, such as Cascading, Hadoop core, Log4J, etc. So, all you need to install is Java, Scala, Ruby, and this workshop.

It helps to pick a work directory where you will install some of the packages. In what follows, we'll assume you're using $HOME/fun on Linux, Mac OSX, or Cygwin for Windows with the bash shell (or a similar shell) or you are using C:\fun on Windows.


You'll need git to clone the workshop repository and optionally for other installs. See here for details. As an alternative, you can download a workshop release from its Github repo, rather than clone it.

This Workshop

Download or clone this workshop from GitHub.

To clone this workshop from GitHub using bash:

cd $HOME/fun
git clone https://github.com/thinkbiganalytics/scalding-workshop

On Windows:

cd C:\fun
git clone https://github.com/thinkbiganalytics/scalding-workshop

Or, simply download a release.

Java v1.6 or Better

Install Java if necessary from here.

Scala v2.9.2

Scalding uses Scala v2.9.2. Install it from here.

Ruby v1.8.7 or v1.9.X

Ruby is used as a platform-independent language for driver scripts by Scalding and we've followed the same convention. See ruby-lang.org for details on installing Ruby. Either version 1.8.7 or 1.9.X will work.

Sanity Check

Once you've completed these steps, run the following commands as a sanity check to ensure that everything is setup properly. Using bash:

cd $HOME/fun/scalding-workshop
./run.rb scripts/SanityCheck0.scala

On Windows:

cd C:\fun\scalding-workshop
ruby run.rb scripts/SanityCheck0.scala

The commands should run without error. Note that it takes a moment to compile the Scala script and run to completion. The output is written to output/SanityCheck0.txt. What's in that file?

Optional Installs

If you're serious about using Scalding, you should clone and build the Scalding repo. We'll talk briefly about it in the workshop, but it isn't required.

SBT v0.11

SBT is the de facto build tool for Scala. You'll need it to build Scalding. Follow these installation instructions.

Scalding from GitHub

Clone Scalding from GitHub. Using bash:

cd $HOME/fun
git clone https://github.com/twitter/scalding.git

On Windows:

cd C:\fun
git clone https://github.com/thinkbiganalytics/scalding-workshop

Build Scalding

Build Scalding according to its Getting Started page. Here is a synopsis of the steps. Using bash:

cd $HOME/fun/scalding
sbt update
sbt assembly

On Windows:

cd C:\fun\scalding
sbt update
sbt assembly

(The Getting Started page says to build the test target between update and assembly, but the later builds test itself.)

Sanity Check

Once you've built Scalding, run the following command as a sanity check to ensure everything is setup properly. Using bash:

cd $HOME/fun/scalding
scripts/scald.rb --local tutorial/Tutorial0.scala

On Windows:

cd C:\fun\scalding
ruby scripts\scald.rb --local tutorial/Tutorial0.scala

Next Steps

The Workshop/Tutorial proper is described in the companion Workshop document.

Notes on Releases


Added missing file to distribution. Refined the run scripts to work better with different Java versions.


Refined several exercises and fixed bugs. Added Makefile for building releases.


First release for StrangeLoop 2012 workshop.

For Further Information

See the Scalding GitHub page for more information about Scalding. The wiki is very useful.

Dean Wampler from Think Big Analytics prepared this workshop. Contact Dean with questions about the workshop. For information about consulting and training on Scalding and other Hadoop-related topics, send us email.

Some of the data used in these exercises was obtained from InfoChimps.

Dean Wampler