The Scalding tutorial as a standalone SBT project
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Scalding Tutorial Project


This is Twitter's tutorial tutorial for Scalding scalding adapted to run on Hadoop as a standalone job - i.e. without requiring scald.rb etc.

This was built as a Scala SBT project by the [Concurrent Inc] concurrent team, in order to integrate the scalding tutorial into the Cascading SDK. It is based on the excellent work done by Snowplow Analytics for porting the Wordcount example to SBT.

The versioning of the project follows the versions of the scalding release on which it is based.

Please note that this tutorial uses scala 2.10 and not 2.9.


In order to use this tutorial, you need to have SBT and the hadoop command installed. Cascading and therefore scalding is compatible with a number of hadoop distributions. If you are unsure, if your distribution is compatible, please check the compatibility page.

You do not need to have a full hadoop cluster, in order to run this tutorial. The local mode of hadoop is sufficient.


Assuming you already have SBT installed:

$ git clone git://
$ cd scalding-tutorial
$ sbt assembly

The 'fat jar' is now available as:


Project structure

Some modifications have been done to the code, order to properly work in an SBT based build.

  • all code is now in src/main/scala/tutorial
  • the data files for the different parts live now in data
  • the classes in the matrix tutorial have been renamed to match the file names, so that the commandline invocation is similar to the original tutorial
  • the documentation of the examples has been adapted to match the new structure

Running the examples

Each part of the tutorial explains, how to run it properly. However the general way is always

$ yarn jar target/scalding-tutorial-0.14.0.jar <TutorialPart> --local <addtional arguments>

Copyright and license

Copyright 2012-2014 Concurrent Inc, with significant portions copyright 2012 Twitter, Inc. and Snowplow Analytics Inc.

Licensed under the [Apache License, Version 2.0] license (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.