source examples to support the "Cascading for the Impatient" blog post series
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Cascading for the Impatient

Welcome to Cascading for the Impatient, a tutorial for Cascading 3.1.x to get you started. Quickly. Like, yesterday.

This set of progressive coding examples starts with a simple file copy and builds up to a MapReduce implementation of the TF-IDF algorithm.

You can read the full series here:

If you have a question or run into any problems send an email to the cascading-user-list.

Part 1

  • Implements simplest Cascading app possible
  • Copies each TSV line from source tap to sink tap
  • Roughly, in about a dozen lines of code
  • Physical plan: 1 Mapper

Part 2

  • Implements a simple example of WordCount
  • Uses a regex to split the input text lines into a token stream
  • Generates a DOT file, to show the Cascading flow graphically
  • Physical plan: 1 Mapper, 1 Reducer

Part 3

  • Uses a custom Function to scrub the token stream
  • Discusses when to use standard Operations vs. creating custom ones
  • Physical plan: 1 Mapper, 1 Reducer

Part 4

  • Shows how to use a HashJoin on two pipes
  • Filters a list of stop words out of the token stream
  • Physical plan: 1 Mapper, 1 Reducer

Part 5

  • Calculates TF-IDF using an ExpressionFunction
  • Shows how to use a CountBy, SumBy, and a CoGroup
  • Physical plan: 10 Mappers, 8 Reducers

Part 6

  • Includes unit tests in the build
  • Shows how to use other TDD features: checkpoints, assertions, traps, debug
  • Physical plan: 11 Mappers, 8 Reducers

Part 7

This example is currently not implemented.

Part 8

  • Scalding equivalents of previous examples in Cascading