Quantisan edited this page Oct 6, 2012 · 28 revisions

Cascalog for the Impatient

Welcome to Cascalog for the Impatient, a series of blog posts and Cascalog code examples to get you started. Quickly. Like, yesterday.

Use this tutorial in conjunction with Cascading for the Impatient.

Part 1

  • Implements simplest Cascalog query possible
  • Copies each TSV line from source to sink
  • Roughly, in about a dozen lines of code

Part 2

  • Implements a simple example of WordCount
  • Uses a regex to split the input text lines into a Tuple stream of tokens
  • Uses a built-in Cascalog operator
  • Introduction to logic programming

Part 3

  • Uses a Clojure function to scrub the token stream
  • Discusses when to use standard Operations vs. creating custom ones

Part 4

  • Shows how to join sources together
  • Filters a list of stop words out of the token stream
  • Uses a Predicate Macro

Part 5

  • Calculates TF-IDF by abstracting the problem into sub-queries
  • Composing the results from these sub-queries into the TF-IDF formula
  • Uses a special query executor for caching query result in memory

Part 6

  • Includes unit tests in the build
  • Use checkpoints for better performance
  • Shows how to use other TDD features: assertions, traps

TO-DO

Part 6

  • Shows how to use other TDD features: debug

If you want to read in more detail about the Cascalog API which were used, see the Cascalog Wiki and JavaDoc.

For more discussion, see the cascalog-user email forum.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.