Skip to content

Local Evaluation

Matt Bossenbroek edited this page May 7, 2015 · 1 revision

In PigPen, there are two dump commands, which are used to evaluate a query locally in the REPL and return the results.

For example, we can run the following to load data and return it in the REPL:

(require '[pigpen.core :as pig])

(spit "input.tsv" "1\t2\tfoo\n4\t5\tbar")

=> (->>
     (pig/load-tsv "input.tsv")
     (pig/dump))
[["1" "2" "foo"] ["4" "5" "bar"]]

pig/dump

The first is pigpen.core/dump. This one is good for unit tests and working with small amounts of data. Each stage is evaluated eagerly and the results are stored in memory. Because of the eager evaluation, the stack traces produced by this version are often more readable.

However, this version can handle only the data that can fit in memory. If your source file is very large, this is not a good choice.

rx/dump

The second option is pigpen.rx/dump. This version requires an extra dependency: [com.netflix.pigpen/pigpen-rx "..."] and introduces a new namespace:

(require '[pigpen.rx :as rx])

This implementation uses rx-java to process the query and produce the result.

The benefit of this implementation is laziness - even up to the load command. As each record is read, it is processed by any subsequent commands before the next record is read. After the query is complete, it will close the underlying input stream.

Certain commands, however, still require buffering. If you have a group-by, cogroup, or join command in your query, rx will buffer the data in memory - there is no support for spilling to disk.