Skip to content

Tutorial

Matt Bossenbroek edited this page May 19, 2016 · 20 revisions

Getting started with Clojure and PigPen is really easy. Just follow the steps below to get up and running.

  1. Install Leiningen
  2. Create a new leiningen project with lein new pigpen-demo. This will create a pigpen-demo folder for your project.
  3. a. To use Pig, add PigPen as a dependency by changing the dependencies in your project's project.clj file to look like this:
``` clojure
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [com.netflix.pigpen/pigpen-pig "0.3.3"]]
  :profiles {:dev {:dependencies [[org.apache.pig/pig "0.13.0"]
                                  [org.apache.hadoop/hadoop-core "1.1.2"]]}}
```

 b. To use Cascading, add PigPen as a dependency by changing the dependencies in your project's `project.clj` file to look like this:

``` clojure
  :repositories [["conjars" "http://conjars.org/repo"]]
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [com.netflix.pigpen/pigpen-cascading "0.3.1"]]
  :profiles {:dev {:dependencies [[org.apache.hadoop/hadoop-core "1.1.2"]]}}
```
  1. Run lein repl to start a REPL for your new project.
  2. Try some samples below...

If you have any questions, or if something doesn't look quite right, contact us here: pigpen-support@googlegroups.com

Note: It is strongly recommended to familiarize yourself with Clojure before using PigPen.

Note: PigPen requires Clojure 1.5.1 or greater. The Leiningen example uses Leiningen 2.0 or greater.

To get started, we import the pigpen.core namespace:

(require '[pigpen.core :as pig])

First, lets load some data. Text files (tsv, csv) can be read using pig/load-tsv. If you have Clojure data, take a look at pig/load-clj.

The following code defines a function that returns a query. This query loads data from the file input.tsv.

(defn my-data []
  (pig/load-tsv "input.tsv"))

Note: If you call this function, it will just return the PigPen representation of a query. To really use it, you'll need to execute it locally or convert it to a script (more on that later).

We can test our query in a REPL like so... First, create some test data:

=> (spit "input.tsv" "1\t2\tfoo\n4\t5\tbar")

And then run the script to return our data:

=> (pig/dump (my-data))
[["1" "2" "foo"] ["4" "5" "bar"]]

Now let's transform our data:

(defn my-data-1 []
  (->>
    (pig/load-tsv "input.tsv")
    (pig/map (fn [[a b c]]
               {:sum (+ (Integer/valueOf a) (Integer/valueOf b))
                :name c}))))

If we run the script now, our output data reflects the transformation:

=> (pig/dump (my-data-1))
[{:sum 3, :name "foo"} {:sum 9, :name "bar"}]

And we can filter the data too:

(defn my-data-2 []
  (->>
    (pig/load-tsv "input.tsv")
    (pig/map (fn [[a b c]]
               {:sum (+ (Integer/valueOf a) (Integer/valueOf b))
                :name c}))
    (pig/filter (fn [{:keys [sum]}]
                  (< sum 5)))))

=> (pig/dump (my-data-2))
[{:sum 3, :name "foo"}]

It's generally a good practice to separate the loading of the data from our business logic. Let's separate our script into multiple functions and add a store operator:

(defn my-data-3 [input-file]
  (pig/load-tsv input-file))

(defn my-func [data]
  (->> data
    (pig/map (fn [[a b c]]
               {:sum (+ (Integer/valueOf a) (Integer/valueOf b))
                :name c}))
    (pig/filter (fn [{:keys [sum]}]
                  (< sum 5)))))

(defn my-query [input-file output-file]
  (->>
    (my-data-3 input-file)
    (my-func)
    (pig/store-clj output-file)))

Now we can define a unit test for our query:

(use 'clojure.test)

(deftest test-my-func
  (let [data (pig/return [["1" "2" "foo"] ["4" "5" "bar"]])]
    (is (= (pig/dump (my-func data))
           [{:sum 3, :name "foo"}]))))

The function pig/dump takes any PigPen query, executes it locally, and returns the data.

If we want to generate a script, that's easy too:

(require '[pigpen.pig])

(pigpen.pig/write-script "my-script.pig" (my-query "input.tsv" "output.clj"))

We can optionally run our script locally in Pig (if you have it installed, which is a not a requirement of PigPen). The easiest way to build the pigpen jar is to build an uberjar for our project. From the command line:

$ lein uberjar
$ cp target/pigpen-demo-0.1.0-SNAPSHOT-standalone.jar pigpen.jar
$ pig -x local -f my-script.pig
$ cat output.clj/part-m-00000
{:sum 3, :name "foo"}

Note: Pig can't overwrite files, so you'll need to delete this folder to run again. Another recommended option is to put a timestamp in the path.

See PigPen for Cascading users for how to convert a PigPen query into a Cascading flow.