Skip to content
lukeorland edited this page Oct 5, 2012 · 13 revisions

This page is designed to get you up and running using thrax quickly. So let's get started!

1. Get the code

This is easy, since you're already on github! Just clone the latest source:

git clone https://github.com/jweese/thrax.git

2. Set up hadoop

If you already have access to a hadoop cluster, skip this step. If you don't, please see Standalone hadoop. If you want to set up a real cluster (not in standalone mode), it's complicated, and we can't help you.

3. Set up AWS

You will need the SDK to compile Thrax. Download the SDK here, and set the environment variable AWS_SDK to the directory you unpacked to. e.g.,

$ wget http://ds60ft5bv5jal.cloudfront.net/aws-java-sdk-1.1.3.zip
$ unzip aws-java-sdk-1.1.3.zip
# for bash
$ export AWS_SDK=$(pwd)/aws-java-sdk-1.1.3

4. Compile

You have to set two environment variables in order to get thrax to compile cleanly:

  • $HADOOP should point to the base of your hadoop installation (where you unpacked the tarball).
  • $HADOOP_VERSION should be set to the version of your hadoop installation.
  • $AWS_SDK should point to the installation directory for the Amazon Web Services SDK. (See running on Amazon Elastic MapReduce. For the moment, the SDK is still a compile-time requirement, even if you never use Amazon.)

We need to set these variables because ant expects $HADOOP/hadoop-$HADOOP_VERSION-core.jar to be on the classpath. Once that is done, simply type

ant

to compile the source. Ant should report BUILD SUCCESSFUL at the end.

5. Prepare your data

Since hadoop operates on data as records, each line of your input file needs to have all of the information necessary for one unit of rule extraction. This means your input file should be full of lines like this:

source sentence ||| target sentence ||| alignment

The source sentence is just normalized, tokenized source text. The format of the target sentence depends on the type of grammar being extracted: for Hiero grammars, we just need normalized, tokenized text. But for SAMT grammars, the target sentence should be parsed Treebank-style. The alignment should be a whitespace-separated list of ordered pairs of integers, where i-j means source word i is aligned to target word j. Don't forget that the sentences are zero-indexed!

Here's an example appropriate for Hiero extraction:

wiederaufnahme der sitzungsperiode ||| resumption of the session ||| 0-0 1-1 1-2 2-3

And here's one for SAMT:

{mnh dAn$ kw dATkA nhyN dyty rhy . ||| (S (NP (NNP Amna)) (VP (VBD used) (S (VP (TO to) (VP (RB not) (VB poke) (NP (NN Danish)))))) (. .)) ||| 5-1 3-4 0-0 6-1 1-5 4-3 7-6 2-2

If you have three parallel files called source, target, and alignment, this pipeline will work:

paste source target alignment | perl -pe 's/\t/ ||| /g' > unified

5. Write the conf file

See thrax.conf for a detailed description.

6. Run!

Assuming $THRAX points to the root of your thrax distribution,

hadoop jar $THRAX/bin/thrax.jar /path/to/thrax/conf/file [output directory]

will do the job. Done!

(7. Create a glue grammar)