-
Notifications
You must be signed in to change notification settings - Fork 16
Quickstart
This page is designed to get you up and running using thrax quickly. So let's get started!
This is easy, since you're already on github! Just clone the latest source:
git clone https://github.com/jweese/thrax.git
If you already have access to a hadoop cluster, skip this step. If you don't, please see Standalone hadoop. If you want to set up a real cluster (not in standalone mode), it's complicated, and we can't help you.
You will need the SDK to compile Thrax. Download the SDK here, and set the environment variable AWS_SDK to the directory you unpacked to. e.g.,
$ wget http://ds60ft5bv5jal.cloudfront.net/aws-java-sdk-1.1.3.zip
$ unzip aws-java-sdk-1.1.3.zip
# for bash
$ export AWS_SDK=$(pwd)/aws-java-sdk-1.1.3
You have to set two environment variables in order to get thrax to compile cleanly:
-
$HADOOP
should point to the base of your hadoop installation (where you unpacked the tarball). -
$HADOOP_VERSION
should be set to the version of your hadoop installation. -
$AWS_SDK
should point to the installation directory for the Amazon Web Services SDK. (See running on Amazon Elastic MapReduce. For the moment, the SDK is still a compile-time requirement, even if you never use Amazon.)
We need to set these variables because ant expects $HADOOP/hadoop-$HADOOP_VERSION-core.jar
to be on the classpath. Once that is done, simply type
ant
to compile the source. Ant should report BUILD SUCCESSFUL
at the end.
Since hadoop operates on data as records, each line of your input file needs to have all of the information necessary for one unit of rule extraction. This means your input file should be full of lines like this:
source sentence ||| target sentence ||| alignment
The source sentence is just normalized, tokenized source text. The format of the target sentence depends on the type of grammar being extracted: for Hiero grammars, we just need normalized, tokenized text. But for SAMT grammars, the target sentence should be parsed Treebank-style. The alignment should be a whitespace-separated list of ordered pairs of integers, where i-j
means source word i
is aligned to target word j
. Don't forget that the sentences are zero-indexed!
Here's an example appropriate for Hiero extraction:
wiederaufnahme der sitzungsperiode ||| resumption of the session ||| 0-0 1-1 1-2 2-3
And here's one for SAMT:
{mnh dAn$ kw dATkA nhyN dyty rhy . ||| (S (NP (NNP Amna)) (VP (VBD used) (S (VP (TO to) (VP (RB not) (VB poke) (NP (NN Danish)))))) (. .)) ||| 5-1 3-4 0-0 6-1 1-5 4-3 7-6 2-2
If you have three parallel files called source
, target
, and alignment
, this pipeline will work:
paste source target alignment | perl -pe 's/\t/ ||| /g' > unified
See thrax.conf for a detailed description.
Assuming $THRAX
points to the root of your thrax distribution,
hadoop jar $THRAX/bin/thrax.jar /path/to/thrax/conf/file [output directory]
will do the job. Done!