Skip to content
TBLign tree alignment system
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
data
lib
.DS_Store
.gitignore
Changes
Changes~
Makefile
README
README~
TODO
TODO~

README

TBLign version 0.0.2b
=====================

INSTALLATION

If you have downloaded it in tarball format (.tar.gz) simply untar it by typing on the terminal:

tar -xzvf TBLign0.0.2b.tgz

If you have downloaded it in .zip format, use the utility of your choice (eg. WinZip or 7-Zip) or type the following on the terminal:

unzip TBLign0.0.2b.zip

The software should be ready for use, with pre-constructed Dutch-to-English treebanks available in the following directories:

./data/align/nlen/trainset/auto
./data/align/nlen/trainset/gold
./data/align/nlen/cutoff/auto
./data/align/nlen/cutoff/gold
./data/align/nlen/devtest/auto
./data/align/nlen/devtest/gold
./data/align/nlen/testset/auto
./data/align/nlen/testset/gold
./data/align/nlen/all-ep/auto
./data/align/nlen/all-ep/gold

See the respective README files in:

./data/align/nlen/trainset
./data/align/nlen/cutoff
./data/align/nlen/devtest
./data/align/nlen/testset
./data/align/nlen/all-ep

for more information.

However, if you would like to use your own data, either update the Makefile variables with the correct paths or run the training and testing scripts (see Makefile) on your own. In the future, we might work on easier ways to incorporate your data.

DOCUMENTATION

TBLign is a tool that can be used for the alignment of phrase-structure constituents between parallel texts, indicating equivalence. The output can be used for various purposes such as training machine translation systems. It is an implementation of Eric Brill's transformation-based learning algorithm for the problem of tree-to-tree alignment and alignment error correction.

A paper, as well as a (more detailed) thesis chapter have been written on the development and experiments done using this tool by the author:

Kotzé, Gideon. 2012. Transformation-based tree-to-tree alignment. Computational Linguistics in The Netherlands Journal. Vol. 2. pp. 71-96. http://www.clinjournal.org/node/30

Kotzé, Gideon. 2013. Complementary approaches to tree alignment: Combining statistical and rule-based methods. PhD Thesis. University of Groningen. Chapter 7, pp. 113-155. http://gideonkotze.nl/downloads/GideonThesis_Electronic.pdf

All trees should be in TIGER-XML format and all alignment files in Stockholm TreeAligner style or Lingua-Align format. See:

Stockholm TreeAligner: http://www.cl.uzh.ch/research/paralleltreebanks/treealigner_en.html
Lingua-Align: https://bitbucket.org/tiedemann/lingua-align/wiki/Home

For your convenience, we have constructed a Makefile which performs all the operations necessary for a full training and testing phase of the software. The Makefile reads the alignment files specified so you don't need to type the names out multiple times.

The basic phases are:

- make train: Learns a set of best rules to be applied to new data by comparing the application of a set of rules to the output of an initial state annotator (this can range from simple word-aligned data to the output of a tree aligner) to a gold standard.

- make cutoff: Applies the learned rules to a held-out data set. The rule that, when applied, leads to the highest score in that set is selected as the last rule in the new set of rules. It therefore functions as a cutoff point. This phase is introduced in order to counteract overtraining.

- make devtest: Tests the set of learned rules (refined during the cutoff phase) against a development test set. This is meant to be used in the process of refining an existing model.

- make testset: Tests the set of learned rules (refined during the cutoff phase) against a test set. This is only meant to be used once a model has been refined and is to be tested for reporting its performance.

- make traintest: Trains and tests. Essentially, this runs 'make train', 'make cutoff' and 'make devtest' in sequence.

For training, we need to specify a number of template-like structures that we call "rule keys". They function like a combination of features where each feature, for a given node pair, can be either true or false, resulting in a binary value. During training, every node pair is assigned such a profile of true or false values according to the rule keys specified. Examples of rule key lists can be found under data/db/nlen. The optimal list that we used for our thesis experiments is rulekeys.txt.

In the near future, we will explain the syntax of the rule keys in more detail. For those proficient in Perl, the script generate-iterate.pl provides more clues on how the rule keys are processed and the extent to which you can change the values.

We have also included a number of useful Perl scripts under ./bin/. They are:

- remove-align-ids.pl: Removes sentences with specific IDs in an alignment file as specified in a text file. This may be used, for example, during manual training data construction - one may find that certain sentence pairs that one has extracted from parallel treebanks are not fit for use in training data construction, and one wishes to remove those sentence pairs.

- merge-STA-training.pl: Merges two alignment sets. For example, one may wish to increase the size of the training data set by merging two separate sets from different domains.

- get-rootcombo-freqs.pl: Takes as input an alignment set and displays the frequencies of category label combinations of aligned nodes. This may be used to inspect the consistency of alignments. For example, there may be many NP-NP alignments and PP-PP, but NP-PP is rare. This may indicate that the NP-PP alignments are either wrong or should receive fuzzy (less confident) links.

- get-n-align.pl: Extracts a specified number of sentence pairs from an alignment set. This is useful if, for example, one decides to use only a set number of sentence pairs for training, or if one wishes to extract training and test sets separately from a single set (for example: 300 training data sentence pairs and 50 test data sentence pairs from a set of 350 sentence pairs).

- count-nt-combos.pl: Displays some basic alignment statistics of an alignment set.

- choose-cutoff.pl: This is used by TBLign, which uses a held-out data set to choose a cutoff point in the learned model, in order to counteract overtraining.

- check-STA-links.pl: This is used to check if all the alignments specified in an alignment file are valid. More specifically, all alignment IDs specified by the alignment file should point to actually existing nodes withe same IDs in the treebank files. This script checks whether this is true or not.

- apply-best-rule.pl: This is used by TBLign to apply the best learned rule in an iteration to a data set. In training, the updated data set is then used in the next iteration.

- write-wordalign.pl: This script takes as input an alignment file and writes only the word alignments to output, ignoring any constituent alignments. This may be useful, for example, if a set of parallel sentences is extracted from an already aligned parallel treebank, with the purpose of creating a gold standard or training data set, or if you would like to apply the aligner to an unaligned version of a gold standard for later comparison.

- eval-nonterms.pl: This script compares an automatically produced alignment set to a gold standard and calculates the precision and recall of all alignments between non-terminal nodes. Two balanced F-scores are provided as a result: One of them takes only the recall of confident links into account and the other one takes all of them into account. See the above papers for an explanation.

For more information on how to use these scripts, run perldoc. For example: perldoc bin/check-STA-links.pl.

COPYRIGHT AND LICENCE

Copyright (C) 2014 by Gideon Kotzé

This is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.12.4 or,
at your option, any later version of Perl 5 you may have available.

The data sets found under the following directories:

./data/align/nlen/trainset/auto
./data/align/nlen/trainset/gold
./data/align/nlen/cutoff/auto
./data/align/nlen/cutoff/gold
./data/align/nlen/devtest/auto
./data/align/nlen/devtest/gold
./data/align/nlen/testset/auto
./data/align/nlen/testset/gold
./data/align/nlen/all-ep/auto
./data/align/nlen/all-ep/gold

are modified (annotated) versions of corpora that are freely available online. They are to be used under the same terms as the corpora that can be found at the websites where they are distributed. At the time of writing, they can be found here:

Europarl: http://www.statmt.org/europarl
OPUS: http://opus.lingfil.uu.se
DGT: http://langtech.jrc.it/DGT-TM.html

If these URLs do not work or if you have any other questions, feel free to contact the author at gidi8ster@gmail.com.

Gideon Kotzé
24 January 2014
You can’t perform that action at this time.