Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
TBLign tree alignment system
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
TBLign version 0.0.2b ===================== INSTALLATION If you have downloaded it in tarball format (.tar.gz) simply untar it by typing on the terminal: tar -xzvf TBLign0.0.2b.tgz If you have downloaded it in .zip format, use the utility of your choice (eg. WinZip or 7-Zip) or type the following on the terminal: unzip TBLign0.0.2b.zip The software should be ready for use, with pre-constructed Dutch-to-English treebanks available in the following directories: ./data/align/nlen/trainset/auto ./data/align/nlen/trainset/gold ./data/align/nlen/cutoff/auto ./data/align/nlen/cutoff/gold ./data/align/nlen/devtest/auto ./data/align/nlen/devtest/gold ./data/align/nlen/testset/auto ./data/align/nlen/testset/gold ./data/align/nlen/all-ep/auto ./data/align/nlen/all-ep/gold See the respective README files in: ./data/align/nlen/trainset ./data/align/nlen/cutoff ./data/align/nlen/devtest ./data/align/nlen/testset ./data/align/nlen/all-ep for more information. However, if you would like to use your own data, either update the Makefile variables with the correct paths or run the training and testing scripts (see Makefile) on your own. In the future, we might work on easier ways to incorporate your data. DOCUMENTATION TBLign is a tool that can be used for the alignment of phrase-structure constituents between parallel texts, indicating equivalence. The output can be used for various purposes such as training machine translation systems. It is an implementation of Eric Brill's transformation-based learning algorithm for the problem of tree-to-tree alignment and alignment error correction. A paper, as well as a (more detailed) thesis chapter have been written on the development and experiments done using this tool by the author: Kotzé, Gideon. 2012. Transformation-based tree-to-tree alignment. Computational Linguistics in The Netherlands Journal. Vol. 2. pp. 71-96. http://www.clinjournal.org/node/30 Kotzé, Gideon. 2013. Complementary approaches to tree alignment: Combining statistical and rule-based methods. PhD Thesis. University of Groningen. Chapter 7, pp. 113-155. http://gideonkotze.nl/downloads/GideonThesis_Electronic.pdf All trees should be in TIGER-XML format and all alignment files in Stockholm TreeAligner style or Lingua-Align format. See: Stockholm TreeAligner: http://www.cl.uzh.ch/research/paralleltreebanks/treealigner_en.html Lingua-Align: https://bitbucket.org/tiedemann/lingua-align/wiki/Home For your convenience, we have constructed a Makefile which performs all the operations necessary for a full training and testing phase of the software. The Makefile reads the alignment files specified so you don't need to type the names out multiple times. The basic phases are: - make train: Learns a set of best rules to be applied to new data by comparing the application of a set of rules to the output of an initial state annotator (this can range from simple word-aligned data to the output of a tree aligner) to a gold standard. - make cutoff: Applies the learned rules to a held-out data set. The rule that, when applied, leads to the highest score in that set is selected as the last rule in the new set of rules. It therefore functions as a cutoff point. This phase is introduced in order to counteract overtraining. - make devtest: Tests the set of learned rules (refined during the cutoff phase) against a development test set. This is meant to be used in the process of refining an existing model. - make testset: Tests the set of learned rules (refined during the cutoff phase) against a test set. This is only meant to be used once a model has been refined and is to be tested for reporting its performance. - make traintest: Trains and tests. Essentially, this runs 'make train', 'make cutoff' and 'make devtest' in sequence. For training, we need to specify a number of template-like structures that we call "rule keys". They function like a combination of features where each feature, for a given node pair, can be either true or false, resulting in a binary value. During training, every node pair is assigned such a profile of true or false values according to the rule keys specified. Examples of rule key lists can be found under data/db/nlen. The optimal list that we used for our thesis experiments is rulekeys.txt. In the near future, we will explain the syntax of the rule keys in more detail. For those proficient in Perl, the script generate-iterate.pl provides more clues on how the rule keys are processed and the extent to which you can change the values. We have also included a number of useful Perl scripts under ./bin/. They are: - remove-align-ids.pl: Removes sentences with specific IDs in an alignment file as specified in a text file. This may be used, for example, during manual training data construction - one may find that certain sentence pairs that one has extracted from parallel treebanks are not fit for use in training data construction, and one wishes to remove those sentence pairs. - merge-STA-training.pl: Merges two alignment sets. For example, one may wish to increase the size of the training data set by merging two separate sets from different domains. - get-rootcombo-freqs.pl: Takes as input an alignment set and displays the frequencies of category label combinations of aligned nodes. This may be used to inspect the consistency of alignments. For example, there may be many NP-NP alignments and PP-PP, but NP-PP is rare. This may indicate that the NP-PP alignments are either wrong or should receive fuzzy (less confident) links. - get-n-align.pl: Extracts a specified number of sentence pairs from an alignment set. This is useful if, for example, one decides to use only a set number of sentence pairs for training, or if one wishes to extract training and test sets separately from a single set (for example: 300 training data sentence pairs and 50 test data sentence pairs from a set of 350 sentence pairs). - count-nt-combos.pl: Displays some basic alignment statistics of an alignment set. - choose-cutoff.pl: This is used by TBLign, which uses a held-out data set to choose a cutoff point in the learned model, in order to counteract overtraining. - check-STA-links.pl: This is used to check if all the alignments specified in an alignment file are valid. More specifically, all alignment IDs specified by the alignment file should point to actually existing nodes withe same IDs in the treebank files. This script checks whether this is true or not. - apply-best-rule.pl: This is used by TBLign to apply the best learned rule in an iteration to a data set. In training, the updated data set is then used in the next iteration. - write-wordalign.pl: This script takes as input an alignment file and writes only the word alignments to output, ignoring any constituent alignments. This may be useful, for example, if a set of parallel sentences is extracted from an already aligned parallel treebank, with the purpose of creating a gold standard or training data set, or if you would like to apply the aligner to an unaligned version of a gold standard for later comparison. - eval-nonterms.pl: This script compares an automatically produced alignment set to a gold standard and calculates the precision and recall of all alignments between non-terminal nodes. Two balanced F-scores are provided as a result: One of them takes only the recall of confident links into account and the other one takes all of them into account. See the above papers for an explanation. For more information on how to use these scripts, run perldoc. For example: perldoc bin/check-STA-links.pl. COPYRIGHT AND LICENCE Copyright (C) 2014 by Gideon Kotzé This is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.4 or, at your option, any later version of Perl 5 you may have available. The data sets found under the following directories: ./data/align/nlen/trainset/auto ./data/align/nlen/trainset/gold ./data/align/nlen/cutoff/auto ./data/align/nlen/cutoff/gold ./data/align/nlen/devtest/auto ./data/align/nlen/devtest/gold ./data/align/nlen/testset/auto ./data/align/nlen/testset/gold ./data/align/nlen/all-ep/auto ./data/align/nlen/all-ep/gold are modified (annotated) versions of corpora that are freely available online. They are to be used under the same terms as the corpora that can be found at the websites where they are distributed. At the time of writing, they can be found here: Europarl: http://www.statmt.org/europarl OPUS: http://opus.lingfil.uu.se DGT: http://langtech.jrc.it/DGT-TM.html If these URLs do not work or if you have any other questions, feel free to contact the author at email@example.com. Gideon Kotzé 24 January 2014