GitHub

Welcome to DTLM, a semi-supervised seq2seq system for low-resource transduction.

Note that due to submission size constraints, we could not include all the necessary libraries to make DTLM and M2M.

M2M further requires the tclap library, while DTLM requires SVMlight, STLPort, and tclap.

To take full advantage of DTLM, you will require 2 additional resources, possibly compiled from an unannotated corpus:

A language model file, in the .ARPA format. For this paper, LMs were constructed using the CMU language modeling toolkit, but all that should be necessary is a language model in .ARPA format.

A word list consisting of counts and words, separated by a tab.

ie:

200214 the 192245 a 46030 is

To start, your training file should contain source and target, separated by a tab. All individual characters in the file should be separated by spaces (_ and | are reserved characters, so please replace them with something else):

w $ k # z w a l k e r s

Call m2m-aligner with the following command:

./m2m+ --maxX 1 --maxY 1 --delX --delY -i train.txt -o train.pass1

This will produce the primary alignment:

w||$|k||#|z| w|a|l|k|e|r|s|

Replace all pipes with spaces:

sed -i 's/|/ /g' train.pass1

w _ $ k _ # z w a l k e r s

And run the second pass of the aligner:

./m2m+ --maxX 1 --maxY 1 -i train.pass1 -o train.pass2

w|:$|k|:r|z| w|a:l|k|e:r|s|

Remove all null mergings:

sed -i 's/://g' train.pass2 sed -i 's/://g' train.pass2

w|$|k|r|z| w|a:l|k|e:r|s|

And run DTLM:

./DTLM --cs 3 --ng 7 --jointMgram 3 --inChar : --outChar ' ' -t test.txt -a test.out -f train.pass2 --mo modelName

Note that the test file should have the source only, separated by :

r:U:n:#:z

If you want to test on another file later, use the second-to-last generated model (DTLM runs until convergence, so the last model will usually be slightly worse than the second-to-last).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Source		Source
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

GarrettNicolai/M2MP

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages