Create training sets for tagger and lemmatiser for Middle Low German.
Prerequisites:
The Bracmat scripts require the programming language Bracmat. Install from https://github.com/BartJongejan/Bracmat
The resources are created by extracting relevant data from the "Reference Corpus Middle Low German/Low Rhenish (1200–1650); Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200–1650)" https://www.fdr.uni-hamburg.de/record/9195
Follow these links to download relANNIS-split_1.1.zip and CorA-ReN-XML_1.1.zip these two archives:
https://www.fdr.uni-hamburg.de/record/9195/files/relANNIS-split_1.1.zip?download=1
https://www.fdr.uni-hamburg.de/record/9195/files/CorA-ReN-XML_1.1.zip?download=1
Unzip in the 'scripts' folder, so the folders 'relANNIS-split_1.1' and 'CorA-ReN-XML_1.1' become siblings of the 'PosTrainFiles' folder.
The folders 'relANNIS-split_1.1/relAnnis_trans_split' and 'CorA-ReN-XML_1.1/ReN_trans_2021-01-06' are not used and can be deleted at once.
Open a terminal (Windows Command prompt or Linux). Navigate to the 'scripts' folder. Run bracmat. After the {?} prompt, type
get$"1ekstrakt.bra"
Press enter. This process takes quite some time (about an hour or so). Output is sent to the 'output' folder.
Next, type
get$"2makelist.bra"
and
get$"3clamp.bra"
This produces 'toklemSort.tab.ph' (used for training lemmatisation rules without regard to POS tags) and 'toklemposSort.tab.ph' (idem, for POS-sensitive rules). and
get$"4tri.bra"
The output 'trigramFrequenciesSorted' can be used by CSTlemma to steer the disambiguation between lemma candidates. (-T option.)
Then run
get$"5postrainset.bra"
This creates text files in the 'scripts/PosTrainFiles' that can be used to train a POS tagger.
To train the lemmatisation rules for the CSTlemma program from the linguistic resources produced by the above steps, install the affixtrain program.
git clone https://github.com/kuhumcst/affixtrain.git
Follow the instructions for building the affixtrain program. Then, to let CSTlemma construct lemmas without access to POS tags,
nohup nice /home/unicph.domain/zgk261/bin/affixtrain -n FB -i toklemSort.tab.ph >out 2>err &
If input is POS tagged, other lemmatisation rules should be used. These are made by running the following:
nohup nice /home/unicph.domain/zgk261/bin/affixtrain -n FBT -i toklemposSort.tab.ph >out 2>err &