Skip to content
/ gml Public

Create training sets for tagger and lemmatiser for Middle Low German.

License

Notifications You must be signed in to change notification settings

kuhumcst/gml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

gml

Create training sets for tagger and lemmatiser for Middle Low German.

Prerequisites:

The Bracmat scripts require the programming language Bracmat. Install from https://github.com/BartJongejan/Bracmat

The resources are created by extracting relevant data from the "Reference Corpus Middle Low German/Low Rhenish (1200–1650); Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200–1650)" https://www.fdr.uni-hamburg.de/record/9195

Follow these links to download relANNIS-split_1.1.zip and CorA-ReN-XML_1.1.zip these two archives:

https://www.fdr.uni-hamburg.de/record/9195/files/relANNIS-split_1.1.zip?download=1

https://www.fdr.uni-hamburg.de/record/9195/files/CorA-ReN-XML_1.1.zip?download=1

Unzip in the 'scripts' folder, so the folders 'relANNIS-split_1.1' and 'CorA-ReN-XML_1.1' become siblings of the 'PosTrainFiles' folder.

The folders 'relANNIS-split_1.1/relAnnis_trans_split' and 'CorA-ReN-XML_1.1/ReN_trans_2021-01-06' are not used and can be deleted at once.

Open a terminal (Windows Command prompt or Linux). Navigate to the 'scripts' folder. Run bracmat. After the {?} prompt, type

get$"1ekstrakt.bra"

Press enter. This process takes quite some time (about an hour or so). Output is sent to the 'output' folder.

Next, type

get$"2makelist.bra"

and

get$"3clamp.bra"

This produces 'toklemSort.tab.ph' (used for training lemmatisation rules without regard to POS tags) and 'toklemposSort.tab.ph' (idem, for POS-sensitive rules). and

get$"4tri.bra"

The output 'trigramFrequenciesSorted' can be used by CSTlemma to steer the disambiguation between lemma candidates. (-T option.)

Then run

get$"5postrainset.bra"

This creates text files in the 'scripts/PosTrainFiles' that can be used to train a POS tagger.

To train the lemmatisation rules for the CSTlemma program from the linguistic resources produced by the above steps, install the affixtrain program.

git clone https://github.com/kuhumcst/affixtrain.git

Follow the instructions for building the affixtrain program. Then, to let CSTlemma construct lemmas without access to POS tags,

nohup nice /home/unicph.domain/zgk261/bin/affixtrain -n FB -i toklemSort.tab.ph >out 2>err &

If input is POS tagged, other lemmatisation rules should be used. These are made by running the following:

nohup nice /home/unicph.domain/zgk261/bin/affixtrain -n FBT -i toklemposSort.tab.ph >out 2>err &

About

Create training sets for tagger and lemmatiser for Middle Low German.

Resources

License

Stars

Watchers

Forks

Packages

No packages published