Creates OCR-D Ground-Truth Transcription Level Rules automatically from the encodings published by MUFI: The Medieval Unicode Font Initiative.
The resulting OCR-D level rules conform to the OCR-D specification. These rules can be used for substitutions or level checks, among other things.
Note:
- There may not always be a definition for every level, esp. on level 1.
- OCR-D will try to fill in these gaps manually or automatically. The automated completion is based on the unicruft program.
- For this reason, using the rules for automatic character normalization from level 3 or level 2 to level 1 is currently not recommended before manually checking and correcting the corresponding rules.
🚦 You can download the set of rules here. 🚦
- select the corresponding rule file: rules directory
- as zip release file: latest Releases
-
copy or clone the repository.
git clone https://github.com/tboenig/gt-MufiLevelRules.git
-
Install Saxon for XSL Transformations v3.0. Then simply run with:
java -jar saxon-he-XX.jar -xsl:scripts/MufiGTLevelRules2.xsl -s:scripts/MufiGTLevelRules.xsl output=characters merge=yes
Parameters:
- output
characters
-> create the rules, all rules are saved under directory:[directory]/rules/characters
- merge
yes
-> create the megarules, all rules in one file. Megarules saved under directoy[directory]/rules
The result of the conversion can be found in the directory: [directory]/rules/characters
.
- Output Format:
- xml
- json
The script uses:
-
the MUFI rules [new Version] and MUFI rules old-Version
-
a summary of the following additional rules from the OCR-D Ground-Truth Transcription Guide, which have priority (take precendence over MUFI rules where applicable):
All JSON files (both the pure MUFI rules and the final result) follow the same schema.
Example:
{"ruleset":[
...
{"rule": ["ä", "aͤ", ""], "type": "level"}
...
]}
- Each rule has a key:
rule
and a list of values - The values define the character representation on each of the 3 transcription levels:
- Level 1 is at the first position
- Level 2 is in the second place
- Level 3 is in the third place
- Additional key-value combinations: ...
- Character values can be empty to signify there is no definition (representation) at that level.
<levelrules>
<ruleset>
<range>AlphPresForm</range>
<desc>LATIN SMALL LIGATURE FF</desc>
<rule>ff</rule>
<rule>ff</rule>
<rule>ff</rule>
<type>level</type>
</ruleset>
</levelrules>
- Elements
<levelrules>
= root element of a gt-MufiLevelRules dataset<ruleset>
= root element of a ruleset<range>
= category of characters<desc>
= general description of the sign or symbol<rule>
- Level 1: rule[position() = 1]
- Level 2: rule[position() = 2]
- Level 3: rule[position() = 3]
The category of characters <range>
and the general description of the sign or symbol <desc>
were imported from the MUFI dataset.
The JSONPaths are:
- range :
$['..']['range']
- desc :
$['..']['description']
- MUFI: The Medieval Unicode Font Initiative https://mufi.info/
- MUFI's data as JSON export https://gefin.ku.dk/q.php?q=mufiexport
- OCR-D Ground Truth Transcription Guidelines https://ocr-d.de/en/gt-guidelines/trans/
- Ground Truth level overview https://ocr-d.de/en/gt-guidelines/trans/trLevels.html