gt-MufiLevelRules

Creates OCR-D Ground-Truth Transcription Level Rules automatically from the encodings published by MUFI: The Medieval Unicode Font Initiative.

The resulting OCR-D level rules conform to the OCR-D specification. These rules can be used for substitutions or level checks, among other things.

Note:

There may not always be a definition for every level, esp. on level 1.
OCR-D will try to fill in these gaps manually or automatically. The automated completion is based on the unicruft program.
For this reason, using the rules for automatic character normalization from level 3 or level 2 to level 1 is currently not recommended before manually checking and correcting the corresponding rules.

Download the Rules

🚦 You can download the set of rules here. 🚦

select the corresponding rule file: rules directory
as zip release file: latest Releases

Recreation of the rules

copy or clone the repository.

git clone https://github.com/tboenig/gt-MufiLevelRules.git
Install Saxon for XSL Transformations v3.0. Then simply run with:

java -jar saxon-he-XX.jar -xsl:scripts/MufiGTLevelRules2.xsl -s:scripts/MufiGTLevelRules.xsl output=characters merge=yes

Parameters:

output characters -> create the rules, all rules are saved under directory: [directory]/rules/characters
merge yes -> create the megarules, all rules in one file. Megarules saved under directoy [directory]/rules

The result of the conversion can be found in the directory: [directory]/rules/characters.

Output Format:
- xml
- json

The script uses:

the MUFI rules [new Version] and MUFI rules old-Version
a summary of the following additional rules from the OCR-D Ground-Truth Transcription Guide, which have priority (take precendence over MUFI rules where applicable):

Description of the rules

JSON Format

All JSON files (both the pure MUFI rules and the final result) follow the same schema.

Example:

 {"ruleset":[
       ...
       {"rule": ["ä", "aͤ", ""], "type": "level"}
       ...
]}

Each rule has a key: rule and a list of values
The values define the character representation on each of the 3 transcription levels:
- Level 1 is at the first position
- Level 2 is in the second place
- Level 3 is in the third place
Additional key-value combinations: ...
Character values can be empty to signify there is no definition (representation) at that level.

XML Format

<levelrules>
    <ruleset>
        <range>AlphPresForm</range>
        <desc>LATIN SMALL LIGATURE FF</desc>
        <rule>ff</rule>
        <rule>ff</rule>
        <rule>ﬀ</rule>
        <type>level</type>
    </ruleset>
</levelrules>

Elements
<levelrules> = root element of a gt-MufiLevelRules dataset
- <ruleset> = root element of a ruleset
  - <range> = category of characters
  - <desc> = general description of the sign or symbol
  - <rule>
    - Level 1: rule[position() = 1]
    - Level 2: rule[position() = 2]
    - Level 3: rule[position() = 3]

The category of characters <range> and the general description of the sign or symbol <desc> were imported from the MUFI dataset.

The JSONPaths are:

range : $['..']['range']
desc : $['..']['description']

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
.github/workflows		.github/workflows
metadata		metadata
scripts		scripts
unicruft		unicruft
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gt-MufiLevelRules

Download the Rules

Recreation of the rules

Description of the rules

JSON Format

XML Format

See Also

About

Releases 12

Packages

Contributors 3

Languages

License

OCR-D/gt-MufiLevelRules

Folders and files

Latest commit

History

Repository files navigation

gt-MufiLevelRules

Download the Rules

Recreation of the rules

Description of the rules

JSON Format

XML Format

See Also

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 12

Packages 0

Contributors 3

Languages

Packages