Skip to content

OCR-D-Level-Rules can be created automatically with gt-MufiLevelRules from the encodings published by MUFI: The Medieval Unicode Font Initiative.

License

Notifications You must be signed in to change notification settings

OCR-D/gt-MufiLevelRules

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gt-MufiLevelRules

Creates OCR-D Ground-Truth Transcription Level Rules automatically from the encodings published by MUFI: The Medieval Unicode Font Initiative.

The resulting OCR-D level rules conform to the OCR-D specification. These rules can be used for substitutions or level checks, among other things.

Note:

  • There may not always be a definition for every level, esp. on level 1.
  • OCR-D will try to fill in these gaps manually or automatically. The automated completion is based on the unicruft program.
  • For this reason, using the rules for automatic character normalization from level 3 or level 2 to level 1 is currently not recommended before manually checking and correcting the corresponding rules.

Download the Rules

🚦 You can download the set of rules here. 🚦

Recreation of the rules

  1. copy or clone the repository.

    git clone https://github.com/tboenig/gt-MufiLevelRules.git

  2. Install Saxon for XSL Transformations v3.0. Then simply run with:

    java -jar saxon-he-XX.jar -xsl:scripts/MufiGTLevelRules2.xsl -s:scripts/MufiGTLevelRules.xsl output=characters merge=yes

Parameters:

  • output characters -> create the rules, all rules are saved under directory: [directory]/rules/characters
  • merge yes -> create the megarules, all rules in one file. Megarules saved under directoy [directory]/rules

The result of the conversion can be found in the directory: [directory]/rules/characters.

  • Output Format:
    • xml
    • json

The script uses:

  1. the MUFI rules [new Version] and MUFI rules old-Version

  2. a summary of the following additional rules from the OCR-D Ground-Truth Transcription Guide, which have priority (take precendence over MUFI rules where applicable):

Description of the rules

JSON Format

All JSON files (both the pure MUFI rules and the final result) follow the same schema.

Example:

 {"ruleset":[
       ...
       {"rule": ["ä", "", ""], "type": "level"}
       ...
]}
  • Each rule has a key: rule and a list of values
  • The values define the character representation on each of the 3 transcription levels:
    • Level 1 is at the first position
    • Level 2 is in the second place
    • Level 3 is in the third place
  • Additional key-value combinations: ...
  • Character values can be empty to signify there is no definition (representation) at that level.

XML Format

<levelrules>
    <ruleset>
        <range>AlphPresForm</range>
        <desc>LATIN SMALL LIGATURE FF</desc>
        <rule>ff</rule>
        <rule>ff</rule>
        <rule>ff</rule>
        <type>level</type>
    </ruleset>
</levelrules>
  • Elements
  • <levelrules> = root element of a gt-MufiLevelRules dataset
    • <ruleset> = root element of a ruleset
      • <range> = category of characters
      • <desc> = general description of the sign or symbol
      • <rule>
        • Level 1: rule[position() = 1]
        • Level 2: rule[position() = 2]
        • Level 3: rule[position() = 3]

The category of characters <range> and the general description of the sign or symbol <desc> were imported from the MUFI dataset.

The JSONPaths are:

  • range : $['..']['range']
  • desc : $['..']['description']

See Also

About

OCR-D-Level-Rules can be created automatically with gt-MufiLevelRules from the encodings published by MUFI: The Medieval Unicode Font Initiative.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages