Skip to content

Commit

Permalink
add info about unicharambigs file v2; fixes #165
Browse files Browse the repository at this point in the history
  • Loading branch information
zdenop committed Oct 21, 2018
1 parent 32c1e4f commit aefcbac
Showing 1 changed file with 22 additions and 0 deletions.
22 changes: 22 additions & 0 deletions doc/unicharambigs.5.asc
Expand Up @@ -33,17 +33,38 @@ EXAMPLE
-------

...............................
v1
2 ' ' 1 " 1
1 m 2 r n 0
3 i i i 1 m 0
...............................

The first line is a version identifier.
In this example, all instances of the '2' character sequence '''' will
*always* be replaced by the '1' character sequence '"'; a '1' character
sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
the '3' character sequence *may* be replaced by the '1' character
sequence 'm'.

Version 3.03 and on supports a new, simpler format for the unicharambigs
file:

...............................
v2
'' " 1
m rn 0
iii m 0
...............................

In this format, the "error" and "correction" are simple UTF-8 strings
separated by a space, and, after another space, the same type specifier
as v1 (0 for optional and 1 for mandatory substitution). Note the downside
of this simpler format is that Tesseract has to encode the UTF-8 strings
into the components of the unicharset. In complex scripts, this encoding
may be ambiguous. In this case, the encoding is chosen such as to use the
least UTF-8 characters for each component, ie the shortest unicharset
components will make up the encoding.

HISTORY
-------
The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
Expand All @@ -60,6 +81,7 @@ letters in the unicharset.
SEE ALSO
--------
tesseract(1), unicharset(5)
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05#the-unicharambigs-file

AUTHOR
------
Expand Down

0 comments on commit aefcbac

Please sign in to comment.