From aefcbac84056a34a27e53c3e4b36e88b0a850bac Mon Sep 17 00:00:00 2001 From: zdenop Date: Sun, 21 Oct 2018 20:18:48 +0200 Subject: [PATCH] add info about unicharambigs file v2; fixes #165 --- doc/unicharambigs.5.asc | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/doc/unicharambigs.5.asc b/doc/unicharambigs.5.asc index 079f6d53de..6981128f26 100644 --- a/doc/unicharambigs.5.asc +++ b/doc/unicharambigs.5.asc @@ -33,17 +33,38 @@ EXAMPLE ------- ............................... +v1 2 ' ' 1 " 1 1 m 2 r n 0 3 i i i 1 m 0 ............................... +The first line is a version identifier. In this example, all instances of the '2' character sequence '''' will *always* be replaced by the '1' character sequence '"'; a '1' character sequence 'm' *may* be replaced by the '2' character sequence 'rn', and the '3' character sequence *may* be replaced by the '1' character sequence 'm'. +Version 3.03 and on supports a new, simpler format for the unicharambigs +file: + +............................... +v2 +'' " 1 +m rn 0 +iii m 0 +............................... + +In this format, the "error" and "correction" are simple UTF-8 strings +separated by a space, and, after another space, the same type specifier +as v1 (0 for optional and 1 for mandatory substitution). Note the downside +of this simpler format is that Tesseract has to encode the UTF-8 strings +into the components of the unicharset. In complex scripts, this encoding +may be ambiguous. In this case, the encoding is chosen such as to use the +least UTF-8 characters for each component, ie the shortest unicharset +components will make up the encoding. + HISTORY ------- The unicharambigs file first appeared in Tesseract 3.00; prior to that, a @@ -60,6 +81,7 @@ letters in the unicharset. SEE ALSO -------- tesseract(1), unicharset(5) +https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05#the-unicharambigs-file AUTHOR ------