add info about unicharambigs file v2; fixes #165

tesseract-ocr · Oct 21, 2018 · aefcbac · aefcbac
1 parent 32c1e4f
commit aefcbac
Showing 1 changed file with 22 additions and 0 deletions.
diff --git a/doc/unicharambigs.5.asc b/doc/unicharambigs.5.asc
@@ -33,17 +33,38 @@ EXAMPLE
 -------
 
 ...............................
+v1
 2       ' '     1       "     1
 1       m       2       r n   0
 3       i i i   1       m     0
 ...............................
 
+The first line is a version identifier.
 In this example, all instances of the '2' character sequence '''' will
 *always* be replaced by the '1' character sequence '"'; a '1' character
 sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
 the '3' character sequence *may* be replaced by the '1' character
 sequence 'm'.
 
+Version 3.03 and on supports a new, simpler format for the unicharambigs
+file:
+
+...............................
+v2
+'' " 1
+m rn 0
+iii m 0
+...............................
+
+In this format, the "error" and "correction" are simple UTF-8 strings
+separated by a space, and, after another space, the same type specifier
+as v1 (0 for optional and 1 for mandatory substitution). Note the downside
+of this simpler format is that Tesseract has to encode the UTF-8 strings
+into the components of the unicharset. In complex scripts, this encoding
+may be ambiguous. In this case, the encoding is chosen such as to use the
+least UTF-8 characters for each component, ie the shortest unicharset
+components will make up the encoding.
+
 HISTORY
 -------
 The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
@@ -60,6 +81,7 @@ letters in the unicharset.
 SEE ALSO
 --------
 tesseract(1), unicharset(5)
+https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05#the-unicharambigs-file
 
 AUTHOR
 ------