diff --git a/doc/ambiguous_words.1.html b/doc/ambiguous_words.1.html index 3fd5f7f1f6..be74b62d0d 100644 --- a/doc/ambiguous_words.1.html +++ b/doc/ambiguous_words.1.html @@ -1,790 +1,790 @@ - - - - - -AMBIGUOUS_WORDS(1) - - - - - -
-
-

SYNOPSIS

-
-

ambiguous_words [-l lang] TESSDATADIR WORDLIST AMBIGUOUSFILE

-
-
-
-

DESCRIPTION

-
-

ambiguous_words(1) runs Tesseract in a special mode, and for each word -in word list, produces a set of words which Tesseract thinks might be -ambiguous with it. TESSDATADIR must be set to the absolute path of -a directory containing tessdata/lang.traineddata.

-
-
-
-

SEE ALSO

-
-

tesseract(1)

-
-
-
-

COPYING

-
-

Copyright (C) 2012 Google, Inc. -Licensed under the Apache License, Version 2.0

-
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +AMBIGUOUS_WORDS(1) + + + + + +
+
+

SYNOPSIS

+
+

ambiguous_words [-l lang] TESSDATADIR WORDLIST AMBIGUOUSFILE

+
+
+
+

DESCRIPTION

+
+

ambiguous_words(1) runs Tesseract in a special mode, and for each word +in word list, produces a set of words which Tesseract thinks might be +ambiguous with it. TESSDATADIR must be set to the absolute path of +a directory containing tessdata/lang.traineddata.

+
+
+
+

SEE ALSO

+
+

tesseract(1)

+
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/ambiguous_words.1.xml b/doc/ambiguous_words.1.xml index 6293866ceb..4900c6eb93 100644 --- a/doc/ambiguous_words.1.xml +++ b/doc/ambiguous_words.1.xml @@ -1,43 +1,43 @@ - - - - - - - AMBIGUOUS_WORDS(1) - - -ambiguous_words -1 -  -  - - - ambiguous_words - generate sets of words Tesseract is likely to find ambiguous - - -ambiguous_words [-l lang] TESSDATADIR WORDLIST AMBIGUOUSFILE - - -DESCRIPTION -ambiguous_words(1) runs Tesseract in a special mode, and for each word -in word list, produces a set of words which Tesseract thinks might be -ambiguous with it. TESSDATADIR must be set to the absolute path of -a directory containing tessdata/lang.traineddata. - - -SEE ALSO -tesseract(1) - - -COPYING -Copyright (C) 2012 Google, Inc. -Licensed under the Apache License, Version 2.0 - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + AMBIGUOUS_WORDS(1) + + +ambiguous_words +1 +  +  + + + ambiguous_words + generate sets of words Tesseract is likely to find ambiguous + + +ambiguous_words [-l lang] TESSDATADIR WORDLIST AMBIGUOUSFILE + + +DESCRIPTION +ambiguous_words(1) runs Tesseract in a special mode, and for each word +in word list, produces a set of words which Tesseract thinks might be +ambiguous with it. TESSDATADIR must be set to the absolute path of +a directory containing tessdata/lang.traineddata. + + +SEE ALSO +tesseract(1) + + +COPYING +Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0 + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + + diff --git a/doc/cntraining.1.html b/doc/cntraining.1.html index 706d3bd0f4..7653061e1e 100644 --- a/doc/cntraining.1.html +++ b/doc/cntraining.1.html @@ -1,805 +1,805 @@ - - - - - -CNTRAINING(1) - - - - - -
-
-

SYNOPSIS

-
-

cntraining [-D dir] FILE

-
-
-
-

DESCRIPTION

-
-

cntraining takes a list of .tr files, from which it generates the -normproto data file (the character normalization sensitivity -prototypes).

-
-
-
-

OPTIONS

-
-
-
--D dir -
-
-

- Directory to write output files to. -

-
-
-
-
-
-

SEE ALSO

-
-

tesseract(1), shapeclustering(1), mftraining(1)

- -
-
-
-

COPYING

-
-

Copyright (c) Hewlett-Packard Company, 1988 -Licensed under the Apache License, Version 2.0

-
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +CNTRAINING(1) + + + + + +
+
+

SYNOPSIS

+
+

cntraining [-D dir] FILE

+
+
+
+

DESCRIPTION

+
+

cntraining takes a list of .tr files, from which it generates the +normproto data file (the character normalization sensitivity +prototypes).

+
+
+
+

OPTIONS

+
+
+
+-D dir +
+
+

+ Directory to write output files to. +

+
+
+
+
+
+

SEE ALSO

+
+

tesseract(1), shapeclustering(1), mftraining(1)

+ +
+
+
+

COPYING

+
+

Copyright (c) Hewlett-Packard Company, 1988 +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/cntraining.1.xml b/doc/cntraining.1.xml index 6795f12f2c..6efc99be1d 100644 --- a/doc/cntraining.1.xml +++ b/doc/cntraining.1.xml @@ -1,58 +1,58 @@ - - - - - - - CNTRAINING(1) - - -cntraining -1 -  -  - - - cntraining - character normalization training for Tesseract - - -cntraining [-D dir] FILE - - -DESCRIPTION -cntraining takes a list of .tr files, from which it generates the -normproto data file (the character normalization sensitivity -prototypes). - - -OPTIONS - - - --D dir - - - - Directory to write output files to. - - - - - - -SEE ALSO -tesseract(1), shapeclustering(1), mftraining(1) -https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract - - -COPYING -Copyright (c) Hewlett-Packard Company, 1988 -Licensed under the Apache License, Version 2.0 - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + CNTRAINING(1) + + +cntraining +1 +  +  + + + cntraining + character normalization training for Tesseract + + +cntraining [-D dir] FILE + + +DESCRIPTION +cntraining takes a list of .tr files, from which it generates the +normproto data file (the character normalization sensitivity +prototypes). + + +OPTIONS + + + +-D dir + + + + Directory to write output files to. + + + + + + +SEE ALSO +tesseract(1), shapeclustering(1), mftraining(1) +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract + + +COPYING +Copyright (c) Hewlett-Packard Company, 1988 +Licensed under the Apache License, Version 2.0 + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + + diff --git a/doc/combine_tessdata.1.asc b/doc/combine_tessdata.1.asc index d93de7ea0f..7b5295f227 100644 --- a/doc/combine_tessdata.1.asc +++ b/doc/combine_tessdata.1.asc @@ -11,7 +11,7 @@ SYNOPSIS DESCRIPTION ----------- -combine_tessdata(1) is the main program to combine/extract/overwrite +combine_tessdata(1) is the main program to combine/extract/overwrite tessdata components in [lang].traineddata files. To combine all the individual tessdata components (unicharset, DAWGs, diff --git a/doc/combine_tessdata.1.html b/doc/combine_tessdata.1.html index 8de474b33b..a7f699f939 100644 --- a/doc/combine_tessdata.1.html +++ b/doc/combine_tessdata.1.html @@ -1,1014 +1,1014 @@ - - - - - -COMBINE_TESSDATA(1) - - - - - -
-
-

SYNOPSIS

-
-

combine_tessdata [OPTION] FILE

-
-
-
-

DESCRIPTION

-
-

combine_tessdata(1) is the main program to combine/extract/overwrite -tessdata components in [lang].traineddata files.

-

To combine all the individual tessdata components (unicharset, DAWGs, -classifier templates, ambiguities, language configs) located at, say, -/home/$USER/temp/eng.* run:

-
-
-
combine_tessdata /home/$USER/temp/eng.
-
-

The result will be a combined tessdata file /home/$USER/temp/eng.traineddata

-

Specify option -e if you would like to extract individual components -from a combined traineddata file. For example, to extract language config -file and the unicharset from tessdata/eng.traineddata run:

-
-
-
combine_tessdata -e tessdata/eng.traineddata \
-  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
-
-

The desired config file and unicharset will be written to -/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset

-

Specify option -o to overwrite individual components of the given -[lang].traineddata file. For example, to overwrite language config -and unichar ambiguities files in tessdata/eng.traineddata use:

-
-
-
combine_tessdata -o tessdata/eng.traineddata \
-  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
-
-

As a result, tessdata/eng.traineddata will contain the new language config -and unichar ambigs, plus all the original DAWGs, classifier templates, etc.

-

Note: the file names of the files to extract to and to overwrite from should -have the appropriate file suffixes (extensions) indicating their tessdata -component type (.unicharset for the unicharset, .unicharambigs for unichar -ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h.

-

Specify option -u to unpack all the components to the specified path:

-
-
-
combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
-
-

This will create /home/$USER/temp/eng.* files with individual tessdata -components from tessdata/eng.traineddata.

-
-
-
-

OPTIONS

-
-

-e .traineddata FILE…: - Extracts the specified components from the .traineddata file

-

-o .traineddata FILE…: - Overwrites the specified components of the .traineddata file - with those provided on the comand line.

-

-u .traineddata PATHPREFIX - Unpacks the .traineddata using the provided prefix.

-
-
-
-

CAVEATS

-
-

Prefix refers to the full file prefix, including period (.)

-
-
-
-

COMPONENTS

-
-

The components in a Tesseract lang.traineddata file as of -Tesseract 3.02 are briefly described below; For more information on -many of these files, see -https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

-
-
-lang.config -
-
-

- (Optional) Language-specific overrides to default config variables. -

-
-
-lang.unicharset -
-
-

- (Required) The list of symbols that Tesseract recognizes, with properties. - See unicharset(5). -

-
-
-lang.unicharambigs -
-
-

- (Optional) This file contains information on pairs of recognized symbols - which are often confused. For example, rn and m. -

-
-
-lang.inttemp -
-
-

- (Required) Character shape templates for each unichar. Produced by - mftraining(1). -

-
-
-lang.pffmtable -
-
-

- (Required) The number of features expected for each unichar. - Produced by mftraining(1) from .tr files. -

-
-
-lang.normproto -
-
-

- (Required) Character normalization prototypes generated by cntraining(1) - from .tr files. -

-
-
-lang.punc-dawg -
-
-

- (Optional) A dawg made from punctuation patterns found around words. - The "word" part is replaced by a single space. -

-
-
-lang.word-dawg -
-
-

- (Optional) A dawg made from dictionary words from the language. -

-
-
-lang.number-dawg -
-
-

- (Optional) A dawg made from tokens which originally contained digits. - Each digit is replaced by a space character. -

-
-
-lang.freq-dawg -
-
-

- (Optional) A dawg made from the most frequent words which would have - gone into word-dawg. -

-
-
-lang.fixed-length-dawgs -
-
-

- (Optional) Several dawgs of different fixed lengths — useful for - languages like Chinese. -

-
-
-lang.cube-unicharset -
-
-

- (Optional) A unicharset for cube, if cube was trained on a different set - of symbols. -

-
-
-lang.cube-word-dawg -
-
-

- (Optional) A word dawg for cube’s alternate unicharset. Not needed if Cube - was trained with Tesseract’s unicharset. -

-
-
-lang.shapetable -
-
-

- (Optional) When present, a shapetable is an extra layer between the character - classifier and the word recognizer that allows the character classifier to - return a collection of unichar ids and fonts instead of a single unichar-id - and font. -

-
-
-lang.bigram-dawg -
-
-

- (Optional) A dawg of word bigrams where the words are separated by a space - and each digit is replaced by a ?. -

-
-
-lang.unambig-dawg -
-
-

- (Optional) TODO: Describe. -

-
-
-lang.params-training-model -
-
-

- (Optional) TODO: Describe. -

-
-
-
-
-
-

HISTORY

-
-

combine_tessdata(1) first appeared in version 3.00 of Tesseract

-
-
-
-

SEE ALSO

-
-

tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), -unicharambigs(5)

-
-
-
-

COPYING

-
-

Copyright (C) 2009, Google Inc. -Licensed under the Apache License, Version 2.0

-
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +COMBINE_TESSDATA(1) + + + + + +
+
+

SYNOPSIS

+
+

combine_tessdata [OPTION] FILE

+
+
+
+

DESCRIPTION

+
+

combine_tessdata(1) is the main program to combine/extract/overwrite +tessdata components in [lang].traineddata files.

+

To combine all the individual tessdata components (unicharset, DAWGs, +classifier templates, ambiguities, language configs) located at, say, +/home/$USER/temp/eng.* run:

+
+
+
combine_tessdata /home/$USER/temp/eng.
+
+

The result will be a combined tessdata file /home/$USER/temp/eng.traineddata

+

Specify option -e if you would like to extract individual components +from a combined traineddata file. For example, to extract language config +file and the unicharset from tessdata/eng.traineddata run:

+
+
+
combine_tessdata -e tessdata/eng.traineddata \
+  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
+
+

The desired config file and unicharset will be written to +/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset

+

Specify option -o to overwrite individual components of the given +[lang].traineddata file. For example, to overwrite language config +and unichar ambiguities files in tessdata/eng.traineddata use:

+
+
+
combine_tessdata -o tessdata/eng.traineddata \
+  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
+
+

As a result, tessdata/eng.traineddata will contain the new language config +and unichar ambigs, plus all the original DAWGs, classifier templates, etc.

+

Note: the file names of the files to extract to and to overwrite from should +have the appropriate file suffixes (extensions) indicating their tessdata +component type (.unicharset for the unicharset, .unicharambigs for unichar +ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h.

+

Specify option -u to unpack all the components to the specified path:

+
+
+
combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
+
+

This will create /home/$USER/temp/eng.* files with individual tessdata +components from tessdata/eng.traineddata.

+
+
+
+

OPTIONS

+
+

-e .traineddata FILE…: + Extracts the specified components from the .traineddata file

+

-o .traineddata FILE…: + Overwrites the specified components of the .traineddata file + with those provided on the comand line.

+

-u .traineddata PATHPREFIX + Unpacks the .traineddata using the provided prefix.

+
+
+
+

CAVEATS

+
+

Prefix refers to the full file prefix, including period (.)

+
+
+
+

COMPONENTS

+
+

The components in a Tesseract lang.traineddata file as of +Tesseract 3.02 are briefly described below; For more information on +many of these files, see +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

+
+
+lang.config +
+
+

+ (Optional) Language-specific overrides to default config variables. +

+
+
+lang.unicharset +
+
+

+ (Required) The list of symbols that Tesseract recognizes, with properties. + See unicharset(5). +

+
+
+lang.unicharambigs +
+
+

+ (Optional) This file contains information on pairs of recognized symbols + which are often confused. For example, rn and m. +

+
+
+lang.inttemp +
+
+

+ (Required) Character shape templates for each unichar. Produced by + mftraining(1). +

+
+
+lang.pffmtable +
+
+

+ (Required) The number of features expected for each unichar. + Produced by mftraining(1) from .tr files. +

+
+
+lang.normproto +
+
+

+ (Required) Character normalization prototypes generated by cntraining(1) + from .tr files. +

+
+
+lang.punc-dawg +
+
+

+ (Optional) A dawg made from punctuation patterns found around words. + The "word" part is replaced by a single space. +

+
+
+lang.word-dawg +
+
+

+ (Optional) A dawg made from dictionary words from the language. +

+
+
+lang.number-dawg +
+
+

+ (Optional) A dawg made from tokens which originally contained digits. + Each digit is replaced by a space character. +

+
+
+lang.freq-dawg +
+
+

+ (Optional) A dawg made from the most frequent words which would have + gone into word-dawg. +

+
+
+lang.fixed-length-dawgs +
+
+

+ (Optional) Several dawgs of different fixed lengths — useful for + languages like Chinese. +

+
+
+lang.cube-unicharset +
+
+

+ (Optional) A unicharset for cube, if cube was trained on a different set + of symbols. +

+
+
+lang.cube-word-dawg +
+
+

+ (Optional) A word dawg for cube’s alternate unicharset. Not needed if Cube + was trained with Tesseract’s unicharset. +

+
+
+lang.shapetable +
+
+

+ (Optional) When present, a shapetable is an extra layer between the character + classifier and the word recognizer that allows the character classifier to + return a collection of unichar ids and fonts instead of a single unichar-id + and font. +

+
+
+lang.bigram-dawg +
+
+

+ (Optional) A dawg of word bigrams where the words are separated by a space + and each digit is replaced by a ?. +

+
+
+lang.unambig-dawg +
+
+

+ (Optional) TODO: Describe. +

+
+
+lang.params-training-model +
+
+

+ (Optional) TODO: Describe. +

+
+
+
+
+
+

HISTORY

+
+

combine_tessdata(1) first appeared in version 3.00 of Tesseract

+
+
+
+

SEE ALSO

+
+

tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), +unicharambigs(5)

+
+
+
+

COPYING

+
+

Copyright (C) 2009, Google Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/combine_tessdata.1.xml b/doc/combine_tessdata.1.xml index 1a43995fb5..693e1343b5 100644 --- a/doc/combine_tessdata.1.xml +++ b/doc/combine_tessdata.1.xml @@ -1,281 +1,281 @@ - - - - - - - COMBINE_TESSDATA(1) - - -combine_tessdata -1 -  -  - - - combine_tessdata - combine/extract/overwrite Tesseract data - - -combine_tessdata [OPTION] FILE - - -DESCRIPTION -combine_tessdata(1) is the main program to combine/extract/overwrite -tessdata components in [lang].traineddata files. -To combine all the individual tessdata components (unicharset, DAWGs, -classifier templates, ambiguities, language configs) located at, say, -/home/$USER/temp/eng.* run: -combine_tessdata /home/$USER/temp/eng. -The result will be a combined tessdata file /home/$USER/temp/eng.traineddata -Specify option -e if you would like to extract individual components -from a combined traineddata file. For example, to extract language config -file and the unicharset from tessdata/eng.traineddata run: -combine_tessdata -e tessdata/eng.traineddata \ - /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset -The desired config file and unicharset will be written to -/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset -Specify option -o to overwrite individual components of the given -[lang].traineddata file. For example, to overwrite language config -and unichar ambiguities files in tessdata/eng.traineddata use: -combine_tessdata -o tessdata/eng.traineddata \ - /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs -As a result, tessdata/eng.traineddata will contain the new language config -and unichar ambigs, plus all the original DAWGs, classifier templates, etc. -Note: the file names of the files to extract to and to overwrite from should -have the appropriate file suffixes (extensions) indicating their tessdata -component type (.unicharset for the unicharset, .unicharambigs for unichar -ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h. -Specify option -u to unpack all the components to the specified path: -combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng. -This will create /home/$USER/temp/eng.* files with individual tessdata -components from tessdata/eng.traineddata. - - -OPTIONS --e .traineddata FILE…: - Extracts the specified components from the .traineddata file --o .traineddata FILE…: - Overwrites the specified components of the .traineddata file - with those provided on the comand line. --u .traineddata PATHPREFIX - Unpacks the .traineddata using the provided prefix. - - -CAVEATS -Prefix refers to the full file prefix, including period (.) - - -COMPONENTS -The components in a Tesseract lang.traineddata file as of -Tesseract 3.02 are briefly described below; For more information on -many of these files, see -https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract - - - -lang.config - - - - (Optional) Language-specific overrides to default config variables. - - - - - -lang.unicharset - - - - (Required) The list of symbols that Tesseract recognizes, with properties. - See unicharset(5). - - - - - -lang.unicharambigs - - - - (Optional) This file contains information on pairs of recognized symbols - which are often confused. For example, rn and m. - - - - - -lang.inttemp - - - - (Required) Character shape templates for each unichar. Produced by - mftraining(1). - - - - - -lang.pffmtable - - - - (Required) The number of features expected for each unichar. - Produced by mftraining(1) from .tr files. - - - - - -lang.normproto - - - - (Required) Character normalization prototypes generated by cntraining(1) - from .tr files. - - - - - -lang.punc-dawg - - - - (Optional) A dawg made from punctuation patterns found around words. - The "word" part is replaced by a single space. - - - - - -lang.word-dawg - - - - (Optional) A dawg made from dictionary words from the language. - - - - - -lang.number-dawg - - - - (Optional) A dawg made from tokens which originally contained digits. - Each digit is replaced by a space character. - - - - - -lang.freq-dawg - - - - (Optional) A dawg made from the most frequent words which would have - gone into word-dawg. - - - - - -lang.fixed-length-dawgs - - - - (Optional) Several dawgs of different fixed lengths — useful for - languages like Chinese. - - - - - -lang.cube-unicharset - - - - (Optional) A unicharset for cube, if cube was trained on a different set - of symbols. - - - - - -lang.cube-word-dawg - - - - (Optional) A word dawg for cube’s alternate unicharset. Not needed if Cube - was trained with Tesseract’s unicharset. - - - - - -lang.shapetable - - - - (Optional) When present, a shapetable is an extra layer between the character - classifier and the word recognizer that allows the character classifier to - return a collection of unichar ids and fonts instead of a single unichar-id - and font. - - - - - -lang.bigram-dawg - - - - (Optional) A dawg of word bigrams where the words are separated by a space - and each digit is replaced by a ?. - - - - - -lang.unambig-dawg - - - - (Optional) TODO: Describe. - - - - - -lang.params-training-model - - - - (Optional) TODO: Describe. - - - - - - -HISTORY -combine_tessdata(1) first appeared in version 3.00 of Tesseract - - -SEE ALSO -tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), -unicharambigs(5) - - -COPYING -Copyright (C) 2009, Google Inc. -Licensed under the Apache License, Version 2.0 - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + COMBINE_TESSDATA(1) + + +combine_tessdata +1 +  +  + + + combine_tessdata + combine/extract/overwrite Tesseract data + + +combine_tessdata [OPTION] FILE + + +DESCRIPTION +combine_tessdata(1) is the main program to combine/extract/overwrite +tessdata components in [lang].traineddata files. +To combine all the individual tessdata components (unicharset, DAWGs, +classifier templates, ambiguities, language configs) located at, say, +/home/$USER/temp/eng.* run: +combine_tessdata /home/$USER/temp/eng. +The result will be a combined tessdata file /home/$USER/temp/eng.traineddata +Specify option -e if you would like to extract individual components +from a combined traineddata file. For example, to extract language config +file and the unicharset from tessdata/eng.traineddata run: +combine_tessdata -e tessdata/eng.traineddata \ + /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset +The desired config file and unicharset will be written to +/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset +Specify option -o to overwrite individual components of the given +[lang].traineddata file. For example, to overwrite language config +and unichar ambiguities files in tessdata/eng.traineddata use: +combine_tessdata -o tessdata/eng.traineddata \ + /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs +As a result, tessdata/eng.traineddata will contain the new language config +and unichar ambigs, plus all the original DAWGs, classifier templates, etc. +Note: the file names of the files to extract to and to overwrite from should +have the appropriate file suffixes (extensions) indicating their tessdata +component type (.unicharset for the unicharset, .unicharambigs for unichar +ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h. +Specify option -u to unpack all the components to the specified path: +combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng. +This will create /home/$USER/temp/eng.* files with individual tessdata +components from tessdata/eng.traineddata. + + +OPTIONS +-e .traineddata FILE…: + Extracts the specified components from the .traineddata file +-o .traineddata FILE…: + Overwrites the specified components of the .traineddata file + with those provided on the comand line. +-u .traineddata PATHPREFIX + Unpacks the .traineddata using the provided prefix. + + +CAVEATS +Prefix refers to the full file prefix, including period (.) + + +COMPONENTS +The components in a Tesseract lang.traineddata file as of +Tesseract 3.02 are briefly described below; For more information on +many of these files, see +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract + + + +lang.config + + + + (Optional) Language-specific overrides to default config variables. + + + + + +lang.unicharset + + + + (Required) The list of symbols that Tesseract recognizes, with properties. + See unicharset(5). + + + + + +lang.unicharambigs + + + + (Optional) This file contains information on pairs of recognized symbols + which are often confused. For example, rn and m. + + + + + +lang.inttemp + + + + (Required) Character shape templates for each unichar. Produced by + mftraining(1). + + + + + +lang.pffmtable + + + + (Required) The number of features expected for each unichar. + Produced by mftraining(1) from .tr files. + + + + + +lang.normproto + + + + (Required) Character normalization prototypes generated by cntraining(1) + from .tr files. + + + + + +lang.punc-dawg + + + + (Optional) A dawg made from punctuation patterns found around words. + The "word" part is replaced by a single space. + + + + + +lang.word-dawg + + + + (Optional) A dawg made from dictionary words from the language. + + + + + +lang.number-dawg + + + + (Optional) A dawg made from tokens which originally contained digits. + Each digit is replaced by a space character. + + + + + +lang.freq-dawg + + + + (Optional) A dawg made from the most frequent words which would have + gone into word-dawg. + + + + + +lang.fixed-length-dawgs + + + + (Optional) Several dawgs of different fixed lengths — useful for + languages like Chinese. + + + + + +lang.cube-unicharset + + + + (Optional) A unicharset for cube, if cube was trained on a different set + of symbols. + + + + + +lang.cube-word-dawg + + + + (Optional) A word dawg for cube’s alternate unicharset. Not needed if Cube + was trained with Tesseract’s unicharset. + + + + + +lang.shapetable + + + + (Optional) When present, a shapetable is an extra layer between the character + classifier and the word recognizer that allows the character classifier to + return a collection of unichar ids and fonts instead of a single unichar-id + and font. + + + + + +lang.bigram-dawg + + + + (Optional) A dawg of word bigrams where the words are separated by a space + and each digit is replaced by a ?. + + + + + +lang.unambig-dawg + + + + (Optional) TODO: Describe. + + + + + +lang.params-training-model + + + + (Optional) TODO: Describe. + + + + + + +HISTORY +combine_tessdata(1) first appeared in version 3.00 of Tesseract + + +SEE ALSO +tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), +unicharambigs(5) + + +COPYING +Copyright (C) 2009, Google Inc. +Licensed under the Apache License, Version 2.0 + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + + diff --git a/doc/dawg2wordlist.1.html b/doc/dawg2wordlist.1.html index b700fe186d..0b2645dfb7 100644 --- a/doc/dawg2wordlist.1.html +++ b/doc/dawg2wordlist.1.html @@ -1,802 +1,802 @@ - - - - - -DAWG2WORDLIST(1) - - - - - -
-
-

SYNOPSIS

-
-

dawg2wordlist UNICHARSET DAWG WORDLIST

-
-
-
-

DESCRIPTION

-
-

dawg2wordlist(1) converts a Tesseract Directed Acyclic Word -Graph (DAWG) to a list of words using a unicharset as key.

-
-
-
-

OPTIONS

-
-

UNICHARSET - The unicharset of the language. This is the unicharset - generated by mftraining(1).

-

DAWG - The input DAWG, created by wordlist2dawg(1)

-

WORDLIST - Plain text (output) file in UTF-8, one word per line

-
-
-
-

SEE ALSO

-
-

tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), -combine_tessdata(1)

- -
-
-
-

COPYING

-
-

Copyright (C) 2012 Google, Inc. -Licensed under the Apache License, Version 2.0

-
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +DAWG2WORDLIST(1) + + + + + +
+
+

SYNOPSIS

+
+

dawg2wordlist UNICHARSET DAWG WORDLIST

+
+
+
+

DESCRIPTION

+
+

dawg2wordlist(1) converts a Tesseract Directed Acyclic Word +Graph (DAWG) to a list of words using a unicharset as key.

+
+
+
+

OPTIONS

+
+

UNICHARSET + The unicharset of the language. This is the unicharset + generated by mftraining(1).

+

DAWG + The input DAWG, created by wordlist2dawg(1)

+

WORDLIST + Plain text (output) file in UTF-8, one word per line

+
+
+
+

SEE ALSO

+
+

tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), +combine_tessdata(1)

+ +
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/dawg2wordlist.1.xml b/doc/dawg2wordlist.1.xml index c73113191c..ee960ad9fc 100644 --- a/doc/dawg2wordlist.1.xml +++ b/doc/dawg2wordlist.1.xml @@ -1,53 +1,53 @@ - - - - - - - DAWG2WORDLIST(1) - - -dawg2wordlist -1 -  -  - - - dawg2wordlist - convert a Tesseract DAWG to a wordlist - - -dawg2wordlist UNICHARSET DAWG WORDLIST - - -DESCRIPTION -dawg2wordlist(1) converts a Tesseract Directed Acyclic Word -Graph (DAWG) to a list of words using a unicharset as key. - - -OPTIONS -UNICHARSET - The unicharset of the language. This is the unicharset - generated by mftraining(1). -DAWG - The input DAWG, created by wordlist2dawg(1) -WORDLIST - Plain text (output) file in UTF-8, one word per line - - -SEE ALSO -tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), -combine_tessdata(1) -https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract - - -COPYING -Copyright (C) 2012 Google, Inc. -Licensed under the Apache License, Version 2.0 - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + DAWG2WORDLIST(1) + + +dawg2wordlist +1 +  +  + + + dawg2wordlist + convert a Tesseract DAWG to a wordlist + + +dawg2wordlist UNICHARSET DAWG WORDLIST + + +DESCRIPTION +dawg2wordlist(1) converts a Tesseract Directed Acyclic Word +Graph (DAWG) to a list of words using a unicharset as key. + + +OPTIONS +UNICHARSET + The unicharset of the language. This is the unicharset + generated by mftraining(1). +DAWG + The input DAWG, created by wordlist2dawg(1) +WORDLIST + Plain text (output) file in UTF-8, one word per line + + +SEE ALSO +tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), +combine_tessdata(1) +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract + + +COPYING +Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0 + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + + diff --git a/doc/mftraining.1.asc b/doc/mftraining.1.asc index 85e1263ade..43fe533a16 100644 --- a/doc/mftraining.1.asc +++ b/doc/mftraining.1.asc @@ -24,12 +24,12 @@ OPTIONS -F 'font_properties_file':: (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1: - + *font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur* -X 'xheights_file':: (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] - + *font_name* *xheight* -D 'dir':: diff --git a/doc/mftraining.1.html b/doc/mftraining.1.html index 4abdfd6a6c..41a3804457 100644 --- a/doc/mftraining.1.html +++ b/doc/mftraining.1.html @@ -1,847 +1,847 @@ - - - - - -MFTRAINING(1) - - - - - -
-
-

SYNOPSIS

-
-

mftraining -U unicharset -O lang.unicharset FILE

-
-
-
-

DESCRIPTION

-
-

mftraining takes a list of .tr files, from which it generates the -files inttemp (the shape prototypes), shapetable, and pffmtable -(the number of expected features for each character). (A fourth file -called Microfeat is also written by this program, but it is not used.)

-
-
-
-

OPTIONS

-
-
-
--U FILE -
-
-

- (Input) The unicharset generated by unicharset_extractor(1) -

-
-
--F font_properties_file -
-
-

- (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1: -

-
-
-
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
-
-
-
--X xheights_file -
-
-

- (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] -

-
-
-
*font_name* *xheight*
-
-
-
--D dir -
-
-

- Directory to write output files to. -

-
-
--O FILE -
-
-

- (Output) The output unicharset that will be given to combine_tessdata(1) -

-
-
-
-
-
-

SEE ALSO

-
-

tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), -shapeclustering(1), unicharset(5)

- -
-
-
-

COPYING

-
-

Copyright (C) Hewlett-Packard Company, 1988 -Licensed under the Apache License, Version 2.0

-
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +MFTRAINING(1) + + + + + +
+
+

SYNOPSIS

+
+

mftraining -U unicharset -O lang.unicharset FILE

+
+
+
+

DESCRIPTION

+
+

mftraining takes a list of .tr files, from which it generates the +files inttemp (the shape prototypes), shapetable, and pffmtable +(the number of expected features for each character). (A fourth file +called Microfeat is also written by this program, but it is not used.)

+
+
+
+

OPTIONS

+
+
+
+-U FILE +
+
+

+ (Input) The unicharset generated by unicharset_extractor(1) +

+
+
+-F font_properties_file +
+
+

+ (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1: +

+
+
+
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
+
+
+
+-X xheights_file +
+
+

+ (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] +

+
+
+
*font_name* *xheight*
+
+
+
+-D dir +
+
+

+ Directory to write output files to. +

+
+
+-O FILE +
+
+

+ (Output) The output unicharset that will be given to combine_tessdata(1) +

+
+
+
+
+
+

SEE ALSO

+
+

tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), +shapeclustering(1), unicharset(5)

+ +
+
+
+

COPYING

+
+

Copyright (C) Hewlett-Packard Company, 1988 +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/mftraining.1.xml b/doc/mftraining.1.xml index 239178a5c1..10b3c6d2e5 100644 --- a/doc/mftraining.1.xml +++ b/doc/mftraining.1.xml @@ -1,102 +1,102 @@ - - - - - - - MFTRAINING(1) - - -mftraining -1 -  -  - - - mftraining - feature training for Tesseract - - -mftraining -U unicharset -O lang.unicharset FILE - - -DESCRIPTION -mftraining takes a list of .tr files, from which it generates the -files inttemp (the shape prototypes), shapetable, and pffmtable -(the number of expected features for each character). (A fourth file -called Microfeat is also written by this program, but it is not used.) - - -OPTIONS - - - --U FILE - - - - (Input) The unicharset generated by unicharset_extractor(1) - - - - - --F font_properties_file - - - - (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1: - -*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur* - - - - --X xheights_file - - - - (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] - -*font_name* *xheight* - - - - --D dir - - - - Directory to write output files to. - - - - - --O FILE - - - - (Output) The output unicharset that will be given to combine_tessdata(1) - - - - - - -SEE ALSO -tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), -shapeclustering(1), unicharset(5) -https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract - - -COPYING -Copyright (C) Hewlett-Packard Company, 1988 -Licensed under the Apache License, Version 2.0 - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + MFTRAINING(1) + + +mftraining +1 +  +  + + + mftraining + feature training for Tesseract + + +mftraining -U unicharset -O lang.unicharset FILE + + +DESCRIPTION +mftraining takes a list of .tr files, from which it generates the +files inttemp (the shape prototypes), shapetable, and pffmtable +(the number of expected features for each character). (A fourth file +called Microfeat is also written by this program, but it is not used.) + + +OPTIONS + + + +-U FILE + + + + (Input) The unicharset generated by unicharset_extractor(1) + + + + + +-F font_properties_file + + + + (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1: + +*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur* + + + + +-X xheights_file + + + + (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] + +*font_name* *xheight* + + + + +-D dir + + + + Directory to write output files to. + + + + + +-O FILE + + + + (Output) The output unicharset that will be given to combine_tessdata(1) + + + + + + +SEE ALSO +tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), +shapeclustering(1), unicharset(5) +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract + + +COPYING +Copyright (C) Hewlett-Packard Company, 1988 +Licensed under the Apache License, Version 2.0 + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + + diff --git a/doc/shapeclustering.1.asc b/doc/shapeclustering.1.asc index 81ca0dbc09..0a1bfb035b 100644 --- a/doc/shapeclustering.1.asc +++ b/doc/shapeclustering.1.asc @@ -35,7 +35,7 @@ OPTIONS -X 'xheights_file':: (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] - + 'font_name' 'xheight' -O 'FILE':: diff --git a/doc/shapeclustering.1.html b/doc/shapeclustering.1.html index 845d49a815..5fca944fc8 100644 --- a/doc/shapeclustering.1.html +++ b/doc/shapeclustering.1.html @@ -1,850 +1,850 @@ - - - - - -SHAPECLUSTERING(1) - - - - - -
-
-

SYNOPSIS

-
-

shapeclustering -D output_dir - -U unicharset -O mfunicharset - -F font_props -X xheights - FILE

-
-
-
-

DESCRIPTION

-
-

shapeclustering(1) takes extracted feature .tr files (generated by -tesseract(1) run in a special mode from box files) and produces a -file shapetable and an enhanced unicharset. This program is still -experimental, and is not required (yet) for training Tesseract.

-
-
-
-

OPTIONS

-
-
-
--U FILE -
-
-

- The unicharset generated by unicharset_extractor(1). -

-
-
--D dir -
-
-

- Directory to write output files to. -

-
-
--F font_properties_file -
-
-

- (Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1: -

-
-
-
'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'
-
-
-
--X xheights_file -
-
-

- (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] -

-
-
-
'font_name' 'xheight'
-
-
-
--O FILE -
-
-

- The output unicharset that will be given to combine_tessdata(1). -

-
-
-
-
-
-

SEE ALSO

-
-

tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), -unicharset(5)

- -
-
-
-

COPYING

-
-

Copyright (C) Google, 2011 -Licensed under the Apache License, Version 2.0

-
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +SHAPECLUSTERING(1) + + + + + +
+
+

SYNOPSIS

+
+

shapeclustering -D output_dir + -U unicharset -O mfunicharset + -F font_props -X xheights + FILE

+
+
+
+

DESCRIPTION

+
+

shapeclustering(1) takes extracted feature .tr files (generated by +tesseract(1) run in a special mode from box files) and produces a +file shapetable and an enhanced unicharset. This program is still +experimental, and is not required (yet) for training Tesseract.

+
+
+
+

OPTIONS

+
+
+
+-U FILE +
+
+

+ The unicharset generated by unicharset_extractor(1). +

+
+
+-D dir +
+
+

+ Directory to write output files to. +

+
+
+-F font_properties_file +
+
+

+ (Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1: +

+
+
+
'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'
+
+
+
+-X xheights_file +
+
+

+ (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] +

+
+
+
'font_name' 'xheight'
+
+
+
+-O FILE +
+
+

+ The output unicharset that will be given to combine_tessdata(1). +

+
+
+
+
+
+

SEE ALSO

+
+

tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), +unicharset(5)

+ +
+
+
+

COPYING

+
+

Copyright (C) Google, 2011 +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/shapeclustering.1.xml b/doc/shapeclustering.1.xml index d02bcf8db9..933789ad3c 100644 --- a/doc/shapeclustering.1.xml +++ b/doc/shapeclustering.1.xml @@ -1,105 +1,105 @@ - - - - - - - SHAPECLUSTERING(1) - - -shapeclustering -1 -  -  - - - shapeclustering - shape clustering training for Tesseract - - -shapeclustering -D output_dir - -U unicharset -O mfunicharset - -F font_props -X xheights - FILE - - -DESCRIPTION -shapeclustering(1) takes extracted feature .tr files (generated by -tesseract(1) run in a special mode from box files) and produces a -file shapetable and an enhanced unicharset. This program is still -experimental, and is not required (yet) for training Tesseract. - - -OPTIONS - - - --U FILE - - - - The unicharset generated by unicharset_extractor(1). - - - - - --D dir - - - - Directory to write output files to. - - - - - --F font_properties_file - - - - (Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1: - -'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur' - - - - --X xheights_file - - - - (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] - -'font_name' 'xheight' - - - - --O FILE - - - - The output unicharset that will be given to combine_tessdata(1). - - - - - - -SEE ALSO -tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), -unicharset(5) -https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract - - -COPYING -Copyright (C) Google, 2011 -Licensed under the Apache License, Version 2.0 - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + SHAPECLUSTERING(1) + + +shapeclustering +1 +  +  + + + shapeclustering + shape clustering training for Tesseract + + +shapeclustering -D output_dir + -U unicharset -O mfunicharset + -F font_props -X xheights + FILE + + +DESCRIPTION +shapeclustering(1) takes extracted feature .tr files (generated by +tesseract(1) run in a special mode from box files) and produces a +file shapetable and an enhanced unicharset. This program is still +experimental, and is not required (yet) for training Tesseract. + + +OPTIONS + + + +-U FILE + + + + The unicharset generated by unicharset_extractor(1). + + + + + +-D dir + + + + Directory to write output files to. + + + + + +-F font_properties_file + + + + (Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1: + +'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur' + + + + +-X xheights_file + + + + (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] + +'font_name' 'xheight' + + + + +-O FILE + + + + The output unicharset that will be given to combine_tessdata(1). + + + + + + +SEE ALSO +tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), +unicharset(5) +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract + + +COPYING +Copyright (C) Google, 2011 +Licensed under the Apache License, Version 2.0 + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + + diff --git a/doc/tesseract.1.asc b/doc/tesseract.1.asc index 237299fe51..312aae07f6 100644 --- a/doc/tesseract.1.asc +++ b/doc/tesseract.1.asc @@ -67,7 +67,7 @@ OPTIONS 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. - 9 = Treat the image as a single word in a circle. + 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character. 'configfile':: @@ -264,10 +264,10 @@ on read_pattern_list(). HISTORY ------- -The engine was developed at Hewlett Packard Laboratories Bristol and at -Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more -changes made in 1996 to port to Windows, and some C\+\+izing in 1998. A -lot of the code was written in C, and then some more was written in C\+\+. +The engine was developed at Hewlett Packard Laboratories Bristol and at +Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more +changes made in 1996 to port to Windows, and some C\+\+izing in 1998. A +lot of the code was written in C, and then some more was written in C\+\+. The C\+\+ code makes heavy use of a list system using macros. This predates stl, was portable before stl, and is more efficient than stl lists, but has the big negative that if you do get a segmentation violation, it is hard to @@ -276,18 +276,18 @@ debug. Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability to train Tesseract. -Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. +Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. See . With Tesseract 2.00, -scripts are now included to allow anyone to reproduce some of these tests. -See for more +scripts are now included to allow anyone to reproduce some of these tests. +See for more details. -Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, -and Korean. It also introduces a new, single-file based system of managing +Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, +and Korean. It also introduces a new, single-file based system of managing language data. -Tesseract 3.02 adds BiDirectional text support, the ability to recognize -multiple languages in a single image, and improved layout analysis. +Tesseract 3.02 adds BiDirectional text support, the ability to recognize +multiple languages in a single image, and improved layout analysis. For further details, see the file ReleaseNotes included with the distribution. diff --git a/doc/tesseract.1.html b/doc/tesseract.1.html index 5e37d31170..d0addae65b 100644 --- a/doc/tesseract.1.html +++ b/doc/tesseract.1.html @@ -1,1163 +1,1163 @@ - - - - - -TESSERACT(1) - - - - - -
-
-

SYNOPSIS

-
-

tesseract imagename|stdin outputbase|stdout [options…] [configfile…]

-
-
-
-

DESCRIPTION

-
-

tesseract(1) is a commercial quality OCR engine originally developed at HP -between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by -UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed -at Google since then.

-
-
-
-

IN/OUT ARGUMENTS

-
-
-
-imagename -
-
-

- The name of the input image. Most image file formats (anything - readable by Leptonica) are supported. -

-
-
-stdin -
-
-

- Instruction to read data from standard input -

-
-
-outputbase -
-
-

- The basename of the output file (to which the appropriate extension - will be appended). By default the output will be named outbase.txt. -

-
-
-stdout -
-
-

- Instruction to sent output data to standard output -

-
-
-
-
-
-

OPTIONS

-
-
-
---tessdata-dir /path -
-
-

- Specify the location of tessdata path -

-
-
---user-words /path/to/file -
-
-

- Specify the location of user words file -

-
-
---user-patterns /path/to/file specify -
-
-

- The location of user patterns file -

-
-
--c configvar=value -
-
-

- Set value for control parameter. Multiple -c arguments are allowed. -

-
-
--l lang -
-
-

- The language to use. If none is specified, English is assumed. - Multiple languages may be specified, separated by plus characters. - Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) -

-
-
---psm N -
-
-

- Set Tesseract to only run a subset of layout analysis and assume - a certain form of image. The options for N are: -

-
-
-
0 = Orientation and script detection (OSD) only.
-1 = Automatic page segmentation with OSD.
-2 = Automatic page segmentation, but no OSD, or OCR.
-3 = Fully automatic page segmentation, but no OSD. (Default)
-4 = Assume a single column of text of variable sizes.
-5 = Assume a single uniform block of vertically aligned text.
-6 = Assume a single uniform block of text.
-7 = Treat the image as a single text line.
-8 = Treat the image as a single word.
-9 = Treat the image as a single word in a circle.
-10 = Treat the image as a single character.
-
-
-
-configfile -
-
-

- The name of a config to use. A config is a plaintext file which - contains a list of variables and their values, one per line, with a - space separating variable from value. Interesting config files - include:
-

-
    -
  • -

    -hocr - Output in hOCR format instead of as a text file. -

    -
  • -
  • -

    -pdf - Output in pdf instead of a text file. -

    -
  • -
-
-
-

Nota Bene: The options -l lang and --psm N must occur -before any configfile.

-
-
-
-

SINGLE OPTIONS

-
-
-
--v -
-
-

- Returns the current version of the tesseract(1) executable. -

-
-
---list-langs -
-
-

- list available languages for tesseract engine. Can be used with --tessdata-dir. -

-
-
---print-parameters -
-
-

- print tesseract parameters to the stdout. -

-
-
-
-
-
-

LANGUAGES

-
-

There are currently language packs available for the following languages -(in https://github.com/tesseract-ocr/tessdata):

-

afr (Afrikaans) -amh (Amharic) -ara (Arabic) -asm (Assamese) -aze (Azerbaijani) -aze_cyrl (Azerbaijani - Cyrilic) -bel (Belarusian) -ben (Bengali) -bod (Tibetan) -bos (Bosnian) -bul (Bulgarian) -cat (Catalan; Valencian) -ceb (Cebuano) -ces (Czech) -chi_sim (Chinese - Simplified) -chi_tra (Chinese - Traditional) -chr (Cherokee) -cym (Welsh) -dan (Danish) -dan_frak (Danish - Fraktur) -deu (German) -deu_frak (German - Fraktur) -dzo (Dzongkha) -ell (Greek, Modern (1453-)) -eng (English) -enm (English, Middle (1100-1500)) -epo (Esperanto) -equ (Math / equation detection module) -est (Estonian) -eus (Basque) -fas (Persian) -fin (Finnish) -fra (French) -frk (Frankish) -frm (French, Middle (ca.1400-1600)) -gle (Irish) -glg (Galician) -grc (Greek, Ancient (to 1453)) -guj (Gujarati) -hat (Haitian; Haitian Creole) -heb (Hebrew) -hin (Hindi) -hrv (Croatian) -hun (Hungarian) -iku (Inuktitut) -ind (Indonesian) -isl (Icelandic) -ita (Italian) -ita_old (Italian - Old) -jav (Javanese) -jpn (Japanese) -kan (Kannada) -kat (Georgian) -kat_old (Georgian - Old) -kaz (Kazakh) -khm (Central Khmer) -kir (Kirghiz; Kyrgyz) -kor (Korean) -kur (Kurdish) -lao (Lao) -lat (Latin) -lav (Latvian) -lit (Lithuanian) -mal (Malayalam) -mar (Marathi) -mkd (Macedonian) -mlt (Maltese) -msa (Malay) -mya (Burmese) -nep (Nepali) -nld (Dutch; Flemish) -nor (Norwegian) -ori (Oriya) -osd (Orientation and script detection module) -pan (Panjabi; Punjabi) -pol (Polish) -por (Portuguese) -pus (Pushto; Pashto) -ron (Romanian; Moldavian; Moldovan) -rus (Russian) -san (Sanskrit) -sin (Sinhala; Sinhalese) -slk (Slovak) -slk_frak (Slovak - Fraktur) -slv (Slovenian) -spa (Spanish; Castilian) -spa_old (Spanish; Castilian - Old) -sqi (Albanian) -srp (Serbian) -srp_latn (Serbian - Latin) -swa (Swahili) -swe (Swedish) -syr (Syriac) -tam (Tamil) -tel (Telugu) -tgk (Tajik) -tgl (Tagalog) -tha (Thai) -tir (Tigrinya) -tur (Turkish) -uig (Uighur; Uyghur) -ukr (Ukrainian) -urd (Urdu) -uzb (Uzbek) -uzb_cyrl (Uzbek - Cyrilic) -vie (Vietnamese) -yid (Yiddish)

-

To use a non-standard language pack named foo.traineddata, set the -TESSDATA_PREFIX environment variable so the file can be found at -TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the -argument -l foo.

-
-
-
-

CONFIG FILES AND AUGMENTING WITH USER DATA

-
-

Tesseract config files consist of lines with variable-value pairs (space -separated). The variables are documented as flags in the source code like -the following one in tesseractclass.h:

-

STRING_VAR_H(tessedit_char_blacklist, "", - "Blacklist of chars not to recognize");

-

These variables may enable or disable various features of the engine, and -may cause it to load (or not load) various data. For instance, let’s suppose -you want to OCR in English, but suppress the normal dictionary and load an -alternative word list and an alternative list of patterns — these two files -are the most commonly used extra data files.

-

If your language pack is in /path/to/eng.traineddata and the hocr config -is in /path/to/configs/hocr then create three new files:

-

/path/to/eng.user-words:

-
-
the
-quick
-brown
-fox
-jumped
-
-
-

/path/to/eng.user-patterns:

-
-
1-\d\d\d-GOOG-411
-www.\n\\\*.com
-
-
-

/path/to/configs/bazaar:

-
-
load_system_dawg     F
-load_freq_dawg       F
-user_words_suffix    user-words
-user_patterns_suffix user-patterns
-
-
-

Now, if you pass the word bazaar as a trailing command line parameter -to Tesseract, Tesseract will not bother loading the system dictionary nor -the dictionary of frequent words and will load and use the eng.user-words -and eng.user-patterns files you provided. The former is a simple word list, -one per line. The format of the latter is documented in dict/trie.h -on read_pattern_list().

-
-
-
-

HISTORY

-
-

The engine was developed at Hewlett Packard Laboratories Bristol and at -Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more -changes made in 1996 to port to Windows, and some C++izing in 1998. A -lot of the code was written in C, and then some more was written in C++. -The C\++ code makes heavy use of a list system using macros. This predates -stl, was portable before stl, and is more efficient than stl lists, but has -the big negative that if you do get a segmentation violation, it is hard to -debug.

-

Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability -to train Tesseract.

-

Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy. -See https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf. With Tesseract 2.00, -scripts are now included to allow anyone to reproduce some of these tests. -See https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract for more -details.

-

Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, -and Korean. It also introduces a new, single-file based system of managing -language data.

-

Tesseract 3.02 adds BiDirectional text support, the ability to recognize -multiple languages in a single image, and improved layout analysis.

-

For further details, see the file ReleaseNotes included with the distribution.

-
-
- -
-

SEE ALSO

-
-

ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), -shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), -unicharset_extractor(1), wordlist2dawg(1)

-
-
-
-

AUTHOR

-
-

Tesseract development was led at Hewlett-Packard and Google by Ray Smith. -The development team has included:

-

Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger, -Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke, -Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle, -Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel -Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh -Lloyd, Shobhit Saxena, and Thomas Kielbus.

-
-
-
-

COPYING

-
-

Licensed under the Apache License, Version 2.0

-
-
-
-

- - - + + + + + +TESSERACT(1) + + + + + +
+
+

SYNOPSIS

+
+

tesseract imagename|stdin outputbase|stdout [options…] [configfile…]

+
+
+
+

DESCRIPTION

+
+

tesseract(1) is a commercial quality OCR engine originally developed at HP +between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by +UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed +at Google since then.

+
+
+
+

IN/OUT ARGUMENTS

+
+
+
+imagename +
+
+

+ The name of the input image. Most image file formats (anything + readable by Leptonica) are supported. +

+
+
+stdin +
+
+

+ Instruction to read data from standard input +

+
+
+outputbase +
+
+

+ The basename of the output file (to which the appropriate extension + will be appended). By default the output will be named outbase.txt. +

+
+
+stdout +
+
+

+ Instruction to sent output data to standard output +

+
+
+
+
+
+

OPTIONS

+
+
+
+--tessdata-dir /path +
+
+

+ Specify the location of tessdata path +

+
+
+--user-words /path/to/file +
+
+

+ Specify the location of user words file +

+
+
+--user-patterns /path/to/file specify +
+
+

+ The location of user patterns file +

+
+
+-c configvar=value +
+
+

+ Set value for control parameter. Multiple -c arguments are allowed. +

+
+
+-l lang +
+
+

+ The language to use. If none is specified, English is assumed. + Multiple languages may be specified, separated by plus characters. + Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) +

+
+
+--psm N +
+
+

+ Set Tesseract to only run a subset of layout analysis and assume + a certain form of image. The options for N are: +

+
+
+
0 = Orientation and script detection (OSD) only.
+1 = Automatic page segmentation with OSD.
+2 = Automatic page segmentation, but no OSD, or OCR.
+3 = Fully automatic page segmentation, but no OSD. (Default)
+4 = Assume a single column of text of variable sizes.
+5 = Assume a single uniform block of vertically aligned text.
+6 = Assume a single uniform block of text.
+7 = Treat the image as a single text line.
+8 = Treat the image as a single word.
+9 = Treat the image as a single word in a circle.
+10 = Treat the image as a single character.
+
+
+
+configfile +
+
+

+ The name of a config to use. A config is a plaintext file which + contains a list of variables and their values, one per line, with a + space separating variable from value. Interesting config files + include:
+

+
    +
  • +

    +hocr - Output in hOCR format instead of as a text file. +

    +
  • +
  • +

    +pdf - Output in pdf instead of a text file. +

    +
  • +
+
+
+

Nota Bene: The options -l lang and --psm N must occur +before any configfile.

+
+
+
+

SINGLE OPTIONS

+
+
+
+-v +
+
+

+ Returns the current version of the tesseract(1) executable. +

+
+
+--list-langs +
+
+

+ list available languages for tesseract engine. Can be used with --tessdata-dir. +

+
+
+--print-parameters +
+
+

+ print tesseract parameters to the stdout. +

+
+
+
+
+
+

LANGUAGES

+
+

There are currently language packs available for the following languages +(in https://github.com/tesseract-ocr/tessdata):

+

afr (Afrikaans) +amh (Amharic) +ara (Arabic) +asm (Assamese) +aze (Azerbaijani) +aze_cyrl (Azerbaijani - Cyrilic) +bel (Belarusian) +ben (Bengali) +bod (Tibetan) +bos (Bosnian) +bul (Bulgarian) +cat (Catalan; Valencian) +ceb (Cebuano) +ces (Czech) +chi_sim (Chinese - Simplified) +chi_tra (Chinese - Traditional) +chr (Cherokee) +cym (Welsh) +dan (Danish) +dan_frak (Danish - Fraktur) +deu (German) +deu_frak (German - Fraktur) +dzo (Dzongkha) +ell (Greek, Modern (1453-)) +eng (English) +enm (English, Middle (1100-1500)) +epo (Esperanto) +equ (Math / equation detection module) +est (Estonian) +eus (Basque) +fas (Persian) +fin (Finnish) +fra (French) +frk (Frankish) +frm (French, Middle (ca.1400-1600)) +gle (Irish) +glg (Galician) +grc (Greek, Ancient (to 1453)) +guj (Gujarati) +hat (Haitian; Haitian Creole) +heb (Hebrew) +hin (Hindi) +hrv (Croatian) +hun (Hungarian) +iku (Inuktitut) +ind (Indonesian) +isl (Icelandic) +ita (Italian) +ita_old (Italian - Old) +jav (Javanese) +jpn (Japanese) +kan (Kannada) +kat (Georgian) +kat_old (Georgian - Old) +kaz (Kazakh) +khm (Central Khmer) +kir (Kirghiz; Kyrgyz) +kor (Korean) +kur (Kurdish) +lao (Lao) +lat (Latin) +lav (Latvian) +lit (Lithuanian) +mal (Malayalam) +mar (Marathi) +mkd (Macedonian) +mlt (Maltese) +msa (Malay) +mya (Burmese) +nep (Nepali) +nld (Dutch; Flemish) +nor (Norwegian) +ori (Oriya) +osd (Orientation and script detection module) +pan (Panjabi; Punjabi) +pol (Polish) +por (Portuguese) +pus (Pushto; Pashto) +ron (Romanian; Moldavian; Moldovan) +rus (Russian) +san (Sanskrit) +sin (Sinhala; Sinhalese) +slk (Slovak) +slk_frak (Slovak - Fraktur) +slv (Slovenian) +spa (Spanish; Castilian) +spa_old (Spanish; Castilian - Old) +sqi (Albanian) +srp (Serbian) +srp_latn (Serbian - Latin) +swa (Swahili) +swe (Swedish) +syr (Syriac) +tam (Tamil) +tel (Telugu) +tgk (Tajik) +tgl (Tagalog) +tha (Thai) +tir (Tigrinya) +tur (Turkish) +uig (Uighur; Uyghur) +ukr (Ukrainian) +urd (Urdu) +uzb (Uzbek) +uzb_cyrl (Uzbek - Cyrilic) +vie (Vietnamese) +yid (Yiddish)

+

To use a non-standard language pack named foo.traineddata, set the +TESSDATA_PREFIX environment variable so the file can be found at +TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the +argument -l foo.

+
+
+
+

CONFIG FILES AND AUGMENTING WITH USER DATA

+
+

Tesseract config files consist of lines with variable-value pairs (space +separated). The variables are documented as flags in the source code like +the following one in tesseractclass.h:

+

STRING_VAR_H(tessedit_char_blacklist, "", + "Blacklist of chars not to recognize");

+

These variables may enable or disable various features of the engine, and +may cause it to load (or not load) various data. For instance, let’s suppose +you want to OCR in English, but suppress the normal dictionary and load an +alternative word list and an alternative list of patterns — these two files +are the most commonly used extra data files.

+

If your language pack is in /path/to/eng.traineddata and the hocr config +is in /path/to/configs/hocr then create three new files:

+

/path/to/eng.user-words:

+
+
the
+quick
+brown
+fox
+jumped
+
+
+

/path/to/eng.user-patterns:

+
+
1-\d\d\d-GOOG-411
+www.\n\\\*.com
+
+
+

/path/to/configs/bazaar:

+
+
load_system_dawg     F
+load_freq_dawg       F
+user_words_suffix    user-words
+user_patterns_suffix user-patterns
+
+
+

Now, if you pass the word bazaar as a trailing command line parameter +to Tesseract, Tesseract will not bother loading the system dictionary nor +the dictionary of frequent words and will load and use the eng.user-words +and eng.user-patterns files you provided. The former is a simple word list, +one per line. The format of the latter is documented in dict/trie.h +on read_pattern_list().

+
+
+
+

HISTORY

+
+

The engine was developed at Hewlett Packard Laboratories Bristol and at +Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more +changes made in 1996 to port to Windows, and some C++izing in 1998. A +lot of the code was written in C, and then some more was written in C++. +The C\++ code makes heavy use of a list system using macros. This predates +stl, was portable before stl, and is more efficient than stl lists, but has +the big negative that if you do get a segmentation violation, it is hard to +debug.

+

Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability +to train Tesseract.

+

Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy. +See https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf. With Tesseract 2.00, +scripts are now included to allow anyone to reproduce some of these tests. +See https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract for more +details.

+

Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, +and Korean. It also introduces a new, single-file based system of managing +language data.

+

Tesseract 3.02 adds BiDirectional text support, the ability to recognize +multiple languages in a single image, and improved layout analysis.

+

For further details, see the file ReleaseNotes included with the distribution.

+
+
+ +
+

SEE ALSO

+
+

ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), +shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), +unicharset_extractor(1), wordlist2dawg(1)

+
+
+
+

AUTHOR

+
+

Tesseract development was led at Hewlett-Packard and Google by Ray Smith. +The development team has included:

+

Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger, +Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke, +Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle, +Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel +Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh +Lloyd, Shobhit Saxena, and Thomas Kielbus.

+
+
+
+

COPYING

+
+

Licensed under the Apache License, Version 2.0

+
+
+
+

+ + + diff --git a/doc/tesseract.1.xml b/doc/tesseract.1.xml index 842c5acd61..8ddce87cd6 100644 --- a/doc/tesseract.1.xml +++ b/doc/tesseract.1.xml @@ -1,424 +1,424 @@ - - - - - - - TESSERACT(1) - - -tesseract -1 -  -  - - - tesseract - command-line OCR engine - - -tesseract imagename|stdin outputbase|stdout [options…] [configfile…] - - -DESCRIPTION -tesseract(1) is a commercial quality OCR engine originally developed at HP -between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by -UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed -at Google since then. - - -IN/OUT ARGUMENTS - - - -imagename - - - - The name of the input image. Most image file formats (anything - readable by Leptonica) are supported. - - - - - -stdin - - - - Instruction to read data from standard input - - - - - -outputbase - - - - The basename of the output file (to which the appropriate extension - will be appended). By default the output will be named outbase.txt. - - - - - -stdout - - - - Instruction to sent output data to standard output - - - - - - -OPTIONS - - - ---tessdata-dir /path - - - - Specify the location of tessdata path - - - - - ---user-words /path/to/file - - - - Specify the location of user words file - - - - - ---user-patterns /path/to/file specify - - - - The location of user patterns file - - - - - --c configvar=value - - - - Set value for control parameter. Multiple -c arguments are allowed. - - - - - --l lang - - - - The language to use. If none is specified, English is assumed. - Multiple languages may be specified, separated by plus characters. - Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) - - - - - ---psm N - - - - Set Tesseract to only run a subset of layout analysis and assume - a certain form of image. The options for N are: - -0 = Orientation and script detection (OSD) only. -1 = Automatic page segmentation with OSD. -2 = Automatic page segmentation, but no OSD, or OCR. -3 = Fully automatic page segmentation, but no OSD. (Default) -4 = Assume a single column of text of variable sizes. -5 = Assume a single uniform block of vertically aligned text. -6 = Assume a single uniform block of text. -7 = Treat the image as a single text line. -8 = Treat the image as a single word. -9 = Treat the image as a single word in a circle. -10 = Treat the image as a single character. - - - - -configfile - - - - The name of a config to use. A config is a plaintext file which - contains a list of variables and their values, one per line, with a - space separating variable from value. Interesting config files - include: - - - - -hocr - Output in hOCR format instead of as a text file. - - - - -pdf - Output in pdf instead of a text file. - - - - - - -Nota Bene: The options -l lang and --psm N must occur -before any configfile. - - -SINGLE OPTIONS - - - --v - - - - Returns the current version of the tesseract(1) executable. - - - - - ---list-langs - - - - list available languages for tesseract engine. Can be used with --tessdata-dir. - - - - - ---print-parameters - - - - print tesseract parameters to the stdout. - - - - - - -LANGUAGES -There are currently language packs available for the following languages -(in https://github.com/tesseract-ocr/tessdata): -afr (Afrikaans) -amh (Amharic) -ara (Arabic) -asm (Assamese) -aze (Azerbaijani) -aze_cyrl (Azerbaijani - Cyrilic) -bel (Belarusian) -ben (Bengali) -bod (Tibetan) -bos (Bosnian) -bul (Bulgarian) -cat (Catalan; Valencian) -ceb (Cebuano) -ces (Czech) -chi_sim (Chinese - Simplified) -chi_tra (Chinese - Traditional) -chr (Cherokee) -cym (Welsh) -dan (Danish) -dan_frak (Danish - Fraktur) -deu (German) -deu_frak (German - Fraktur) -dzo (Dzongkha) -ell (Greek, Modern (1453-)) -eng (English) -enm (English, Middle (1100-1500)) -epo (Esperanto) -equ (Math / equation detection module) -est (Estonian) -eus (Basque) -fas (Persian) -fin (Finnish) -fra (French) -frk (Frankish) -frm (French, Middle (ca.1400-1600)) -gle (Irish) -glg (Galician) -grc (Greek, Ancient (to 1453)) -guj (Gujarati) -hat (Haitian; Haitian Creole) -heb (Hebrew) -hin (Hindi) -hrv (Croatian) -hun (Hungarian) -iku (Inuktitut) -ind (Indonesian) -isl (Icelandic) -ita (Italian) -ita_old (Italian - Old) -jav (Javanese) -jpn (Japanese) -kan (Kannada) -kat (Georgian) -kat_old (Georgian - Old) -kaz (Kazakh) -khm (Central Khmer) -kir (Kirghiz; Kyrgyz) -kor (Korean) -kur (Kurdish) -lao (Lao) -lat (Latin) -lav (Latvian) -lit (Lithuanian) -mal (Malayalam) -mar (Marathi) -mkd (Macedonian) -mlt (Maltese) -msa (Malay) -mya (Burmese) -nep (Nepali) -nld (Dutch; Flemish) -nor (Norwegian) -ori (Oriya) -osd (Orientation and script detection module) -pan (Panjabi; Punjabi) -pol (Polish) -por (Portuguese) -pus (Pushto; Pashto) -ron (Romanian; Moldavian; Moldovan) -rus (Russian) -san (Sanskrit) -sin (Sinhala; Sinhalese) -slk (Slovak) -slk_frak (Slovak - Fraktur) -slv (Slovenian) -spa (Spanish; Castilian) -spa_old (Spanish; Castilian - Old) -sqi (Albanian) -srp (Serbian) -srp_latn (Serbian - Latin) -swa (Swahili) -swe (Swedish) -syr (Syriac) -tam (Tamil) -tel (Telugu) -tgk (Tajik) -tgl (Tagalog) -tha (Thai) -tir (Tigrinya) -tur (Turkish) -uig (Uighur; Uyghur) -ukr (Ukrainian) -urd (Urdu) -uzb (Uzbek) -uzb_cyrl (Uzbek - Cyrilic) -vie (Vietnamese) -yid (Yiddish) -To use a non-standard language pack named foo.traineddata, set the -TESSDATA_PREFIX environment variable so the file can be found at -TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the -argument -l foo. - - -CONFIG FILES AND AUGMENTING WITH USER DATA -Tesseract config files consist of lines with variable-value pairs (space -separated). The variables are documented as flags in the source code like -the following one in tesseractclass.h: -STRING_VAR_H(tessedit_char_blacklist, "", - "Blacklist of chars not to recognize"); -These variables may enable or disable various features of the engine, and -may cause it to load (or not load) various data. For instance, let’s suppose -you want to OCR in English, but suppress the normal dictionary and load an -alternative word list and an alternative list of patterns — these two files -are the most commonly used extra data files. -If your language pack is in /path/to/eng.traineddata and the hocr config -is in /path/to/configs/hocr then create three new files: -/path/to/eng.user-words: -
-the -quick -brown -fox -jumped -
-/path/to/eng.user-patterns: -
-1-\d\d\d-GOOG-411 -www.\n\\\*.com -
-/path/to/configs/bazaar: -
-load_system_dawg F -load_freq_dawg F -user_words_suffix user-words -user_patterns_suffix user-patterns -
-Now, if you pass the word bazaar as a trailing command line parameter -to Tesseract, Tesseract will not bother loading the system dictionary nor -the dictionary of frequent words and will load and use the eng.user-words -and eng.user-patterns files you provided. The former is a simple word list, -one per line. The format of the latter is documented in dict/trie.h -on read_pattern_list(). -
- -HISTORY -The engine was developed at Hewlett Packard Laboratories Bristol and at -Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more -changes made in 1996 to port to Windows, and some C++izing in 1998. A -lot of the code was written in C, and then some more was written in C++. -The C\++ code makes heavy use of a list system using macros. This predates -stl, was portable before stl, and is more efficient than stl lists, but has -the big negative that if you do get a segmentation violation, it is hard to -debug. -Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability -to train Tesseract. -Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy. -See https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf. With Tesseract 2.00, -scripts are now included to allow anyone to reproduce some of these tests. -See https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract for more -details. -Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, -and Korean. It also introduces a new, single-file based system of managing -language data. -Tesseract 3.02 adds BiDirectional text support, the ability to recognize -multiple languages in a single image, and improved layout analysis. -For further details, see the file ReleaseNotes included with the distribution. - - -RESOURCES -Main web site: https://github.com/tesseract-ocr -Information on training: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract - - -SEE ALSO -ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), -shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), -unicharset_extractor(1), wordlist2dawg(1) - - -AUTHOR -Tesseract development was led at Hewlett-Packard and Google by Ray Smith. -The development team has included: -Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger, -Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke, -Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle, -Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel -Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh -Lloyd, Shobhit Saxena, and Thomas Kielbus. - - -COPYING -Licensed under the Apache License, Version 2.0 - -
+ + + + + + + TESSERACT(1) + + +tesseract +1 +  +  + + + tesseract + command-line OCR engine + + +tesseract imagename|stdin outputbase|stdout [options…] [configfile…] + + +DESCRIPTION +tesseract(1) is a commercial quality OCR engine originally developed at HP +between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by +UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed +at Google since then. + + +IN/OUT ARGUMENTS + + + +imagename + + + + The name of the input image. Most image file formats (anything + readable by Leptonica) are supported. + + + + + +stdin + + + + Instruction to read data from standard input + + + + + +outputbase + + + + The basename of the output file (to which the appropriate extension + will be appended). By default the output will be named outbase.txt. + + + + + +stdout + + + + Instruction to sent output data to standard output + + + + + + +OPTIONS + + + +--tessdata-dir /path + + + + Specify the location of tessdata path + + + + + +--user-words /path/to/file + + + + Specify the location of user words file + + + + + +--user-patterns /path/to/file specify + + + + The location of user patterns file + + + + + +-c configvar=value + + + + Set value for control parameter. Multiple -c arguments are allowed. + + + + + +-l lang + + + + The language to use. If none is specified, English is assumed. + Multiple languages may be specified, separated by plus characters. + Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) + + + + + +--psm N + + + + Set Tesseract to only run a subset of layout analysis and assume + a certain form of image. The options for N are: + +0 = Orientation and script detection (OSD) only. +1 = Automatic page segmentation with OSD. +2 = Automatic page segmentation, but no OSD, or OCR. +3 = Fully automatic page segmentation, but no OSD. (Default) +4 = Assume a single column of text of variable sizes. +5 = Assume a single uniform block of vertically aligned text. +6 = Assume a single uniform block of text. +7 = Treat the image as a single text line. +8 = Treat the image as a single word. +9 = Treat the image as a single word in a circle. +10 = Treat the image as a single character. + + + + +configfile + + + + The name of a config to use. A config is a plaintext file which + contains a list of variables and their values, one per line, with a + space separating variable from value. Interesting config files + include: + + + + +hocr - Output in hOCR format instead of as a text file. + + + + +pdf - Output in pdf instead of a text file. + + + + + + +Nota Bene: The options -l lang and --psm N must occur +before any configfile. + + +SINGLE OPTIONS + + + +-v + + + + Returns the current version of the tesseract(1) executable. + + + + + +--list-langs + + + + list available languages for tesseract engine. Can be used with --tessdata-dir. + + + + + +--print-parameters + + + + print tesseract parameters to the stdout. + + + + + + +LANGUAGES +There are currently language packs available for the following languages +(in https://github.com/tesseract-ocr/tessdata): +afr (Afrikaans) +amh (Amharic) +ara (Arabic) +asm (Assamese) +aze (Azerbaijani) +aze_cyrl (Azerbaijani - Cyrilic) +bel (Belarusian) +ben (Bengali) +bod (Tibetan) +bos (Bosnian) +bul (Bulgarian) +cat (Catalan; Valencian) +ceb (Cebuano) +ces (Czech) +chi_sim (Chinese - Simplified) +chi_tra (Chinese - Traditional) +chr (Cherokee) +cym (Welsh) +dan (Danish) +dan_frak (Danish - Fraktur) +deu (German) +deu_frak (German - Fraktur) +dzo (Dzongkha) +ell (Greek, Modern (1453-)) +eng (English) +enm (English, Middle (1100-1500)) +epo (Esperanto) +equ (Math / equation detection module) +est (Estonian) +eus (Basque) +fas (Persian) +fin (Finnish) +fra (French) +frk (Frankish) +frm (French, Middle (ca.1400-1600)) +gle (Irish) +glg (Galician) +grc (Greek, Ancient (to 1453)) +guj (Gujarati) +hat (Haitian; Haitian Creole) +heb (Hebrew) +hin (Hindi) +hrv (Croatian) +hun (Hungarian) +iku (Inuktitut) +ind (Indonesian) +isl (Icelandic) +ita (Italian) +ita_old (Italian - Old) +jav (Javanese) +jpn (Japanese) +kan (Kannada) +kat (Georgian) +kat_old (Georgian - Old) +kaz (Kazakh) +khm (Central Khmer) +kir (Kirghiz; Kyrgyz) +kor (Korean) +kur (Kurdish) +lao (Lao) +lat (Latin) +lav (Latvian) +lit (Lithuanian) +mal (Malayalam) +mar (Marathi) +mkd (Macedonian) +mlt (Maltese) +msa (Malay) +mya (Burmese) +nep (Nepali) +nld (Dutch; Flemish) +nor (Norwegian) +ori (Oriya) +osd (Orientation and script detection module) +pan (Panjabi; Punjabi) +pol (Polish) +por (Portuguese) +pus (Pushto; Pashto) +ron (Romanian; Moldavian; Moldovan) +rus (Russian) +san (Sanskrit) +sin (Sinhala; Sinhalese) +slk (Slovak) +slk_frak (Slovak - Fraktur) +slv (Slovenian) +spa (Spanish; Castilian) +spa_old (Spanish; Castilian - Old) +sqi (Albanian) +srp (Serbian) +srp_latn (Serbian - Latin) +swa (Swahili) +swe (Swedish) +syr (Syriac) +tam (Tamil) +tel (Telugu) +tgk (Tajik) +tgl (Tagalog) +tha (Thai) +tir (Tigrinya) +tur (Turkish) +uig (Uighur; Uyghur) +ukr (Ukrainian) +urd (Urdu) +uzb (Uzbek) +uzb_cyrl (Uzbek - Cyrilic) +vie (Vietnamese) +yid (Yiddish) +To use a non-standard language pack named foo.traineddata, set the +TESSDATA_PREFIX environment variable so the file can be found at +TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the +argument -l foo. + + +CONFIG FILES AND AUGMENTING WITH USER DATA +Tesseract config files consist of lines with variable-value pairs (space +separated). The variables are documented as flags in the source code like +the following one in tesseractclass.h: +STRING_VAR_H(tessedit_char_blacklist, "", + "Blacklist of chars not to recognize"); +These variables may enable or disable various features of the engine, and +may cause it to load (or not load) various data. For instance, let’s suppose +you want to OCR in English, but suppress the normal dictionary and load an +alternative word list and an alternative list of patterns — these two files +are the most commonly used extra data files. +If your language pack is in /path/to/eng.traineddata and the hocr config +is in /path/to/configs/hocr then create three new files: +/path/to/eng.user-words: +
+the +quick +brown +fox +jumped +
+/path/to/eng.user-patterns: +
+1-\d\d\d-GOOG-411 +www.\n\\\*.com +
+/path/to/configs/bazaar: +
+load_system_dawg F +load_freq_dawg F +user_words_suffix user-words +user_patterns_suffix user-patterns +
+Now, if you pass the word bazaar as a trailing command line parameter +to Tesseract, Tesseract will not bother loading the system dictionary nor +the dictionary of frequent words and will load and use the eng.user-words +and eng.user-patterns files you provided. The former is a simple word list, +one per line. The format of the latter is documented in dict/trie.h +on read_pattern_list(). +
+ +HISTORY +The engine was developed at Hewlett Packard Laboratories Bristol and at +Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more +changes made in 1996 to port to Windows, and some C++izing in 1998. A +lot of the code was written in C, and then some more was written in C++. +The C\++ code makes heavy use of a list system using macros. This predates +stl, was portable before stl, and is more efficient than stl lists, but has +the big negative that if you do get a segmentation violation, it is hard to +debug. +Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability +to train Tesseract. +Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy. +See https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf. With Tesseract 2.00, +scripts are now included to allow anyone to reproduce some of these tests. +See https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract for more +details. +Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, +and Korean. It also introduces a new, single-file based system of managing +language data. +Tesseract 3.02 adds BiDirectional text support, the ability to recognize +multiple languages in a single image, and improved layout analysis. +For further details, see the file ReleaseNotes included with the distribution. + + +RESOURCES +Main web site: https://github.com/tesseract-ocr +Information on training: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract + + +SEE ALSO +ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), +shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), +unicharset_extractor(1), wordlist2dawg(1) + + +AUTHOR +Tesseract development was led at Hewlett-Packard and Google by Ray Smith. +The development team has included: +Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger, +Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke, +Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle, +Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel +Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh +Lloyd, Shobhit Saxena, and Thomas Kielbus. + + +COPYING +Licensed under the Apache License, Version 2.0 + +
diff --git a/doc/unicharambigs.5.asc b/doc/unicharambigs.5.asc index 7ce25e4478..079f6d53de 100644 --- a/doc/unicharambigs.5.asc +++ b/doc/unicharambigs.5.asc @@ -38,7 +38,7 @@ EXAMPLE 3 i i i 1 m 0 ............................... -In this example, all instances of the '2' character sequence '''' will +In this example, all instances of the '2' character sequence '''' will *always* be replaced by the '1' character sequence '"'; a '1' character sequence 'm' *may* be replaced by the '2' character sequence 'rn', and the '3' character sequence *may* be replaced by the '1' character diff --git a/doc/unicharambigs.5.html b/doc/unicharambigs.5.html index c6a645e69c..bb9fb291a3 100644 --- a/doc/unicharambigs.5.html +++ b/doc/unicharambigs.5.html @@ -1,875 +1,875 @@ - - - - - -UNICHARAMBIGS(5) - - - - - -
-
-

DESCRIPTION

-
-

The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) -is used by Tesseract to represent possible ambiguities between characters, -or groups of characters.

-

The file contains a number of lines, laid out as follow:

-
-
-
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
-
-
- - - - - - - - - - - - - - - - - - - - -
-Field one -
-
-

-the number of characters contained in field two -

-
-Field two -
-
-

-the character sequence to be replaced -

-
-Field three -
-
-

-the number of characters contained in field four -

-
-Field four -
-
-

-the character sequence used to replace field two -

-
-Field five -
-
-

-contains either 1 or 0. 1 denotes a mandatory -replacement, 0 denotes an optional replacement. -

-
-

Characters appearing in fields two and four should appear in -unicharset. The numbers in fields one and three refer to the -number of unichars (not bytes).

-
-
-
-

EXAMPLE

-
-
-
-
2       ' '     1       "     1
-1       m       2       r n   0
-3       i i i   1       m     0
-
-

In this example, all instances of the 2 character sequence '' will -always be replaced by the 1 character sequence "; a 1 character -sequence m may be replaced by the 2 character sequence rn, and -the 3 character sequence may be replaced by the 1 character -sequence m.

-
-
-
-

HISTORY

-
-

The unicharambigs file first appeared in Tesseract 3.00; prior to that, a -similar format, called DangAmbigs (dangerous ambiguities) was used: the -format was almost identical, except only mandatory replacements could be -specified, and field 5 was absent.

-
-
-
-

BUGS

-
-

This is a documentation "bug": it’s not currently clear what should be done -in the case of ligatures (such as fi) which may also appear as regular -letters in the unicharset.

-
-
-
-

SEE ALSO

-
-

tesseract(1), unicharset(5)

-
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +UNICHARAMBIGS(5) + + + + + +
+
+

DESCRIPTION

+
+

The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) +is used by Tesseract to represent possible ambiguities between characters, +or groups of characters.

+

The file contains a number of lines, laid out as follow:

+
+
+
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
+
+
+ + + + + + + + + + + + + + + + + + + + +
+Field one +
+
+

+the number of characters contained in field two +

+
+Field two +
+
+

+the character sequence to be replaced +

+
+Field three +
+
+

+the number of characters contained in field four +

+
+Field four +
+
+

+the character sequence used to replace field two +

+
+Field five +
+
+

+contains either 1 or 0. 1 denotes a mandatory +replacement, 0 denotes an optional replacement. +

+
+

Characters appearing in fields two and four should appear in +unicharset. The numbers in fields one and three refer to the +number of unichars (not bytes).

+
+
+
+

EXAMPLE

+
+
+
+
2       ' '     1       "     1
+1       m       2       r n   0
+3       i i i   1       m     0
+
+

In this example, all instances of the 2 character sequence '' will +always be replaced by the 1 character sequence "; a 1 character +sequence m may be replaced by the 2 character sequence rn, and +the 3 character sequence may be replaced by the 1 character +sequence m.

+
+
+
+

HISTORY

+
+

The unicharambigs file first appeared in Tesseract 3.00; prior to that, a +similar format, called DangAmbigs (dangerous ambiguities) was used: the +format was almost identical, except only mandatory replacements could be +specified, and field 5 was absent.

+
+
+
+

BUGS

+
+

This is a documentation "bug": it’s not currently clear what should be done +in the case of ligatures (such as fi) which may also appear as regular +letters in the unicharset.

+
+
+
+

SEE ALSO

+
+

tesseract(1), unicharset(5)

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/unicharambigs.5.xml b/doc/unicharambigs.5.xml index 75b3c66431..cbc0f50e50 100644 --- a/doc/unicharambigs.5.xml +++ b/doc/unicharambigs.5.xml @@ -1,126 +1,126 @@ - - - - - - - UNICHARAMBIGS(5) - - -unicharambigs -5 -  -  - - - unicharambigs - Tesseract unicharset ambiguities - - -DESCRIPTION -The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) -is used by Tesseract to represent possible ambiguities between characters, -or groups of characters. -The file contains a number of lines, laid out as follow: -[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num] - - - - -Field one - - - - -the number of characters contained in field two - - - - - - -Field two - - - - -the character sequence to be replaced - - - - - - -Field three - - - - -the number of characters contained in field four - - - - - - -Field four - - - - -the character sequence used to replace field two - - - - - - -Field five - - - - -contains either 1 or 0. 1 denotes a mandatory -replacement, 0 denotes an optional replacement. - - - - -Characters appearing in fields two and four should appear in -unicharset. The numbers in fields one and three refer to the -number of unichars (not bytes). - - -EXAMPLE -2 ' ' 1 " 1 -1 m 2 r n 0 -3 i i i 1 m 0 -In this example, all instances of the 2 character sequence '' will -always be replaced by the 1 character sequence "; a 1 character -sequence m may be replaced by the 2 character sequence rn, and -the 3 character sequence may be replaced by the 1 character -sequence m. - - -HISTORY -The unicharambigs file first appeared in Tesseract 3.00; prior to that, a -similar format, called DangAmbigs (dangerous ambiguities) was used: the -format was almost identical, except only mandatory replacements could be -specified, and field 5 was absent. - - -BUGS -This is a documentation "bug": it’s not currently clear what should be done -in the case of ligatures (such as fi) which may also appear as regular -letters in the unicharset. - - -SEE ALSO -tesseract(1), unicharset(5) - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + UNICHARAMBIGS(5) + + +unicharambigs +5 +  +  + + + unicharambigs + Tesseract unicharset ambiguities + + +DESCRIPTION +The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) +is used by Tesseract to represent possible ambiguities between characters, +or groups of characters. +The file contains a number of lines, laid out as follow: +[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num] + + + + +Field one + + + + +the number of characters contained in field two + + + + + + +Field two + + + + +the character sequence to be replaced + + + + + + +Field three + + + + +the number of characters contained in field four + + + + + + +Field four + + + + +the character sequence used to replace field two + + + + + + +Field five + + + + +contains either 1 or 0. 1 denotes a mandatory +replacement, 0 denotes an optional replacement. + + + + +Characters appearing in fields two and four should appear in +unicharset. The numbers in fields one and three refer to the +number of unichars (not bytes). + + +EXAMPLE +2 ' ' 1 " 1 +1 m 2 r n 0 +3 i i i 1 m 0 +In this example, all instances of the 2 character sequence '' will +always be replaced by the 1 character sequence "; a 1 character +sequence m may be replaced by the 2 character sequence rn, and +the 3 character sequence may be replaced by the 1 character +sequence m. + + +HISTORY +The unicharambigs file first appeared in Tesseract 3.00; prior to that, a +similar format, called DangAmbigs (dangerous ambiguities) was used: the +format was almost identical, except only mandatory replacements could be +specified, and field 5 was absent. + + +BUGS +This is a documentation "bug": it’s not currently clear what should be done +in the case of ligatures (such as fi) which may also appear as regular +letters in the unicharset. + + +SEE ALSO +tesseract(1), unicharset(5) + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + + diff --git a/doc/unicharset.5.html b/doc/unicharset.5.html index 0f16c9e5e5..f3c3e7a9fc 100644 --- a/doc/unicharset.5.html +++ b/doc/unicharset.5.html @@ -1,965 +1,965 @@ - - - - - -UNICHARSET(5) - - - - - -
-
-

DESCRIPTION

-
-

Tesseract’s unicharset file contains information on each symbol -(unichar) the Tesseract OCR engine is trained to recognize.

-

A unicharset file (i.e. eng.unicharset) is distributed as part of a -Tesseract language pack (i.e. eng.traineddata). For information on -extracting the unicharset file, see combine_tessdata(1).

-

The first line of a unicharset file contains the number of unichars in -the file. After this line, each subsequent line provides information for -a single unichar. The first such line contains a placeholder reserved for -the space character. Each unichar is referred to within Tesseract by its -Unichar ID, which is the line number (minus 1) within the unicharset file. -Therefore, space gets unichar 0.

-

Each unichar line in the unicharset file (v2+) may have four space-separated fields:

-
-
-
'character' 'properties' 'script' 'id'
-
-

Starting with Tesseract v3.02, more information may be given for each unichar:

-
-
-
'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
-
-

Entries:

-
-
-character -
-
-

-The UTF-8 encoded string to be produced for this unichar. -

-
-
-properties -
-
-

-An integer mask of character properties, one per bit. - From least to most significant bit, these are: isalpha, islower, isupper, - isdigit, ispunctuation. -

-
-
-glyph_metrics -
-
-

-Ten comma-separated integers representing various standards - for where this glyph is to be found within a baseline-normalized coordinate - system where 128 is normalized to x-height. -

-
    -
  • -

    -min_bottom, max_bottom: the ranges where the bottom of the character can - be found. -

    -
  • -
  • -

    -min_top, max_top: the ranges where the top of the character may be found. -

    -
  • -
  • -

    -min_width, max_width: horizontal width of the character. -

    -
  • -
  • -

    -min_bearing, max_bearing: how far from the usual start position does the - leftmost part of the character begin. -

    -
  • -
  • -

    -min_advance, max_advance: how far from the printer’s cell left do we - advance to begin the next character. -

    -
  • -
-
-
-script -
-
-

-Name of the script (Latin, Common, Greek, Cyrillic, Han, null). -

-
-
-other_case -
-
-

-The Unichar ID of the other case version of this character - (upper or lower). -

-
-
-direction -
-
-

-The Unicode BiDi direction of this character, as defined by - ICU’s enum UCharDirection. (0 = Left to Right, 1 = Right to Left, - 2 = European Number…) -

-
-
-mirror -
-
-

-The Unichar ID of the BiDirectional mirror of this character. - For example the mirror of open paren is close paren, but Latin Capital C - has no mirror, so it remains a Latin Capital C. -

-
-
-normed_form -
-
-

-The UTF-8 representation of a "normalized form" of this unichar - for the purpose of blaming a module for errors given ground truth text. - For instance, a left or right single quote may normalize to an ASCII quote. -

-
-
-
-
-
-

EXAMPLE (v2)

-
-
-
-
; 10 Common 46
-b 3 Latin 59
-W 5 Latin 40
-7 8 Common 66
-= 0 Common 93
-
-

";" is a punctuation character. Its properties are thus represented by the -binary number 10000 (10 in hexadecimal).

-

"b" is an alphabetic character and a lower case character. Its properties are -thus represented by the binary number 00011 (3 in hexadecimal).

-

"W" is an alphabetic character and an upper case character. Its properties are -thus represented by the binary number 00101 (5 in hexadecimal).

-

"7" is just a digit. Its properties are thus represented by the binary number -01000 (8 in hexadecimal).

-

"=" is not punctuation nor a digit nor an alphabetic character. Its properties -are thus represented by the binary number 00000 (0 in hexadecimal).

-

Japanese or Chinese alphabetic character properties are represented by the -binary number 00001 (1 in hexadecimal): they are alphabetic, but neither -upper nor lower case.

-
-
-
-

EXAMPLE (v3.02)

-
-
-
-
110
-NULL 0 NULL 0
-N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
-Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
-1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
-9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
-a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
-. . .
-
-
-
-
-

CAVEATS

-
-

Although the unicharset reader maintains the ability to read unicharsets -of older formats and will assign default values to missing fields, -the accuracy will be degraded.

-

Further, most other data files are indexed by the unicharset file, -so changing it without re-generating the others is likely to have dire -consequences.

-
-
-
-

HISTORY

-
-

The unicharset format first appeared with Tesseract 2.00, which was the -first version to support languages other than English. The unicharset file -contained only the first two fields, and the "ispunctuation" property was -absent (punctuation was regarded as "0", as "=" is in the above example.

-
-
-
-

SEE ALSO

-
-

tesseract(1), combine_tessdata(1), unicharset_extractor(1)

- -
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +UNICHARSET(5) + + + + + +
+
+

DESCRIPTION

+
+

Tesseract’s unicharset file contains information on each symbol +(unichar) the Tesseract OCR engine is trained to recognize.

+

A unicharset file (i.e. eng.unicharset) is distributed as part of a +Tesseract language pack (i.e. eng.traineddata). For information on +extracting the unicharset file, see combine_tessdata(1).

+

The first line of a unicharset file contains the number of unichars in +the file. After this line, each subsequent line provides information for +a single unichar. The first such line contains a placeholder reserved for +the space character. Each unichar is referred to within Tesseract by its +Unichar ID, which is the line number (minus 1) within the unicharset file. +Therefore, space gets unichar 0.

+

Each unichar line in the unicharset file (v2+) may have four space-separated fields:

+
+
+
'character' 'properties' 'script' 'id'
+
+

Starting with Tesseract v3.02, more information may be given for each unichar:

+
+
+
'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
+
+

Entries:

+
+
+character +
+
+

+The UTF-8 encoded string to be produced for this unichar. +

+
+
+properties +
+
+

+An integer mask of character properties, one per bit. + From least to most significant bit, these are: isalpha, islower, isupper, + isdigit, ispunctuation. +

+
+
+glyph_metrics +
+
+

+Ten comma-separated integers representing various standards + for where this glyph is to be found within a baseline-normalized coordinate + system where 128 is normalized to x-height. +

+
    +
  • +

    +min_bottom, max_bottom: the ranges where the bottom of the character can + be found. +

    +
  • +
  • +

    +min_top, max_top: the ranges where the top of the character may be found. +

    +
  • +
  • +

    +min_width, max_width: horizontal width of the character. +

    +
  • +
  • +

    +min_bearing, max_bearing: how far from the usual start position does the + leftmost part of the character begin. +

    +
  • +
  • +

    +min_advance, max_advance: how far from the printer’s cell left do we + advance to begin the next character. +

    +
  • +
+
+
+script +
+
+

+Name of the script (Latin, Common, Greek, Cyrillic, Han, null). +

+
+
+other_case +
+
+

+The Unichar ID of the other case version of this character + (upper or lower). +

+
+
+direction +
+
+

+The Unicode BiDi direction of this character, as defined by + ICU’s enum UCharDirection. (0 = Left to Right, 1 = Right to Left, + 2 = European Number…) +

+
+
+mirror +
+
+

+The Unichar ID of the BiDirectional mirror of this character. + For example the mirror of open paren is close paren, but Latin Capital C + has no mirror, so it remains a Latin Capital C. +

+
+
+normed_form +
+
+

+The UTF-8 representation of a "normalized form" of this unichar + for the purpose of blaming a module for errors given ground truth text. + For instance, a left or right single quote may normalize to an ASCII quote. +

+
+
+
+
+
+

EXAMPLE (v2)

+
+
+
+
; 10 Common 46
+b 3 Latin 59
+W 5 Latin 40
+7 8 Common 66
+= 0 Common 93
+
+

";" is a punctuation character. Its properties are thus represented by the +binary number 10000 (10 in hexadecimal).

+

"b" is an alphabetic character and a lower case character. Its properties are +thus represented by the binary number 00011 (3 in hexadecimal).

+

"W" is an alphabetic character and an upper case character. Its properties are +thus represented by the binary number 00101 (5 in hexadecimal).

+

"7" is just a digit. Its properties are thus represented by the binary number +01000 (8 in hexadecimal).

+

"=" is not punctuation nor a digit nor an alphabetic character. Its properties +are thus represented by the binary number 00000 (0 in hexadecimal).

+

Japanese or Chinese alphabetic character properties are represented by the +binary number 00001 (1 in hexadecimal): they are alphabetic, but neither +upper nor lower case.

+
+
+
+

EXAMPLE (v3.02)

+
+
+
+
110
+NULL 0 NULL 0
+N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
+Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
+1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
+9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
+a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
+. . .
+
+
+
+
+

CAVEATS

+
+

Although the unicharset reader maintains the ability to read unicharsets +of older formats and will assign default values to missing fields, +the accuracy will be degraded.

+

Further, most other data files are indexed by the unicharset file, +so changing it without re-generating the others is likely to have dire +consequences.

+
+
+
+

HISTORY

+
+

The unicharset format first appeared with Tesseract 2.00, which was the +first version to support languages other than English. The unicharset file +contained only the first two fields, and the "ispunctuation" property was +absent (punctuation was regarded as "0", as "=" is in the above example.

+
+
+
+

SEE ALSO

+
+

tesseract(1), combine_tessdata(1), unicharset_extractor(1)

+ +
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/unicharset.5.xml b/doc/unicharset.5.xml index 9ae6257e60..40e03c6eea 100644 --- a/doc/unicharset.5.xml +++ b/doc/unicharset.5.xml @@ -1,219 +1,219 @@ - - - - - - - UNICHARSET(5) - - -unicharset -5 -  -  - - - unicharset - character properties file used by tesseract(1) - - -DESCRIPTION -Tesseract’s unicharset file contains information on each symbol -(unichar) the Tesseract OCR engine is trained to recognize. -A unicharset file (i.e. eng.unicharset) is distributed as part of a -Tesseract language pack (i.e. eng.traineddata). For information on -extracting the unicharset file, see combine_tessdata(1). -The first line of a unicharset file contains the number of unichars in -the file. After this line, each subsequent line provides information for -a single unichar. The first such line contains a placeholder reserved for -the space character. Each unichar is referred to within Tesseract by its -Unichar ID, which is the line number (minus 1) within the unicharset file. -Therefore, space gets unichar 0. -Each unichar line in the unicharset file (v2+) may have four space-separated fields: -'character' 'properties' 'script' 'id' -Starting with Tesseract v3.02, more information may be given for each unichar: -'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form' -Entries: - - - -character - - - -The UTF-8 encoded string to be produced for this unichar. - - - - - -properties - - - -An integer mask of character properties, one per bit. - From least to most significant bit, these are: isalpha, islower, isupper, - isdigit, ispunctuation. - - - - - -glyph_metrics - - - -Ten comma-separated integers representing various standards - for where this glyph is to be found within a baseline-normalized coordinate - system where 128 is normalized to x-height. - - - - -min_bottom, max_bottom: the ranges where the bottom of the character can - be found. - - - - -min_top, max_top: the ranges where the top of the character may be found. - - - - -min_width, max_width: horizontal width of the character. - - - - -min_bearing, max_bearing: how far from the usual start position does the - leftmost part of the character begin. - - - - -min_advance, max_advance: how far from the printer’s cell left do we - advance to begin the next character. - - - - - - - -script - - - -Name of the script (Latin, Common, Greek, Cyrillic, Han, null). - - - - - -other_case - - - -The Unichar ID of the other case version of this character - (upper or lower). - - - - - -direction - - - -The Unicode BiDi direction of this character, as defined by - ICU’s enum UCharDirection. (0 = Left to Right, 1 = Right to Left, - 2 = European Number…) - - - - - -mirror - - - -The Unichar ID of the BiDirectional mirror of this character. - For example the mirror of open paren is close paren, but Latin Capital C - has no mirror, so it remains a Latin Capital C. - - - - - -normed_form - - - -The UTF-8 representation of a "normalized form" of this unichar - for the purpose of blaming a module for errors given ground truth text. - For instance, a left or right single quote may normalize to an ASCII quote. - - - - - - -EXAMPLE (v2) -; 10 Common 46 -b 3 Latin 59 -W 5 Latin 40 -7 8 Common 66 -= 0 Common 93 -";" is a punctuation character. Its properties are thus represented by the -binary number 10000 (10 in hexadecimal). -"b" is an alphabetic character and a lower case character. Its properties are -thus represented by the binary number 00011 (3 in hexadecimal). -"W" is an alphabetic character and an upper case character. Its properties are -thus represented by the binary number 00101 (5 in hexadecimal). -"7" is just a digit. Its properties are thus represented by the binary number -01000 (8 in hexadecimal). -"=" is not punctuation nor a digit nor an alphabetic character. Its properties -are thus represented by the binary number 00000 (0 in hexadecimal). -Japanese or Chinese alphabetic character properties are represented by the -binary number 00001 (1 in hexadecimal): they are alphabetic, but neither -upper nor lower case. - - -EXAMPLE (v3.02) -110 -NULL 0 NULL 0 -N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N -Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y -1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 -9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 -a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a -. . . - - -CAVEATS -Although the unicharset reader maintains the ability to read unicharsets -of older formats and will assign default values to missing fields, -the accuracy will be degraded. -Further, most other data files are indexed by the unicharset file, -so changing it without re-generating the others is likely to have dire -consequences. - - -HISTORY -The unicharset format first appeared with Tesseract 2.00, which was the -first version to support languages other than English. The unicharset file -contained only the first two fields, and the "ispunctuation" property was -absent (punctuation was regarded as "0", as "=" is in the above example. - - -SEE ALSO -tesseract(1), combine_tessdata(1), unicharset_extractor(1) -https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + UNICHARSET(5) + + +unicharset +5 +  +  + + + unicharset + character properties file used by tesseract(1) + + +DESCRIPTION +Tesseract’s unicharset file contains information on each symbol +(unichar) the Tesseract OCR engine is trained to recognize. +A unicharset file (i.e. eng.unicharset) is distributed as part of a +Tesseract language pack (i.e. eng.traineddata). For information on +extracting the unicharset file, see combine_tessdata(1). +The first line of a unicharset file contains the number of unichars in +the file. After this line, each subsequent line provides information for +a single unichar. The first such line contains a placeholder reserved for +the space character. Each unichar is referred to within Tesseract by its +Unichar ID, which is the line number (minus 1) within the unicharset file. +Therefore, space gets unichar 0. +Each unichar line in the unicharset file (v2+) may have four space-separated fields: +'character' 'properties' 'script' 'id' +Starting with Tesseract v3.02, more information may be given for each unichar: +'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form' +Entries: + + + +character + + + +The UTF-8 encoded string to be produced for this unichar. + + + + + +properties + + + +An integer mask of character properties, one per bit. + From least to most significant bit, these are: isalpha, islower, isupper, + isdigit, ispunctuation. + + + + + +glyph_metrics + + + +Ten comma-separated integers representing various standards + for where this glyph is to be found within a baseline-normalized coordinate + system where 128 is normalized to x-height. + + + + +min_bottom, max_bottom: the ranges where the bottom of the character can + be found. + + + + +min_top, max_top: the ranges where the top of the character may be found. + + + + +min_width, max_width: horizontal width of the character. + + + + +min_bearing, max_bearing: how far from the usual start position does the + leftmost part of the character begin. + + + + +min_advance, max_advance: how far from the printer’s cell left do we + advance to begin the next character. + + + + + + + +script + + + +Name of the script (Latin, Common, Greek, Cyrillic, Han, null). + + + + + +other_case + + + +The Unichar ID of the other case version of this character + (upper or lower). + + + + + +direction + + + +The Unicode BiDi direction of this character, as defined by + ICU’s enum UCharDirection. (0 = Left to Right, 1 = Right to Left, + 2 = European Number…) + + + + + +mirror + + + +The Unichar ID of the BiDirectional mirror of this character. + For example the mirror of open paren is close paren, but Latin Capital C + has no mirror, so it remains a Latin Capital C. + + + + + +normed_form + + + +The UTF-8 representation of a "normalized form" of this unichar + for the purpose of blaming a module for errors given ground truth text. + For instance, a left or right single quote may normalize to an ASCII quote. + + + + + + +EXAMPLE (v2) +; 10 Common 46 +b 3 Latin 59 +W 5 Latin 40 +7 8 Common 66 += 0 Common 93 +";" is a punctuation character. Its properties are thus represented by the +binary number 10000 (10 in hexadecimal). +"b" is an alphabetic character and a lower case character. Its properties are +thus represented by the binary number 00011 (3 in hexadecimal). +"W" is an alphabetic character and an upper case character. Its properties are +thus represented by the binary number 00101 (5 in hexadecimal). +"7" is just a digit. Its properties are thus represented by the binary number +01000 (8 in hexadecimal). +"=" is not punctuation nor a digit nor an alphabetic character. Its properties +are thus represented by the binary number 00000 (0 in hexadecimal). +Japanese or Chinese alphabetic character properties are represented by the +binary number 00001 (1 in hexadecimal): they are alphabetic, but neither +upper nor lower case. + + +EXAMPLE (v3.02) +110 +NULL 0 NULL 0 +N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N +Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y +1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 +9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 +a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a +. . . + + +CAVEATS +Although the unicharset reader maintains the ability to read unicharsets +of older formats and will assign default values to missing fields, +the accuracy will be degraded. +Further, most other data files are indexed by the unicharset file, +so changing it without re-generating the others is likely to have dire +consequences. + + +HISTORY +The unicharset format first appeared with Tesseract 2.00, which was the +first version to support languages other than English. The unicharset file +contained only the first two fields, and the "ispunctuation" property was +absent (punctuation was regarded as "0", as "=" is in the above example. + + +SEE ALSO +tesseract(1), combine_tessdata(1), unicharset_extractor(1) +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + + diff --git a/doc/unicharset_extractor.1.asc b/doc/unicharset_extractor.1.asc index c972783a8e..bde21ab3ba 100644 --- a/doc/unicharset_extractor.1.asc +++ b/doc/unicharset_extractor.1.asc @@ -11,9 +11,9 @@ SYNOPSIS DESCRIPTION ----------- -Tesseract needs to know the set of possible characters it can output. -To generate the unicharset data file, use the unicharset_extractor -program on the same training pages bounding box files as used for +Tesseract needs to know the set of possible characters it can output. +To generate the unicharset data file, use the unicharset_extractor +program on the same training pages bounding box files as used for clustering: unicharset_extractor fontfile_1.box fontfile_2.box ... @@ -21,19 +21,19 @@ clustering: The unicharset will be put into the file 'dir/unicharset', or simply './unicharset' if no output directory is provided. -Tesseract also needs to have access to character properties isalpha, -isdigit, isupper, islower, ispunctuation. all of this auxilury data +Tesseract also needs to have access to character properties isalpha, +isdigit, isupper, islower, ispunctuation. all of this auxilury data and more is encoded in this file. (See unicharset(5)) -If your system supports the wctype functions, these values will be set -automatically by unicharset_extractor and there is no need to edit the -unicharset file. On some older systems (eg Windows 95), the unicharset +If your system supports the wctype functions, these values will be set +automatically by unicharset_extractor and there is no need to edit the +unicharset file. On some older systems (eg Windows 95), the unicharset file must be edited by hand to add these property description codes. -*NOTE* The unicharset file must be regenerated whenever inttemp, normproto -and pffmtable are generated (i.e. they must all be recreated when the box -file is changed) as they have to be in sync. This is made easier than in -previous versions by running unicharset_extractor before mftraining and +*NOTE* The unicharset file must be regenerated whenever inttemp, normproto +and pffmtable are generated (i.e. they must all be recreated when the box +file is changed) as they have to be in sync. This is made easier than in +previous versions by running unicharset_extractor before mftraining and cntraining, and giving the unicharset to mftraining. SEE ALSO diff --git a/doc/unicharset_extractor.1.html b/doc/unicharset_extractor.1.html index a6ac9e898b..6fdeb5e953 100644 --- a/doc/unicharset_extractor.1.html +++ b/doc/unicharset_extractor.1.html @@ -1,815 +1,815 @@ - - - - - -UNICHARSET_EXTRACTOR(1) - - - - - -
-
-

SYNOPSIS

-
-

unicharset_extractor [-D dir] FILE

-
-
-
-

DESCRIPTION

-
-

Tesseract needs to know the set of possible characters it can output. -To generate the unicharset data file, use the unicharset_extractor -program on the same training pages bounding box files as used for -clustering:

-
-
-
unicharset_extractor fontfile_1.box fontfile_2.box ...
-
-

The unicharset will be put into the file dir/unicharset, or simply -./unicharset if no output directory is provided.

-

Tesseract also needs to have access to character properties isalpha, -isdigit, isupper, islower, ispunctuation. all of this auxilury data -and more is encoded in this file. (See unicharset(5))

-

If your system supports the wctype functions, these values will be set -automatically by unicharset_extractor and there is no need to edit the -unicharset file. On some older systems (eg Windows 95), the unicharset -file must be edited by hand to add these property description codes.

-

NOTE The unicharset file must be regenerated whenever inttemp, normproto -and pffmtable are generated (i.e. they must all be recreated when the box -file is changed) as they have to be in sync. This is made easier than in -previous versions by running unicharset_extractor before mftraining and -cntraining, and giving the unicharset to mftraining.

-
-
-
-

SEE ALSO

- -
-
-

HISTORY

-
-

unicharset_extractor first appeared in Tesseract 2.00.

-
-
-
-

COPYING

-
-

Copyright (C) 2006, Google Inc. -Licensed under the Apache License, Version 2.0

-
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +UNICHARSET_EXTRACTOR(1) + + + + + +
+
+

SYNOPSIS

+
+

unicharset_extractor [-D dir] FILE

+
+
+
+

DESCRIPTION

+
+

Tesseract needs to know the set of possible characters it can output. +To generate the unicharset data file, use the unicharset_extractor +program on the same training pages bounding box files as used for +clustering:

+
+
+
unicharset_extractor fontfile_1.box fontfile_2.box ...
+
+

The unicharset will be put into the file dir/unicharset, or simply +./unicharset if no output directory is provided.

+

Tesseract also needs to have access to character properties isalpha, +isdigit, isupper, islower, ispunctuation. all of this auxilury data +and more is encoded in this file. (See unicharset(5))

+

If your system supports the wctype functions, these values will be set +automatically by unicharset_extractor and there is no need to edit the +unicharset file. On some older systems (eg Windows 95), the unicharset +file must be edited by hand to add these property description codes.

+

NOTE The unicharset file must be regenerated whenever inttemp, normproto +and pffmtable are generated (i.e. they must all be recreated when the box +file is changed) as they have to be in sync. This is made easier than in +previous versions by running unicharset_extractor before mftraining and +cntraining, and giving the unicharset to mftraining.

+
+
+
+

SEE ALSO

+ +
+
+

HISTORY

+
+

unicharset_extractor first appeared in Tesseract 2.00.

+
+
+
+

COPYING

+
+

Copyright (C) 2006, Google Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/unicharset_extractor.1.xml b/doc/unicharset_extractor.1.xml index bea4d1e16e..45087a8c64 100644 --- a/doc/unicharset_extractor.1.xml +++ b/doc/unicharset_extractor.1.xml @@ -1,63 +1,63 @@ - - - - - - - UNICHARSET_EXTRACTOR(1) - - -unicharset_extractor -1 -  -  - - - unicharset_extractor - extract unicharset from Tesseract boxfiles - - -unicharset_extractor [-D dir] FILE - - -DESCRIPTION -Tesseract needs to know the set of possible characters it can output. -To generate the unicharset data file, use the unicharset_extractor -program on the same training pages bounding box files as used for -clustering: -unicharset_extractor fontfile_1.box fontfile_2.box ... -The unicharset will be put into the file dir/unicharset, or simply -./unicharset if no output directory is provided. -Tesseract also needs to have access to character properties isalpha, -isdigit, isupper, islower, ispunctuation. all of this auxilury data -and more is encoded in this file. (See unicharset(5)) -If your system supports the wctype functions, these values will be set -automatically by unicharset_extractor and there is no need to edit the -unicharset file. On some older systems (eg Windows 95), the unicharset -file must be edited by hand to add these property description codes. -NOTE The unicharset file must be regenerated whenever inttemp, normproto -and pffmtable are generated (i.e. they must all be recreated when the box -file is changed) as they have to be in sync. This is made easier than in -previous versions by running unicharset_extractor before mftraining and -cntraining, and giving the unicharset to mftraining. - - -SEE ALSO -tesseract(1), unicharset(5) -https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract - - -HISTORY -unicharset_extractor first appeared in Tesseract 2.00. - - -COPYING -Copyright (C) 2006, Google Inc. -Licensed under the Apache License, Version 2.0 - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + UNICHARSET_EXTRACTOR(1) + + +unicharset_extractor +1 +  +  + + + unicharset_extractor + extract unicharset from Tesseract boxfiles + + +unicharset_extractor [-D dir] FILE + + +DESCRIPTION +Tesseract needs to know the set of possible characters it can output. +To generate the unicharset data file, use the unicharset_extractor +program on the same training pages bounding box files as used for +clustering: +unicharset_extractor fontfile_1.box fontfile_2.box ... +The unicharset will be put into the file dir/unicharset, or simply +./unicharset if no output directory is provided. +Tesseract also needs to have access to character properties isalpha, +isdigit, isupper, islower, ispunctuation. all of this auxilury data +and more is encoded in this file. (See unicharset(5)) +If your system supports the wctype functions, these values will be set +automatically by unicharset_extractor and there is no need to edit the +unicharset file. On some older systems (eg Windows 95), the unicharset +file must be edited by hand to add these property description codes. +NOTE The unicharset file must be regenerated whenever inttemp, normproto +and pffmtable are generated (i.e. they must all be recreated when the box +file is changed) as they have to be in sync. This is made easier than in +previous versions by running unicharset_extractor before mftraining and +cntraining, and giving the unicharset to mftraining. + + +SEE ALSO +tesseract(1), unicharset(5) +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract + + +HISTORY +unicharset_extractor first appeared in Tesseract 2.00. + + +COPYING +Copyright (C) 2006, Google Inc. +Licensed under the Apache License, Version 2.0 + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + + diff --git a/doc/wordlist2dawg.1.html b/doc/wordlist2dawg.1.html index 58e5cab4fa..733570511a 100644 --- a/doc/wordlist2dawg.1.html +++ b/doc/wordlist2dawg.1.html @@ -1,820 +1,820 @@ - - - - - -WORDLIST2DAWG(1) - - - - - -
-
-

SYNOPSIS

-
-

wordlist2dawg WORDLIST DAWG lang.unicharset

-

wordlist2dawg -t WORDLIST DAWG lang.unicharset

-

wordlist2dawg -r 1 WORDLIST DAWG lang.unicharset

-

wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset

-

wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset

-
-
-
-

DESCRIPTION

-
-

wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph -(DAWG) for use with Tesseract. A DAWG is a compressed, space and time -efficient representation of a word list.

-
-
-
-

OPTIONS

-
-

-t - Verify that a given dawg file is equivalent to a given wordlist.

-

-r 1 - Reverse a word if it contains an RTL character.

-

-r 2 - Reverse all words.

-

-l <short> <long> - Produce a file with several dawgs in it, one each for words - of length <short>, <short+1>,… <long>

-
-
-
-

ARGUMENTS

-
-

WORDLIST - A plain text file in UTF-8, one word per line.

-

DAWG - The output DAWG to write.

-

lang.unicharset - The unicharset of the language. This is the unicharset - generated by mftraining(1).

-
-
-
-

SEE ALSO

-
-

tesseract(1), combine_tessdata(1), dawg2wordlist(1)

- -
-
-
-

COPYING

-
-

Copyright (C) 2006 Google, Inc. -Licensed under the Apache License, Version 2.0

-
-
-
-

AUTHOR

-
-

The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present).

-
-
-
-

- - - + + + + + +WORDLIST2DAWG(1) + + + + + +
+
+

SYNOPSIS

+
+

wordlist2dawg WORDLIST DAWG lang.unicharset

+

wordlist2dawg -t WORDLIST DAWG lang.unicharset

+

wordlist2dawg -r 1 WORDLIST DAWG lang.unicharset

+

wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset

+

wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset

+
+
+
+

DESCRIPTION

+
+

wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph +(DAWG) for use with Tesseract. A DAWG is a compressed, space and time +efficient representation of a word list.

+
+
+
+

OPTIONS

+
+

-t + Verify that a given dawg file is equivalent to a given wordlist.

+

-r 1 + Reverse a word if it contains an RTL character.

+

-r 2 + Reverse all words.

+

-l <short> <long> + Produce a file with several dawgs in it, one each for words + of length <short>, <short+1>,… <long>

+
+
+
+

ARGUMENTS

+
+

WORDLIST + A plain text file in UTF-8, one word per line.

+

DAWG + The output DAWG to write.

+

lang.unicharset + The unicharset of the language. This is the unicharset + generated by mftraining(1).

+
+
+
+

SEE ALSO

+
+

tesseract(1), combine_tessdata(1), dawg2wordlist(1)

+ +
+
+
+

COPYING

+
+

Copyright (C) 2006 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/doc/wordlist2dawg.1.xml b/doc/wordlist2dawg.1.xml index 907d3a574d..bad256fe70 100644 --- a/doc/wordlist2dawg.1.xml +++ b/doc/wordlist2dawg.1.xml @@ -1,69 +1,69 @@ - - - - - - - WORDLIST2DAWG(1) - - -wordlist2dawg -1 -  -  - - - wordlist2dawg - convert a wordlist to a DAWG for Tesseract - - -wordlist2dawg WORDLIST DAWG lang.unicharset -wordlist2dawg -t WORDLIST DAWG lang.unicharset -wordlist2dawg -r 1 WORDLIST DAWG lang.unicharset -wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset -wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset - - -DESCRIPTION -wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph -(DAWG) for use with Tesseract. A DAWG is a compressed, space and time -efficient representation of a word list. - - -OPTIONS --t - Verify that a given dawg file is equivalent to a given wordlist. --r 1 - Reverse a word if it contains an RTL character. --r 2 - Reverse all words. --l <short> <long> - Produce a file with several dawgs in it, one each for words - of length <short>, <short+1>,… <long> - - -ARGUMENTS -WORDLIST - A plain text file in UTF-8, one word per line. -DAWG - The output DAWG to write. -lang.unicharset - The unicharset of the language. This is the unicharset - generated by mftraining(1). - - -SEE ALSO -tesseract(1), combine_tessdata(1), dawg2wordlist(1) -https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract - - -COPYING -Copyright (C) 2006 Google, Inc. -Licensed under the Apache License, Version 2.0 - - -AUTHOR -The Tesseract OCR engine was written by Ray Smith and his research groups -at Hewlett Packard (1985-1995) and Google (2006-present). - - + + + + + + + WORDLIST2DAWG(1) + + +wordlist2dawg +1 +  +  + + + wordlist2dawg + convert a wordlist to a DAWG for Tesseract + + +wordlist2dawg WORDLIST DAWG lang.unicharset +wordlist2dawg -t WORDLIST DAWG lang.unicharset +wordlist2dawg -r 1 WORDLIST DAWG lang.unicharset +wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset +wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset + + +DESCRIPTION +wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph +(DAWG) for use with Tesseract. A DAWG is a compressed, space and time +efficient representation of a word list. + + +OPTIONS +-t + Verify that a given dawg file is equivalent to a given wordlist. +-r 1 + Reverse a word if it contains an RTL character. +-r 2 + Reverse all words. +-l <short> <long> + Produce a file with several dawgs in it, one each for words + of length <short>, <short+1>,… <long> + + +ARGUMENTS +WORDLIST + A plain text file in UTF-8, one word per line. +DAWG + The output DAWG to write. +lang.unicharset + The unicharset of the language. This is the unicharset + generated by mftraining(1). + + +SEE ALSO +tesseract(1), combine_tessdata(1), dawg2wordlist(1) +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract + + +COPYING +Copyright (C) 2006 Google, Inc. +Licensed under the Apache License, Version 2.0 + + +AUTHOR +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present). + +