Skip to content

Commit

Permalink
Manpages (#1378)
Browse files Browse the repository at this point in the history
* Add missing man pages

* Update lstmeval.1.asc

* Update combine_lang_model.1.asc

* Update lstmtraining.1.asc

* Update merge_unicharsets.1.asc

* Update set_unicharset_properties.1.asc

* Update text2image.1.asc

* Update text2image.1.asc

* Update combine_lang_model.1.asc
  • Loading branch information
Shreeshrii authored and zdenop committed Mar 12, 2018
1 parent 79c6fa6 commit df58108
Show file tree
Hide file tree
Showing 8 changed files with 594 additions and 5 deletions.
26 changes: 21 additions & 5 deletions doc/Makefile.am
Expand Up @@ -2,11 +2,27 @@ if MAINTAINER_MODE

asciidoc=asciidoc -d manpage

man_MANS = cntraining.1 combine_tessdata.1 mftraining.1 tesseract.1 \
unicharset_extractor.1 wordlist2dawg.1 unicharambigs.5 \
unicharset.5 ambiguous_words.1 shapeclustering.1 dawg2wordlist.1

EXTRA_DIST = $(man_MANS) Doxyfile
man_MANS = \
ambiguous_words.1 \
classifier_tester.1 \
cntraining.1 \
combine_lang_model.1 \
combine_tessdata.1 \
dawg2wordlist.1 \
lstmeval.1 \
lstmtraining.1 \
merge_unicharsets.1 \
mftraining.1 \
set_unicharset_properties.1 \
shapeclustering.1 \
tesseract.1 \
text2image.1 \
unicharambigs.5 \
unicharset.5 \
unicharset_extractor.1 \
wordlist2dawg.1

EXTRA_DIST = $(man_MANS) Doxyfile

%: %.asc
$(asciidoc) -o $@ $<
Expand Down
61 changes: 61 additions & 0 deletions doc/classifier_tester.1.asc
@@ -0,0 +1,61 @@
CLASSIFIER_TESTER(1)
====================

NAME
----
classifier_tester - for *legacy tesseract* engine.

SYNOPSIS
--------
*classifier_tester* -U 'unicharset_file' -F 'font_properties_file' -X 'xheights_file' -classifier 'x' -lang 'lang' [-output_trainer trainer] *.tr
DESCRIPTION
-----------
classifier_tester(1) runs Tesseract in a special mode.
It takes a list of .tr files and tests a character classifier
on data as formatted for training,
but it doesn't have to be the same as the training data.
IN/OUT ARGUMENTS
----------------
a list of .tr files
OPTIONS
-------
-l 'lang'::
(Input) three character language code; default value 'eng'.
-classifier 'x'::
(Input) One of "pruner", "full".
-U 'unicharset'::
(Input) The unicharset for the language.
-F 'font_properties_file'::
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*

-X 'xheights_file'::
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]

*font_name* *xheight*

-output_trainer 'trainer'::
(Output, Optional) Filename for output trainer.

SEE ALSO
--------
tesseract(1)
COPYING
-------
Copyright \(C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0

AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).
71 changes: 71 additions & 0 deletions doc/combine_lang_model.1.asc
@@ -0,0 +1,71 @@
COMBINE_LANG_MODEL(1)
=====================
:doctype: manpage

NAME
----
combine_lang_model - generate starter traineddata

SYNOPSIS
--------
*combine_lang_model* --input_unicharset 'filename' --script_dir 'dirname' --output_dir 'rootdir' --lang 'lang' [--lang_is_rtl] [pass_through_recoder] [--words file --puncs file --numbers file]

DESCRIPTION
-----------
combine_lang_model(1) generates a starter traineddata file that can be used to train an LSTM-based neural network model. It takes as input a unicharset and an optional set of wordlists. It eliminates the need to run set_unicharset_properties(1), wordlist2dawg(1), some non-existent binary to generate the recoder (unicode compressor), and finally combine_tessdata(1).

OPTIONS
-------
'-l lang'::
The language to use.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)

'--script_dir PATH'::
Directory name for input script unicharsets. It should point to the location of langdata (github repo) directory. (type:string default:)

'--input_unicharset FILE'::
Unicharset to complete and use in encoding. It can be a hand-created file with incomplete fields. Its basic and script properties will be set before it is used. (type:string default:)

'--lang_is_rtl BOOL'::
True if language being processed is written right-to-left (eg Arabic/Hebrew). (type:bool default:false)

'--pass_through_recoder BOOL'::
If true, the recoder is a simple pass-through of the unicharset. Otherwise, potentially a compression of it by encoding Hangul in Jamos, decomposing multi-unicode symbols into sequences of unicodes, and encoding Han using the data in the radical_table_data, which must be the content of the file: langdata/radical-stroke.txt. (type:bool default:false)

'--version_str STRING'::
An arbitrary version label to add to traineddata file (type:string default:)

'--words FILE'::
(Optional) File listing words to use for the system dictionary (type:string default:)

'--numbers FILE'::
(Optional) File listing number patterns (type:string default:)

'--puncs FILE'::
(Optional) File listing punctuation patterns. The words/puncs/numbers lists may be all empty. If any are non-empty then puncs must be non-empty. (type:string default:)

'--output_dir PATH'::
Root directory for output files. Output files will be written to <output_dir>/<lang>/<lang>.* (type:string default:)

HISTORY
-------
combine_lang_model(1) was first made available for tesseract4.00.00alpha.

RESOURCES
---------
Main web site: <https://github.com/tesseract-ocr> +
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>

SEE ALSO
--------
tesseract(1)

COPYING
-------
Copyright \(C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0

AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).
55 changes: 55 additions & 0 deletions doc/lstmeval.1.asc
@@ -0,0 +1,55 @@
LSTMEVAL(1)
===========
:doctype: manpage

NAME
----
lstmeval - Evaluation program for LSTM-based networks.

SYNOPSIS
--------
*lstmeval* --model 'lang.lstm|langtrain_checkpoint|pluscharsN.NNN_NN.checkpoint' [--traineddata lang/lang.traineddata] --eval_listfile 'lang.eval_files.txt' [--verbosity N] [--max_image_MB NNNN]

DESCRIPTION
-----------
lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, '--traineddata' should also be specified.

OPTIONS
-------
'--model FILE'::
Name of model file (training or recognition) (type:string default:)

'--traineddata FILE'::
If model is a training checkpoint, then traineddata must be the traineddata file that was given to the trainer (type:string default:)

'--eval_listfile FILE'::
File listing sample files in lstmf training format. (type:string default:)

'--max_image_MB INT'::
Max memory to use for images. (type:int default:2000)

'--verbosity INT'::
Amount of diagnosting information to output (0-2). (type:int default:1)

HISTORY
-------
lstmeval(1) was first made available for tesseract4.00.00alpha.

RESOURCES
---------
Main web site: <https://github.com/tesseract-ocr> +
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>

SEE ALSO
--------
tesseract(1)

COPYING
-------
Copyright \(C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0

AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).
117 changes: 117 additions & 0 deletions doc/lstmtraining.1.asc
@@ -0,0 +1,117 @@
LSTMTRAINING(1)
===============
:doctype: manpage

NAME
----
lstmtraining - Training program for LSTM-based networks.

SYNOPSIS
--------
*lstmtraining*
--continue_from 'train_output_dir/continue_from_lang.lstm'
--old_traineddata 'bestdata_dir/continue_from_lang.traineddata'
--traineddata 'train_output_dir/lang/lang.traineddata'
--max_iterations 'NNN'
--debug_interval '0|-1'
--train_listfile 'train_output_dir/lang.training_files.txt'
--model_output 'train_output_dir/newlstmmodel'

DESCRIPTION
-----------
lstmtraining(1) trains LSTM-based networks using a list of lstmf files and starter traineddata file as the main input. Training from scratch is not recommended to be done by users. Finetuning (example command shown in synopsis above) or replacing a layer options can be used instead. Different options apply to different types of training. Read [Training Wiki page](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) for details.

OPTIONS
-------

'--debug_interval '::
How often to display the alignment. (type:int default:0)
'--net_mode '::
Controls network behavior. (type:int default:192)
'--perfect_sample_delay '::
How many imperfect samples between perfect ones. (type:int default:0)
'--max_image_MB '::
Max memory to use for images. (type:int default:6000)
'--append_index '::
Index in continue_from Network at which to attach the new network defined by net_spec (type:int default:-1)
'--max_iterations '::
If set, exit after this many iterations (type:int default:0)
'--target_error_rate '::
Final error rate in percent. (type:double default:0.01)
'--weight_range '::
Range of initial random weights. (type:double default:0.1)
'--learning_rate '::
Weight factor for new deltas. (type:double default:0.001)
'--momentum '::
Decay factor for repeating deltas. (type:double default:0.5)
'--adam_beta '::
Decay factor for repeating deltas. (type:double default:0.999)
'--stop_training '::
Just convert the training model to a runtime model. (type:bool default:false)
'--convert_to_int '::
Convert the recognition model to an integer model. (type:bool default:false)
'--sequential_training '::
Use the training files sequentially instead of round-robin. (type:bool default:false)
'--debug_network '::
Get info on distribution of weight values (type:bool default:false)
'--randomly_rotate '::
Train OSD and randomly turn training samples upside-down (type:bool default:false)
'--net_spec '::
Network specification (type:string default:)
'--continue_from '::
Existing model to extend (type:string default:)
'--model_output '::
Basename for output models (type:string default:lstmtrain)
'--train_listfile '::
File listing training files in lstmf training format. (type:string default:)
'--eval_listfile '::
File listing eval files in lstmf training format. (type:string default:)
'--traineddata '::
Starter traineddata with combined Dawgs/Unicharset/Recoder for language model (type:string default:)
'--old_traineddata '::
When changing the character set, this specifies the traineddata with the old character set that is to be replaced (type:string default:)
HISTORY
-------
lstmtraining(1) was first made available for tesseract4.00.00alpha.
RESOURCES
---------
Main web site: <https://github.com/tesseract-ocr> +
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
SEE ALSO
--------
tesseract(1)
COPYING
-------
Copyright \(C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).
51 changes: 51 additions & 0 deletions doc/merge_unicharsets.1.asc
@@ -0,0 +1,51 @@
MERGE_UNICHARSETS(1)
====================
:doctype: manpage

NAME
----
merge_unicharsets - Simple tool to merge two or more unicharsets.

SYNOPSIS
--------
*merge_unicharsets* 'unicharset-in-1' ... 'unicharset-in-n' 'unicharset-out'

DESCRIPTION
-----------
merge_unicharsets(1) is a simple tool to merge two or more unicharsets.
It could be used to create a combined unicharset for a script-level engine,
like the new Latin or Devanagari.

IN/OUT ARGUMENTS
----------------
'unicharset-in-1'::
(Input) The name of the first unicharset file to be merged.

'unicharset-in-n'::
(Input) The name of the nth unicharset file to be merged.

'unicharset-out'::
(Output) The name of the merged unicharset file.

HISTORY
-------
merge_unicharsets(1) was first made available for tesseract4.00.00alpha.

RESOURCES
---------
Main web site: <https://github.com/tesseract-ocr> +
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>

SEE ALSO
--------
tesseract(1)

COPYING
-------
Copyright \(C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0

AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

0 comments on commit df58108

Please sign in to comment.