Manpages (#1378)

* Add missing man pages * Update lstmeval.1.asc * Update combine_lang_model.1.asc * Update lstmtraining.1.asc * Update merge_unicharsets.1.asc * Update set_unicharset_properties.1.asc * Update text2image.1.asc * Update text2image.1.asc * Update combine_lang_model.1.asc
tesseract-ocr · Mar 12, 2018 · df58108 · df58108
1 parent 79c6fa6
commit df58108
Show file tree

Hide file tree

Showing 8 changed files with 594 additions and 5 deletions.
diff --git a/doc/Makefile.am b/doc/Makefile.am
@@ -2,11 +2,27 @@ if MAINTAINER_MODE
 
 asciidoc=asciidoc -d manpage
 
-man_MANS = cntraining.1  combine_tessdata.1  mftraining.1  tesseract.1 \
-	unicharset_extractor.1  wordlist2dawg.1 unicharambigs.5 \
-	unicharset.5 ambiguous_words.1 shapeclustering.1 dawg2wordlist.1
-
-EXTRA_DIST = $(man_MANS) Doxyfile
+man_MANS = \
+  ambiguous_words.1 \
+  classifier_tester.1 \
+  cntraining.1  \
+  combine_lang_model.1 \
+  combine_tessdata.1  \
+  dawg2wordlist.1 \
+  lstmeval.1 \
+  lstmtraining.1 \
+  merge_unicharsets.1 \
+  mftraining.1  \
+  set_unicharset_properties.1 \
+  shapeclustering.1 \
+  tesseract.1 \
+  text2image.1 \
+  unicharambigs.5 \
+  unicharset.5 \
+  unicharset_extractor.1  \
+  wordlist2dawg.1 
+
+  EXTRA_DIST = $(man_MANS) Doxyfile
 
 %: %.asc
 	$(asciidoc) -o $@ $<

diff --git a/doc/classifier_tester.1.asc b/doc/classifier_tester.1.asc
@@ -0,0 +1,61 @@
+CLASSIFIER_TESTER(1)
+====================
+
+NAME
+----
+classifier_tester - for *legacy tesseract* engine.
+
+SYNOPSIS
+--------
+*classifier_tester* -U 'unicharset_file' -F 'font_properties_file' -X 'xheights_file'  -classifier 'x' -lang 'lang' [-output_trainer trainer] *.tr
+
+DESCRIPTION
+-----------
+classifier_tester(1) runs Tesseract in a special mode. 
+It takes a list of .tr files and tests a character classifier 
+on data as formatted for training, 
+but it doesn't have to be the same as the training data.
+
+IN/OUT ARGUMENTS
+----------------
+
+a list of .tr files
+
+OPTIONS
+-------
+-l 'lang'::
+	(Input) three character language code; default value 'eng'.
+  
+-classifier 'x'::
+	(Input) One of "pruner", "full".
+  
+ 
+-U 'unicharset'::
+	(Input) The unicharset for the language.
+
+-F 'font_properties_file'::
+	(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
+
+	*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
+
+-X 'xheights_file'::
+	(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
+
+	*font_name* *xheight*
+
+-output_trainer 'trainer'::
+	(Output, Optional) Filename for output trainer.
+
+SEE ALSO
+--------
+tesseract(1)
+
+COPYING
+-------
+Copyright \(C) 2012 Google, Inc.
+Licensed under the Apache License, Version 2.0
+
+AUTHOR
+------
+The Tesseract OCR engine was written by Ray Smith and his research groups
+at Hewlett Packard (1985-1995) and Google (2006-present).
diff --git a/doc/combine_lang_model.1.asc b/doc/combine_lang_model.1.asc
@@ -0,0 +1,71 @@
+COMBINE_LANG_MODEL(1)
+=====================
+:doctype: manpage
+
+NAME
+----
+combine_lang_model - generate starter traineddata
+
+SYNOPSIS
+--------
+*combine_lang_model*  --input_unicharset 'filename' --script_dir 'dirname' --output_dir 'rootdir' --lang 'lang' [--lang_is_rtl] [pass_through_recoder] [--words file --puncs file --numbers file] 
+
+DESCRIPTION
+-----------
+combine_lang_model(1) generates a starter traineddata file that can be used to train an LSTM-based neural network model. It takes as input a unicharset and an optional set of wordlists. It eliminates the need to run set_unicharset_properties(1), wordlist2dawg(1), some non-existent binary to generate the recoder (unicode compressor), and finally combine_tessdata(1).
+
+OPTIONS
+-------
+'-l lang'::
+	The language to use. 
+	Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
+
+'--script_dir  PATH'::   
+  Directory name for input script unicharsets. It should point to the location of langdata (github repo) directory.  (type:string default:)
+
+'--input_unicharset  FILE':: 
+  Unicharset to complete and use in encoding. It can be a hand-created file with incomplete fields. Its basic and script properties will be set before it is used.  (type:string default:)
+
+'--lang_is_rtl  BOOL'::
+  True if language being processed is written right-to-left (eg Arabic/Hebrew). (type:bool default:false)
+
+'--pass_through_recoder BOOL'::
+  If true, the recoder is a simple pass-through of the unicharset. Otherwise, potentially a compression of it by encoding Hangul in Jamos, decomposing multi-unicode symbols into sequences of unicodes, and encoding Han using the data in the radical_table_data, which must be the content of the file: langdata/radical-stroke.txt. (type:bool default:false)
+
+'--version_str  STRING':: 
+  An arbitrary version label to add to traineddata file  (type:string default:)
+
+'--words  FILE'::   
+  (Optional) File listing words to use for the system dictionary  (type:string default:)
+
+'--numbers  FILE'::   
+  (Optional) File listing number patterns  (type:string default:)
+
+'--puncs  FILE'::   
+  (Optional) File listing punctuation patterns. The words/puncs/numbers lists may be all empty. If any are non-empty then puncs must be non-empty.  (type:string default:)
+
+'--output_dir   PATH'::   
+  Root directory for output files. Output files will be written to <output_dir>/<lang>/<lang>.*  (type:string default:)
+
+HISTORY
+-------
+combine_lang_model(1) was first made available for tesseract4.00.00alpha. 
+
+RESOURCES
+---------
+Main web site: <https://github.com/tesseract-ocr> +
+Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
+
+SEE ALSO
+--------
+tesseract(1)
+
+COPYING
+-------
+Copyright \(C) 2012 Google, Inc.
+Licensed under the Apache License, Version 2.0
+
+AUTHOR
+------
+The Tesseract OCR engine was written by Ray Smith and his research groups
+at Hewlett Packard (1985-1995) and Google (2006-present).
diff --git a/doc/lstmeval.1.asc b/doc/lstmeval.1.asc
@@ -0,0 +1,55 @@
+LSTMEVAL(1)
+===========
+:doctype: manpage
+
+NAME
+----
+lstmeval - Evaluation program for LSTM-based networks. 
+
+SYNOPSIS
+--------
+*lstmeval* --model 'lang.lstm|langtrain_checkpoint|pluscharsN.NNN_NN.checkpoint' [--traineddata lang/lang.traineddata] --eval_listfile 'lang.eval_files.txt' [--verbosity N] [--max_image_MB NNNN]
+
+DESCRIPTION
+-----------
+lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, '--traineddata' should also be specified. 
+
+OPTIONS
+-------
+'--model  FILE'::
+  Name of model file (training or recognition)  (type:string default:)
+
+'--traineddata  FILE'::
+  If model is a training checkpoint, then traineddata must be the traineddata file that was given to the trainer  (type:string default:)
+
+'--eval_listfile  FILE'::
+  File listing sample files in lstmf training format.  (type:string default:)
+
+'--max_image_MB  INT'::
+  Max memory to use for images.  (type:int default:2000)
+
+'--verbosity  INT'::
+  Amount of diagnosting information to output (0-2).  (type:int default:1)
+
+HISTORY
+-------
+lstmeval(1) was first made available for tesseract4.00.00alpha. 
+
+RESOURCES
+---------
+Main web site: <https://github.com/tesseract-ocr> +
+Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
+
+SEE ALSO
+--------
+tesseract(1)
+
+COPYING
+-------
+Copyright \(C) 2012 Google, Inc.
+Licensed under the Apache License, Version 2.0
+
+AUTHOR
+------
+The Tesseract OCR engine was written by Ray Smith and his research groups
+at Hewlett Packard (1985-1995) and Google (2006-present).
diff --git a/doc/lstmtraining.1.asc b/doc/lstmtraining.1.asc
@@ -0,0 +1,117 @@
+LSTMTRAINING(1)
+===============
+:doctype: manpage
+
+NAME
+----
+lstmtraining - Training program for LSTM-based networks.
+
+SYNOPSIS
+--------
+*lstmtraining*  
+  --continue_from  'train_output_dir/continue_from_lang.lstm'
+  --old_traineddata 'bestdata_dir/continue_from_lang.traineddata' 
+  --traineddata   'train_output_dir/lang/lang.traineddata' 
+  --max_iterations 'NNN' 
+  --debug_interval '0|-1' 
+  --train_listfile 'train_output_dir/lang.training_files.txt'
+  --model_output  'train_output_dir/newlstmmodel'
+
+DESCRIPTION
+-----------
+lstmtraining(1)  trains LSTM-based networks using a list of lstmf files and starter traineddata file as the main input. Training from scratch is not recommended to be done by users. Finetuning (example command shown in synopsis above) or replacing a layer options can be used instead. Different options apply to different types of training. Read [Training Wiki page](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) for details.
+
+OPTIONS
+-------
+
+'--debug_interval  '::
+  How often to display the alignment.  (type:int default:0)
+
+'--net_mode  '::
+  Controls network behavior.  (type:int default:192)
+
+'--perfect_sample_delay  '::
+  How many imperfect samples between perfect ones.  (type:int default:0)
+
+'--max_image_MB  '::
+  Max memory to use for images.  (type:int default:6000)
+
+'--append_index  '::
+  Index in continue_from Network at which to attach the new network defined by net_spec  (type:int default:-1)
+
+'--max_iterations  '::
+  If set, exit after this many iterations  (type:int default:0)
+
+'--target_error_rate  '::
+  Final error rate in percent.  (type:double default:0.01)
+
+'--weight_range  '::
+  Range of initial random weights.  (type:double default:0.1)
+
+'--learning_rate  '::
+  Weight factor for new deltas.  (type:double default:0.001)
+
+'--momentum  '::
+  Decay factor for repeating deltas.  (type:double default:0.5)
+
+'--adam_beta  '::
+  Decay factor for repeating deltas.  (type:double default:0.999)
+
+'--stop_training  '::
+  Just convert the training model to a runtime model.  (type:bool default:false)
+
+'--convert_to_int  '::
+  Convert the recognition model to an integer model.  (type:bool default:false)
+
+'--sequential_training  '::
+  Use the training files sequentially instead of round-robin.  (type:bool default:false)
+
+'--debug_network  '::
+  Get info on distribution of weight values  (type:bool default:false)
+
+'--randomly_rotate  '::
+  Train OSD and randomly turn training samples upside-down  (type:bool default:false)
+
+'--net_spec  '::
+  Network specification  (type:string default:)
+
+'--continue_from  '::
+  Existing model to extend  (type:string default:)
+
+'--model_output  '::
+  Basename for output models  (type:string default:lstmtrain)
+
+'--train_listfile  '::
+  File listing training files in lstmf training format.  (type:string default:)
+
+'--eval_listfile  '::
+  File listing eval files in lstmf training format.  (type:string default:)
+
+'--traineddata  '::
+  Starter traineddata with combined Dawgs/Unicharset/Recoder for language model  (type:string default:)
+
+'--old_traineddata  '::
+  When changing the character set, this specifies the traineddata with the old character set that is to be replaced  (type:string default:)
+
+HISTORY
+-------
+lstmtraining(1) was first made available for tesseract4.00.00alpha. 
+
+RESOURCES
+---------
+Main web site: <https://github.com/tesseract-ocr> +
+Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
+   
+SEE ALSO
+--------
+tesseract(1)
+
+COPYING
+-------
+Copyright \(C) 2012 Google, Inc.
+Licensed under the Apache License, Version 2.0
+
+AUTHOR
+------
+The Tesseract OCR engine was written by Ray Smith and his research groups
+at Hewlett Packard (1985-1995) and Google (2006-present).
diff --git a/doc/merge_unicharsets.1.asc b/doc/merge_unicharsets.1.asc
@@ -0,0 +1,51 @@
+MERGE_UNICHARSETS(1)
+====================
+:doctype: manpage
+
+NAME
+----
+merge_unicharsets - Simple tool to merge two or more unicharsets.
+
+SYNOPSIS
+--------
+*merge_unicharsets* 'unicharset-in-1' ... 'unicharset-in-n' 'unicharset-out'
+
+DESCRIPTION
+-----------
+merge_unicharsets(1) is a simple tool to merge two or more unicharsets.
+It could be used to create a combined unicharset for a script-level engine, 
+like the new Latin or Devanagari.
+
+IN/OUT ARGUMENTS
+----------------
+'unicharset-in-1'::
+	(Input) The name of the first unicharset file to be merged.
+
+'unicharset-in-n'::
+	(Input) The name of the nth unicharset file to be merged.
+
+'unicharset-out'::
+	(Output) The name of the merged unicharset file.
+
+HISTORY
+-------
+merge_unicharsets(1) was first made available for tesseract4.00.00alpha. 
+
+RESOURCES
+---------
+Main web site: <https://github.com/tesseract-ocr> +
+Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
+
+SEE ALSO
+--------
+tesseract(1)
+
+COPYING
+-------
+Copyright \(C) 2012 Google, Inc.
+Licensed under the Apache License, Version 2.0
+
+AUTHOR
+------
+The Tesseract OCR engine was written by Ray Smith and his research groups
+at Hewlett Packard (1985-1995) and Google (2006-present).