Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
zdenop committed Feb 16, 2019
2 parents 4d8bbe2 + 48be357 commit e54c06f
Show file tree
Hide file tree
Showing 13 changed files with 169 additions and 28 deletions.
3 changes: 2 additions & 1 deletion CMakeLists.txt
Expand Up @@ -88,7 +88,7 @@ get_property(known_features GLOBAL PROPERTY CMAKE_CXX_KNOWN_FEATURES)
if(cxx_std_14 IN_LIST known_features)
set(CMAKE_CXX_STANDARD 14)
message("C++14 support enabled...")
else() # minimum requeired standard
else() # minimum required standard
set(CMAKE_CXX_STANDARD 11)
message("C++11 support enabled...")
endif()
Expand Down Expand Up @@ -269,6 +269,7 @@ set(tesseract_src ${tesseract_src}
src/api/hocrrenderer.cpp
src/api/lstmboxrenderer.cpp
src/api/pdfrenderer.cpp
src/api/wordstrboxrenderer.cpp
)

if (WIN32)
Expand Down
5 changes: 3 additions & 2 deletions configure.ac
Expand Up @@ -395,7 +395,8 @@ AC_HEADER_STDBOOL
# ----------------------------------------
AC_CHECK_PROG([have_asciidoc], asciidoc, true, false)
if $have_asciidoc; then
AC_CHECK_PROG([have_xsltproc], xsltproc, true, false)
if $have_asciidoc && $have_xsltproc; then
AM_CONDITIONAL([ASCIIDOC], true)
else
AM_CONDITIONAL([ASCIIDOC], false)
Expand Down Expand Up @@ -505,7 +506,7 @@ AM_COND_IF([ASCIIDOC],
[
echo "This will also build the documentation."
], [
echo "Documentation will not be built because asciidoc is missing."
echo "Documentation will not be built because asciidoc or xsltproc is missing."
]
)
Expand Down
2 changes: 1 addition & 1 deletion doc/Makefile.am
Expand Up @@ -37,7 +37,7 @@ html: ${man_MANS:%=%.html}
SUFFIXES = .asc .html

.asc:
asciidoc -b docbook -d manpage -o - $< | \
-asciidoc -b docbook -d manpage -o - $< | \
xsltproc --nonet $(man_xslt) -

.asc.html:
Expand Down
43 changes: 22 additions & 21 deletions doc/tesseract.1.asc
Expand Up @@ -36,7 +36,7 @@ IN/OUT ARGUMENTS
The basename of the output file (to which the appropriate extension
will be appended). By default the output will be a text file
with `.txt` added to the basename unless there are one or more
'configfile' options which explicitly specify the desired output.
parameters set which explicitly specify the desired output.

'stdout'::
Instruction to send output data to standard output.
Expand All @@ -54,7 +54,7 @@ OPTIONS
Specify the location of user patterns file.

'-c configvar=value'::
Set value for control parameter. Multiple -c arguments are allowed.
Set value for parameter 'configvar'. Multiple -c arguments are allowed.

'-l lang'::
The language to use. If none is specified, English is assumed.
Expand Down Expand Up @@ -86,20 +86,21 @@ OPTIONS
3 = Default, based on what is available.

'configfile'::
The name of a config to use. A config is a plaintext file which
contains a list of variables and their values, one per line, with a
space separating variable from value. Interesting config files
include: +
* `alto` - Output in ALTO format (file extension `.xml`).
* `hocr` - Output in hOCR format (file extension `.hocr`).
* `pdf` - Output PDF (file extension `.pdf`).
* `tsv` - Output TSV (file extension `.tsv`).
* `txt` - Output plain text (file extension `.txt`).
* `get.images` - Write images.
* `logfile` - Write debug file `tesseract.log`.
* `lstm.train` - Used for LSTM training.
* `makebox` - Output box file.
* `quiet` - Write debug file to /dev/null.
The name of a config to use. A config is a plain text file which
contains a list of parameters and their values, one per line,
with a space separating parameter from value. +
Interesting config files include:

* `alto` - Output in ALTO format ('outputbase'`.xml`).
* `hocr` - Output in hOCR format ('outputbase'`.hocr`).
* `pdf` - Output PDF ('outputbase'`.pdf`).
* `tsv` - Output TSV ('outputbase'`.tsv`).
* `txt` - Output plain text ('outputbase'`.txt`).
* `get.images` - Write processed input images to file (`tessinput.tif`).
* `logfile` - Redirect debug messages to file (`tesseract.log`).
* `lstm.train` - Output files used by LSTM training ('outputbase'`.lstmf`).
* `makebox` - Write box file ('outputbase'`.box`).
* `quiet` - Redirect debug messages to /dev/null.

It is possible to select several config files, for example
`tesseract image.png demo hocr pdf txt` will create three output files
Expand Down Expand Up @@ -334,14 +335,14 @@ Tesseract 4 LSTM OCR engine.
CONFIG FILES AND AUGMENTING WITH USER DATA
------------------------------------------
Tesseract config files consist of lines with variable-value pairs (space
separated). The variables are documented as flags in the source code like
Tesseract config files consist of lines with parameter-value pairs (space
separated). The parameters are documented as flags in the source code like
the following one in tesseractclass.h:
STRING_VAR_H(tessedit_char_blacklist, "",
"Blacklist of chars not to recognize");
These variables may enable or disable various features of the engine, and
These parameters may enable or disable various features of the engine, and
may cause it to load (or not load) various data. For instance, let's suppose
you want to OCR in English, but suppress the normal dictionary and load an
alternative word list and an alternative list of patterns -- these two files
Expand Down Expand Up @@ -371,8 +372,8 @@ load_freq_dawg F
user_words_suffix user-words
user_patterns_suffix user-patterns
Now, if you pass the word 'bazaar' as a trailing command line parameter
to Tesseract, Tesseract will not bother loading the system dictionary nor
Now, if you pass the word 'bazaar' as a 'configfile' to Tesseract,
Tesseract will not bother loading the system dictionary nor
the dictionary of frequent words and will load and use the eng.user-words
and eng.user-patterns files you provided. The former is a simple word list,
one per line. The format of the latter is documented in dict/trie.h
Expand Down
1 change: 1 addition & 0 deletions src/api/Makefile.am
Expand Up @@ -37,6 +37,7 @@ libtesseract_api_la_SOURCES += altorenderer.cpp
libtesseract_api_la_SOURCES += hocrrenderer.cpp
libtesseract_api_la_SOURCES += lstmboxrenderer.cpp
libtesseract_api_la_SOURCES += pdfrenderer.cpp
libtesseract_api_la_SOURCES += wordstrboxrenderer.cpp
libtesseract_api_la_SOURCES += renderer.cpp

lib_LTLIBRARIES += libtesseract.la
Expand Down
10 changes: 9 additions & 1 deletion src/api/baseapi.h
Expand Up @@ -630,7 +630,15 @@ class TESS_API TessBaseAPI {
* Returned string must be freed with the delete [] operator.
*/
char* GetBoxText(int page_number);


/**
* The recognized text is returned as a char* which is coded in the same
* format as a WordStr box file used in training.
* page_number is a 0-based page index that will appear in the box file.
* Returned string must be freed with the delete [] operator.
*/
char* GetWordStrBoxText(int page_number);

/**
* The recognized text is returned as a char* which is coded
* as UNLV format Latin-1 with specific reject and suspect codes.
Expand Down
11 changes: 11 additions & 0 deletions src/api/renderer.h
Expand Up @@ -269,6 +269,17 @@ class TESS_API TessBoxTextRenderer : public TessResultRenderer {
virtual bool AddImageHandler(TessBaseAPI* api);
};

/**
* Renders tesseract output into a plain UTF-8 text string in WordStr format
*/
class TESS_API TessWordStrBoxRenderer : public TessResultRenderer {
public:
explicit TessWordStrBoxRenderer(const char* outputbase);

protected:
virtual bool AddImageHandler(TessBaseAPI* api);
};

#ifndef DISABLED_LEGACY_ENGINE

/**
Expand Down
14 changes: 14 additions & 0 deletions src/api/tesseractmain.cpp
Expand Up @@ -524,6 +524,20 @@ static void PreloadRenderers(
}
}

api->GetBoolVariable("tessedit_create_wordstrbox", &b);
if (b) {
tesseract::TessWordStrBoxRenderer* renderer =
new tesseract::TessWordStrBoxRenderer(outputbase);
if (renderer->happy()) {
renderers->push_back(renderer);
} else {
delete renderer;
tprintf("Error, could not create WordStr BOX output file: %s\n",
strerror(errno));
error = true;
}
}

api->GetBoolVariable("tessedit_create_txt", &b);
if (b || (!error && renderers->empty())) {
tesseract::TessTextRenderer* renderer =
Expand Down
101 changes: 101 additions & 0 deletions src/api/wordstrboxrenderer.cpp
@@ -0,0 +1,101 @@
/**********************************************************************
* File: wordstrboxrenderer.cpp
* Description: Renderer for creating box file with WordStr strings.
* based on the tsv renderer.
*
* (C) Copyright 2006, Google Inc.
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
*
**********************************************************************/

#include "baseapi.h" // for TessBaseAPI
#include "renderer.h"
#include "tesseractclass.h" // for Tesseract

namespace tesseract {

/**
* Create a UTF8 box file with WordStr strings from the internal data structures.
* page_number is a 0-base page index that will appear in the box file.
* Returned string must be freed with the delete [] operator.
*/

char* TessBaseAPI::GetWordStrBoxText(int page_number) {
if (tesseract_ == nullptr || (page_res_ == nullptr && Recognize(nullptr) < 0))
return nullptr;

STRING wordstr_box_str("");
int left, top, right, bottom;
int page_num = page_number;
bool first_line = true;

LTRResultIterator* res_it = GetLTRIterator();
while (!res_it->Empty(RIL_BLOCK)) {
if (res_it->Empty(RIL_WORD)) {
res_it->Next(RIL_WORD);
continue;
}

if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
if (!first_line) {
wordstr_box_str.add_str_int("\n\t ", right + 1);
wordstr_box_str.add_str_int(" ", image_height_ - bottom);
wordstr_box_str.add_str_int(" ", right + 5);
wordstr_box_str.add_str_int(" ", image_height_ - top);
wordstr_box_str.add_str_int(" ", page_num); // row for tab for EOL
wordstr_box_str += "\n";
} else {
first_line = false;
}
// Use bounding box for whole line for WordStr
res_it->BoundingBox(RIL_TEXTLINE, &left, &top, &right, &bottom);
wordstr_box_str.add_str_int("WordStr ", left);
wordstr_box_str.add_str_int(" ", image_height_ - bottom);
wordstr_box_str.add_str_int(" ", right);
wordstr_box_str.add_str_int(" ", image_height_ - top);
wordstr_box_str.add_str_int(" ", page_num); // word
wordstr_box_str += " #";
}
do { wordstr_box_str +=
std::unique_ptr<const char[]>(res_it->GetUTF8Text(RIL_WORD)).get();
wordstr_box_str += " ";
res_it->Next(RIL_WORD);
} while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
}
wordstr_box_str.add_str_int("\n\t ", right + 1);
wordstr_box_str.add_str_int(" ", image_height_ - bottom);
wordstr_box_str.add_str_int(" ", right + 5);
wordstr_box_str.add_str_int(" ", image_height_ - top);
wordstr_box_str.add_str_int(" ", page_num); // row for tab for EOL
wordstr_box_str += "\n";
char* ret = new char[wordstr_box_str.length() + 1];
strcpy(ret, wordstr_box_str.string());
delete res_it;
return ret;
}

/**********************************************************************
* WordStrBox Renderer interface implementation
**********************************************************************/
TessWordStrBoxRenderer::TessWordStrBoxRenderer(const char *outputbase)
: TessResultRenderer(outputbase, "box") {
}

bool TessWordStrBoxRenderer::AddImageHandler(TessBaseAPI* api) {
const std::unique_ptr<const char[]> wordstrbox(api->GetWordStrBoxText(imagenum()));
if (wordstrbox == nullptr) return false;

AppendString(wordstrbox.get());

return true;
}

} // namespace tesseract.
2 changes: 2 additions & 0 deletions src/ccmain/tesseractclass.cpp
Expand Up @@ -395,6 +395,8 @@ Tesseract::Tesseract()
this->params()),
BOOL_MEMBER(tessedit_create_tsv, false, "Write .tsv output file",
this->params()),
BOOL_MEMBER(tessedit_create_wordstrbox, false, "Write WordStr format .box output file",
this->params()),
BOOL_MEMBER(tessedit_create_pdf, false, "Write .pdf output file",
this->params()),
BOOL_MEMBER(textonly_pdf, false,
Expand Down
1 change: 1 addition & 0 deletions src/ccmain/tesseractclass.h
Expand Up @@ -1042,6 +1042,7 @@ class Tesseract : public Wordrec {
BOOL_VAR_H(tessedit_create_alto, false, "Write .xml ALTO output file");
BOOL_VAR_H(tessedit_create_lstmbox, false, "Write .box file for LSTM training");
BOOL_VAR_H(tessedit_create_tsv, false, "Write .tsv output file");
BOOL_VAR_H(tessedit_create_wordstrbox, false, "Write WordStr format .box output file");
BOOL_VAR_H(tessedit_create_pdf, false, "Write .pdf output file");
BOOL_VAR_H(textonly_pdf, false,
"Create PDF with only one invisible text layer");
Expand Down
3 changes: 1 addition & 2 deletions src/lstm/input.cpp
Expand Up @@ -2,7 +2,6 @@
// File: input.cpp
// Description: Input layer class for neural network implementations.
// Author: Ray Smith
// Created: Thu Mar 13 09:10:34 PDT 2014
//
// (C) Copyright 2014, Google Inc.
// Licensed under the Apache License, Version 2.0 (the "License");
Expand Down Expand Up @@ -93,7 +92,7 @@ Pix* Input::PrepareLSTMInputs(const ImageData& image_data,
tprintf("Bad pix from ImageData!\n");
return nullptr;
}
if (width <= min_width || height < min_width) {
if (width < min_width || height < min_width) {
tprintf("Image too small to scale!! (%dx%d vs min width of %d)\n", width,
height, min_width);
pixDestroy(&pix);
Expand Down
1 change: 1 addition & 0 deletions tessdata/configs/wordstrbox
@@ -0,0 +1 @@
tessedit_create_wordstrbox 1

0 comments on commit e54c06f

Please sign in to comment.