Merge branch 'develop'

ocrmypdf · Nov 7, 2016 · 8abc2f1 · 8abc2f1
2 parents 6821e8e + 949d2ff
commit 8abc2f1
Show file tree

Hide file tree

Showing 17 changed files with 166 additions and 47 deletions.
diff --git a/RELEASE_NOTES.rst b/RELEASE_NOTES.rst
@@ -3,6 +3,16 @@ RELEASE NOTES
 
 OCRmyPDF uses `semantic versioning <http://semver.org/>`_.
 
+v4.3.1:
+=======
+
+-  Fixed an issue where pages produced by the "hocr" renderer after a Tesseract timeout would be rotated incorrectly if the input page was rotated with a /Rotate marker
+-  Fixed a file handle leak in LeptonicaErrorTrap that would cause a "too many open files" error for files around hundred pages of pages long when ``--deskew`` or ``--remove-background`` or other Leptonica based image processing features were in use, depending on the system value of ``ulimit -n``
+-  Ability to specify multiple languages for multilingual documents is now advertised in documentation
+-  Reduced the file sizes of some test resources
+-  Cleaned up debug output
+-  Tesseract caching in test cases is now more cautious about false cache hits and reproducing exact output, not that any problems were observed
+
 
 v4.3:
 =====

diff --git a/docs/cookbook.rst b/docs/cookbook.rst
@@ -49,10 +49,23 @@ OCR will attempt to automatic correct the rotation of each page. This can help f
 You can increase (decrease) the parameter ``--rotate-pages-threshold`` to make page rotation more (less) aggressive.
 
 
+OCR languages other than English
+""""""""""""""""""""""""""""""""
+
+By default OCRmyPDF assumes the document is English. 
+
+.. code-block:: bash
+
+	ocrmypdf -l fre LeParisien.pdf LeParisien.pdf
+	ocrmypdf -l eng+fre Bilingual-English-French.pdf Bilingual-English-French.pdf
+
+Language packs must be installed for all languages specified. See :ref:`Installing additional language packs <lang-packs>`.
+
+
 OCR images, not PDFs
 --------------------
 
-Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the resutls to run ocrmypdf:
+Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the results to run ocrmypdf:
 
 .. code-block:: bash
 
@@ -107,19 +120,20 @@ watchdog installs the command line program ``watchmedo``, which can be told to r
 	mkdir out
 	watchmedo shell-command \
 		--patterns="*.pdf" \
+		--ignore-directories \
 		--command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \
 		.  # don't forget the final dot
 
 For more complex behavior you can write a Python script around to use the watchdog API.
 
 On file servers, you could configure watchmedo as a system service so it will run all the time.
 
-
 Caveats
 """""""
 
 * ``watchmedo`` may not work properly on a networked file system, depending on the capabilities of the file system client and server.
 * This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.
+* If the source and destination directory are the same, watchmedo may create an infinite loop.
 
 
 Batch jobs

diff --git a/docs/introduction.rst b/docs/introduction.rst
@@ -28,11 +28,16 @@ Rasterizing a PDF is the process of generating an image suitable for display or
 About PDF/A
 -----------
 
-`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`_ is a standardized subset of the full PDF specification that is designed for archiving.  PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript or references to external fonts.  All fonts and resources needed to interpret the PDF must be contained within it.  Generally speaking, scanned documents should be converted to PDF/A. There are various conformance levels and versions, such as "PDF/A-2b".
+`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`_ is an ISO-standardized subset of the full PDF specification that is designed for archiving (the 'A' stands for Archive).  PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript, video, audio and references to external fonts.  All fonts and resources needed to interpret the PDF must be contained within it. Because PDF/A disables Javascript and other types of embedded content, it is probably more secure.
 
-Since most people who scan documents are interested in reading them in the future, OCRmyPDF generates PDF/A-2b by default.
+There are various conformance levels and versions, such as "PDF/A-2b".
+
+Generally speaking, the best format for scanned documents is PDF/A. Some governments and jurisdictions, US Courts in particular, `mandate the use of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`_ for scanned documents.
+
+Since most people who scan documents are interested in reading them indefinitely into the future, OCRmyPDF generates PDF/A-2b by default.
+
+PDF/A has a few drawbacks.  Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users.  It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available. PDF/A files can be digitally signed, but may not be encrypted, to ensure they can be read in the future.  Fortunately, converting from PDF/A to a regular PDF is trivial, and any PDF viewer can view PDF/A.
 
-PDF/A has a few drawbacks.  Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users.  It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available. 
 
 What OCRmyPDF does
 ------------------

diff --git a/docs/languages.rst b/docs/languages.rst
@@ -1,3 +1,5 @@
+.. _lang-packs:
+
 Installing additional language packs
 ====================================
 
@@ -19,7 +21,7 @@ Debian and Ubuntu users
    apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language back
    
 You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple
-languages can be requested.
+languages can be requested using either ``-l eng+fre`` (English and French) or ``-l eng -l fre``.
 
 Mac OS X (macOS) users
 ----------------------
@@ -38,7 +40,7 @@ As of v4.2, users of ocrmypdf working languages outside the Latin alphabet shoul
 
 .. code-block:: bash
 
-	ocrmypdf --output-type pdf --pdf-renderer tesseract
+	ocrmypdf -l eng+gre --output-type pdf --pdf-renderer tesseract
 
 The reasons for this are:
 

diff --git a/ocrmypdf/__main__.py b/ocrmypdf/__main__.py
@@ -21,7 +21,7 @@
 from functools import partial
 
 from ruffus import transform, suffix, merge, active_if, regex, jobs_limit, \
-    formatter, follows, split, collate, check_if_uptodate, graphviz
+    formatter, follows, split, collate, check_if_uptodate, graphviz, posttask
 import ruffus.ruffus_exceptions as ruffus_exceptions
 import ruffus.cmdline as cmdline
 import ruffus.proxy_logger as proxy_logger
@@ -166,8 +166,10 @@ def check_pil_encoder(codec_name, friendly_name):
     help="output searchable PDF file (or '-' to write to standard output)")
 parser.add_argument(
     '-l', '--language', action='append',
-    help="languages of the file to be OCRed (see tesseract --list-langs for "
-         "all language packs installed in your system)")
+    help="Language(s) of the file to be OCRed (see tesseract --list-langs for "
+         "all language packs installed in your system). To specify multiple "
+         "languages, join them with '+' or issue this argument once for each "
+         "language.")
 parser.add_argument(
     '-j', '--jobs', metavar='N', type=int,
     help="Use up to N CPU cores simultaneously (default: use all)")
@@ -304,6 +306,8 @@ def check_pil_encoder(codec_name, friendly_name):
 # ----------
 # Arguments
 
+options.verbose_abbreviated_path = 1
+
 if options.pdf_renderer == 'auto':
     options.pdf_renderer = 'hocr'
 
@@ -460,6 +464,11 @@ def re_symlink(input_file, soft_link_name, log=_log):
 work_folder = mkdtemp(prefix="com.github.ocrmypdf.")
 
 
+def done_task(caller):
+    "Useful as debug hook"
+    pass
+
+
 @atexit.register
 def cleanup_working_files(*args):
     if options.keep_temporary_files:
@@ -533,6 +542,7 @@ def triage_image_file(input_file, output_file, log):
         sys.exit(ExitCode.input_file)
 
 
+@posttask(partial(done_task, 'triage'))
 @transform(
     input=os.path.join(work_folder, 'origin'),
     filter=formatter('(?i)'),
@@ -555,6 +565,7 @@ def triage(
     triage_image_file(input_file, output_file, log)
 
 
+@posttask(partial(done_task, 'repair_pdf'))
 @transform(
     input=triage,
     filter=suffix('.pdf'),
@@ -648,6 +659,7 @@ def is_ocr_required(pageinfo, log):
     return ocr_required
 
 
+@posttask(partial(done_task, 'split_pages'))
 @split(
     repair_pdf,
     os.path.join(work_folder, '*.page.pdf'),
@@ -690,6 +702,7 @@ def split_pages(
                 os.path.basename(filename)[0:6] + alt_suffix))
 
 
+@posttask(partial(done_task, 'rasterize_preview'))
 @active_if(options.rotate_pages)
 @transform(
     input=split_pages,
@@ -712,6 +725,7 @@ def rasterize_preview(
         log=log)
 
 
+@posttask(partial(done_task, 'orient_page'))
 @collate(
     input=[split_pages, rasterize_preview],
     filter=regex(r".*/(\d{6})(\.ocr|\.skip)(?:\.page\.pdf|\.preview\.jpg)"),
@@ -786,6 +800,7 @@ def orient_page(
             pdfinfo[pageno] = pageinfo
 
 
+@posttask(partial(done_task, 'rasterize_with_ghostscript'))
 @transform(
     input=orient_page,
     filter=suffix('.ocr.oriented.pdf'),
@@ -822,6 +837,7 @@ def rasterize_with_ghostscript(
         log=log)
 
 
+@posttask(partial(done_task, 'preprocess_remove_background'))
 @transform(
     input=rasterize_with_ghostscript,
     filter=suffix(".page.png"),
@@ -848,6 +864,7 @@ def preprocess_remove_background(
         re_symlink(input_file, output_file, log)
 
 
+@posttask(partial(done_task, 'preprocess_deskew'))
 @transform(
     input=preprocess_remove_background,
     filter=suffix(".pp-background.png"),
@@ -870,6 +887,7 @@ def preprocess_deskew(
     leptonica.deskew(input_file, output_file, dpi)
 
 
+@posttask(partial(done_task, 'preprocess_clean'))
 @transform(
     input=preprocess_deskew,
     filter=suffix(".pp-deskew.png"),
@@ -892,6 +910,7 @@ def preprocess_clean(
     unpaper.clean(input_file, output_file, dpi, log)
 
 
+@posttask(partial(done_task, 'ocr_tesseract_hocr'))
 @active_if(options.pdf_renderer == 'hocr')
 @transform(
     input=preprocess_clean,
@@ -919,6 +938,7 @@ def ocr_tesseract_hocr(
         )
 
 
+@posttask(partial(done_task, 'select_image_for_pdf'))
 @collate(
     input=[rasterize_with_ghostscript, preprocess_remove_background,
            preprocess_deskew, preprocess_clean],
@@ -962,6 +982,7 @@ def select_image_for_pdf(
         re_symlink(image, output_file)
 
 
+@posttask(partial(done_task, 'select_image_layer'))
 @active_if(options.pdf_renderer == 'hocr')
 @collate(
     input=[select_image_for_pdf, orient_page],
@@ -992,11 +1013,14 @@ def select_image_layer(
         with open(image, 'rb') as imfile, \
                 open(output_file, 'wb') as pdf:
             rawdata = imfile.read()
+            log.debug('{:4d}: convert'.format(page_number(page_pdf)))
             img2pdf.convert(
                 rawdata, with_pdfrw=False,
                 layout_fun=layout_fun, outputstream=pdf)
+            log.debug('{:4d}: convert done'.format(page_number(page_pdf)))
 
 
+@posttask(partial(done_task, 'render_hocr_page'))
 @active_if(options.pdf_renderer == 'hocr')
 @transform(
     input=ocr_tesseract_hocr,
@@ -1019,6 +1043,7 @@ def render_hocr_page(
                          showBoundingboxes=False, invisibleText=True)
 
 
+@posttask(partial(done_task, 'render_hocr_debug_page'))
 @active_if(options.pdf_renderer == 'hocr')
 @active_if(options.debug_rendering)
 @collate(
@@ -1048,6 +1073,7 @@ class PdfMergeFailedError(Exception):
     pass
 
 
+@posttask(partial(done_task, 'add_text_layer'))
 @active_if(options.pdf_renderer == 'hocr')
 @collate(
     input=[render_hocr_page, select_image_layer],
@@ -1126,6 +1152,7 @@ def add_text_layer(
         pdf_output.write(out)
 
 
+@posttask(partial(done_task, 'tesseract_ocr_and_render_pdf'))
 @active_if(options.pdf_renderer == 'tesseract')
 @collate(
     input=[select_image_for_pdf, orient_page],
@@ -1191,6 +1218,7 @@ def from_document_info(key):
     return pdfmark
 
 
+@posttask(partial(done_task, 'generate_postscript_stub'))
 @active_if(options.output_type == 'pdfa')
 @transform(
     input=repair_pdf,
@@ -1207,6 +1235,7 @@ def generate_postscript_stub(
     generate_pdfa_def(output_file, pdfmark)
 
 
+@posttask(partial(done_task, 'skip_page'))
 @transform(
     input=orient_page,
     filter=suffix('.skip.oriented.pdf'),
@@ -1224,6 +1253,7 @@ def skip_page(
     re_symlink(input_file, output_file, log)
 
 
+@posttask(partial(done_task, 'merge_pages_ghostscript'))
 @active_if(options.output_type == 'pdfa')
 @merge(
     input=[add_text_layer, render_hocr_debug_page, skip_page,
@@ -1255,6 +1285,7 @@ def input_file_order(s):
     ghostscript.generate_pdfa(pdf_pages, output_file, options.jobs or 1)
 
 
+@posttask(partial(done_task, 'merge_pages_qpdf'))
 @active_if(options.output_type == 'pdf')
 @merge(
     input=[add_text_layer, render_hocr_debug_page, skip_page,
@@ -1301,6 +1332,7 @@ def input_file_order(s):
     qpdf.merge(pdf_pages, output_file)
 
 
+@posttask(partial(done_task, 'copy_final'))
 @merge(
     input=[merge_pages_ghostscript, merge_pages_qpdf],
     output=options.output_file,
@@ -1501,7 +1533,10 @@ def run_pipeline():
         _log.info("Output sent to stdout")
 
     with _pdfinfo_lock:
-        _log.debug(_pdfinfo)
+        if options.verbose:
+            from pprint import pformat
+            referent = _pdfinfo._getvalue()  # get the real list out of proxy
+            _log.debug(pformat(referent))
         direction = {0: 'n', 90: 'e',
                      180: 's', 270: 'w'}
         orientations = []

diff --git a/ocrmypdf/leptonica.py b/ocrmypdf/leptonica.py
@@ -5,7 +5,6 @@
 #
 # Python FFI wrapper for Leptonica library
 
-from __future__ import print_function, absolute_import, division
 import argparse
 import sys
 import os
@@ -45,23 +44,30 @@ def __enter__(self):
         self.tmpfile = TemporaryFile()
 
         # Save the old stderr, and redirect stderr to temporary file
+        sys.stderr.flush()
         try:
-            self.old_stderr_fileno = os.dup(sys.stderr.fileno())
-            os.dup2(self.tmpfile.fileno(), sys.stderr.fileno())
+            self.copy_of_stderr = os.dup(sys.stderr.fileno())
+            os.dup2(self.tmpfile.fileno(), sys.stderr.fileno(),
+                    inheritable=False)
         except UnsupportedOperation:
-            self.old_stderr_fileno = None
+            self.copy_of_stderr = None
         return
 
     def __exit__(self, exc_type, exc_value, traceback):
         # Restore old stderr
-        if self.old_stderr_fileno is not None:
-            os.dup2(self.old_stderr_fileno, sys.stderr.fileno())
+        sys.stderr.flush()
+        if self.copy_of_stderr is not None:
+            os.dup2(self.copy_of_stderr, sys.stderr.fileno())
+            os.close(self.copy_of_stderr)
 
         # Get data from tmpfile (in with block to ensure it is closed)
         with self.tmpfile as tmpfile:
             tmpfile.seek(0)  # Cursor will be at end, so move back to beginning
             leptonica_output = tmpfile.read().decode(errors='replace')
 
+        assert self.tmpfile.closed
+        assert not sys.stderr.closed
+
         # If there are Python errors, let them bubble up
         if exc_type:
             logger.warning(leptonica_output)