Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
James R. Barlow committed Nov 7, 2016
2 parents 6821e8e + 949d2ff commit 8abc2f1
Show file tree
Hide file tree
Showing 17 changed files with 166 additions and 47 deletions.
10 changes: 10 additions & 0 deletions RELEASE_NOTES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,16 @@ RELEASE NOTES

OCRmyPDF uses `semantic versioning <http://semver.org/>`_.

v4.3.1:
=======

- Fixed an issue where pages produced by the "hocr" renderer after a Tesseract timeout would be rotated incorrectly if the input page was rotated with a /Rotate marker
- Fixed a file handle leak in LeptonicaErrorTrap that would cause a "too many open files" error for files around hundred pages of pages long when ``--deskew`` or ``--remove-background`` or other Leptonica based image processing features were in use, depending on the system value of ``ulimit -n``
- Ability to specify multiple languages for multilingual documents is now advertised in documentation
- Reduced the file sizes of some test resources
- Cleaned up debug output
- Tesseract caching in test cases is now more cautious about false cache hits and reproducing exact output, not that any problems were observed


v4.3:
=====
Expand Down
18 changes: 16 additions & 2 deletions docs/cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,23 @@ OCR will attempt to automatic correct the rotation of each page. This can help f
You can increase (decrease) the parameter ``--rotate-pages-threshold`` to make page rotation more (less) aggressive.


OCR languages other than English
""""""""""""""""""""""""""""""""

By default OCRmyPDF assumes the document is English.

.. code-block:: bash
ocrmypdf -l fre LeParisien.pdf LeParisien.pdf
ocrmypdf -l eng+fre Bilingual-English-French.pdf Bilingual-English-French.pdf
Language packs must be installed for all languages specified. See :ref:`Installing additional language packs <lang-packs>`.


OCR images, not PDFs
--------------------

Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the resutls to run ocrmypdf:
Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to convert your images to PDFs, and then pipe the results to run ocrmypdf:

.. code-block:: bash
Expand Down Expand Up @@ -107,19 +120,20 @@ watchdog installs the command line program ``watchmedo``, which can be told to r
mkdir out
watchmedo shell-command \
--patterns="*.pdf" \
--ignore-directories \
--command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \
. # don't forget the final dot
For more complex behavior you can write a Python script around to use the watchdog API.

On file servers, you could configure watchmedo as a system service so it will run all the time.


Caveats
"""""""

* ``watchmedo`` may not work properly on a networked file system, depending on the capabilities of the file system client and server.
* This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.
* If the source and destination directory are the same, watchmedo may create an infinite loop.


Batch jobs
Expand Down
11 changes: 8 additions & 3 deletions docs/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,16 @@ Rasterizing a PDF is the process of generating an image suitable for display or
About PDF/A
-----------

`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`_ is a standardized subset of the full PDF specification that is designed for archiving. PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript or references to external fonts. All fonts and resources needed to interpret the PDF must be contained within it. Generally speaking, scanned documents should be converted to PDF/A. There are various conformance levels and versions, such as "PDF/A-2b".
`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`_ is an ISO-standardized subset of the full PDF specification that is designed for archiving (the 'A' stands for Archive). PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript, video, audio and references to external fonts. All fonts and resources needed to interpret the PDF must be contained within it. Because PDF/A disables Javascript and other types of embedded content, it is probably more secure.

Since most people who scan documents are interested in reading them in the future, OCRmyPDF generates PDF/A-2b by default.
There are various conformance levels and versions, such as "PDF/A-2b".

Generally speaking, the best format for scanned documents is PDF/A. Some governments and jurisdictions, US Courts in particular, `mandate the use of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`_ for scanned documents.

Since most people who scan documents are interested in reading them indefinitely into the future, OCRmyPDF generates PDF/A-2b by default.

PDF/A has a few drawbacks. Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users. It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available. PDF/A files can be digitally signed, but may not be encrypted, to ensure they can be read in the future. Fortunately, converting from PDF/A to a regular PDF is trivial, and any PDF viewer can view PDF/A.

PDF/A has a few drawbacks. Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users. It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available.

What OCRmyPDF does
------------------
Expand Down
6 changes: 4 additions & 2 deletions docs/languages.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _lang-packs:

Installing additional language packs
====================================

Expand All @@ -19,7 +21,7 @@ Debian and Ubuntu users
apt-get install tesseract-ocr-chi-sim # Example: Install Chinese Simplified language back
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple
languages can be requested.
languages can be requested using either ``-l eng+fre`` (English and French) or ``-l eng -l fre``.

Mac OS X (macOS) users
----------------------
Expand All @@ -38,7 +40,7 @@ As of v4.2, users of ocrmypdf working languages outside the Latin alphabet shoul

.. code-block:: bash
ocrmypdf --output-type pdf --pdf-renderer tesseract
ocrmypdf -l eng+gre --output-type pdf --pdf-renderer tesseract
The reasons for this are:

Expand Down
43 changes: 39 additions & 4 deletions ocrmypdf/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from functools import partial

from ruffus import transform, suffix, merge, active_if, regex, jobs_limit, \
formatter, follows, split, collate, check_if_uptodate, graphviz
formatter, follows, split, collate, check_if_uptodate, graphviz, posttask
import ruffus.ruffus_exceptions as ruffus_exceptions
import ruffus.cmdline as cmdline
import ruffus.proxy_logger as proxy_logger
Expand Down Expand Up @@ -166,8 +166,10 @@ def check_pil_encoder(codec_name, friendly_name):
help="output searchable PDF file (or '-' to write to standard output)")
parser.add_argument(
'-l', '--language', action='append',
help="languages of the file to be OCRed (see tesseract --list-langs for "
"all language packs installed in your system)")
help="Language(s) of the file to be OCRed (see tesseract --list-langs for "
"all language packs installed in your system). To specify multiple "
"languages, join them with '+' or issue this argument once for each "
"language.")
parser.add_argument(
'-j', '--jobs', metavar='N', type=int,
help="Use up to N CPU cores simultaneously (default: use all)")
Expand Down Expand Up @@ -304,6 +306,8 @@ def check_pil_encoder(codec_name, friendly_name):
# ----------
# Arguments

options.verbose_abbreviated_path = 1

if options.pdf_renderer == 'auto':
options.pdf_renderer = 'hocr'

Expand Down Expand Up @@ -460,6 +464,11 @@ def re_symlink(input_file, soft_link_name, log=_log):
work_folder = mkdtemp(prefix="com.github.ocrmypdf.")


def done_task(caller):
"Useful as debug hook"
pass


@atexit.register
def cleanup_working_files(*args):
if options.keep_temporary_files:
Expand Down Expand Up @@ -533,6 +542,7 @@ def triage_image_file(input_file, output_file, log):
sys.exit(ExitCode.input_file)


@posttask(partial(done_task, 'triage'))
@transform(
input=os.path.join(work_folder, 'origin'),
filter=formatter('(?i)'),
Expand All @@ -555,6 +565,7 @@ def triage(
triage_image_file(input_file, output_file, log)


@posttask(partial(done_task, 'repair_pdf'))
@transform(
input=triage,
filter=suffix('.pdf'),
Expand Down Expand Up @@ -648,6 +659,7 @@ def is_ocr_required(pageinfo, log):
return ocr_required


@posttask(partial(done_task, 'split_pages'))
@split(
repair_pdf,
os.path.join(work_folder, '*.page.pdf'),
Expand Down Expand Up @@ -690,6 +702,7 @@ def split_pages(
os.path.basename(filename)[0:6] + alt_suffix))


@posttask(partial(done_task, 'rasterize_preview'))
@active_if(options.rotate_pages)
@transform(
input=split_pages,
Expand All @@ -712,6 +725,7 @@ def rasterize_preview(
log=log)


@posttask(partial(done_task, 'orient_page'))
@collate(
input=[split_pages, rasterize_preview],
filter=regex(r".*/(\d{6})(\.ocr|\.skip)(?:\.page\.pdf|\.preview\.jpg)"),
Expand Down Expand Up @@ -786,6 +800,7 @@ def orient_page(
pdfinfo[pageno] = pageinfo


@posttask(partial(done_task, 'rasterize_with_ghostscript'))
@transform(
input=orient_page,
filter=suffix('.ocr.oriented.pdf'),
Expand Down Expand Up @@ -822,6 +837,7 @@ def rasterize_with_ghostscript(
log=log)


@posttask(partial(done_task, 'preprocess_remove_background'))
@transform(
input=rasterize_with_ghostscript,
filter=suffix(".page.png"),
Expand All @@ -848,6 +864,7 @@ def preprocess_remove_background(
re_symlink(input_file, output_file, log)


@posttask(partial(done_task, 'preprocess_deskew'))
@transform(
input=preprocess_remove_background,
filter=suffix(".pp-background.png"),
Expand All @@ -870,6 +887,7 @@ def preprocess_deskew(
leptonica.deskew(input_file, output_file, dpi)


@posttask(partial(done_task, 'preprocess_clean'))
@transform(
input=preprocess_deskew,
filter=suffix(".pp-deskew.png"),
Expand All @@ -892,6 +910,7 @@ def preprocess_clean(
unpaper.clean(input_file, output_file, dpi, log)


@posttask(partial(done_task, 'ocr_tesseract_hocr'))
@active_if(options.pdf_renderer == 'hocr')
@transform(
input=preprocess_clean,
Expand Down Expand Up @@ -919,6 +938,7 @@ def ocr_tesseract_hocr(
)


@posttask(partial(done_task, 'select_image_for_pdf'))
@collate(
input=[rasterize_with_ghostscript, preprocess_remove_background,
preprocess_deskew, preprocess_clean],
Expand Down Expand Up @@ -962,6 +982,7 @@ def select_image_for_pdf(
re_symlink(image, output_file)


@posttask(partial(done_task, 'select_image_layer'))
@active_if(options.pdf_renderer == 'hocr')
@collate(
input=[select_image_for_pdf, orient_page],
Expand Down Expand Up @@ -992,11 +1013,14 @@ def select_image_layer(
with open(image, 'rb') as imfile, \
open(output_file, 'wb') as pdf:
rawdata = imfile.read()
log.debug('{:4d}: convert'.format(page_number(page_pdf)))
img2pdf.convert(
rawdata, with_pdfrw=False,
layout_fun=layout_fun, outputstream=pdf)
log.debug('{:4d}: convert done'.format(page_number(page_pdf)))


@posttask(partial(done_task, 'render_hocr_page'))
@active_if(options.pdf_renderer == 'hocr')
@transform(
input=ocr_tesseract_hocr,
Expand All @@ -1019,6 +1043,7 @@ def render_hocr_page(
showBoundingboxes=False, invisibleText=True)


@posttask(partial(done_task, 'render_hocr_debug_page'))
@active_if(options.pdf_renderer == 'hocr')
@active_if(options.debug_rendering)
@collate(
Expand Down Expand Up @@ -1048,6 +1073,7 @@ class PdfMergeFailedError(Exception):
pass


@posttask(partial(done_task, 'add_text_layer'))
@active_if(options.pdf_renderer == 'hocr')
@collate(
input=[render_hocr_page, select_image_layer],
Expand Down Expand Up @@ -1126,6 +1152,7 @@ def add_text_layer(
pdf_output.write(out)


@posttask(partial(done_task, 'tesseract_ocr_and_render_pdf'))
@active_if(options.pdf_renderer == 'tesseract')
@collate(
input=[select_image_for_pdf, orient_page],
Expand Down Expand Up @@ -1191,6 +1218,7 @@ def from_document_info(key):
return pdfmark


@posttask(partial(done_task, 'generate_postscript_stub'))
@active_if(options.output_type == 'pdfa')
@transform(
input=repair_pdf,
Expand All @@ -1207,6 +1235,7 @@ def generate_postscript_stub(
generate_pdfa_def(output_file, pdfmark)


@posttask(partial(done_task, 'skip_page'))
@transform(
input=orient_page,
filter=suffix('.skip.oriented.pdf'),
Expand All @@ -1224,6 +1253,7 @@ def skip_page(
re_symlink(input_file, output_file, log)


@posttask(partial(done_task, 'merge_pages_ghostscript'))
@active_if(options.output_type == 'pdfa')
@merge(
input=[add_text_layer, render_hocr_debug_page, skip_page,
Expand Down Expand Up @@ -1255,6 +1285,7 @@ def input_file_order(s):
ghostscript.generate_pdfa(pdf_pages, output_file, options.jobs or 1)


@posttask(partial(done_task, 'merge_pages_qpdf'))
@active_if(options.output_type == 'pdf')
@merge(
input=[add_text_layer, render_hocr_debug_page, skip_page,
Expand Down Expand Up @@ -1301,6 +1332,7 @@ def input_file_order(s):
qpdf.merge(pdf_pages, output_file)


@posttask(partial(done_task, 'copy_final'))
@merge(
input=[merge_pages_ghostscript, merge_pages_qpdf],
output=options.output_file,
Expand Down Expand Up @@ -1501,7 +1533,10 @@ def run_pipeline():
_log.info("Output sent to stdout")

with _pdfinfo_lock:
_log.debug(_pdfinfo)
if options.verbose:
from pprint import pformat
referent = _pdfinfo._getvalue() # get the real list out of proxy
_log.debug(pformat(referent))
direction = {0: 'n', 90: 'e',
180: 's', 270: 'w'}
orientations = []
Expand Down
18 changes: 12 additions & 6 deletions ocrmypdf/leptonica.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
#
# Python FFI wrapper for Leptonica library

from __future__ import print_function, absolute_import, division
import argparse
import sys
import os
Expand Down Expand Up @@ -45,23 +44,30 @@ def __enter__(self):
self.tmpfile = TemporaryFile()

# Save the old stderr, and redirect stderr to temporary file
sys.stderr.flush()
try:
self.old_stderr_fileno = os.dup(sys.stderr.fileno())
os.dup2(self.tmpfile.fileno(), sys.stderr.fileno())
self.copy_of_stderr = os.dup(sys.stderr.fileno())
os.dup2(self.tmpfile.fileno(), sys.stderr.fileno(),
inheritable=False)
except UnsupportedOperation:
self.old_stderr_fileno = None
self.copy_of_stderr = None
return

def __exit__(self, exc_type, exc_value, traceback):
# Restore old stderr
if self.old_stderr_fileno is not None:
os.dup2(self.old_stderr_fileno, sys.stderr.fileno())
sys.stderr.flush()
if self.copy_of_stderr is not None:
os.dup2(self.copy_of_stderr, sys.stderr.fileno())
os.close(self.copy_of_stderr)

# Get data from tmpfile (in with block to ensure it is closed)
with self.tmpfile as tmpfile:
tmpfile.seek(0) # Cursor will be at end, so move back to beginning
leptonica_output = tmpfile.read().decode(errors='replace')

assert self.tmpfile.closed
assert not sys.stderr.closed

# If there are Python errors, let them bubble up
if exc_type:
logger.warning(leptonica_output)
Expand Down

0 comments on commit 8abc2f1

Please sign in to comment.