Skip to content

Commit

Permalink
Additional docs updates for v4.4
Browse files Browse the repository at this point in the history
  • Loading branch information
James R. Barlow committed Jan 27, 2017
1 parent 9a15a4d commit 5480da4
Show file tree
Hide file tree
Showing 6 changed files with 149 additions and 45 deletions.
3 changes: 2 additions & 1 deletion RELEASE_NOTES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@ v4.4:

+ A new rendering option ``--pdf-renderer tess4`` exploits Tesseract 4's new text-only output PDF mode. See the documentation on PDF Renderers for details.
+ The ``--tesseract-oem`` argument allows control over the Tesseract 4 OCR
engine mode.
engine mode (tesseract's ``--oem``). Use ``--tesseract-oem 2`` to enforce the new LSTM mode.
+ Fixed poor performance with Tesseract 4.00 on Linux

- Fixed an issue that caused corruption of output to stdout in some cases
- Removed test for Pillow JPEG and PNG support, as the minimum supported version of Pillow now enforces this
- OCRmyPDF now tests that the intended destination file is writable before proceeding
- Significant code reorganization to make OCRmyPDF re-entrant and improve performance. All changes should be backward compatible for the v4.x series.

+ However, OCRmyPDF's dependency "ruffus" is not re-entrant, so no Python API is available. Scripts should continue to use the command line interface.
Expand Down
124 changes: 124 additions & 0 deletions docs/batch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
Batch processing
================

This article provides information about running OCRmyPDF on multiple files or configuring it as a service triggered by file system events.

Batch jobs
----------

Consider using the excellent `GNU Parallel <https://www.gnu.org/software/parallel/>`_ to apply OCRmyPDF to multiple files at once.

Both ``parallel`` and ``ocrmypdf`` will try to use all available processors. To maximize parallelism without overloading your system with processes, consider using ``parallel -j 2`` to limit parallel to running two jobs at once.

This command will run all ocrmypdf all files named ``*.pdf`` in the current directory and write them to the previous created ``output/`` folder.

.. code-block:: bash
parallel -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf
Sample script
"""""""""""""

This user contributed script also provides an example of batch processing.

.. code-block:: python
#!/usr/bin/env python3
# Walk through directory tree, replacing all files with OCR'd version
# Contributed by DeliciousPickle@github
import logging
import os
import subprocess
import sys
script_dir = os.path.dirname(os.path.realpath(__file__))
print(script_dir + '/ocr-tree.py: Start')
if len(sys.argv) > 1:
start_dir = sys.argv[1]
else:
start_dir = '.'
if len(sys.argv) > 2:
log_file = sys.argv[2]
else:
log_file = script_dir + '/ocr-tree.log'
logging.basicConfig(
level=logging.INFO, format='%(asctime)s %(message)s',
filename=log_file, filemode='w')
for dir_name, subdirs, file_list in os.walk(start_dir):
logging.info('\n')
logging.info(dir_name + '\n')
os.chdir(dir_name)
for filename in file_list:
file_ext = os.path.splitext(filename)[1]
if file_ext == '.pdf':
full_path = dir_name + '/' + filename
print(full_path)
cmd = ["ocrmypdf", "--deskew", filename, filename]
logging.info(cmd)
proc = subprocess.Popen(
cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
result = proc.stdout.read()
if proc.returncode == 6:
print("Skipped document because it already contained text")
elif proc.returncode == 0:
print("OCR complete")
logging.info(result)
API
"""

OCRmyPDF is currently supported as a command line interface. Due to limitations in one of the libraries OCRmyPDF depends on, it is not yet usable as an API.


Huge batch jobs
"""""""""""""""

If you have thousands of files to work with, contact the author.


Hot (watched) folders
---------------------

To set up a "hot folder" that will trigger OCR for every file inserted, use a program like Python `watchdog <https://pypi.python.org/pypi/watchdog>`_ (supports all major OS).

One could then configure a scanner to automatically place scanned files in a hot folder, so that they will be queued for OCR and copied to the destination.

.. code-block:: bash
pip install watchdog
watchdog installs the command line program ``watchmedo``, which can be told to run ``ocrmypdf`` on any .pdf added to the current directory (``.``) and place the result in the previously created ``out/`` folder.

.. code-block:: bash
cd hot-folder
mkdir out
watchmedo shell-command \
--patterns="*.pdf" \
--ignore-directories \
--command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \
. # don't forget the final dot
For more complex behavior you can write a Python script around to use the watchdog API.

On file servers, you could configure watchmedo as a system service so it will run all the time.

Caveats
"""""""

* ``watchmedo`` may not work properly on a networked file system, depending on the capabilities of the file system client and server.
* This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.
* If the source and destination directory are the same, watchmedo may create an infinite loop.
* On BSD, FreeBSD and older versions of macOS, you may need to increase the number of file descriptors to monitor more files, using ``ulimit -n 1024`` to watch a folder of up to 1024 files.

Alternatives
""""""""""""

* `Watchman <https://facebook.github.io/watchman/>`_ is a more powerful alternative to ``watchmedo``.


2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@

# General information about the project.
project = 'ocrmypdf'
copyright = '2016, James R. Barlow'
copyright = '2017, James R. Barlow'
author = 'James R. Barlow'

# The version info for the project you're documenting, acts as replacement for
Expand Down
60 changes: 19 additions & 41 deletions docs/cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ By default OCRmyPDF assumes the document is English.
ocrmypdf -l fre LeParisien.pdf LeParisien.pdf
ocrmypdf -l eng+fre Bilingual-English-French.pdf Bilingual-English-French.pdf
Language packs must be installed for all languages specified. See :ref:`Installing additional language packs <lang-packs>`.
Language packs must be installed for all languages specified. See :ref:`Installing additional language packs <languages>`.


Produce PDF and text file containing OCR text
Expand All @@ -82,7 +82,13 @@ Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to
img2pdf my-images*.jpg | ocrmypdf - myfile.pdf
If given a single image as input, OCRmyPDF will try converting it to a PDF on its own. This feature may be removed at some point, because OCRmyPDF does not specialize in converting images to PDFs.
If given a single image as input, OCRmyPDF will try converting it to a PDF on its own. If the DPI specified in the image is incorrect, it can be overridden with ``--image-dpi``:

.. code-block:: bash
ocrmypdf --image-dpi 300 image.png myfile.pdf
This feature may be removed at some point, because OCRmyPDF does not specialize in converting images to PDFs.

You can also use Tesseract 3.04+ directly to convert single page images or multi-page TIFFs to PDF:

Expand All @@ -109,56 +115,28 @@ OCRmyPDF perform some image processing on each page of a PDF, if desired. The s
OCR and correct document skew (crooked scan)
""""""""""""""""""""""""""""""""""""""""""""

.. code-block:: bash
ocrmypdf --deskew input.pdf output.pdf
Hot (watched) folders
---------------------

To set up a "hot folder" that will trigger an OCR operation for every file inserted, use a program like Python `watchdog <https://pypi.python.org/pypi/watchdog>`_ (supports all major OS).
Deskew:

.. code-block:: bash
pip install watchdog
ocrmypdf --deskew input.pdf output.pdf
watchdog installs the command line program ``watchmedo``, which can be told to run ``ocrmypdf`` on any .pdf added to the current directory (``.``) and place the result in the previously created ``out/`` folder.
Image processing commands can be combined. The order in which options are given does not matter. OCRmyPDF always applies the steps of the image processing pipeline in the same order (rotate, remove background, deskew, clean).

.. code-block:: bash
cd hot-folder
mkdir out
watchmedo shell-command \
--patterns="*.pdf" \
--ignore-directories \
--command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \
. # don't forget the final dot
For more complex behavior you can write a Python script around to use the watchdog API.

On file servers, you could configure watchmedo as a system service so it will run all the time.

Caveats
"""""""
ocrmypdf --deskew --clean --rotate-pages input.pdf output.pdf
* ``watchmedo`` may not work properly on a networked file system, depending on the capabilities of the file system client and server.
* This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.
* If the source and destination directory are the same, watchmedo may create an infinite loop.
Control of OCR options
----------------------

By default, OCRmyPDF permits tesseract to run for only three minutes (180 seconds) per page. This is usually more than enough time to find all text on a reasonably sized page with modern hardware. A skipped page will be inserted into the output without any OCR text.

Batch jobs
----------

Consider using the excellent `GNU Parallel <https://www.gnu.org/software/parallel/>`_ to apply OCRmyPDF to multiple files at once.

Both ``parallel`` and ``ocrmypdf`` will try to use all available processors. To maximize parallelism without overloading your system with processes, consider using ``parallel -j 2`` to limit parallel to running two jobs at once.

This command will run all ocrmypdf all files named ``*.pdf`` in the current directory and write them to the previous created ``output/`` folder.
If you want to adjust the amount of time spent on OCR, change ``--tesseract-timeout``. You can also automatically skip images that exceed a certain number of megapixels.

.. code-block:: bash
parallel -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf
# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
If you have thousands of files to work with, contact the author.

1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Contents:
installation
languages
cookbook
batch
renderers
security
errors
Expand Down
4 changes: 2 additions & 2 deletions docs/languages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ Debian and Ubuntu users
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple
languages can be requested using either ``-l eng+fre`` (English and French) or ``-l eng -l fre``.

Mac OS X (macOS) users
----------------------
macOS users
-----------

You can install additional language packs by :ref:`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.

Expand Down

0 comments on commit 5480da4

Please sign in to comment.