Additional docs updates for v4.4

ocrmypdf · Jan 27, 2017 · 5480da4 · 5480da4
1 parent 9a15a4d
commit 5480da4
Show file tree

Hide file tree

Showing 6 changed files with 149 additions and 45 deletions.
diff --git a/RELEASE_NOTES.rst b/RELEASE_NOTES.rst
@@ -10,11 +10,12 @@ v4.4:
 
    +  A new rendering option ``--pdf-renderer tess4`` exploits Tesseract 4's new text-only output PDF mode. See the documentation on PDF Renderers for details.
    +  The ``--tesseract-oem`` argument allows control over the Tesseract 4 OCR 
-   engine mode.
+   engine mode (tesseract's ``--oem``). Use ``--tesseract-oem 2`` to enforce the new LSTM mode.
    +  Fixed poor performance with Tesseract 4.00 on Linux
 
 -  Fixed an issue that caused corruption of output to stdout in some cases
 -  Removed test for Pillow JPEG and PNG support, as the minimum supported version of Pillow now enforces this
+-  OCRmyPDF now tests that the intended destination file is writable before proceeding
 -  Significant code reorganization to make OCRmyPDF re-entrant and improve performance. All changes should be backward compatible for the v4.x series.
 
    + However, OCRmyPDF's dependency "ruffus" is not re-entrant, so no Python API is available. Scripts should continue to use the command line interface.

diff --git a/docs/batch.rst b/docs/batch.rst
@@ -0,0 +1,124 @@
+Batch processing
+================
+
+This article provides information about running OCRmyPDF on multiple files or configuring it as a service triggered by file system events.
+
+Batch jobs
+----------
+
+Consider using the excellent `GNU Parallel <https://www.gnu.org/software/parallel/>`_ to apply OCRmyPDF to multiple files at once.
+
+Both ``parallel`` and ``ocrmypdf`` will try to use all available processors. To maximize parallelism without overloading your system with processes, consider using ``parallel -j 2`` to limit parallel to running two jobs at once.
+
+This command will run all ocrmypdf all files named ``*.pdf`` in the current directory and write them to the previous created ``output/`` folder.
+
+.. code-block:: bash
+
+	parallel -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf
+
+Sample script
+"""""""""""""
+
+This user contributed script also provides an example of batch processing.
+
+.. code-block:: python
+
+	#!/usr/bin/env python3
+	# Walk through directory tree, replacing all files with OCR'd version
+	# Contributed by DeliciousPickle@github
+
+	import logging
+	import os
+	import subprocess
+	import sys
+
+	script_dir = os.path.dirname(os.path.realpath(__file__))
+	print(script_dir + '/ocr-tree.py: Start')
+
+	if len(sys.argv) > 1:
+	    start_dir = sys.argv[1]
+	else:
+	    start_dir = '.'
+
+	if len(sys.argv) > 2:
+	    log_file = sys.argv[2]
+	else:
+	    log_file = script_dir + '/ocr-tree.log'
+
+	logging.basicConfig(
+			level=logging.INFO, format='%(asctime)s %(message)s', 
+			filename=log_file, filemode='w')
+
+	for dir_name, subdirs, file_list in os.walk(start_dir):
+	    logging.info('\n')
+	    logging.info(dir_name + '\n')
+	    os.chdir(dir_name)
+	    for filename in file_list:
+	        file_ext = os.path.splitext(filename)[1]
+	        if file_ext == '.pdf':
+	            full_path = dir_name + '/' + filename
+	            print(full_path)
+	            cmd = ["ocrmypdf",  "--deskew", filename, filename]
+	            logging.info(cmd)
+	            proc = subprocess.Popen(
+	            	cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+	            result = proc.stdout.read()
+	            if proc.returncode == 6:
+	                print("Skipped document because it already contained text")
+	            elif proc.returncode == 0:
+	                print("OCR complete")
+	            logging.info(result)
+
+API
+"""
+
+OCRmyPDF is currently supported as a command line interface. Due to limitations in one of the libraries OCRmyPDF depends on, it is not yet usable as an API.
+
+
+Huge batch jobs
+"""""""""""""""
+
+If you have thousands of files to work with, contact the author.
+
+
+Hot (watched) folders
+---------------------
+
+To set up a "hot folder" that will trigger OCR for every file inserted, use a program like Python `watchdog <https://pypi.python.org/pypi/watchdog>`_ (supports all major OS).
+
+One could then configure a scanner to automatically place scanned files in a hot folder, so that they will be queued for OCR and copied to the destination.
+
+.. code-block:: bash
+
+	pip install watchdog
+
+watchdog installs the command line program ``watchmedo``, which can be told to run ``ocrmypdf`` on any .pdf added to the current directory (``.``) and place the result in the previously created ``out/`` folder.
+
+.. code-block:: bash
+
+	cd hot-folder
+	mkdir out
+	watchmedo shell-command \
+		--patterns="*.pdf" \
+		--ignore-directories \
+		--command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \
+		.  # don't forget the final dot
+
+For more complex behavior you can write a Python script around to use the watchdog API.
+
+On file servers, you could configure watchmedo as a system service so it will run all the time.
+
+Caveats
+"""""""
+
+* ``watchmedo`` may not work properly on a networked file system, depending on the capabilities of the file system client and server.
+* This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.
+* If the source and destination directory are the same, watchmedo may create an infinite loop.
+* On BSD, FreeBSD and older versions of macOS, you may need to increase the number of file descriptors to monitor more files, using ``ulimit -n 1024`` to watch a folder of up to 1024 files.
+
+Alternatives
+""""""""""""
+
+* `Watchman <https://facebook.github.io/watchman/>`_ is a more powerful alternative to ``watchmedo``.
+
+
diff --git a/docs/conf.py b/docs/conf.py
@@ -52,7 +52,7 @@
 
 # General information about the project.
 project = 'ocrmypdf'
-copyright = '2016, James R. Barlow'
+copyright = '2017, James R. Barlow'
 author = 'James R. Barlow'
 
 # The version info for the project you're documenting, acts as replacement for

diff --git a/docs/cookbook.rst b/docs/cookbook.rst
@@ -59,7 +59,7 @@ By default OCRmyPDF assumes the document is English.
 	ocrmypdf -l fre LeParisien.pdf LeParisien.pdf
 	ocrmypdf -l eng+fre Bilingual-English-French.pdf Bilingual-English-French.pdf
 
-Language packs must be installed for all languages specified. See :ref:`Installing additional language packs <lang-packs>`.
+Language packs must be installed for all languages specified. See :ref:`Installing additional language packs <languages>`.
 
 
 Produce PDF and text file containing OCR text
@@ -82,7 +82,13 @@ Use a program like `img2pdf <https://gitlab.mister-muffin.de/josch/img2pdf>`_ to
 
 	img2pdf my-images*.jpg | ocrmypdf - myfile.pdf
 
-If given a single image as input, OCRmyPDF will try converting it to a PDF on its own.  This feature may be removed at some point, because OCRmyPDF does not specialize in converting images to PDFs.
+If given a single image as input, OCRmyPDF will try converting it to a PDF on its own.  If the DPI specified in the image is incorrect, it can be overridden with ``--image-dpi``:
+
+.. code-block:: bash
+
+	ocrmypdf --image-dpi 300 image.png myfile.pdf
+
+This feature may be removed at some point, because OCRmyPDF does not specialize in converting images to PDFs.
 
 You can also use Tesseract 3.04+ directly to convert single page images or multi-page TIFFs to PDF:
 
@@ -109,56 +115,28 @@ OCRmyPDF perform some image processing on each page of a PDF, if desired.  The s
 OCR and correct document skew (crooked scan)
 """"""""""""""""""""""""""""""""""""""""""""
 
-.. code-block:: bash
-
-	ocrmypdf --deskew input.pdf output.pdf
-
-
-Hot (watched) folders
----------------------
-
-To set up a "hot folder" that will trigger an OCR operation for every file inserted, use a program like Python `watchdog <https://pypi.python.org/pypi/watchdog>`_ (supports all major OS).
+Deskew:
 
 .. code-block:: bash
 
-	pip install watchdog
+	ocrmypdf --deskew input.pdf output.pdf
 
-watchdog installs the command line program ``watchmedo``, which can be told to run ``ocrmypdf`` on any .pdf added to the current directory (``.``) and place the result in the previously created ``out/`` folder.
+Image processing commands can be combined. The order in which options are given does not matter. OCRmyPDF always applies the steps of the image processing pipeline in the same order (rotate, remove background, deskew, clean).
 
 .. code-block:: bash
 
-	cd hot-folder
-	mkdir out
-	watchmedo shell-command \
-		--patterns="*.pdf" \
-		--ignore-directories \
-		--command='ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \
-		.  # don't forget the final dot
-
-For more complex behavior you can write a Python script around to use the watchdog API.
-
-On file servers, you could configure watchmedo as a system service so it will run all the time.
-
-Caveats
-"""""""
+	ocrmypdf --deskew --clean --rotate-pages input.pdf output.pdf
 
-* ``watchmedo`` may not work properly on a networked file system, depending on the capabilities of the file system client and server.
-* This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.
-* If the source and destination directory are the same, watchmedo may create an infinite loop.
+Control of OCR options
+----------------------
 
+By default, OCRmyPDF permits tesseract to run for only three minutes (180 seconds) per page. This is usually more than enough time to find all text on a reasonably sized page with modern hardware. A skipped page will be inserted into the output without any OCR text.
 
-Batch jobs
-----------
-
-Consider using the excellent `GNU Parallel <https://www.gnu.org/software/parallel/>`_ to apply OCRmyPDF to multiple files at once.
-
-Both ``parallel`` and ``ocrmypdf`` will try to use all available processors. To maximize parallelism without overloading your system with processes, consider using ``parallel -j 2`` to limit parallel to running two jobs at once.
-
-This command will run all ocrmypdf all files named ``*.pdf`` in the current directory and write them to the previous created ``output/`` folder.
+If you want to adjust the amount of time spent on OCR, change ``--tesseract-timeout``.  You can also automatically skip images that exceed a certain number of megapixels.
 
 .. code-block:: bash
 
-	parallel -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf
+	# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
+	ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
+
 
-If you have thousands of files to work with, contact the author.
-
diff --git a/docs/index.rst b/docs/index.rst
@@ -20,6 +20,7 @@ Contents:
    installation
    languages
    cookbook
+   batch
    renderers
    security
    errors

diff --git a/docs/languages.rst b/docs/languages.rst
@@ -23,8 +23,8 @@ Debian and Ubuntu users
 You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple
 languages can be requested using either ``-l eng+fre`` (English and French) or ``-l eng -l fre``.
 
-Mac OS X (macOS) users
-----------------------
+macOS users
+-----------
 
 You can install additional language packs by :ref:`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.