Skip to content

Commit

Permalink
watcher: Add an option to archive processed originals (#951)
Browse files Browse the repository at this point in the history
* watcher: Add an option to archive processed originals

This adds a feature from existing OCRmyPDF watchdog Docker containers like meyay/ocrmypdf-batch and unze/ocrmypdf-watchdog. With this option, the input directory can be kept clean from already processed files, without losing the originals.

* docs: Improve watcher.py's Docker parameters documentation
  • Loading branch information
bllngr committed Jun 17, 2022
1 parent d8753dc commit 7cabbb1
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 11 deletions.
19 changes: 12 additions & 7 deletions docs/batch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,9 @@ Users may need to customize the script to meet their requirements.

"OCR_INPUT_DIRECTORY", "Set input directory to monitor (recursive)"
"OCR_OUTPUT_DIRECTORY", "Set output directory (should not be under input)"
"OCR_ARCHIVE_DIRECTORY", "Set archive directory for processed originals (should not be under input, requires ``OCR_ON_SUCCESS_ARCHIVE`` to be set)"
"OCR_ON_SUCCESS_DELETE", "This will delete the input file if the exit code is 0 (OK)"
"OCR_ON_SUCCESS_ARCHIVE", "This will move the processed orignal file to ``OCR_ARCHIVE_DIRECTORY`` if the exit code is 0 (OK). Note that ``OCR_ON_SUCCESS_DELETE`` takes precedence over this option, i.e. if both options are set, the input file will be deleted."
"OCR_OUTPUT_DIRECTORY_YEAR_MONTH", "This will place files in the output in ``{output}/{year}/{month}/{filename}``"
"OCR_DESKEW", "Apply deskew to crooked input PDFs"
"OCR_JSON_SETTINGS", "A JSON string specifying any other arguments for ``ocrmypdf.ocr``, e.g. ``'OCR_JSON_SETTINGS={""rotate_pages"": true}'``."
Expand All @@ -144,27 +146,30 @@ The watcher service is included in the OCRmyPDF Docker image. To run it:
docker run \
-v <path to files to convert>:/input \
-v <path to store results>:/output \
-v <path to store processed originals>:/archive \
-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
-e OCR_ON_SUCCESS_DELETE=1 \
-e OCR_ON_SUCCESS_ARCHIVE=1 \
-e OCR_DESKEW=1 \
-e PYTHONUNBUFFERED=1 \
-it --entrypoint python3 \
jbarlow83/ocrmypdf \
watcher.py
This service will watch for a file that matches ``/input/\*.pdf`` and will
convert it to a OCRed PDF in ``/output/``. The parameters to this image are:
This service will watch for a file that matches ``/input/\*.pdf``,
convert it to a OCRed PDF in ``/output/``, and move the processed
original to ``/archive``. The parameters to this image are:

.. csv-table:: watcher.py parameters for Docker
:header: "Parameter", "Description"
:widths: 50, 50

"``-v <path to files to convert>:/input``", "Files placed in this location will be OCRed"
"``-v <path to store results>:/output``", "This is where OCRed files will be stored"
"``-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "Define environment variable OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1"
"``-e OCR_ON_SUCCESS_DELETE=1``", "Define environment variable"
"``-e OCR_DESKEW=1``", "Define environment variable"
"``-e PYTHONBUFFERED=1``", "This will force STDOUT to be unbuffered and allow you to see messages in docker logs"
"``-v <path to store processed originals>:/archive``", "Archive processed originals here"
"``-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "Define environment variable ``OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1`` to place files in the output in ``{output}/{year}/{month}/{filename}``"
"``-e OCR_ON_SUCCESS_ARCHIVE=1``", "Define environment variable ``OCR_ON_SUCCESS_ARCHIVE`` to move processed originals"
"``-e OCR_DESKEW=1``", "Define environment variable ``OCR_DESKEW`` to apply deskew to crooked input PDFs"
"``-e PYTHONBUFFERED=1``", "This will force ``STDOUT`` to be unbuffered and allow you to see messages in docker logs"

This service relies on polling to check for changes to the filesystem. It
may not be suitable for some environments, such as filesystems shared on a
Expand Down
18 changes: 14 additions & 4 deletions misc/watcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import json
import logging
import os
import shutil
import sys
import time
from datetime import datetime
Expand All @@ -44,8 +45,10 @@ def getenv_bool(name: str, default: str = 'False'):

INPUT_DIRECTORY = os.getenv('OCR_INPUT_DIRECTORY', '/input')
OUTPUT_DIRECTORY = os.getenv('OCR_OUTPUT_DIRECTORY', '/output')
ARCHIVE_DIRECTORY = os.getenv('OCR_ARCHIVE_DIRECTORY', '/processed')
OUTPUT_DIRECTORY_YEAR_MONTH = getenv_bool('OCR_OUTPUT_DIRECTORY_YEAR_MONTH')
ON_SUCCESS_DELETE = getenv_bool('OCR_ON_SUCCESS_DELETE')
ON_SUCCESS_ARCHIVE = getenv_bool('OCR_ON_SUCCESS_ARCHIVE')
DESKEW = getenv_bool('OCR_DESKEW')
OCR_JSON_SETTINGS = json.loads(os.getenv('OCR_JSON_SETTINGS', '{}'))
POLL_NEW_FILE_SECONDS = int(os.getenv('OCR_POLL_NEW_FILE_SECONDS', '1'))
Expand Down Expand Up @@ -108,9 +111,13 @@ def execute_ocrmypdf(file_path):
deskew=DESKEW,
**OCR_JSON_SETTINGS,
)
if exit_code == 0 and ON_SUCCESS_DELETE:
log.info(f'OCR is done. Deleting: {file_path}')
file_path.unlink()
if exit_code == 0:
if ON_SUCCESS_DELETE:
log.info(f'OCR is done. Deleting: {file_path}')
file_path.unlink()
elif ON_SUCCESS_ARCHIVE:
log.info(f'OCR is done. Archiving {file_path.name} to {ARCHIVE_DIRECTORY}')
shutil.move(file_path, f'{ARCHIVE_DIRECTORY}/{file_path.name}')
else:
log.info('OCR is done')

Expand All @@ -135,13 +142,16 @@ def main():
f"Starting OCRmyPDF watcher with config:\n"
f"Input Directory: {INPUT_DIRECTORY}\n"
f"Output Directory: {OUTPUT_DIRECTORY}\n"
f"Output Directory Year & Month: {OUTPUT_DIRECTORY_YEAR_MONTH}"
f"Output Directory Year & Month: {OUTPUT_DIRECTORY_YEAR_MONTH}\n"
f"Archive Directory: {ARCHIVE_DIRECTORY}"
)
log.debug(
f"INPUT_DIRECTORY: {INPUT_DIRECTORY}\n"
f"OUTPUT_DIRECTORY: {OUTPUT_DIRECTORY}\n"
f"OUTPUT_DIRECTORY_YEAR_MONTH: {OUTPUT_DIRECTORY_YEAR_MONTH}\n"
f"ARCHIVE_DIRECTORY: {ARCHIVE_DIRECTORY}\n"
f"ON_SUCCESS_DELETE: {ON_SUCCESS_DELETE}\n"
f"ON_SUCCESS_ARCHIVE: {ON_SUCCESS_ARCHIVE}\n"
f"DESKEW: {DESKEW}\n"
f"ARGS: {OCR_JSON_SETTINGS}\n"
f"POLL_NEW_FILE_SECONDS: {POLL_NEW_FILE_SECONDS}\n"
Expand Down

0 comments on commit 7cabbb1

Please sign in to comment.