Bash script for adding a text layer to PDF files and converting images in PDFs (with OCR).
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Initial commit Sep 18, 2018
OCRmyFiles.sh Input dir can now contain spaces; after OCR, the input files are deleted Dec 31, 2018
README.md Instruction for calling the script with a cronjob Oct 31, 2018

README.md

OCRmyFiles

Bash script for adding a text layer to PDF files and converting images in PDFs (with OCR).

Adds an OCR text layer to all PDF files in the given input directory and saves the new PDF files to the output directory.

When the input directory also contains image files (e.g. jpg, png), these are converted to (OCR'ed) PDFs.

All other file types are just copied from the input directory to the output directory.

Requirements

Usage

  • Download script or clone repository
  • Make script executable sudo chmod +x OCRmyFiles.sh
  • Modify the script to fit your needs:
    • Set default input/output directories
    • Modify the OCRmyPDF command line arguments (you can find an overview of available command line arguments here)
    • Modify the Tesseract command line arguments (you can find an overview of available command line arguments here)
  • Call the script:
    • OCRmyFiles.sh (no parameter): using default directories for input/output (as defined in the script itself)
    • OCRmyFiles.sh <inputDir> <outputDir>: using specified directories for input/output
  • The script might print some warnings/errors from Tesseract. These can be ignored in most cases as the OCR text layer will be created anyway
  • You can also call this script with a cronjob for automated processing of PDFs/images:
    • With the user the cronjob should be executed, call contab -e
    • Add the following to run the script e.g. every 30 minutes: */30 * * * * /path/to/the/script/OCRmyFiles.sh > /dev/null 2>&1