Skip to content

HenryKautz/pdf_ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

PDF OCR Tool

Preprocesses scanned PDFs (removes colored backgrounds, enhances contrast) and runs OCR to produce:

  • <name>.txt — extracted plain text (single column)
  • <name>-ocr.pdf — searchable PDF with embedded text layer

System Prerequisites

You need two system-level packages installed before the Python dependencies:

Tesseract OCR Engine

OS Command
Ubuntu / Debian sudo apt-get install tesseract-ocr
macOS (Homebrew) brew install tesseract
Windows Download installer from https://github.com/UB-Mannheim/tesseract/wiki — then add the install directory to your PATH

For additional languages (e.g. French, German, Spanish):

# Ubuntu/Debian
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa

# macOS
brew install tesseract-lang

# Windows: select languages during installation

List available languages with: tesseract --list-langs

Poppler (PDF-to-image conversion)

OS Command
Ubuntu / Debian sudo apt-get install poppler-utils
macOS (Homebrew) brew install poppler
Windows Download from https://github.com/oschwartz10612/poppler-windows/releases — extract and add the bin/ folder to your PATH

Python Setup

# (Optional) create a virtual environment
python -m venv venv
source venv/bin/activate   # Linux/macOS
# venv\Scripts\activate    # Windows

# Install Python dependencies
pip install -r requirements.txt

Usage

# Single file
python pdf_ocr.py scan.pdf

# Multiple files
python pdf_ocr.py scan1.pdf scan2.pdf

# Entire folder (recursive)
python pdf_ocr.py /path/to/pdfs/

# Mix of files and folders
python pdf_ocr.py report.pdf /path/to/more_scans/

# Custom output directory
python pdf_ocr.py scan.pdf -o results/

# Higher DPI for better accuracy on small text
python pdf_ocr.py scan.pdf --dpi 400

# Multiple languages (e.g. English + French)
python pdf_ocr.py scan.pdf --lang eng+fra

Options

Flag Default Description
-o, --output-dir Same as input file Directory for output files
--dpi 300 Render resolution (higher = better accuracy, slower)
--lang eng Tesseract language code(s), +-separated

How Preprocessing Works

Each PDF page goes through these steps before OCR:

  1. Grayscale conversion — removes color information
  2. Background removal — pixels above the 85th percentile brightness are pushed to white, eliminating tinted/colored backgrounds
  3. Contrast stretching — the remaining foreground pixel range is stretched to use the full 0–255 range
  4. Sharpening — enhances edge definition for cleaner character boundaries
  5. Contrast boost — additional 1.5× contrast enhancement
  6. Binarization — converts to pure black-and-white at threshold 180

This pipeline is optimized for scanned documents with colored or shaded backgrounds, watermarks, and low-contrast text.

Troubleshooting

"tesseract is not installed or not in PATH" → Install Tesseract (see above) and ensure it's on your system PATH.

"Unable to get page count" or pdf2image errors → Install Poppler (see above) and ensure pdftoppm is on your PATH.

Poor OCR accuracy → Try increasing --dpi to 400 or 600. Ensure the correct --lang is set.

Very large PDFs are slow → Higher DPI uses more RAM and CPU. For 100+ page documents at 400 DPI, expect several minutes. Reduce to --dpi 200 for faster processing with slightly lower accuracy.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages