PDF OCR Tool

Preprocesses scanned PDFs (removes colored backgrounds, enhances contrast) and runs OCR to produce:

<name>.txt — extracted plain text (single column)
<name>-ocr.pdf — searchable PDF with embedded text layer

System Prerequisites

You need two system-level packages installed before the Python dependencies:

Tesseract OCR Engine

OS	Command
Ubuntu / Debian	`sudo apt-get install tesseract-ocr`
macOS (Homebrew)	`brew install tesseract`
Windows	Download installer from https://github.com/UB-Mannheim/tesseract/wiki — then add the install directory to your `PATH`

For additional languages (e.g. French, German, Spanish):

# Ubuntu/Debian
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa

# macOS
brew install tesseract-lang

# Windows: select languages during installation

List available languages with: tesseract --list-langs

Poppler (PDF-to-image conversion)

OS	Command
Ubuntu / Debian	`sudo apt-get install poppler-utils`
macOS (Homebrew)	`brew install poppler`
Windows	Download from https://github.com/oschwartz10612/poppler-windows/releases — extract and add the `bin/` folder to your `PATH`

Python Setup

# (Optional) create a virtual environment
python -m venv venv
source venv/bin/activate   # Linux/macOS
# venv\Scripts\activate    # Windows

# Install Python dependencies
pip install -r requirements.txt

Usage

# Single file
python pdf_ocr.py scan.pdf

# Multiple files
python pdf_ocr.py scan1.pdf scan2.pdf

# Entire folder (recursive)
python pdf_ocr.py /path/to/pdfs/

# Mix of files and folders
python pdf_ocr.py report.pdf /path/to/more_scans/

# Custom output directory
python pdf_ocr.py scan.pdf -o results/

# Higher DPI for better accuracy on small text
python pdf_ocr.py scan.pdf --dpi 400

# Multiple languages (e.g. English + French)
python pdf_ocr.py scan.pdf --lang eng+fra

Options

Flag	Default	Description
`-o`, `--output-dir`	Same as input file	Directory for output files
`--dpi`	300	Render resolution (higher = better accuracy, slower)
`--lang`	`eng`	Tesseract language code(s), `+`-separated

How Preprocessing Works

Each PDF page goes through these steps before OCR:

Grayscale conversion — removes color information
Background removal — pixels above the 85th percentile brightness are pushed to white, eliminating tinted/colored backgrounds
Contrast stretching — the remaining foreground pixel range is stretched to use the full 0–255 range
Sharpening — enhances edge definition for cleaner character boundaries
Contrast boost — additional 1.5× contrast enhancement
Binarization — converts to pure black-and-white at threshold 180

This pipeline is optimized for scanned documents with colored or shaded backgrounds, watermarks, and low-contrast text.

Troubleshooting

"tesseract is not installed or not in PATH" → Install Tesseract (see above) and ensure it's on your system PATH.

"Unable to get page count" or pdf2image errors → Install Poppler (see above) and ensure pdftoppm is on your PATH.

Poor OCR accuracy → Try increasing --dpi to 400 or 600. Ensure the correct --lang is set.

Very large PDFs are slow → Higher DPI uses more RAM and CPU. For 100+ page documents at 400 DPI, expect several minutes. Reduce to --dpi 200 for faster processing with slightly lower accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
pdf_ocr.py		pdf_ocr.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF OCR Tool

System Prerequisites

Tesseract OCR Engine

Poppler (PDF-to-image conversion)

Python Setup

Usage

Options

How Preprocessing Works

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF OCR Tool

System Prerequisites

Tesseract OCR Engine

Poppler (PDF-to-image conversion)

Python Setup

Usage

Options

How Preprocessing Works

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages