Preprocesses scanned PDFs (removes colored backgrounds, enhances contrast) and runs OCR to produce:
<name>.txt— extracted plain text (single column)<name>-ocr.pdf— searchable PDF with embedded text layer
You need two system-level packages installed before the Python dependencies:
| OS | Command |
|---|---|
| Ubuntu / Debian | sudo apt-get install tesseract-ocr |
| macOS (Homebrew) | brew install tesseract |
| Windows | Download installer from https://github.com/UB-Mannheim/tesseract/wiki — then add the install directory to your PATH |
For additional languages (e.g. French, German, Spanish):
# Ubuntu/Debian
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa
# macOS
brew install tesseract-lang
# Windows: select languages during installationList available languages with: tesseract --list-langs
| OS | Command |
|---|---|
| Ubuntu / Debian | sudo apt-get install poppler-utils |
| macOS (Homebrew) | brew install poppler |
| Windows | Download from https://github.com/oschwartz10612/poppler-windows/releases — extract and add the bin/ folder to your PATH |
# (Optional) create a virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install Python dependencies
pip install -r requirements.txt# Single file
python pdf_ocr.py scan.pdf
# Multiple files
python pdf_ocr.py scan1.pdf scan2.pdf
# Entire folder (recursive)
python pdf_ocr.py /path/to/pdfs/
# Mix of files and folders
python pdf_ocr.py report.pdf /path/to/more_scans/
# Custom output directory
python pdf_ocr.py scan.pdf -o results/
# Higher DPI for better accuracy on small text
python pdf_ocr.py scan.pdf --dpi 400
# Multiple languages (e.g. English + French)
python pdf_ocr.py scan.pdf --lang eng+fra| Flag | Default | Description |
|---|---|---|
-o, --output-dir |
Same as input file | Directory for output files |
--dpi |
300 | Render resolution (higher = better accuracy, slower) |
--lang |
eng |
Tesseract language code(s), +-separated |
Each PDF page goes through these steps before OCR:
- Grayscale conversion — removes color information
- Background removal — pixels above the 85th percentile brightness are pushed to white, eliminating tinted/colored backgrounds
- Contrast stretching — the remaining foreground pixel range is stretched to use the full 0–255 range
- Sharpening — enhances edge definition for cleaner character boundaries
- Contrast boost — additional 1.5× contrast enhancement
- Binarization — converts to pure black-and-white at threshold 180
This pipeline is optimized for scanned documents with colored or shaded backgrounds, watermarks, and low-contrast text.
"tesseract is not installed or not in PATH" → Install Tesseract (see above) and ensure it's on your system PATH.
"Unable to get page count" or pdf2image errors
→ Install Poppler (see above) and ensure pdftoppm is on your PATH.
Poor OCR accuracy
→ Try increasing --dpi to 400 or 600. Ensure the correct --lang is set.
Very large PDFs are slow
→ Higher DPI uses more RAM and CPU. For 100+ page documents at 400 DPI, expect several minutes. Reduce to --dpi 200 for faster processing with slightly lower accuracy.