A cross-platform guide for document text extraction using OpenCV and Tesseract OCR. Complete with OS-specific setup instructions.
A beginner-friendly guide to extracting text from documents using OpenCV for image processing and Tesseract OCR for text recognition. Designed for AI/Engineering students to understand foundational document intelligence concepts.
This project demonstrates how to:
- Detect text regions in images using basic computer vision techniques
- Extract machine-readable text with Tesseract OCR
- Build interactive document scanners with Streamlit/Jupyter
- No deep learning or complex libraries (like LayoutParser) required!
- OCR Workflow: Preprocessing → Text Localization → OCR → Postprocessing
- Tool Roles:
OpenCV: Image thresholding, contour detection, ROI extractionTesseract: Optical Character Recognition (OCR) engine
- Real-World Challenges: Handling low contrast, complex layouts, multi-language text
- Limitations: Simpler but less accurate than deep learning approaches (e.g., LayoutParser)
- Python 3.8+
- 4GB RAM (minimum)
- 500MB disk space
- macOS 10.15 (Catalina) or newer
- Xcode Command Line Tools
- Ubuntu 20.04/Debian 10 or equivalent
- GTK+ 3.x libraries
| OS | Command | Additional Notes |
|---|---|---|
| Windows | Download installer | Check "Add to PATH" during install |
| macOS | brew install tesseract |
Requires Homebrew |
| Linux | sudo apt install tesseract-ocr libtesseract-dev |
For Debian/Ubuntu |
macOS:
# Install Homebrew if missing
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install image libs
brew install leptonicaAdd this to your Python code:
import pytesseract
# Windows
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# macOS (Homebrew install)
pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'
# Linux
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'-
Window Manager Conflicts:
If using headless Linux server:sudo apt install xvfb export DISPLAY=:0 -
Font Issues: Install additional fonts
sudo apt install tesseract-ocr-eng tesseract-ocr-fra # etc.
-
M1/M2 Chip Optimization:
Use native ARM Homebrew in Terminal:arch -arm64 brew install tesseract
-
Gatekeeper Issues: If blocked by macOS security:
xattr -d com.apple.quarantine /path/to/tesseract
# Clone repo
git clone https://github.com/yourusername/document-intelligence-demo.git
cd document-intelligence-demo
# Install requirements (in virtual env)
pip install -r requirements.txtAll OS:
streamlit run app.py # Web app
jupyter notebook # Jupyter version| OS | Issue | Solution |
|---|---|---|
| macOS | Error: Failed building wheel for opencv |
brew install cmake pkg-config |
| Linux | ImportError: libGL.so.1 |
sudo apt install libgl1-mesa-glx |
| All | TesseractNotFoundError |
Verify path with which tesseract |
MIT License - Free for academic and commercial use. Tesseract OCR is Apache 2.0 licensed.
