A modular Python toolkit for intelligent document processing using OCR and LLMs. Extract structured data from PDFs, Excel files, images, and more with extensible pipelines.
- Modular Architecture: Chain of responsibility pattern for flexible document processing
- Multi-Engine OCR: Support for Tesseract and PaddleOCR with layout preservation
- LLM Integration: Async processing with rate limiting for API calls
- Concurrent Processing: Multi-threaded extraction for high throughput
- Extensible: Easy to add new document types and processing methods
- Python 3.13+
- Tesseract OCR
- Poppler (for PDF processing)
- PaddleOCR (optional, for advanced OCR tasks)
-
Clone the repository:
git clone https://github.com/Prajwal-Prathiksh/llm_ocr_py.git cd llm_ocr_py
-
Install dependencies (using uv for best experience):
uv sync source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install system dependencies:
- Linux:
sudo apt-get install tesseract-ocr poppler-utils
- Windows: Use the provided
Setup.ps1
script or install via Scoop
- Linux:
from src.document_processor import DocumentProcessor
processor = DocumentProcessor()
result = processor.process("path/to/document.pdf")
print(result)
For detailed tutorials, see the tutorials/
directory.
Run the test suite to ensure everything is set up correctly:
python -m pytest
src/
: Core modulesdocument_processor/
: Processing pipelines for different document typesllms/
: LLM client integrationsconcurrency_utils.py
: Multi-threading utilities
tests/
: Unit teststutorials/
: Jupyter notebooks with examplesassets/
: OCR training data and test files
Contributions welcome! Please see the tutorials for API details and submit PRs for new features.
GPL v3 License - see LICENSE for details.