`llm_ocr_py`: AI Document Extractor Toolkit

A modular Python toolkit for intelligent document processing using OCR and LLMs. Extract structured data from PDFs, Excel files, images, and more with extensible pipelines.

Features

Modular Architecture: Chain of responsibility pattern for flexible document processing
Multi-Engine OCR: Support for Tesseract and PaddleOCR with layout preservation
LLM Integration: Async processing with rate limiting for API calls
Concurrent Processing: Multi-threaded extraction for high throughput
Extensible: Easy to add new document types and processing methods

Quick Start

Prerequisites

Python 3.13+
Tesseract OCR
Poppler (for PDF processing)
PaddleOCR (optional, for advanced OCR tasks)

Installation

Clone the repository:

git clone https://github.com/Prajwal-Prathiksh/llm_ocr_py.git
cd llm_ocr_py

Install dependencies (using uv for best experience):

uv sync
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install system dependencies:
- Linux: sudo apt-get install tesseract-ocr poppler-utils
- Windows: Use the provided Setup.ps1 script or install via Scoop

Usage

from src.document_processor import DocumentProcessor

processor = DocumentProcessor()
result = processor.process("path/to/document.pdf")
print(result)

For detailed tutorials, see the tutorials/ directory.

Testing

Run the test suite to ensure everything is set up correctly:

python -m pytest

Project Structure

src/: Core modules
- document_processor/: Processing pipelines for different document types
- llms/: LLM client integrations
- concurrency_utils.py: Multi-threading utilities
tests/: Unit tests
tutorials/: Jupyter notebooks with examples
assets/: OCR training data and test files

Contributing

Contributions welcome! Please see the tutorials for API details and submit PRs for new features.

License

GPL v3 License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
assets		assets
research		research
src		src
tests		tests
tutorials		tutorials
.flake8		.flake8
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
Setup.ps1		Setup.ps1
poc.py		poc.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`llm_ocr_py`: AI Document Extractor Toolkit

Features

Quick Start

Prerequisites

Installation

Usage

Testing

Project Structure

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

Prajwal-Prathiksh/llm_ocr_py

Folders and files

Latest commit

History

Repository files navigation

llm_ocr_py: AI Document Extractor Toolkit

Features

Quick Start

Prerequisites

Installation

Usage

Testing

Project Structure

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`llm_ocr_py`: AI Document Extractor Toolkit

Packages