Skip to content

A Python toolkit for intelligent document processing using OCR and LLMs. Extract structured data from PDFs, Excel files, and more with modular, extensible pipelines

License

Notifications You must be signed in to change notification settings

Prajwal-Prathiksh/llm_ocr_py

Repository files navigation

llm_ocr_py: AI Document Extractor Toolkit

Python License Tests

A modular Python toolkit for intelligent document processing using OCR and LLMs. Extract structured data from PDFs, Excel files, images, and more with extensible pipelines.

Features

  • Modular Architecture: Chain of responsibility pattern for flexible document processing
  • Multi-Engine OCR: Support for Tesseract and PaddleOCR with layout preservation
  • LLM Integration: Async processing with rate limiting for API calls
  • Concurrent Processing: Multi-threaded extraction for high throughput
  • Extensible: Easy to add new document types and processing methods

Quick Start

Prerequisites

  • Python 3.13+
  • Tesseract OCR
  • Poppler (for PDF processing)
  • PaddleOCR (optional, for advanced OCR tasks)

Installation

  1. Clone the repository:

    git clone https://github.com/Prajwal-Prathiksh/llm_ocr_py.git
    cd llm_ocr_py
  2. Install dependencies (using uv for best experience):

    uv sync
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install system dependencies:

    • Linux: sudo apt-get install tesseract-ocr poppler-utils
    • Windows: Use the provided Setup.ps1 script or install via Scoop

Usage

from src.document_processor import DocumentProcessor

processor = DocumentProcessor()
result = processor.process("path/to/document.pdf")
print(result)

For detailed tutorials, see the tutorials/ directory.

Testing

Run the test suite to ensure everything is set up correctly:

python -m pytest

Project Structure

  • src/: Core modules
    • document_processor/: Processing pipelines for different document types
    • llms/: LLM client integrations
    • concurrency_utils.py: Multi-threading utilities
  • tests/: Unit tests
  • tutorials/: Jupyter notebooks with examples
  • assets/: OCR training data and test files

Contributing

Contributions welcome! Please see the tutorials for API details and submit PRs for new features.

License

GPL v3 License - see LICENSE for details.

About

A Python toolkit for intelligent document processing using OCR and LLMs. Extract structured data from PDFs, Excel files, and more with modular, extensible pipelines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published