Skip to content

SIMOUNIX/harvestor

Repository files navigation

🌾 Harvestor

AI-powered document data extraction toolkit

Extract structured data from documents (invoices, receipts, forms) using any supported provider. Easily integrate into your Python applications with flexible input options and built-in cost tracking.

⚠️ Early Development: This project is in active development. Core functionality is working, but many features are still being built.

What Works Now

  • Vision API Integration: Extract data from images (.jpg, .png, .gif, .webp)
  • Flexible Input: Accepts file paths, bytes, or file-like objects (like PIL, requests)
  • Cost Tracking: Built-in monitoring and limits for API usage (needs to be improved)
  • Structured Output: Returns Pydantic-validated data models that you can define
  • Providers: Currently supports Anthropic, OpenAI and local with Ollama

Quick Start

# Install dependencies
uv sync

# Setup environment
cp .env.template .env
# Add your Anthropic or OpenAI API key to .env

# Run a test
uv run python example.py

Basic Usage

from harvestor import harvest

# From file path
result = harvest("invoice.jpg")
print(f"Invoice #: {result.data.get('invoice_number')}")
print(f"Total: ${result.data.get('total_amount')}")
print(f"Cost: ${result.total_cost:.4f}")

# From bytes (e.g., API upload)
with open("invoice.jpg", "rb") as f:
    data = f.read()
result = harvest(data, filename="invoice.jpg")

# From file-like object
from io import BytesIO
buffer = BytesIO(image_data)
result = harvest(buffer, filename="invoice.jpg")

# Display summary output
print(result.to_summary())

Testing

# Install test dependencies
uv sync --extra dev

# Run tests
make test

# Run with coverage
make test-cov

Requirements

  • Python 3.13
  • Anthropic API key or OpenAI API key
  • Optional: Ollama for local model support

Citation

For testing and evaluation, we are currently using the following dataset:

Limam, M., et al. FATURA Dataset. Zenodo, 13 Dec. 2023, https://doi.org/10.5281/zenodo.10371464.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published