A Python framework for intelligent data extraction using LLMs.
- Python 3.11+
- Poetry (or pip)
Using Poetry (recommended):
poetry install
poetry shellUsing pip:
pip install -r requirements.txt- Copy
.env.exampleto.env:
cp .env.example .env- Add your API keys to
.env:
ANTHROPIC_API_KEY=sk-ant-your-key-here
OPENAI_API_KEY=sk-your-key-here
uvicorn spec.main:app --reload --port 8000The API will be available at http://localhost:8000
- Docs: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- Health: http://localhost:8000/api/v1/health
spec/
├── api/ # REST API endpoints
│ └── v1/
│ ├── endpoints/ # Endpoint implementations
│ ├── router.py # Route aggregator
│ └── dependencies.py # Dependency injection
├── core/ # Core infrastructure
│ ├── config.py # Settings management
│ ├── exceptions.py # Custom exceptions
│ ├── logging_config.py # Logging setup
│ └── security.py # Security utilities
├── models/ # Pydantic data models
├── extraction/ # Extraction engine
│ ├── engine.py # Main orchestrator
│ ├── llm/ # LLM providers
│ ├── parsers/ # Content parsers
│ └── layout/ # Layout fingerprinting
├── search_library/ # Pattern storage
├── output/ # Output management
└── main.py # FastAPI entry point
GET /api/v1/healthPOST /api/v1/extract
Content-Type: application/json
{
"config_id": "config_001",
"source": {
"type": "text",
"content": "Document content here..."
},
"force_llm": false,
"options": {
"auto_create_patterns": true
}
}Run all tests:
pytestRun specific test file:
pytest tests/unit/test_models.py -vRun with coverage:
pytest --cov=spec --cov-report=html- Formatter: Black (88 chars line length)
- Linter: Ruff
- Type Checker: Mypy
Format code:
black spec/ tests/
ruff check . --fixType checking:
mypy spec/MIT
Phase 1: MVP Core - In Development
- ✓ Project setup and tooling
- ✓ Core infrastructure
- ✓ Pydantic models
- ✓ LLM provider interface (Anthropic)
- ✓ Text and PDF parsers
- ✓ Layout fingerprinting
- ✓ Search library (JSON storage)
- ✓ Extraction engine
- ✓ REST API endpoints
- ⏳ Comprehensive testing
- ⏳ End-to-end validation
See PHASE-1-PLAN.md for detailed roadmap.