Extract structured data from unstructured documents in seconds — not hours.
DocExtract runs as a production system with observability, evaluation gates, and resilience built in:
- Sync Sidecar Pattern: Non-blocking Langfuse trace submission via FastAPI BackgroundTasks. Request path is never blocked by observability.
- Langfuse Cloud Tracing: Every extraction, search, and review action is traced with model calls, token usage, latency, and confidence scores. PII is sanitized before trace submission.
- Tiered Evaluation CI Gates: Deterministic schema/regex checks on every PR. LLM-as-a-judge (DeepEval) runs nightly against a 92.6% accuracy golden baseline. Deployment blocks if quality regresses.
- Circuit Breaker Failover: Sonnet-to-Haiku model fallback chain with configurable thresholds. Prometheus gauge tracks breaker state.
- Cost Tracking: Per-request USD breakdown. Semantic cache hit/miss/cost-saved counters. Model A/B testing with z-test significance.
- Infrastructure: Docker multi-stage builds, Kubernetes (Kustomize + HPA), AWS Terraform IaC (RDS + ElastiCache), GitHub Actions CI/CD publishing to GHCR.
Explore the full pipeline without uploading real documents:
DEMO_MODE=true streamlit run frontend/app.pyOr visit the hosted dashboard (demo data pre-loaded).
Demo includes: document extraction with confidence scores, hybrid semantic search, RAGAS evaluation metrics, cost dashboard, and architecture diagram.
graph LR
A[Browser / API Client] -->|POST /documents| B[FastAPI]
B -->|enqueue| C[ARQ Worker]
C -->|classify| D{Model Router}
D -->|primary| E[Claude Sonnet]
D -->|fallback| F[Claude Haiku]
E -->|Pass 2: extract + correct| G[pgvector HNSW]
B -->|SSE stream stages| A
G -->|semantic search| B
B -->|/metrics| H[Prometheus]
D --- I[Circuit Breaker]
| Upload & Extraction | Extracted Records & ROI |
|---|---|
![]() |
![]() |
Real-time progress: PREPROCESSING → EXTRACTING → CLASSIFYING → VALIDATING → EMBEDDING → COMPLETED
- 5 document types: PDF, images (PNG/JPEG/TIFF/BMP/GIF/WebP), email (.eml/.msg), and plain text
- Two-pass Claude extraction: Pass 1 extracts structured JSON with a confidence score. If confidence < 0.80, Pass 2 fires a
tool_usecorrection call for automatic error correction - Circuit breaker model fallback: Per-model circuit breakers (CLOSED/OPEN/HALF_OPEN) with automatic Sonnet → Haiku failover. Configurable via
EXTRACTION_MODELS/CLASSIFICATION_MODELS - Golden eval CI gate: 16 fixture-based eval cases run in CI with 2% regression tolerance — extraction quality is a first-class CI signal
- OpenTelemetry + Prometheus: LLM call latency, token usage, and call counts exposed at
/metrics. Enable withOTEL_ENABLED=true - Per-document-type confidence thresholds: Identity documents require 0.90 confidence; receipts tolerate 0.75. Configurable per type via
CONFIDENCE_THRESHOLDS - Hybrid search (BM25 + vector RRF): Semantic search combined with BM25 keyword matching via reciprocal rank fusion. Use
?mode=hybridon the search endpoint for best recall - Structured table extraction: Tables in PDFs are extracted as structured JSON (headers + rows), not flattened to markdown text
- Page-by-page streaming for long PDFs: Multi-page PDFs emit partial extraction results per page via SSE — no need to wait for the full document
- Vision-native extraction path: Set
OCR_ENGINE=visionto route image documents directly through Claude's vision API, bypassing Tesseract entirely - Active learning from HITL corrections: Approved corrections feed back into extraction prompts via
ACTIVE_LEARNING_ENABLED=true - MCP tool server: Connect Claude Desktop or any MCP-compatible agent to extract documents and search records via
mcp_server.py - Streaming Agent Reasoning (SSE) — Real-time Server-Sent Events stream of Think → Act → Observe steps as the agentic RAG loop executes. Each reasoning step is emitted as an SSE event, enabling live UI updates. POST
/api/v1/agent-search/stream - Multi-Document Synthesis — Map-reduce RAG across multiple documents. For each document, extracts relevant passages (map), then synthesizes a combined answer with per-document citations (reduce). Concurrent LLM orchestration via
asyncio.gather+Semaphore. POST/api/v1/agent-search/synthesize - Semantic Caching — Caches LLM responses by embedding cosine similarity (not exact match). Sub-millisecond lookup via numpy batch cosine distance. TTL-based expiry, FIFO eviction, Prometheus hit/miss/cost-saved counters. Feature-flagged (
SEMANTIC_CACHE_ENABLED). GET/api/v1/cache/stats - Fine-Tuning Data Pipeline — Exports HITL corrections as training datasets: supervised JSONL (OpenAI format), DPO pairs (chosen/rejected for RLHF), and evaluation JSONL. Deduplication, train/val split, doc_type filtering. GET
/api/v1/finetune/export+/finetune/stats - Agentic RAG (ReAct) — ReAct think-act-observe loop with 5 retrieval tools (vector, BM25, hybrid, metadata, rerank). Agent autonomously selects strategy per query. Confidence-gated at 0.8 with max 3 iterations. POST
/api/v1/agent-search - RAGAS Evaluation Pipeline — Context recall, faithfulness, and answer relevancy metrics computed via LLM-as-judge with structured rubric and few-shot examples. CI quality gate blocks merges on regression. Feature-flagged (
RAGAS_ENABLED,LLM_JUDGE_ENABLED). - Structured Output Extraction — Per-document-type Pydantic schemas (Invoice, Contract, Receipt, Medical Record) with field-level confidence scores. Batch processing with
asyncio.gather+Semaphore(5). POST/api/v1/extract/structured - Cost Tracker & Model A/B Testing — Per-request USD cost computation using Decimal arithmetic from
llm_traces. SHA-256 deterministic variant routing with z-test statistical significance (n≥30). Cost comparison dashboard in Streamlit. - Prompt Versioning & Regression Testing — Semver-tagged prompts stored as
prompts/{category}/vX.Y.Z.txt. Env-configurable active version. Automated golden eval comparison with 2% regression threshold. - Interactive Demo Sandbox — Full pipeline demo without API keys (
DEMO_MODE=true). Pre-cached extraction, search, and evaluation results. Three-tab experience. - SSE streaming progress: Real-time job status updates via Server-Sent Events (Redis pub/sub)
- HNSW vector search: pgvector semantic search over extracted records (gemini-embedding-2-preview, 768-dim)
- Human review workflow: Claim, approve, or correct low-confidence extractions with full audit trail
- ROI tracking: Executive report generation with extraction cost/time analytics
- SHA-256 deduplication: Identical file uploads return existing job IDs instantly
- Webhook delivery: HMAC-SHA256 signed payloads with 4-attempt exponential retry
- Sliding-window rate limiting: Per-API-key Redis rate limiter with
X-RateLimit-*headers - AES-GCM encrypted secrets: Webhook signing secrets encrypted at rest
- Pluggable storage: Local filesystem or Cloudflare R2
| Metric | Value |
|---|---|
| Document extraction (p50) | ~8s (two-pass Claude) |
| SSE first token (p50) | <500ms |
| Semantic search (p95) | <100ms |
| Extraction accuracy (golden eval) | 92.6% across 6 document types |
| Test suite | ~5s (1,109 tests) |
| Coverage | 90.66% (CI-enforced) |
- Reduces manual document review from hours to seconds
- 92.6% extraction accuracy measured against 16 golden eval fixtures
- Circuit breaker model fallback ensures continuity during provider outages
- Async pipeline handles concurrent uploads without blocking
Self-hosted: docker compose up — see Deployment.
# Health check (no auth)
curl http://localhost:8000/api/v1/health
# List records (demo key)
curl -H "X-API-Key: demo-key-docextract-2026" \
http://localhost:8000/api/v1/recordsOne-click deploy via Render Blueprint. Sets DEMO_MODE=true automatically. You only need to add your ANTHROPIC_API_KEY.
Full manifest set under deploy/k8s/ — Deployments, Services, Ingress, HPA, ConfigMap, and Secrets template for all three services (API, Worker, Frontend).
# Deploy base manifests (fill in secrets.yaml first)
kubectl apply -k deploy/k8s/
# Deploy production overlay (3 replicas, higher resource limits)
kubectl apply -k deploy/k8s/overlays/production/Architecture:
- API: 2 replicas (base), HPA scales to 8 on CPU >70%
- Worker: 2 replicas (base), HPA scales to 6 — higher memory limit for Tesseract OCR
- Ingress: nginx class, routes
/apiand/docsto API,/to Streamlit frontend - SSE:
nginx.ingress.kubernetes.io/proxy-buffering: "off"ensures job progress streams are delivered in real time
# Validate manifests (no cluster required)
kubectl kustomize deploy/k8s/ | kubectl apply --dry-run=client -f -Full IaC under deploy/aws/ — Terraform provisions an EC2 instance (t2.micro), two ECR repositories, S3 document storage, RDS PostgreSQL 16 (db.t3.micro, pgvector via migration 002), and ElastiCache Redis 7 (cache.t3.micro). All free-tier eligible.
# 1. Provision infrastructure
cd deploy/aws
terraform init
terraform apply \
-var="key_pair_name=your-key-pair" \
-var="anthropic_api_key=$ANTHROPIC_API_KEY" \
-var="gemini_api_key=$GEMINI_API_KEY" \
-var="db_password=your-secure-db-password"
# 2. Build and push images to ECR
cd ../..
make aws-push AWS_REGION=us-east-1
# 3. EC2 user_data runs migrations (alembic upgrade head) then starts API + ARQ worker
terraform -chdir=deploy/aws output api_urlThe instance uses an IAM role with scoped ECR pull + S3 read/write permissions (no static credentials). RDS and ElastiCache are in private subnets — only the EC2 security group can reach them.
All endpoints are prefixed with /api/v1. Authenticated endpoints require X-API-Key header.
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/health |
No | Basic health check |
GET |
/health/detailed |
No | Health with DB/Redis/storage status |
POST |
/documents/upload |
Yes | Upload a document for extraction (202) |
POST |
/documents/batch |
Yes | Batch upload multiple documents (202) |
DELETE |
/documents/{document_id} |
Yes | Delete a document and its data |
GET |
/jobs |
Yes | List jobs with optional status filter |
GET |
/jobs/{job_id} |
Yes | Get job status and details |
GET |
/jobs/{job_id}/record |
Yes | Get extracted record for a job |
PATCH |
/jobs/{job_id} |
Yes | Cancel a running job |
GET |
/jobs/{job_id}/events |
Yes | SSE stream of job progress events |
GET |
/records |
Yes | List extracted records (paginated) |
GET |
/records/search |
Yes | Semantic search over records (?mode=vector|bm25|hybrid, default: vector) |
GET |
/records/export |
Yes | Stream records as CSV or JSON |
GET |
/records/{record_id} |
Yes | Get a single extracted record |
PATCH |
/records/{record_id}/review |
Yes | Submit review for a record |
POST |
/webhooks/test |
Yes | Send a test webhook payload |
GET |
/stats |
Yes | Aggregate dashboard statistics |
POST |
/api-keys |
Admin | Create a new API key |
GET |
/api-keys |
Admin | List all API keys |
DELETE |
/api-keys/{key_id} |
Admin | Revoke an API key |
GET |
/review/items |
Yes | List review queue items |
POST |
/review/items/{item_id}/claim |
Yes | Claim a review item |
POST |
/review/items/{item_id}/approve |
Yes | Approve a review item |
POST |
/review/items/{item_id}/correct |
Yes | Submit corrections for a review item |
GET |
/review/metrics |
Yes | Review queue metrics |
GET |
/roi/summary |
Yes | ROI summary with date range filter |
GET |
/roi/trends |
Yes | ROI trends by week or month |
POST |
/reports/generate |
Admin | Generate an executive report |
GET |
/reports |
Admin | List generated reports |
GET |
/reports/{report_id} |
Admin | Get a specific report |
POST |
/api/v1/agent-search |
Yes | Query documents with autonomous ReAct retrieval agent |
POST |
/api/v1/extract/structured |
Yes | Extract typed structured data from a document |
POST |
/api/v1/extract/structured/batch |
Yes | Batch structured extraction (async, semaphore-limited) |
git clone https://github.com/ChunkyTortoise/docextract.git
cd docextract
cp .env.example .env # fill in ANTHROPIC_API_KEY + GEMINI_API_KEY at minimum
alembic upgrade head # apply database migrations
docker-compose upServices start on:
- API: http://localhost:8000 (docs at
/docs) - Frontend: http://localhost:8501
- PostgreSQL: localhost:5432
- Redis: localhost:6379
Seed a dev API key:
docker-compose exec api python -m scripts.seed_api_key| Variable | Required | Description |
|---|---|---|
DATABASE_URL |
Yes | PostgreSQL connection string (asyncpg driver added automatically) |
REDIS_URL |
Yes | Redis connection string |
ANTHROPIC_API_KEY |
Yes | Anthropic API key for Claude extraction |
API_KEY_SECRET |
Yes | Secret for hashing API keys (32+ chars) |
AES_KEY |
No | Base64-encoded 32-byte key for AES-GCM webhook secret encryption |
GEMINI_API_KEY |
Yes | Required for Gemini embeddings |
STORAGE_BACKEND |
No | local (default) or r2 |
STORAGE_LOCAL_PATH |
No | Local file storage path (default: ./storage/local) |
R2_ACCOUNT_ID |
No | Cloudflare R2 account ID |
R2_ACCESS_KEY_ID |
No | Cloudflare R2 access key |
R2_SECRET_ACCESS_KEY |
No | Cloudflare R2 secret key |
R2_BUCKET_NAME |
No | R2 bucket name (default: docextract) |
CORS_ORIGINS |
No | JSON array of allowed origins |
LOG_LEVEL |
No | Logging level (default: INFO) |
MAX_FILE_SIZE_MB |
No | Max upload size in MB (default: 50) |
MAX_PAGES |
No | Max PDF pages to process (default: 100) |
OCR_ENGINE |
No | tesseract, paddle, or vision (default: tesseract). Use vision to route images through Claude's vision API |
EXTRACTION_CONFIDENCE_THRESHOLD |
No | Global two-pass fallback threshold (default: 0.8) |
CONFIDENCE_THRESHOLDS |
No | JSON dict of per-type thresholds, e.g. {"invoice":0.80,"identity_document":0.90} |
VISION_EXTRACTION_ENABLED |
No | Route image MIMEs through vision extractor instead of OCR (default: false) |
ACTIVE_LEARNING_ENABLED |
No | Feed approved HITL corrections back into extraction prompts (default: false) |
DEMO_MODE |
No | Enable demo mode with read-only access (default: false) |
DEMO_API_KEY |
No | API key for demo access (default: demo-key-docextract-2026) |
pytest tests/ -v # 1,109 tests, ~5sapp/
api/ -- FastAPI route modules (10 routers)
auth/ -- API key auth + rate limiting middleware
models/ -- SQLAlchemy models (8 tables)
schemas/ -- Pydantic request/response schemas
services/ -- Extraction, classification, embedding, validation
storage/ -- Pluggable storage backends (local, R2)
utils/ -- Hashing, MIME detection, token counting
worker/ -- ARQ async job processor
frontend/ -- Streamlit 14-page dashboard
alembic/ -- Database migrations (001-010)
scripts/ -- Seed scripts (API keys, sample docs, cleanup)
tests/ -- Unit + integration tests
Self-hosted via Docker Compose. See Quickstart above.
- API: http://localhost:8000
- Frontend: http://localhost:8501
- Demo API key:
demo-key-docextract-2026 - Docs: http://localhost:8000/docs (Swagger UI)
Tip: Set
DEMO_MODE=truein your.envto explore the full pipeline without API keys or real documents.
Measured against 16 hand-crafted golden fixtures covering all 6 document types. Scores are field-level F1 (token overlap) between extracted JSON and golden ground truth. Run in CI on every push.
| Document Type | Accuracy |
|---|---|
| Invoice | 95.0% |
| Purchase Order | 96.4% |
| Bank Statement | 91.6% |
| Medical Record | 98.9% |
| Receipt | 82.1% |
| Identity Document | 81.4% |
| Overall | 92.6% |
# Reproduce locally (no API calls):
python scripts/run_eval_ci.py --ciDocExtract is evaluated against the SROIE receipt benchmark (4 fields: company, date, address, total).
python scripts/benchmark_sroie.py --dry-runFull SROIE evaluation requires the dataset download and Anthropic API credits. See scripts/benchmark_sroie.py --help.
DocExtract ships with an MCP (Model Context Protocol) tool server. Connect it to Claude Desktop, Cursor, or any MCP-compatible agent to extract documents and search records directly from your AI assistant.
| Tool | Description |
|---|---|
extract_document |
Download a document from a URL, extract structured data, return the full record |
search_records |
Semantic search over all extracted records |
pip install mcp
export DOCEXTRACT_API_URL=http://localhost:8000/api/v1
export DOCEXTRACT_API_KEY=your-api-key
python mcp_server.pyAdd to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"docextract": {
"command": "python",
"args": ["/path/to/docextract/mcp_server.py"],
"env": {
"DOCEXTRACT_API_URL": "http://localhost:8000/api/v1",
"DOCEXTRACT_API_KEY": "your-api-key"
}
}
}
}Once connected, you can ask Claude: "Extract the invoice at [URL] and tell me the total amount due."
Full setup guide and tool reference: docs/mcp-integration.md. Built with the same patterns as mcp-server-toolkit (pip install mcp-server-toolkit) — a PyPI package providing production MCP server boilerplate with caching, rate limiting, and OpenTelemetry instrumentation.
DocExtract is built for production monitoring from day one.
Spin up the complete monitoring stack with a single command:
docker compose -f docker-compose.yml -f docker-compose.observability.yml up| Service | URL | What you see |
|---|---|---|
| Jaeger | http://localhost:16686 | Distributed traces per document extraction |
| Prometheus | http://localhost:9090 | Raw metrics with PromQL |
| Grafana | http://localhost:3000 (admin/admin) | Pre-built dashboard: latency, throughput, tokens, circuit breaker state |
Enable with OTEL_ENABLED=true. Exposes a /metrics endpoint in Prometheus format:
| Metric | Type | Labels |
|---|---|---|
llm_call_duration_ms |
Histogram | model, operation, status |
llm_calls_total |
Counter | model, operation, status |
llm_tokens_total |
Counter | model, direction |
circuit_breaker_state |
Gauge | model — 0=CLOSED, 1=HALF_OPEN, 2=OPEN |
OTEL_ENABLED=true uvicorn app.main:app --reload
curl http://localhost:8000/metricsEnable span export by setting OTEL_EXPORTER_OTLP_ENDPOINT:
OTEL_ENABLED=true \
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317 \
uvicorn app.main:app --reloadWorks with any OTLP-compatible backend: Jaeger, Grafana Tempo, Honeycomb, Datadog.
Each model in the fallback chain has its own circuit breaker (CLOSED → OPEN → HALF_OPEN state machine). When the primary model (Claude Sonnet) trips its circuit, traffic automatically routes to Claude Haiku without any downtime.
Configure via environment:
EXTRACTION_MODELS=claude-sonnet-4-6,claude-haiku-4-5-20251001
CLASSIFICATION_MODELS=claude-haiku-4-5-20251001,claude-sonnet-4-6
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_RECOVERY_SECONDS=60Every LLM call is traced with model, operation, latency, token counts, and confidence score — stored in PostgreSQL and queryable via the /stats endpoint. View per-model cost trends, p95 latency, and error rates.
Token cost comparison across models (per 1,000 tokens, as of 2026):
| Model | Input | Output | Best For |
|---|---|---|---|
| Claude Sonnet 4.6 | $0.003 | $0.015 | Complex extraction, high accuracy |
| Claude Haiku 4.5 | $0.00025 | $0.00125 | Classification, simple queries |
| Claude Opus 4.6 | $0.015 | $0.075 | Evaluation, edge cases |
DocExtract routes 60% of classification traffic to Haiku after A/B testing showed <2% quality difference vs Sonnet — reducing classification costs by ~67%.
Track live cost-per-query in the Cost Dashboard.
| Model | Operation | Avg Cost/Request | Avg Latency |
|---|---|---|---|
| claude-sonnet-4-6 | Extraction (2-pass) | $0.004–$0.012 | 1.8s |
| claude-haiku-4-5 | Classification | $0.0003–$0.001 | 0.4s |
| claude-sonnet-4-6 | LLM Judge | $0.002–$0.006 | 1.2s |
Model routing strategy: Classification and re-ranking use Haiku (4x cheaper, <5% quality gap). Full extraction uses Sonnet. LLM judge uses Sonnet for accuracy. A/B testing with z-test significance determines optimal model allocation per operation.
Cost monitoring: /api/v1/metrics (Prometheus) + Cost Dashboard in Streamlit frontend.
- Tesseract degradation on handwriting: OCR accuracy drops significantly on handwritten documents or forms with mixed print/handwriting. Set
OCR_ENGINE=visionto route image documents through Claude's vision API instead, which handles handwriting substantially better. - English-only extraction prompts: Extraction and classification prompts are optimized for English-language documents. Non-English documents may extract with lower accuracy.
12 Architecture Decision Records (ADRs) document the key design choices: docs/adr/
| ADR | Decision |
|---|---|
| ADR-0001 | ARQ over Celery for async job queue |
| ADR-0002 | pgvector over Pinecone/Weaviate |
| ADR-0003 | Two-pass Claude extraction with confidence gating |
| ADR-0006 | Circuit breaker model fallback chain |
| ADR-0011 | API key auth over OAuth/JWT |
| ADR-0012 | Pluggable storage backend (Local/R2) |
For a detailed breakdown of the architecture decisions, RAG pipeline design, extraction accuracy benchmarks, and async job queue patterns, see the Case Study. This document covers the full engineering journey from prototype to production.
| Metric | Target |
|---|---|
| Field-level extraction accuracy | >= 92% (CI-gated, 2% regression tolerance) |
| Single-page extraction p95 | < 8s |
| Multi-page extraction p95 | < 45s (10 pages, streamed) |
| Semantic search p95 | < 200ms |
| API uptime | 99.5% monthly |
| Circuit breaker recovery | < 60s after provider restoration |
| Cost per 1,000 documents | < $25 blended (Sonnet/Haiku) |
Full SLO definitions with error budgets: docs/slo.md
This service is designed for production deployment with:
- Reliability: Circuit breaker model fallback, queue-based async processing, Redis-backed rate limiting, HMAC-signed webhook delivery with 4-attempt retry
- Observability: OpenTelemetry traces (Jaeger/Tempo), Prometheus metrics (
/metrics), LLM cost/latency tracking, Grafana dashboards - Quality gates: Golden eval CI gate (92% threshold), RAGAS evaluation, confidence-based two-pass correction
- Infrastructure: K8s manifests with HPA auto-scaling, AWS Terraform (RDS + ElastiCache), multi-stage Docker builds
- Security: API key authentication, webhook HMAC signing, rate limiting, bandit static analysis in CI
| Document | Purpose |
|---|---|
| SLO Targets | Latency, availability, quality, cost targets |
| Common Failure Runbook | Circuit breaker, Redis, DB, queue, vector index recovery |
| Security Guide | API keys, webhooks, CORS, data handling |
| Release Checklist | Pre-deploy verification steps |
| Client Onboarding | Setup guide for new deployments |
| Prometheus Alerts | SLO breach alerting rules |
| Document Type | Model | Avg Tokens | Cost/Doc | Cost/1,000 |
|---|---|---|---|---|
| Invoice (1 page) | Sonnet | ~2,500 | $0.025 | $25.00 |
| Invoice (1 page) | Haiku (fallback) | ~2,500 | $0.004 | $4.00 |
| Receipt | Sonnet | ~1,200 | $0.012 | $12.00 |
| Multi-page PDF (10p) | Sonnet | ~15,000 | $0.150 | $150.00 |
| Embedding (any) | Gemini | 768-dim | $0.0004 | $0.40 |
Costs assume Anthropic March 2026 pricing. Two-pass correction adds ~20% to base cost for low-confidence documents.
Extraction returns low confidence: Check if document is a scanned image. Set OCR_ENGINE=vision for better results on scans. Identity documents require 0.90 confidence; receipts tolerate 0.75 (configurable via CONFIDENCE_THRESHOLDS).
Circuit breaker stuck OPEN: Check Anthropic API status. The circuit auto-recovers within 60s of provider restoration. See runbook.
Search returns no results: Ensure documents have been embedded. Check EMBEDDING_MODEL env var. Run GET /api/v1/records to verify records exist.
Demo mode not working: Set DEMO_MODE=true before starting. Demo uses pre-cached results and requires no API keys.
OTEL traces not appearing: OTEL is disabled by default (OTEL_ENABLED=false). Enable it and verify OTEL_EXPORTER_OTLP_ENDPOINT points to your Jaeger/Tempo instance.
# Setup
git clone https://github.com/ChunkyTortoise/docextract.git
cd docextract
pip install -r requirements.txt -r requirements-dev.txt
# Run tests (1,060+ tests, ~90% coverage)
pytest tests/ -v --tb=short
# Lint
ruff check app/ worker/ frontend/
# Type check
mypy app/ worker/
# Run golden eval (no API key needed)
python scripts/run_eval_ci.py
# Run locally
docker-compose up # API + Worker + Frontend + Postgres + RedisPRs should pass all CI checks: lint, type check, tests (80% coverage gate), golden eval (92% accuracy gate), and Docker build.
MIT


