Skip to content

Rohcode/pdf-field-extractor

Repository files navigation

PDF Field Extractor

Open-source labelled-field extraction for PDFs — born-digital, AcroForm, or scanned. Built-in OCR. Citation-backed. Deterministic. MIT licensed.

PDF Field Extractor — drop, extract, review

A focused service that pulls structured fields out of any PDF and returns them as JSON, with a page reference and bounding box per value. Three OSS libraries glued into one opinionated pipeline; one HTTP endpoint, one reviewer UI, ~2k LOC end-to-end. Production-safe license stack (MIT / BSD / Apache-2.0 / MPL-2.0 — zero AGPL/GPL).

What it does

Capability How
Native (born-digital) PDFs pdfplumber word extraction + labelled-regex matching. ~80 ms p50 per page.
AcroForm PDFs pypdf reads named form fields directly. ~20 ms per doc.
Scanned / faxed / photographed PDFs ocrmypdf (Tesseract LSTM) transparently adds a text layer. ~3 s/page at 200 DPI. One code path serves all three input modes — downstream strategies don't know or care which kind of PDF arrived.
Citation per field Every value carries (page, snippet, bounding_box, method, confidence) — auditable, side-by-side reviewable, no hallucination.
Reviewer UI Single page. Drop a PDF / pick a sample → click Extract → recognised fields are highlighted in situ on the thumbnail and listed inline for review → Copy JSON or Export CSV.
REST API POST /api/extract returns JSON. OpenAPI 3.1 at /openapi.json, Swagger UI at /docs.
Deterministic No LLMs in the pipeline. Same input → same output, every run. Reproducible for audit.

When to use it

  • Insurance — claim-file intake (FNOL, loss runs, declaration pages, ACORD-style packets).
  • Legal — case-file intake forms, retainer agreements, court intake.
  • Finance / accounting — labelled invoices, receipts, expense reports.
  • HR / compliance — onboarding paperwork, certification forms, KYC packets.
  • Anywhere someone is manually typing Label: value pairs from PDFs into a spreadsheet.

Why it's different

This service Cloud OCR (Textract / Document AI) LLM extractor (GPT-4o / Claude)
Hallucination risk None — deterministic None Possible
Citation per field Yes — (page, bbox) Partial Sometimes
Native-PDF latency <100 ms seconds seconds
Air-gapped / on-prem Yes (one container) No Rarely
API keys required None Yes Yes
Cost per page $0 ~$0.0015 ~$0.003–0.03
License footprint MIT, all transitive permissive proprietary proprietary
Single-binary deploy Yes (Docker) n/a n/a

Quick start

git clone https://github.com/rohcode/pdf-field-extractor
cd pdf-field-extractor
cp .env.example .env
docker compose up
# Open http://localhost:3000/ui/

Pick a sample from the dropdown (or drop your own PDF). Click Extract. Review fields. Click Export CSV.

API

OpenAPI spec at /openapi.json; interactive docs at /docs.

POST /api/extract

Multipart pdf field. Returns JSON.

curl -F 'pdf=@samples/fnol-text.pdf' http://localhost:3000/api/extract | jq
{
  "doc_id": "a3f1b8c2",
  "pages": 1,
  "ocr_applied": false,
  "elapsed_ms": 84,
  "fields": [
    {
      "name": "claim_number",
      "value": "CLM-2026-00481",
      "value_normalized": { "type": "policy_number", "raw": "CLM-2026-00481" },
      "method": "labeled_regex",
      "confidence": "high",
      "source": {
        "page": 1,
        "snippet": "Claim Number: CLM-2026-00481",
        "bbox": { "x0": 180.0, "y0": 158.4, "x1": 286.0, "y1": 172.1 }
      }
    }
  ]
}

Other endpoints

  • GET /api/samples — bundled-sample manifest.
  • GET /api/samples/{name} — stream a bundled sample PDF.
  • GET /api/samples/all — all bundled samples as a single ZIP.
  • GET /healthz200 OK health probe.

Architecture

PDF in
  │
  ▼
normalize.py        → ocrmypdf (Tesseract) if text-coverage < threshold,
                      else pass-through
  │
  ▼   text-layer PDF
  ├── pypdf AcroForm strategy
  └── pdfplumber labelled-regex strategy (with per-word bbox)
  │
  ▼
merge → ExtractResult { doc_id, pages, ocr_applied, elapsed_ms, fields[…] }

See ARCHITECTURE.md for the longer narrative — coordinate systems, why no LLM, threat model, scaling characteristics.

Performance

Measured on an M1 Pro, single page in/out:

Document Strategy p50
Native text PDF pdfplumber regex ~80 ms
AcroForm PDF pypdf field lookup ~20 ms
Scanned PDF (200 DPI) ocrmypdf + pdfplumber ~3 s

OCR dominates the scanned-PDF latency. Tesseract is single-threaded per page; multi-page batches can be parallelised at the worker layer.

Configuration

All knobs live in .env:

Var Default Description
PORT 3000 HTTP port
OCR_TIMEOUT_S 60 Max wall-time for an OCR run
MAX_UPLOAD_MB 10 Reject larger uploads with 413
RATE_LIMIT_PER_MIN 30 Per-IP token-bucket size & refill rate
TEXT_COVERAGE_THRESHOLD 50 Avg chars/page below which OCR fires

Built-in field schema

The demo ships with an insurance-flavoured field set; pre-trained in patterns.py:

policy_number, claim_number, date_of_loss, loss_amount, claimant_name, insured_name, policy_period_start, policy_period_end, deductible, coverage_limit, loss_type.

Add or modify a field by appending one FieldDef entry to that file — no other code change needed.

Need a field that isn't built in for a single request? POST fields_extra alongside the PDF — a JSON array of {name, labels, value_type} objects (max 10 per request). The UI exposes this via "+ Add custom field". Persistent additions still go in patterns.py.

What this is (and isn't)

Is

  • A small, focused extractor — three OSS libs and one opinionated pipeline.
  • Production-safe license stack (MIT / BSD / Apache-2.0 / MPL-2.0).
  • Type-checked end-to-end (mypy --strict), ruff-linted, pytest goldens cover all three input modes.
  • Single docker compose up deploy.

Isn't

  • An LLM extractor. No Anthropic, no OpenAI, no Mastra. Deterministic by construction.
  • A multi-format ingester. DOCX / EML / MSG / image / archive support are clearly-scoped phase-2 adapters, not in v1.
  • A table extractor (yet). Camelot is the natural add when a real prospect's case files need loss-run / SOV-style tables.
  • A multi-tenant SaaS product. No auth, no DB, no audit log. Wrap it in whatever shell your product needs.

Development

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

pytest -v               # 44 tests; goldens cover AcroForm, native, OCR + custom fields
ruff check src tests
mypy --strict src

uvicorn pdf_field_extractor.main:app --reload --port 3000

Inside Docker:

docker compose run --rm pdf-field-extractor pytest -v

Security

See SECURITY.md. Vulnerabilities go through GitHub's private "Report a vulnerability" workflow on this repo, not public issues.

License

MIT. Every transitive dependency is permissively licensed (MIT / BSD / Apache-2.0 / MPL-2.0). CI fails the build if a non-permissive licence shows up in a transitive — see .github/workflows/ci.yml.

Frontend asset note

The reviewer UI loads pdfjs-dist@4.10.38 from jsdelivr's CDN on first paint. For air-gapped deployments, vendor pdf.min.mjs and pdf.worker.min.mjs into public/ and adjust the imports in app.js.

About

Deterministic labelled-field extraction from PDFs — native, AcroForm, scanned. Citation-backed. Useful for enterprise document workflows.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors