PDF Field Extractor

Open-source labelled-field extraction for PDFs — born-digital, AcroForm, or scanned. Built-in OCR. Citation-backed. Deterministic. MIT licensed.

A focused service that pulls structured fields out of any PDF and returns them as JSON, with a page reference and bounding box per value. Three OSS libraries glued into one opinionated pipeline; one HTTP endpoint, one reviewer UI, ~2k LOC end-to-end. Production-safe license stack (MIT / BSD / Apache-2.0 / MPL-2.0 — zero AGPL/GPL).

What it does

Capability	How
Native (born-digital) PDFs	`pdfplumber` word extraction + labelled-regex matching. ~80 ms p50 per page.
AcroForm PDFs	`pypdf` reads named form fields directly. ~20 ms per doc.
Scanned / faxed / photographed PDFs	`ocrmypdf` (Tesseract LSTM) transparently adds a text layer. ~3 s/page at 200 DPI. One code path serves all three input modes — downstream strategies don't know or care which kind of PDF arrived.
Citation per field	Every value carries `(page, snippet, bounding_box, method, confidence)` — auditable, side-by-side reviewable, no hallucination.
Reviewer UI	Single page. Drop a PDF / pick a sample → click Extract → recognised fields are highlighted in situ on the thumbnail and listed inline for review → Copy JSON or Export CSV.
REST API	`POST /api/extract` returns JSON. OpenAPI 3.1 at `/openapi.json`, Swagger UI at `/docs`.
Deterministic	No LLMs in the pipeline. Same input → same output, every run. Reproducible for audit.

When to use it

Insurance — claim-file intake (FNOL, loss runs, declaration pages, ACORD-style packets).
Legal — case-file intake forms, retainer agreements, court intake.
Finance / accounting — labelled invoices, receipts, expense reports.
HR / compliance — onboarding paperwork, certification forms, KYC packets.
Anywhere someone is manually typing Label: value pairs from PDFs into a spreadsheet.

Why it's different

	This service	Cloud OCR (Textract / Document AI)	LLM extractor (GPT-4o / Claude)
Hallucination risk	None — deterministic	None	Possible
Citation per field	Yes — `(page, bbox)`	Partial	Sometimes
Native-PDF latency	<100 ms	seconds	seconds
Air-gapped / on-prem	Yes (one container)	No	Rarely
API keys required	None	Yes	Yes
Cost per page	$0	~$0.0015	~$0.003–0.03
License footprint	MIT, all transitive permissive	proprietary	proprietary
Single-binary deploy	Yes (Docker)	n/a	n/a

Quick start

git clone https://github.com/rohcode/pdf-field-extractor
cd pdf-field-extractor
cp .env.example .env
docker compose up
# Open http://localhost:3000/ui/

Pick a sample from the dropdown (or drop your own PDF). Click Extract. Review fields. Click Export CSV.

API

OpenAPI spec at /openapi.json; interactive docs at /docs.

`POST /api/extract`

Multipart pdf field. Returns JSON.

curl -F 'pdf=@samples/fnol-text.pdf' http://localhost:3000/api/extract | jq

{
  "doc_id": "a3f1b8c2",
  "pages": 1,
  "ocr_applied": false,
  "elapsed_ms": 84,
  "fields": [
    {
      "name": "claim_number",
      "value": "CLM-2026-00481",
      "value_normalized": { "type": "policy_number", "raw": "CLM-2026-00481" },
      "method": "labeled_regex",
      "confidence": "high",
      "source": {
        "page": 1,
        "snippet": "Claim Number: CLM-2026-00481",
        "bbox": { "x0": 180.0, "y0": 158.4, "x1": 286.0, "y1": 172.1 }
      }
    }
  ]
}

Other endpoints

GET /api/samples — bundled-sample manifest.
GET /api/samples/{name} — stream a bundled sample PDF.
GET /api/samples/all — all bundled samples as a single ZIP.
GET /healthz — 200 OK health probe.

Architecture

PDF in
  │
  ▼
normalize.py        → ocrmypdf (Tesseract) if text-coverage < threshold,
                      else pass-through
  │
  ▼   text-layer PDF
  ├── pypdf AcroForm strategy
  └── pdfplumber labelled-regex strategy (with per-word bbox)
  │
  ▼
merge → ExtractResult { doc_id, pages, ocr_applied, elapsed_ms, fields[…] }

See ARCHITECTURE.md for the longer narrative — coordinate systems, why no LLM, threat model, scaling characteristics.

Performance

Measured on an M1 Pro, single page in/out:

Document	Strategy	p50
Native text PDF	`pdfplumber` regex	~80 ms
AcroForm PDF	`pypdf` field lookup	~20 ms
Scanned PDF (200 DPI)	`ocrmypdf` + `pdfplumber`	~3 s

OCR dominates the scanned-PDF latency. Tesseract is single-threaded per page; multi-page batches can be parallelised at the worker layer.

Configuration

All knobs live in .env:

Var	Default	Description
`PORT`	`3000`	HTTP port
`OCR_TIMEOUT_S`	`60`	Max wall-time for an OCR run
`MAX_UPLOAD_MB`	`10`	Reject larger uploads with `413`
`RATE_LIMIT_PER_MIN`	`30`	Per-IP token-bucket size & refill rate
`TEXT_COVERAGE_THRESHOLD`	`50`	Avg chars/page below which OCR fires

Built-in field schema

The demo ships with an insurance-flavoured field set; pre-trained in patterns.py:

policy_number, claim_number, date_of_loss, loss_amount, claimant_name, insured_name, policy_period_start, policy_period_end, deductible, coverage_limit, loss_type.

Add or modify a field by appending one FieldDef entry to that file — no other code change needed.

Need a field that isn't built in for a single request? POST fields_extra alongside the PDF — a JSON array of {name, labels, value_type} objects (max 10 per request). The UI exposes this via "+ Add custom field". Persistent additions still go in patterns.py.

What this is (and isn't)

Is

A small, focused extractor — three OSS libs and one opinionated pipeline.
Production-safe license stack (MIT / BSD / Apache-2.0 / MPL-2.0).
Type-checked end-to-end (mypy --strict), ruff-linted, pytest goldens cover all three input modes.
Single docker compose up deploy.

Isn't

An LLM extractor. No Anthropic, no OpenAI, no Mastra. Deterministic by construction.
A multi-format ingester. DOCX / EML / MSG / image / archive support are clearly-scoped phase-2 adapters, not in v1.
A table extractor (yet). Camelot is the natural add when a real prospect's case files need loss-run / SOV-style tables.
A multi-tenant SaaS product. No auth, no DB, no audit log. Wrap it in whatever shell your product needs.

Development

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

pytest -v               # 44 tests; goldens cover AcroForm, native, OCR + custom fields
ruff check src tests
mypy --strict src

uvicorn pdf_field_extractor.main:app --reload --port 3000

Inside Docker:

docker compose run --rm pdf-field-extractor pytest -v

Security

See SECURITY.md. Vulnerabilities go through GitHub's private "Report a vulnerability" workflow on this repo, not public issues.

License

MIT. Every transitive dependency is permissively licensed (MIT / BSD / Apache-2.0 / MPL-2.0). CI fails the build if a non-permissive licence shows up in a transitive — see .github/workflows/ci.yml.

Frontend asset note

The reviewer UI loads pdfjs-dist@4.10.38 from jsdelivr's CDN on first paint. For air-gapped deployments, vendor pdf.min.mjs and pdf.worker.min.mjs into public/ and adjust the imports in app.js.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
public		public
samples		samples
src/pdf_field_extractor		src/pdf_field_extractor
tests		tests
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Field Extractor

What it does

When to use it

Why it's different

Quick start

API

`POST /api/extract`

Other endpoints

Architecture

Performance

Configuration

Built-in field schema

What this is (and isn't)

Development

Security

License

Frontend asset note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Field Extractor

What it does

When to use it

Why it's different

Quick start

API

POST /api/extract

Other endpoints

Architecture

Performance

Configuration

Built-in field schema

What this is (and isn't)

Development

Security

License

Frontend asset note

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /api/extract`

Packages