PolicyTrace is a Document AI workflow for UK motor insurance PDFs. It extracts a structured Golden Record, resolves fields across multiple policy documents, and gives reviewers source-level evidence inside a split-screen PDF audit UI.
PolicyTrace is part of AI Tool Stack: practical AI builds, deployable workflows, and lessons beyond the demo.
Most AI document demos stop at "the model returned JSON once." Real document workflows need more:
- PDF parsing that survives real layouts.
- Typed outputs that downstream systems can trust.
- PII handling before model calls.
- Multi-document source authority rules.
- Conflict detection.
- Field-level evidence.
- A human review loop.
PolicyTrace shows that full path using a realistic UK motor insurance pack.
The repo includes a fully synthetic demo pack, safe for public screenshots and deployments:
sample_data/policytrace_demo_pack/
Upload the demo PDFs, then inspect and verify extracted fields:
- Upload Schedule, Certificate, Statement of Fact, and Policy Booklet PDFs.
- Convert PDF text and layout with Docling.
- Mask configured PII entities before LLM extraction.
- Classify document type.
- Extract typed JSON with Groq, Instructor, and Pydantic.
- Merge fields using a "hierarchy of truth" policy arbiter.
- Match extracted fields back to source PDF locations.
- Review each field with verify, flag, and override actions.
flowchart LR
A["PDF pack"] --> B["Docling text + layout"]
B --> C["PII masking"]
C --> D["Document classifier"]
D --> E["Specialist extraction prompts"]
E --> F["Pydantic schema"]
F --> G["PolicyArbiter"]
B --> H["Geometry corpus"]
G --> I["Golden Record"]
I --> J["Provenance matcher"]
H --> J
J --> K["FastAPI session API"]
K --> L["React review UI"]
See docs/architecture.md for the detailed walkthrough.
| Layer | Tools |
|---|---|
| PDF parsing | Docling |
| Extraction | Groq, Instructor, Pydantic |
| PII masking | Microsoft Presidio, spaCy |
| Arbitration | Custom hierarchy-of-truth merge logic |
| Provenance | Docling geometry + fuzzy matching |
| API | FastAPI |
| UI | React, Vite, Tailwind, react-pdf, Zustand |
| Deployment | Docker, Hugging Face Spaces-compatible |
.
|-- src/ FastAPI backend, extraction, schema, provenance
|-- ui/ React review dashboard
|-- config/ Runtime settings and versioned prompts
|-- sample_data/ Synthetic public demo PDFs
|-- scripts/ Demo PDF generation utilities
|-- tests/ Deterministic unit tests
|-- docs/ Architecture and deployment notes
|-- Dockerfile Single-container production build
|-- .env.example Local environment template
`-- README.md
pip install -r requirements.txt
python -m spacy download en_core_web_sm
Copy-Item .env.example .envAdd your Groq key to .env:
GROQ_API_KEY="replace_with_your_groq_api_key"
Start the API:
uvicorn api:app --app-dir src --reload --port 8000cd ui
npm install
npm run devOpen:
http://localhost:5173
Upload the synthetic PDFs from:
sample_data/policytrace_demo_pack/
Run extraction without the review UI:
python src/main.py --input sample_data/policytrace_demo_pack --output output/golden_record.jsonThe Dockerfile builds the React UI and serves it from FastAPI:
docker build -t policytrace .
docker run --rm -p 7860:7860 --env-file .env policytraceOpen:
http://localhost:7860
Use a Hugging Face Docker Space. Add GROQ_API_KEY as a Space Secret and point the Space at this repo.
Deployment notes are in docs/hugging-face.md. Once the live Space is available, add the demo link here:
https://huggingface.co/spaces/<org>/<space-name>
pip install -r requirements-dev.txt
pytest tests/test_arbiter.py -vThis project can process sensitive insurance documents. For public demos:
- Use only synthetic or redacted PDFs.
- Never commit
.envor API keys. - Never commit real policy documents.
- Never commit
output/, session folders, or debug artifacts. - Rotate any API key that was ever stored locally before publishing.
See SECURITY.md.
- Public demo extraction is synchronous and can take 30 to 90 seconds.
- Provenance matching is useful but not a legal-grade guarantee.
- Public deployments should use synthetic/redacted documents unless stronger retention and access controls are added.
- Production use needs authentication, audit logs, monitoring, and storage policy controls.
MIT. See LICENSE.


