PolicyTrace

PolicyTrace is a Document AI workflow for UK motor insurance PDFs. It extracts a structured Golden Record, resolves fields across multiple policy documents, and gives reviewers source-level evidence inside a split-screen PDF audit UI.

PolicyTrace is part of AI Tool Stack: practical AI builds, deployable workflows, and lessons beyond the demo.

Why This Project Exists

Most AI document demos stop at "the model returned JSON once." Real document workflows need more:

PDF parsing that survives real layouts.
Typed outputs that downstream systems can trust.
PII handling before model calls.
Multi-document source authority rules.
Conflict detection.
Field-level evidence.
A human review loop.

PolicyTrace shows that full path using a realistic UK motor insurance pack.

Demo

The repo includes a fully synthetic demo pack, safe for public screenshots and deployments:

sample_data/policytrace_demo_pack/

Upload the demo PDFs, then inspect and verify extracted fields:

What It Does

Upload Schedule, Certificate, Statement of Fact, and Policy Booklet PDFs.
Convert PDF text and layout with Docling.
Mask configured PII entities before LLM extraction.
Classify document type.
Extract typed JSON with Groq, Instructor, and Pydantic.
Merge fields using a "hierarchy of truth" policy arbiter.
Match extracted fields back to source PDF locations.
Review each field with verify, flag, and override actions.

Architecture

flowchart LR
    A["PDF pack"] --> B["Docling text + layout"]
    B --> C["PII masking"]
    C --> D["Document classifier"]
    D --> E["Specialist extraction prompts"]
    E --> F["Pydantic schema"]
    F --> G["PolicyArbiter"]
    B --> H["Geometry corpus"]
    G --> I["Golden Record"]
    I --> J["Provenance matcher"]
    H --> J
    J --> K["FastAPI session API"]
    K --> L["React review UI"]

See docs/architecture.md for the detailed walkthrough.

Tech Stack

Layer	Tools
PDF parsing	Docling
Extraction	Groq, Instructor, Pydantic
PII masking	Microsoft Presidio, spaCy
Arbitration	Custom hierarchy-of-truth merge logic
Provenance	Docling geometry + fuzzy matching
API	FastAPI
UI	React, Vite, Tailwind, react-pdf, Zustand
Deployment	Docker, Hugging Face Spaces-compatible

Repository Layout

.
|-- src/                 FastAPI backend, extraction, schema, provenance
|-- ui/                  React review dashboard
|-- config/              Runtime settings and versioned prompts
|-- sample_data/         Synthetic public demo PDFs
|-- scripts/             Demo PDF generation utilities
|-- tests/               Deterministic unit tests
|-- docs/                Architecture and deployment notes
|-- Dockerfile           Single-container production build
|-- .env.example         Local environment template
`-- README.md

Quickstart

1. Backend

pip install -r requirements.txt
python -m spacy download en_core_web_sm
Copy-Item .env.example .env

Add your Groq key to .env:

GROQ_API_KEY="replace_with_your_groq_api_key"

Start the API:

uvicorn api:app --app-dir src --reload --port 8000

2. Frontend

cd ui
npm install
npm run dev

Open:

http://localhost:5173

Upload the synthetic PDFs from:

sample_data/policytrace_demo_pack/

CLI Mode

Run extraction without the review UI:

python src/main.py --input sample_data/policytrace_demo_pack --output output/golden_record.json

Docker

The Dockerfile builds the React UI and serves it from FastAPI:

docker build -t policytrace .
docker run --rm -p 7860:7860 --env-file .env policytrace

Open:

http://localhost:7860

Hugging Face Deployment

Use a Hugging Face Docker Space. Add GROQ_API_KEY as a Space Secret and point the Space at this repo.

Deployment notes are in docs/hugging-face.md. Once the live Space is available, add the demo link here:

https://huggingface.co/spaces/<org>/<space-name>

Tests

pip install -r requirements-dev.txt
pytest tests/test_arbiter.py -v

Privacy And Safety

This project can process sensitive insurance documents. For public demos:

Use only synthetic or redacted PDFs.
Never commit .env or API keys.
Never commit real policy documents.
Never commit output/, session folders, or debug artifacts.
Rotate any API key that was ever stored locally before publishing.

See SECURITY.md.

Current Limitations

Public demo extraction is synchronous and can take 30 to 90 seconds.
Provenance matching is useful but not a legal-grade guarantee.
Public deployments should use synthetic/redacted documents unless stronger retention and access controls are added.
Production use needs authentication, audit logs, monitoring, and storage policy controls.

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PolicyTrace

Why This Project Exists

Demo

What It Does

Architecture

Tech Stack

Repository Layout

Quickstart

1. Backend

2. Frontend

CLI Mode

Docker

Hugging Face Deployment

Tests

Privacy And Safety

Current Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
docs		docs
sample_data		sample_data
scripts		scripts
src		src
tests		tests
ui		ui
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PolicyTrace

Why This Project Exists

Demo

What It Does

Architecture

Tech Stack

Repository Layout

Quickstart

1. Backend

2. Frontend

CLI Mode

Docker

Hugging Face Deployment

Tests

Privacy And Safety

Current Limitations

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages