Skip to content

Engineersmind/pdf-autofillr

PDF Autofillr

AI-powered PDF form filling system. Extracts fields from PDF forms, maps them to your data schema using an LLM, and fills them automatically.


How it works

PDF template + schema keys
         │
         ▼
   make_embed_file          ← run once per PDF template
   (extract → map → embed)
         │
         ▼
   Embedded PDF             ← reusable template with metadata baked in
         │
         ▼
   fill(input_json)         ← run once per user
         │
         ▼
   Filled PDF

Project structure

pdf-autofillr/
├── modules/
│   ├── mapper/             # Core engine (production-ready)
│   │   ├── src/            # Server-side business logic
│   │   ├── sdk/            # Python SDK (pip install pdf-autofiller-mapper)
│   │   ├── entrypoints/    # local, HTTP server, Lambda, Azure, GCP
│   │   ├── deployment/     # Docker
│   │   ├── docs/           # Module-level docs
│   │   ├── tests/          # 169 tests
│   │   ├── api_server.py   # FastAPI entry point
│   │   └── README.md       # Module guide
│   │
│   └── chatbot/            # Conversational data collection (separate service)
│
├── sdks/
│   ├── openapi-mapper.yaml         # OpenAPI spec for mapper
│   ├── openapi-chatbot.yaml
│   ├── openapi-rag.yaml
│   ├── openapi-upload.yaml
│   └── typescript/                 # TypeScript HTTP client
│
├── docs/
│   ├── architecture/               # System design docs
│   ├── guides/                     # Per-module guides
│   └── MIGRATION_SDK_INTO_MODULES.md
│
├── benchmarks/                     # Model evaluation — datasets, tasks, metrics, leaderboard
│   ├── datasets/                   # PDF categories (financial, medical, legal, …)
│   ├── tasks/                      # field_extraction, field_mapping, form_filling
│   ├── metrics/                    # Scoring functions
│   ├── models/                     # Model config cards (gpt-4o, claude, llama, …)
│   ├── results/                    # Benchmark run outputs + leaderboard
│   └── run_benchmark.py            # Entry point
│
├── data/                           # Shared sample PDFs and JSON fixtures
├── examples/                       # Usage examples (HTTP API, direct SDK)
├── Makefile                        # Common commands
├── setup.sh / setup.ps1            # One-time project setup
└── start.sh / stop.sh              # Server lifecycle

Quick start

Option 1 — Automated setup

./setup.sh                        # Mac / Linux
# or
pwsh -File setup.ps1              # Windows

Option 2 — Manual setup

cd modules/mapper
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp config.ini.example config.ini
# set llm_model and API keys in config.ini
python api_server.py
# → http://localhost:8000

Option 3 — Docker

cd modules/mapper/deployment/docker
./docker-build.sh
./docker-run-local.sh

Set your LLM key before starting:

export OPENAI_API_KEY=sk-...          # OpenAI
export ANTHROPIC_API_KEY=sk-ant-...   # Anthropic
# or use Ollama (free, local): set llm_model = ollama/llama3.1 in config.ini

Python SDK

pip install pdf-autofiller-mapper           # HTTP client only
pip install pdf-autofiller-mapper[embedded] # + in-process pipeline
# Embedded (in-process, no server needed)
from pdf_autofiller_mapper import PDFMapper

mapper = PDFMapper(config_path="config.ini")
result = mapper.make_embed_file("form.pdf", "schema_keys.json")
result.save("form_embedded.pdf")

filled = mapper.fill("form_embedded.pdf", {"firstName": "Jane", "lastName": "Doe"})
filled.save("filled.pdf")
# HTTP client (talks to running server / Docker container)
from pdf_autofiller_mapper import PDFMapperClient

with PDFMapperClient("http://localhost:8000") as client:
    result = client.mapper.make_embed_file(pdf_path="s3://bucket/form.pdf")

Full SDK guide: modules/mapper/sdk/README.md


API endpoints

Method Path Description
GET /health Health check
POST /extract Extract form fields from PDF
POST /map LLM semantic mapping
POST /embed Embed field metadata into PDF
POST /fill Fill embedded PDF with user data
POST /make-embed-file extract + map + embed in one call
POST /fill-pdf Alias for fill
POST /run-all Full pipeline (make-embed + fill)
POST /check-embed-file Check if PDF has embedded metadata

Full API reference: modules/mapper/docs/api_server.md


Makefile commands

make setup          # Full automated setup
make start          # Start the API server
make dev            # Start with auto-reload
make stop           # Stop the server
make health         # curl /health
make test           # Run all tests (169 tests)
make install        # Install Python dependencies
make install-sdk    # Install mapper Python SDK
make docker-build   # Build Docker image
make docker-run     # Run Docker container

Configuration

Edit modules/mapper/config.ini (copied from config.ini.example):

[general]
llm_model = gpt-4o-mini       # or anthropic/claude-3-5-haiku, ollama/llama3.1
source_type = local            # local | aws | azure | gcp

[local]
workspace = /path/to/data

Full configuration reference: modules/mapper/docs/setup_guide.md


Tests

# Mapper module tests
cd modules/mapper
venv/bin/python -m pytest tests/ --override-ini="addopts=" -q
# 169 passed

# SDK tests
cd modules/mapper/sdk
venv/bin/python -m pytest tests/ -q
# 101 passed

Documentation

Topic Link
Mapper module modules/mapper/README.md
SDK guide modules/mapper/sdk/README.md
API server modules/mapper/docs/api_server.md
Setup guide modules/mapper/docs/setup_guide.md
Docker modules/mapper/docs/docker.md
Architecture docs/architecture/system-overview.md
Module guides docs/guides/
OpenAPI specs sdks/
Benchmarks benchmarks/README.md

License

MIT — see LICENSE

About

Automation of filling data in PDF

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors