Upload a PDF. Get structured, queryable knowledge back — powered by Azure AI.
This repo is a ready-to-use AI prompt that instructs any coding assistant to build a complete document intelligence pipeline on Azure. Hand it to GitHub Copilot, Claude Code, Cursor, or Windsurf and watch it scaffold the entire system: extraction, chunking, indexing, and a query UI — deployable in under 30 minutes.
It ships as an APM package so you can install it as a versioned dependency in any project.
Organizations sit on mountains of complex PDFs — reports, manuals, contracts — full of tables, charts, and figures that are effectively unsearchable. Traditional search can't understand a bar chart or a multi-column table.
This pipeline turns every page, table, and infographic into indexed, queryable chunks. Users upload a PDF and immediately ask questions grounded in its actual content.
flowchart LR
A["📄 PDF Upload"] --> B["🔍 Azure AI\nDocument Intelligence"]
B --> C["🧠 GPT-5\nFigure Interpretation"]
B --> D["✂️ Chunking\n~512 tokens"]
C --> D
D --> E["📦 JSON\nBlob Storage"]
E --> F["🔎 Azure AI Search\nIndex"]
F --> G["💬 Query UI\n+ GPT-5 Answers"]
style A fill:#4A90D9,color:#fff
style B fill:#5B9BD5,color:#fff
style C fill:#7B68EE,color:#fff
style D fill:#F4A460,color:#fff
style E fill:#87CEEB,color:#000
style F fill:#3CB371,color:#fff
style G fill:#9370DB,color:#fff
Stage 1 — Extract · Azure AI Document Intelligence pulls out text, tables, reading order, and layout. Detected figures are cropped and sent to GPT-5 for visual interpretation.
Stage 2 — Chunk · All content (including figure descriptions) is split into ~512-token overlapping chunks. Tables and figure captions are kept intact — never split mid-content.
Stage 3 — Store · Chunks are written as structured JSON to Azure Blob Storage, creating an auditable intermediate artifact.
Stage 4 — Index · Each chunk is pushed into Azure AI Search, making the full document instantly queryable.
Stage 5 — Query · A web UI lets users ask natural-language questions and get grounded answers backed by the indexed content and GPT-5.
| Service | Role |
|---|---|
| Azure AI Document Intelligence | Extract text, tables, reading order, and figures from PDFs |
| Azure AI Foundry + GPT-5 | Interpret figures/infographics; power the query layer |
| Azure Blob Storage | Store raw PDFs and intermediate JSON chunks |
| Azure AI Search | Index all chunks for fast, grounded retrieval |
| Azure App Services | Host the front-end web app |
All infrastructure is defined in Bicep for fully repeatable deployments.
APM (Agent Package Manager) lets you declare AI agent instructions as versioned dependencies — like npm install but for prompts and agent context.
# Install APM
brew install microsoft/apm/apm # macOS
pip install apm-cli # Python (cross-platform)# Install this prompt into your project
apm install M2LabOrg/document-intelligence-pipelineOr declare it in your project's apm.yml:
packages:
- M2LabOrg/document-intelligence-pipelineThen run apm install. APM pulls the instructions into your agent's context — no manual copy-pasting needed.
- Open
.apm/instructions/document-intelligence-pipeline.instructions.md - Copy the full contents
- Paste into your AI coding assistant (GitHub Copilot, Claude, Cursor, Gemini, etc.)
- Review the generated spec, then ask the assistant to produce code and Bicep files
The prompt instructs your AI assistant to produce — in order:
- Manifesto — what the system is and why it exists
- Auditor description — plain-English data-flow explanation
- Architecture diagram — full pipeline visualization
- Development plan — ordered build steps
- Threat model — risks and mitigations for data in transit, storage, API exposure, and untrusted uploads
- Working code — Flask/React front end, processing pipeline, and Bicep infrastructure
This spec-first approach means you review the design before a single line of code is written.
| Technique | How it appears |
|---|---|
| Vibe prompting | Intent and tone set clearly; the AI infers sensible defaults |
| Structured specificity | Exact JSON schema, file-size limits, Bicep folder layout |
| Spec-first thinking | Manifesto + threat model required before any code |
| Deferral as a tool | Out-of-scope items named explicitly to prevent over-engineering |
| Constraint clarity | PDF-only, 20 MB limit, 30-minute deployment target |
.
├── apm.yml # APM package manifest
├── .apm/
│ └── instructions/
│ └── document-intelligence-pipeline.instructions.md
└── README.md
Yes. This repo contains only a generic, technology-level prompt — no credentials, API keys, secrets, subscription IDs, or organization-specific data. If you fork and add environment-specific details, keep them in .env and add it to .gitignore.
Pull requests welcome — especially improvements to the prompt (better constraints, alternative chunking strategies, additional deferral items). Open an issue if you adapt this for a different cloud provider.
MIT