A Databricks-native document intelligence + agent stack: parse PDFs once with ai_parse_document, classify and extract structured KPIs with ai_classify / ai_extract, score quality on a 5-dimension rubric, index high-quality summaries into Mosaic AI Vector Search, and serve a cited-answer agent through a Streamlit app on Databricks Apps. Demonstrated on synthetic SEC 10-K filings, but the architecture works for any structured document corpus (contracts, invoices, research reports, regulatory filings).
Important
Open-source reference implementation. The repo demonstrates production-grade Databricks patterns end-to-end, but it is not a turnkey production deployment. Read PRODUCTION_READINESS.md, SECURITY.md, and VALIDATION.md before pointing real users at it.
SEC 10-K PDF Analyst's question
(e.g., ACME_10K_2024.pdf) "What were ACME's top 3 risks in FY24?"
│ │
▼ ▼
┌─────────────────────┐ ┌──────────────────────┐
│ Pipeline (offline) │ ───────────▶ │ Agent (online) │
│ Parse → KPIs │ indexed │ Retrieve → Answer │
│ Quality scoring │ knowledge │ with citations │
└─────────────────────┘ └──────────────────────┘
│
▼
"ACME cited supply-chain
risk [1], AI competition
[2], regulation [3]…"
- Why this exists
- Features
- Readiness levels
- Prerequisites
- Getting started
- Architecture
- How it's built — three pillars
- Deploy ordering: foundation → consumers
- CLEARS quality gate
- Configuration
- Testing & validation
- Deployment
- Repo layout
- What you can learn from this repo
- Limitations
- Contributing
- Security
- License
- Acknowledgments
Databricks shipped a lot of new generative-AI surface area in 2025–2026: ai_parse_document, Mosaic AI Vector Search, the Agent Framework, AI Gateway, Lakebase, Databricks Apps. Tutorials show each piece in isolation; nobody shows them wired together with eval gates, governance, and reproducible deploys the way you'd actually ship to analysts.
This repo is that worked example. Drop a PDF into a governed UC volume; ten minutes later, an analyst can ask cited questions in plain English with end-to-end audit. The whole stack is described declaratively as one Databricks Asset Bundle (DAB) plus a small bootstrap script. DAB manages catalog/schema/volume, pipeline, jobs, the Vector Search endpoint, the Lakebase instance, the serving endpoint, the monitor, the app, and the dashboard; the Vector Search index itself is created and synced by jobs/index_refresh/sync_index.py (DAB doesn't yet manage indexes as a resource type), and the agent model version is registered by agent/log_and_register.py. The bootstrap script orchestrates them in the right order.
It also demonstrates a development workflow: Spec-Kit for spec-driven design, Claude Code with Databricks skill bundles for AI-assisted implementation, six non-negotiable constitution principles that gate every plan. See How it's built.
- End-to-end document intelligence pipeline — Auto Loader ingest →
ai_parse_document→ section explosion →ai_classify+ai_extract→ 5-dim quality rubric → Vector Search Delta-Sync index (the endpoint is DAB-managed; the index is created/synced byjobs/index_refresh/sync_index.py). SQL-only pipeline (Lakeflow Spark Declarative Pipelines). - Cited-answer agent — Mosaic AI Agent Framework (MLflow
pyfunc), hybrid retrieval + Mosaic re-ranker, single-filing and cross-company supervisor paths. Logged with auth_policy for end-to-end OBO when the workspace supports it. - Streamlit chat UI on Databricks Apps — citation chips, thumbs feedback, conversation history persisted to Lakebase Postgres.
- Eval-gated promotion —
mlflow.evaluate(model_type="databricks-agent")against a 30-question set with thresholds for Correctness, Adherence, Relevance, Execution, Safety, Latency p95. - Reproducible synthetic corpus —
samples/synthesize.pygenerates ACME / BETA / GAMMA 10-Ks plus a deliberately-low-qualitygarbage_10K_2024.pdffor the rubric-exclusion test (SC-006). No EDGAR dependency in CI. - Staged deploy with chicken-egg resolution —
scripts/bootstrap-dev.shorchestrates foundation → data production → consumers so a fresh workspace deploys cleanly with no "errors tolerated." - Lakehouse Monitoring + AI/BI dashboard — drift on extraction confidence, p95 latency by company, ungrounded-answer rate.
| Level | Meaning | Required evidence |
|---|---|---|
| Reference-ready | Synthetic corpus deploys and demonstrates the architecture end-to-end | Dev bundle validates, bootstrap succeeds, synthetic CLEARS passes |
| Pilot-ready | Real 10-K filings validate parse/extract/retrieval behavior | Reference-ready + small real EDGAR corpus + reviewed costs/latency |
| Production-ready | Analysts can use it under governed identity and operational SLOs | Pilot-ready + app-level OBO enabled, audit proof, alerts/dashboards, rollback tested |
Full checklists in PRODUCTION_READINESS.md.
| Tool | Version | Why |
|---|---|---|
| Python | 3.11 or 3.12 | Agent + app runtime; tests; eval gate |
| Databricks CLI | ≥ 0.298 | DAB --strict validation, bundle run for apps, UC permissions API, Lakebase + serving-endpoint resource schemas |
| Git | any recent | Repo + Spec-Kit commit hooks |
jq |
any recent | Workspace ID discovery in step 2 of Getting Started (CLI-only fallback shown inline if you don't have it) |
make (optional) |
any | Convenience targets if you choose to add them |
macOS install:
brew install python@3.12 jq
brew install databricks/tap/databricksLinux: see Databricks CLI install docs.
You need a workspace with all of the following enabled:
- Serverless SQL warehouse (AI Functions GA —
ai_parse_document,ai_classify,ai_extract,ai_query) - Mosaic AI Vector Search (endpoint + Delta-Sync index)
- Mosaic AI Agent Framework (
databricks-agents) - Mosaic AI Model Serving (CPU instances; AI Gateway)
- Lakebase Postgres (preview / GA depending on region)
- Databricks Apps (Streamlit runtime)
- Lakehouse Monitoring
- Unity Catalog with permission to create catalogs/schemas/volumes (or an existing schema you can write to)
Optional but recommended for production-tier OBO:
- Databricks Apps user token passthrough (workspace admin setting). Without it, the app falls back to service-principal auth — see
SECURITY.md.
Don't have a workspace? The fastest path is the 14-day Premium trial at https://databricks.com/try-databricks. Verify each entitlement above is enabled in your trial workspace and region — Mosaic AI Vector Search, Lakebase, Databricks Apps, and Model Serving rollout varies by cloud and region, so a Premium tier doesn't automatically guarantee every feature is on. Workspace settings → Previews / Compute → Mosaic AI is the place to check.
Note: Free Edition at databricks.com/learn/free-edition does not include Mosaic AI Vector Search or Model Serving and cannot run this reference. Use the Premium trial.
After signup:
databricks auth login --host https://<your-workspace-host>.cloud.databricks.com
databricks auth profiles # verify the DEFAULT profile is configuredgit clone https://github.com/<your-fork>/databricks-document-intelligence-agent.git
cd databricks-document-intelligence-agent
python -m venv .venv
.venv/bin/pip install -r agent/requirements.txt -r evals/requirements.txt pytest# With jq:
databricks warehouses list --output json | jq '.[] | {id, name, state}'
# Without jq (CLI-only fallback):
databricks warehouses listPick the ID of a serverless warehouse (state can be STOPPED — it auto-starts). You'll need it as DOCINTEL_WAREHOUSE_ID.
databricks bundle validate --strict -t devIf this prints Validation OK!, every YAML resource is schema-correct.
DOCINTEL_CATALOG=workspace \
DOCINTEL_SCHEMA=docintel_10k_dev \
DOCINTEL_WAREHOUSE_ID=<from-step-2> \
./scripts/bootstrap-dev.shThe script handles the chicken-egg ordering automatically — see Deploy ordering.
DOCINTEL_CATALOG=workspace DOCINTEL_SCHEMA=docintel_10k_dev \
.venv/bin/python evals/clears_eval.py \
--endpoint analyst-agent-dev \
--dataset evals/dataset.jsonlExit 0 means every CLEARS axis met its threshold.
In the workspace UI: Apps → doc-intel-analyst-dev. Ask:
What were the top 3 risk factors disclosed by ACME in their FY24 10-K?
You should see a grounded answer with citation chips linking to ACME_10K_2024.pdf / Risk.
After the first bring-up, iteration depends on what changed:
# YAML / pipeline / job / app config changes
databricks bundle deploy -t dev
databricks bundle run -t dev analyst_app # apply app config + restart
# Agent code changes (agent/*.py): register a new model version
# and repoint the existing serving endpoint in-place.
DOCINTEL_CATALOG=workspace \
DOCINTEL_SCHEMA=docintel_10k_dev \
DOCINTEL_WAREHOUSE_ID=<from-step-2> \
.venv/bin/python agent/log_and_register.py --target dev --serving-endpoint analyst-agent-dev
# Pipeline SQL changes that need to re-process existing filings
databricks bundle run -t dev doc_intel_pipelineYou can also re-run ./scripts/bootstrap-dev.sh — it auto-detects steady-state and does the full cycle (deploy → refresh data → register/promote → app run → grants → smoke) in one command.
For a guided 30-minute tour, see specs/001-doc-intel-10k/quickstart.md.
╔═══════════════════════════════════════════════════════════════════╗
║ pipelines/sql/ (one SQL file per tier) ║
╚═══════════════════════════════════════════════════════════════════╝
raw_filings/ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐
ACME_10K.pdf ──▶ │ bronze_filings │──▶│ silver_parsed_ │──▶│ gold_filing_ │
BETA_10K.pdf │ (raw bytes, │ │ filings (parsed │ │ sections (one │
GAMMA_10K.pdf │ filename, │ │ VARIANT — │ │ row per parsed │
│ ingested_at) │ │ ai_parse_ │ │ $.sections[*]; │
│ │ │ document) │ │ fallback to │
│ >50MB rejects: │ │ │ │ full_document │
│ bronze_filings │ │ Status: ok / │ │ if absent) │
│ _rejected │ │ partial / error │ │ │
└─────────────────┘ └─────────────────┘ │ gold_filing_kpis │
01_bronze.sql 02_silver_parse │ (typed columns: │
.sql │ segment_revenue │
│ ARRAY<STRUCT…>, │
│ top_risks │
│ ARRAY<STRING>) │
└──────────────────┘
03_gold_classify
_extract.sql
│
▼
┌──────────────────┐
│ gold_filing_ │
│ quality │
│ (5-dim rubric: │
│ parse, layout, │
│ ocr, sections, │
│ kpi → 0-30) │
└──────────────────┘
04_gold_quality.sql
Key idea — "parse once, extract many": PDFs are expensive to parse. Silver runs ai_parse_document exactly once per file and stores the structured result as a VARIANT. Everything downstream — classification, KPI extraction, summarization, quality scoring — reads the parsed output, never the raw bytes. This is a non-negotiable constitution principle.
Triggering: prod runs the pipeline in continuous: true mode so Auto Loader (read_files) reacts to new PDFs in the volume automatically. Dev overrides to continuous: false to avoid a 24/7 cluster during smoke iterations. See resources/foundation/doc_intel.pipeline.yml and the dev override block in databricks.yml.
gold_filing_sections ┌─────────────────────────┐
(governed Delta table) ─────▶ │ Mosaic AI Vector │
│ Search Index │
Filter: embed_eligible=true │ (Delta-Sync — auto- │
Embed column: "summary" │ refreshes when Gold │
│ updates) │
└─────────────────────────┘
Why "summary" not the raw text?
─────────────────────────────
Embedding a 50-page 10-K verbatim is noisy. We embed an LLM-written
summary instead — tighter, more searchable. Constitution principle IV:
"Quality before retrieval."
Ownership note: DAB manages the Vector Search endpoint (resources/consumers/filings_index.yml) and the index-refresh job (resources/consumers/index_refresh.job.yml). The index itself isn't yet a DAB-managed resource type as of CLI 0.298 — jobs/index_refresh/sync_index.py creates the Delta-Sync index on first run and triggers a sync on subsequent runs. That's why the bootstrap script's stage-2 deploy creates the endpoint + job, and the job's first execution materializes the actual index.
User question
│
▼
┌────────────────────────────────────────────┐
│ AnalystAgent.predict() │
│ ───────────────────── │
│ contains "compare" / "vs" / │
│ "between" + ≥2 company names? │
└────────────┬─────────────────┬─────────────┘
│ no │ yes
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Single-filing path │ │ Supervisor path │
│ │ │ │
│ 1. Hybrid search │ │ For each company: │
│ (keyword + vec) │ │ ▸ run analyst path │
│ 2. Re-rank → top 5 │ │ ▸ pull KPIs from │
│ 3. LLM generates │ │ gold_filing_kpis │
│ answer w/ [1] [2] │ │ Format markdown │
│ citations │ │ table with cites. │
└──────────────────────┘ └──────────────────────┘
│ │
└────────┬────────┘
▼
┌──────────────────────┐
│ Response JSON: │
│ answer │
│ citations[] │
│ grounded: bool │
│ latency_ms │
└──────────────────────┘
The agent is an mlflow.pyfunc model registered in Unity Catalog and served behind an AI Gateway (rate limiting per-user, usage tracking, inference-table audit). Identity passthrough is implemented at the App layer when the workspace has Databricks Apps user-token passthrough enabled: the Streamlit app extracts the user's x-forwarded-access-token header and constructs a user-scoped WorkspaceClient. The served model is OBO-ready via MLflow auth_policy and Model Serving user credentials. If app-level passthrough is not enabled, the app falls back to service-principal auth and the repo must be treated as a reference/dev deployment, not a production row-level-security deployment. See SECURITY.md and app/README.md.
┌──────────────────────────────────────────────────────────────────┐
│ │
│ Databricks App (Streamlit) ← user interacts here │
│ app/app.py │
│ │
│ ┌────────────────┐ ┌──────────────────┐ │
│ │ Chat input box │ │ Citation chips │ │
│ │ Thumbs up/down │ │ Markdown tables │ │
│ └────────┬───────┘ └─────┬────────────┘ │
│ │ │ │
└──────────────│─────────────────│─────────────────────────────────┘
│ │
│ query │ feedback writes
▼ ▼
┌────────────────────────┐ ┌────────────────────────┐
│ Model Serving endpoint │ │ Lakebase Postgres │
│ "analyst-agent-dev" │ │ ───────────────── │
│ (CPU, scales to 0) │ │ conversation_history │
│ │ │ query_logs │
│ + AI Gateway: │ │ feedback │
│ rate limit │ │ │
│ (per-user key) │ │ (Postgres for tiny │
│ inference-table │ │ per-turn writes — │
│ audit │ │ Delta isn't great │
│ usage tracking │ │ at row-by-row) │
└────────────────────────┘ └────────────────────────┘
OBO (user identity end-to-end, when enabled):
──────────────────────────────
App reads `x-forwarded-access-token` from the request, builds
`WorkspaceClient(token=...)`, calls the serving endpoint with the
user's identity. The agent-side MLflow auth policy and Model Serving
OBO credentials let downstream calls run as the user. If the app-side
feature is unavailable, the bootstrap script prints an explicit warning
and the deployment remains reference/dev only.
Why Postgres for state? Delta tables are great for analytics but bad at "insert one tiny row per chat turn at high frequency." Lakebase is Databricks's managed Postgres — same governance, right tool for the job.
This repo is a worked example of combining three things that, together, change how you ship Databricks projects.
Spec-Kit is a workflow that forces you to write — and clarify — a specification before writing code. Each phase is a slash-command in Claude Code that produces a checked-in artifact:
/speckit-specify → specs/<NNN>/spec.md What & why (no how)
│
▼
/speckit-clarify → appended Q&A in spec.md Resolve ambiguity
│
▼
/speckit-plan → specs/<NNN>/plan.md Tech stack + structure
│ + research.md, data-model.md,
│ contracts/, quickstart.md
▼
/speckit-tasks → specs/<NNN>/tasks.md Dependency-ordered tasks
│
▼
/speckit-analyze → cross-artifact consistency check
│
▼
/speckit-implement → the actual code
.specify/extensions.yml auto-commits at each phase boundary so the trail is clean. .specify/memory/constitution.md defines six non-negotiable principles every plan must respect:
| # | Principle | What it means |
|---|---|---|
| I | Unity Catalog source of truth | Every table, volume, model, index, endpoint lives under <catalog>.<schema> — no DBFS, no workspace-local resources |
| II | Parse once, extract many | ai_parse_document runs once at Silver → VARIANT; everything downstream reads the parsed output |
| III | Declarative over imperative | SDP SQL pipelines, Lakeflow Jobs, DAB resources — no production notebooks |
| IV | Quality before retrieval | 5-dim rubric scores every section; only ≥22/30 reach the index. Embed summary, not raw text |
| V | Eval-gated agents | MLflow CLEARS scores must clear thresholds before any deploy is considered complete |
| VI | Reproducible deploys | databricks bundle deploy -t <env> recreates the entire stack; dev and prod parity enforced |
When you read specs/001-doc-intel-10k/plan.md you'll see a "Constitution Check" gate that maps each design decision back to the principle it satisfies. When you read specs/001-doc-intel-10k/tasks.md you'll see how each task derives from the plan, and how user-stories (P1, P2, P3) are independently demoable.
Databricks Asset Bundles (DABs) describe most of the workspace state as YAML. One root databricks.yml declares variables and targets (dev, prod); resources/**/*.yml declares each resource (pipeline, jobs, Vector Search endpoint, index-refresh job, serving endpoint, app, monitor, dashboard, Lakebase instance + catalog). databricks bundle deploy -t dev reconciles workspace state to YAML. The two non-DAB-managed pieces — the Vector Search index itself and the registered model version — are produced at runtime by jobs/index_refresh/sync_index.py and agent/log_and_register.py respectively, which the bootstrap script orchestrates.
This repo was built with Databricks-specific Claude Code skill bundles. Those bundles are distributed by Databricks via the CLI / Claude Code plugin channel and are not vendored in this open-source tree — install them locally if you have access, or reference the canonical Databricks docs (mapping in CONTRIBUTING.md).
| Skill bundle | What it provides | Canonical docs |
|---|---|---|
| databricks-core | Auth, profiles, data exploration, bundle basics | docs |
| databricks-dabs | DAB structure, validation, deploy workflow, target separation | docs |
| databricks-pipelines | Lakeflow Spark Declarative Pipelines (ai_parse_document, ai_classify, ai_extract, APPLY CHANGES INTO) |
docs |
| databricks-jobs | Lakeflow Jobs with retries, schedules, table-update / file-arrival triggers | docs |
| databricks-apps | Databricks Apps (Streamlit), App resource bindings | docs |
| databricks-lakebase | Lakebase Postgres instances, branches, computes, endpoint provisioning | docs |
| databricks-model-serving | Model Serving endpoints, AI Gateway, served entities, scaling config | docs |
Skills are loaded by Claude Code on demand. When you ask Claude to "wire up Vector Search," it should read the Databricks pipeline/model-serving guidance before writing YAML, so the output reflects current Databricks API shapes — not stale training data.
Spec-Kit produces the specs. The Databricks skills provide platform expertise. Claude Code orchestrates both: every phase artifact and every code file in this repo was authored by prompting Claude Code with the spec/plan/tasks as context.
The workflow looks like:
/speckit-specify→ Claude writes spec.md from a natural-language description, you iterate via/speckit-clarifyuntil ambiguity is resolved./speckit-plan→ Claude consults the constitution + Databricks skills, drafts plan.md with research decisions and architecture./speckit-tasks→ Claude generates a dependency-ordered task list grouped by user story (P1, P2, P3)./speckit-implement→ Claude writes the actual SQL/Python/YAML, one task at a time, committing per task.- Operational loops: when the deploy hits unexpected issues (it always does), Claude reads the runbook, fixes the issue, updates the runbook, commits.
The "AI-driven" part isn't "the AI did it for you" — it's "the AI carries the boring parts (boilerplate YAML, retry-loop scripts, dependency analysis) so you focus on the actually-hard parts (what the spec should say, what the constitution should require)."
DABs deploy everything in one shot. But our resources have a chicken-and-egg problem on a fresh workspace:
┌────────────────────────────────────────────────┐
│ What "bundle deploy" tries to create: │
│ │
│ ▸ Pipeline ────┐ │
│ ▸ Tables ────┼──── all need each other │
│ ▸ Vector idx ───┤ │
│ ▸ Model ───┤ Monitor wants the │
│ ▸ Endpoint ────┤ KPI table to exist │
│ ▸ App ───┤ BEFORE it can attach │
│ ▸ Monitor ────┘ │
│ ▸ Lakebase ──── │
└────────────────────────────────────────────────┘
Endpoint needs a registered model version.
Model version needs the model logged.
Model logging needs the agent code.
Monitor needs the table populated.
Table needs the pipeline to run.
▶ Single `bundle deploy` → 4+ errors on a fresh workspace.
The fix is a staged deploy orchestrated by scripts/bootstrap-dev.sh. Resources are split into two directories by data dependency:
resources/
├── foundation/ ← no data deps — deploy first
│ ├── catalog.yml (schema + volume + grants)
│ ├── doc_intel.pipeline.yml
│ ├── retention.job.yml
│ └── lakebase_instance.yml
│
└── consumers/ ← need foundation to be RUNNING and producing data
├── agent.serving.yml (needs registered model version)
├── kpi_drift.yml (needs gold_filing_kpis table)
├── filings_index.yml (VS endpoint)
├── index_refresh.job.yml (needs source table)
├── analyst.app.yml (needs Lakebase + agent endpoint)
├── usage.dashboard.yml
└── lakebase_catalog.yml (needs instance AVAILABLE)
The bootstrap script auto-detects which mode to run by checking whether the agent serving endpoint already has a populated config:
does analyst-agent-${target} have served entities?
│
no ◀───────┴───────▶ yes
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ FIRST-DEPLOY │ │ STEADY-STATE │
│ (staged) │ │ (full deploy) │
├──────────────────┤ ├──────────────────┤
│ 1. temp-rename │ │ 1. bundle deploy │
│ consumers/* │ │ (full bundle) │
│ .yml.skip │ │ │
│ 2. bundle deploy │ │ 2. refresh data: │
│ (foundation) │ │ upload, run │
│ 3. produce data: │ │ pipeline, │
│ upload, run, │ │ register new │
│ register │ │ model version │
│ model │ │ + repoint │
│ 4. wait Lakebase │ │ serving in- │
│ AVAILABLE │ │ place │
│ 5. restore yamls │ │ │
│ 6. bundle deploy │ │ │
│ (full bundle) │ │ │
└────────┬─────────┘ └────────┬─────────┘
│ │
└───────────┬───────────┘
▼
┌──────────────────────────┐
│ Common to both: │
│ • bundle run analyst_app│
│ • UC grants chain │
│ • smoke check │
└──────────────────────────┘
Why two modes? DAB tracks resource state; if you run the temp-rename trick against an existing deployment, DAB sees the consumer YAMLs as removed and plans to delete the serving endpoint, app, monitor, etc. Safe-ish on a fresh workspace; destructive in steady-state. The script detects mode and does the right thing.
CI (.github/workflows/deploy.yml) assumes steady-state — the first-ever bring-up of a workspace must be done locally with ./scripts/bootstrap-dev.sh. After that, every push to main runs the steady-state path: full bundle deploy → refresh data → repoint serving endpoint → grants → CLEARS gate.
Full breakdown in docs/runbook.md.
Before any deploy reaches production, an evaluation must pass. This is constitution principle V — eval-gated agents.
evals/dataset.jsonl (30 questions: 20 single-filing P2 + 10 cross-company P3)
│
▼
evals/clears_eval.py ──▶ hits the dev endpoint, scores 6 axes:
┌─────────────────────────────────────────────────────┐
│ C - Correctness ≥ 0.80 (factual accuracy) │
│ L - Latency p95 ≤ 8000 ms (responsiveness) │
│ E - Execution ≥ 0.95 (no crashes) │
│ A - Adherence ≥ 0.90 (cites sources) │
│ R - Relevance ≥ 0.80 (retrieved good docs) │
│ S - Safety ≥ 0.99 (no harmful output) │
└─────────────────────────────────────────────────────┘
Any axis fails ▶ exit 1 ▶ deploy blocked.
The bar is hard-coded; changing it requires editing .specify/memory/constitution.md, which is its own small ceremony (PR + version bump + Sync Impact Report).
Implementation uses mlflow.evaluate(model_type="databricks-agent") for the four LLM-judged axes; Execution + Latency are computed from the raw response stream. Per-row Correctness is sliced from result.tables['eval_results'] for the SC-002/SC-003 P2 vs P3 thresholds.
| Variable | Default | Purpose |
|---|---|---|
catalog |
workspace |
UC catalog for all resources |
schema |
docintel_10k (prod) / docintel_10k_dev (dev) |
Schema under the catalog |
lakebase_instance |
per-target | Lakebase database instance name |
lakebase_stopped |
false |
Flip to true only after instance exists |
service_principal_id |
"" |
Required for -t prod; bundle validate -t prod fails loudly without it |
warehouse_id |
looked up from Serverless Starter Warehouse |
Used by index-refresh + dashboards |
embedding_model_endpoint_name |
databricks-bge-large-en |
Vector Search embeddings |
foundation_model_endpoint_name |
databricks-meta-llama-3-3-70b-instruct |
Agent answer generation |
rerank_model_endpoint_name |
databricks-bge-rerank-v2 |
Mosaic re-ranker |
quality_threshold |
22 |
Section quality cutoff (0-30) for index inclusion |
top_k |
5 |
Citations returned after re-rank |
max_pdf_bytes |
52428800 (50 MB) |
Reject filings larger than this |
analyst_group |
account users |
UC group granted SELECT/USE on schema, READ/WRITE on volume |
Override via --var name=value on any bundle command.
| Variable | Required | Used by |
|---|---|---|
DOCINTEL_CATALOG |
yes | Bootstrap, CI, eval |
DOCINTEL_SCHEMA |
yes | Same |
DOCINTEL_WAREHOUSE_ID |
yes | Bootstrap (passed to bundle as --var warehouse_id, used by kpi-poll + smoke); agent/log_and_register.py (auth-policy SQL warehouse resource); agent/tools.py UC Function tool |
DOCINTEL_TARGET |
no (default dev) |
Bootstrap |
DOCINTEL_ANALYST_GROUP |
no (default account users) |
UC grants in bootstrap + CI |
DOCINTEL_WAIT_SECONDS |
no (default 600) | Bootstrap KPI-table poll timeout |
DOCINTEL_LAKEBASE_TIMEOUT |
no (default 600) | Bootstrap Lakebase-AVAILABLE poll |
DATABRICKS_HOST / DATABRICKS_TOKEN |
yes (CI only) | GitHub Actions auth |
# Unit tests (18 tests covering retrieval, agent routing, supervisor)
.venv/bin/python -m pytest agent/tests/ -q
# Bundle schema + interpolation
databricks bundle validate --strict -t dev
databricks bundle validate --strict -t prod # expected to FAIL without --var service_principal_id (intended safety)
# Bash syntax
bash -n scripts/bootstrap-dev.sh
# Compile checks for all modified Python
.venv/bin/python -m py_compile \
agent/_obo.py agent/analyst_agent.py agent/log_and_register.py \
agent/retrieval.py agent/supervisor.py agent/tools.py \
app/app.py app/lakebase_client.py \
evals/clears_eval.py scripts/wait_for_kpis.py samples/synthesize.pyEnd-to-end is exercised by ./scripts/bootstrap-dev.sh against a real workspace; see VALIDATION.md for the full procedure with expected outputs.
| Path | When |
|---|---|
./scripts/bootstrap-dev.sh |
Fresh-workspace bring-up (or after bundle destroy). Auto-detects FIRST-DEPLOY vs STEADY-STATE; handles staged deploy + data production + UC grants in either mode. |
databricks bundle deploy -t dev |
YAML / pipeline / job / app config changes after the first bring-up. |
databricks bundle run -t dev analyst_app |
After any change to app/ or resources/consumers/analyst.app.yml — required to apply runtime config + restart the app. |
python agent/log_and_register.py --target dev --serving-endpoint analyst-agent-dev |
After agent code changes (agent/*.py). Registers a new UC model version and repoints the existing serving endpoint in-place. |
databricks bundle deploy -t prod --var service_principal_id=<sp-app-id> |
Production deploy, run as the prod SP. |
GitHub Actions on push to main |
Steady-state CI: full bundle deploy → wait for Lakebase AVAILABLE → upload samples + run pipeline + register/promote agent → UC grants → bundle run analyst_app → CLEARS eval gate. (The first-ever bring-up of a workspace must be done locally with ./scripts/bootstrap-dev.sh.) |
For day-2 ops (rolling agent versions, debugging low quality scores, inspecting CLEARS metrics in MLflow), see docs/runbook.md. For the production-readiness checklist, see PRODUCTION_READINESS.md.
databricks/
├── databricks.yml # Bundle root — variables + dev/prod targets
├── README.md # This file
├── CLAUDE.md # Runtime guidance for Claude Code sessions
├── CONTRIBUTING.md # Contribution guidelines
├── SECURITY.md # Identity modes, OBO, grants
├── PRODUCTION_READINESS.md # Reference / Pilot / Production checklists
├── VALIDATION.md # Validation procedure with expected outputs
├── REAL_10K_PILOT.md # Real EDGAR pilot guidance
├── LICENSE # MIT
│
├── pipelines/sql/ # Lakeflow SDP — Bronze → Silver → Gold (SQL)
│ ├── 01_bronze.sql # Auto Loader BINARYFILE ingest + size filter
│ ├── 02_silver_parse.sql # ai_parse_document → VARIANT
│ ├── 03_gold_classify_extract.sql # ai_classify + ai_extract → typed KPIs
│ └── 04_gold_quality.sql # 5-dim rubric → embed_eligible filter
│
├── agent/ # Mosaic AI Agent Framework
│ ├── analyst_agent.py # mlflow.pyfunc model + routing
│ ├── retrieval.py # Hybrid search + re-rank + OBO VS client
│ ├── supervisor.py # Cross-company fan-out
│ ├── tools.py # UC Function tool over gold_filing_kpis
│ ├── _obo.py # On-behalf-of credentials helpers
│ ├── log_and_register.py # Register + auth_policy + alias
│ └── tests/ # pytest unit tests
│
├── app/ # Streamlit App on Databricks Apps
│ ├── app.py # Chat UI + citations + thumbs feedback + OBO
│ ├── lakebase_client.py # psycopg writes to query_logs / feedback
│ ├── app.yaml # App runtime config (port, CORS, XSRF)
│ └── README.md # App-specific runtime + local-dev notes
│
├── evals/ # MLflow CLEARS eval gate
│ ├── dataset.jsonl # 30 hand-authored questions (P2 + P3)
│ └── clears_eval.py # mlflow.evaluate(model_type="databricks-agent")
│
├── jobs/ # Lakeflow Jobs Python tasks
│ ├── retention/prune_volume.py # 90-day raw PDF cleanup
│ └── index_refresh/sync_index.py # Vector Search SYNC INDEX
│
├── resources/ # DAB resources, split by data dependency
│ ├── foundation/ # Stage 1 — no data deps
│ └── consumers/ # Stage 2 — depend on foundation data
│
├── scripts/ # Operational scripts
│ ├── bootstrap-dev.sh # Fresh-workspace bring-up (staged deploy)
│ └── wait_for_kpis.py # Poll helper used by bootstrap + CI
│
├── samples/ # Synthetic 10-Ks for smoke tests + eval
│ ├── synthesize.py # Reproducible PDF generator
│ ├── ACME_10K_2024.pdf
│ ├── BETA_10K_2024.pdf
│ ├── GAMMA_10K_2024.pdf
│ └── garbage_10K_2024.pdf # SC-006 negative test (low quality)
│
├── specs/ # Spec-Kit artifacts
│ └── 001-doc-intel-10k/
│ ├── spec.md # What & why
│ ├── plan.md # Tech stack + Constitution Check
│ ├── tasks.md # Dependency-ordered implementation tasks
│ ├── research.md # Decision log
│ ├── data-model.md # Entity → table mapping
│ ├── quickstart.md # 30-min deploy walkthrough
│ └── contracts/ # JSON schemas for KPIs + agent I/O
│
├── docs/
│ └── runbook.md # Day-2 ops + bring-up workflow
│
├── .specify/ # Spec-Kit machinery (constitution, hooks)
│ ├── memory/constitution.md # Six non-negotiable principles
│ └── extensions.yml # Auto-commit hooks per phase
│
└── .github/workflows/
└── deploy.yml # PR validate; main → steady-state deploy + CLEARS gate
# (first-ever bring-up must be done locally via bootstrap-dev.sh)
- How to wire
ai_parse_documentinto Lakeflow SDP — pattern for streaming-tables +STREAM(...)views +APPLY CHANGES INTOkeyed on filename. - How to score document quality before retrieval — five 0–6 dimensions in SQL, threshold filter on the index source.
- How to log a Mosaic AI agent to UC —
mlflow.pyfuncwith both inputs and outputs in the signature (UC requirement),AnyTypefor variable-shape fields,auth_policy+resourcesfor OBO. - How to ground an agent with citations — hybrid Vector Search → re-rank → top-k → LLM with explicit "cite sources [1] [2]" prompt.
- How to handle DAB deploy ordering — chicken-egg dependencies between heterogeneous resources, solved with a 5-step bootstrap rather than
depends_on(which DAB doesn't reliably honor across resource types). - How to gate deploys on MLflow eval —
mlflow.evaluate(model_type="databricks-agent")with documented metric keys, per-axis thresholds, exit-code gate in CI. - How to do end-to-end OBO —
ModelServingUserCredentialsfromdatabricks_ai_bridge,CredentialStrategy.MODEL_SERVING_USER_CREDENTIALSfor Vector Search, MLflowauth_policywithmodel-serving+vector-searchuser scopes, App-sideuser_api_scopesdeclaration. - How Spec-Kit + Claude Code + Databricks skills compose — every artifact in
specs/andpipelines/andagent/was generated through that loop.
This is a pilot-scale reference implementation, not a turnkey production deployment:
| Limit | Value | Source |
|---|---|---|
| Filings in dev | ~500 | spec.md scale |
| Filings in prod | ~5,000 | spec.md scale |
| Concurrent app users | ~20 | spec.md scale |
| PDF size cap | 50 MB | FR / bronze_filings_rejected |
| Raw retention | 90 days | spec clarification |
| Compute | CPU only | constitution add'l constraints |
| Languages | English filings | implicit (foundation model) |
| Eval set size | 30 questions | spec clarification |
| OBO end-to-end | Requires workspace-level Databricks Apps - user token passthrough feature |
SECURITY.md |
Latency SLOs: P95 ≤ 8s for single-filing, ≤ 20s for cross-company. End-to-end pipeline ≤ 10 min P95 on a 30 MB PDF.
Bug reports, doc fixes, and pattern improvements are welcome. The constitution at .specify/memory/constitution.md defines what the project will and won't accept; PRs that conflict need a constitution amendment first.
See CONTRIBUTING.md for local setup, the spec-kit workflow, skill alignment expectations, and the deploy-ordering gotchas reviewers will check for.
See SECURITY.md for the identity model (App SP fallback vs end-to-end OBO), required UC grants, secrets-handling guidance, and how to report security issues in a fork or deployment.
Released under the MIT License — Copyright (c) 2026 Sathish Krishnan. Use it, fork it, learn from it; just keep the copyright notice.
- Spec-Kit — spec-driven development workflow for AI coding agents.
- Claude Code — Anthropic's CLI for AI-assisted development.
- Anthropic Skills — general-purpose Claude Code skill bundles.
- Databricks Lakehouse + Mosaic AI — Unity Catalog, Lakeflow Spark Declarative Pipelines, Mosaic AI Vector Search, Agent Framework, Model Serving, AI Gateway, Databricks Apps, Lakebase, Lakehouse Monitoring.
The 10-K analyst pattern is inspired by Databricks's own reference architecture for governed agent applications.