Agentic system for legal document compliance Q&A using Cloudflare Workers, AI, and RAG
Este es un sistema agentico inteligente que puede leer y razonar sobre documentos legales , proporcionando respuestas precisas basadas en compliance y regulaciones.
- Architecture Overview
- Design Decisions
- Data Extraction & RAG Pipeline
- Model Usage
- Agent Architecture
- Getting Started
- Testing the Project
- API Endpoints
- Trade-offs & Considerations
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER REQUEST β
β POST /question {"question": "..."} β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLOUDFLARE WORKER (Hono) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Validate Request β β
β β 2. Create runId β β
β β 3. Queue to IntelligentAgent (Durable Object) β β
β β 4. Return 202 Accepted + statusUrl β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INTELLIGENT AGENT (Durable Object + Agents SDK) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GUARDRAIL: Relevance Check β β
β β β’ LLM classifies if question is compliance-related β β
β β β’ Rejects casual greetings, off-topic questions β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RAG PIPELINE (Structured Reasoning) β β
β β β β
β β PHASE 1: EXTRACTION β β
β β ββ Tool: Text Embedding Generator β β
β β ββ Workers AI (@cf/baai/bge-base-en-v1.5) β β
β β β β
β β PHASE 2: SEARCH β β
β β ββ Tool: Vector Similarity Search β β
β β ββ Vectorize (top-10 semantic search) β β
β β β β
β β PHASE 3: RETRIEVAL β β
β β ββ Tool: Document Retriever β β
β β ββ D1 Database (structured docs + metadata) β β
β β β β
β β PHASE 4: GENERATION β β
β β ββ Tool: LLM Answer Generator β β
β β ββ Workers AI (@cf/meta/llama-4-scout-17b...) β β
β β ββ Chain-of-thought prompting β β
β β β β
β β PHASE 5: EVALUATION β β
β β ββ Quality checks (content, citations, relevance) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PERSISTENCE β β
β β β’ Store in agent_runs table (D1) β β
β β β’ Track metrics, tools used, latency β β
β β β’ Update agent state β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER POLLS STATUS β
β GET /status/:runId β returns answer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Cloudflare Worker (Hono): HTTP API layer
- IntelligentAgent (Durable Object): Agentic processing with state management
- Workers AI: Embeddings + LLM inference
- Vectorize: Semantic vector search
- D1 Database: Document storage + run history
- AI Gateway: Observability and caching
Decision: Use Cloudflare's official Agents SDK instead of building a custom agent system.
Rationale:
- β Built-in state management with automatic persistence
- β RPC support for direct method calls
- β WebSocket integration for real-time updates
- β SQL storage integrated (no external DB needed)
- β Follows Cloudflare best practices
Implementation: IntelligentAgent extends Agent<Env, RAGAgentState>
Decision: Use Durable Objects to manage agent state and ensure consistency.
Rationale:
- β Single instance ensures no race conditions
- β Built-in persistence (state survives restarts)
- β Co-location with data (SQL, state in same location)
- β Can run long operations without timeouts
Decision: Return immediately (202 Accepted) and process in background.
Rationale:
- β Agent runs can take 10+ seconds
- β User doesn't need to keep connection open
- β Resilient to network issues
- β Can handle multiple requests concurrently
Implementation:
// POST /question returns immediately
{ "runId": "...", "statusUrl": "/status/run-xxx" }
// Agent processes in background using ctx.waitUntil
await agent.processQuestion({ question, runId });Decision: Add LLM-based relevance filter before RAG pipeline.
Rationale:
- β Prevents hallucinations on irrelevant questions
- β Saves compute (no embedding/search for "Hello")
- β Clear user feedback for off-topic questions
- β Aligns with compliance Q&A purpose
Implementation:
- Uses separate LLM call with binary classification
- Low temperature (0.1) for consistent filtering
- Fast response (~500ms)
Decision: Explicit phases with tool usage metrics.
Rationale:
- β Observable reasoning process
- β Easy to debug which phase failed
- β Metrics for optimization
- β Matches interview requirements
Phases:
- Guardrail: Relevance Check (optional)
- Extraction: Generate question embedding
- Search: Vector similarity in Vectorize
- Retrieval: Fetch full documents from D1
- Generation: LLM answer with chain-of-thought
- Evaluation: Quality assessment
Decision: Structured prompts following Cloudflare best practices.
Rationale:
- β Better reasoning quality
- β Explicit citation requirements
- β Metadata context (relevance scores)
- β Output format specification
Crea una nueva pregunta para el agente. Retorna un runId para tracking.
curl -X POST http://localhost:8787/question \
-H 'Content-Type: application/json' \
-d '{"question":"ΒΏQuΓ© obligaciones tengo sobre protecciΓ³n de datos personales?"}'Respuesta:
{
"success": true,
"runId": "run-uuid",
"status": "pending",
"message": "Agent run created. Processing asynchronously."
}Consulta el estado de un run especΓfico.
curl http://localhost:8787/status/run-uuidRespuesta:
{
"success": true,
"data": {
"id": "run-uuid",
"question": "...",
"status": "completed",
"result": "...",
"createdAt": 1699999999999,
"completedAt": 1699999999999
}
}Carga documentos en la base de conocimiento (uso interno).
ConexiΓ³n WebSocket para recibir actualizaciones en tiempo real del agente.
Protocolo:
- Conecta a:
ws://localhost:8787/ws - SuscripciΓ³n especΓfica:
{"type": "subscribe", "runId": "run-uuid"} - SuscripciΓ³n global (testing):
{"type": "subscribe", "runId": "all"}β β Recibe TODOS los runs - Recibe actualizaciones automΓ‘ticas cuando el agente progresa
Ejemplo con websocat:
# Instalar websocat: brew install websocat
websocat ws://localhost:8787/wsLuego envΓa:
{"type": "subscribe", "runId": "tu-run-id-aqui"}Interfaz web para probar el WebSocket fΓ‘cilmente.
Uso:
- Abre en tu navegador:
http://localhost:8787/ws-client - Escribe tu pregunta legal
- Click en "Send Question"
- Observa las actualizaciones en tiempo real! π
β Modo Testing: El cliente web se auto-suscribe a TODOS los runs al conectar (no necesitas especificar runId). Esto es perfecto para testing - verΓ‘s actualizaciones de cualquier pregunta que se ejecute.
En lugar de hacer polling constante a /status/:runId, puedes suscribirte via WebSocket y recibir actualizaciones automΓ‘ticas:
Ventajas:
- β Sin polling - actualizaciones instantΓ‘neas
- β Menor latencia
- β Menos requests al servidor
- β Experiencia mΓ‘s fluida
Flujo:
1. POST /question β obtienes runId
2. Conectar WebSocket a /ws
3. Enviar {"type": "subscribe", "runId": "..."}
4. Recibir actualizaciones automΓ‘ticamente:
- status: "running", step: "embedding_complete"
- status: "running", step: "documents_retrieved"
- status: "completed", result: "..."
Tipos de mensajes:
// ConexiΓ³n
{ type: "connected", clientId: "...", timestamp: ... }
// SuscripciΓ³n confirmada
{ type: "subscribed", runId: "...", timestamp: ... }
// ActualizaciΓ³n del run
{
type: "run_update",
runId: "...",
update: {
status: "running" | "completed" | "failed",
step: "started" | "embedding_complete" | "documents_retrieved",
result?: "...", // solo cuando completed
totalTime?: 2847 // milisegundos
}
}Hay 2 Opciones :
- Usando el script seed-documents.ts
- Usando el endpoint /seed