Skip to content

TheBarmaEffect/TensorMonk

Repository files navigation

Verdict

 __     __            _ _      _
 \ \   / /__ _ __ ___| (_) ___| |_
  \ \ / / _ \ '__/ _ \ | |/ __| __|
   \ V /  __/ | |  __/ | | (__| |_
    \_/ \___|_|  \___|_|_|\___|\__|

Multi-agent adversarial AI courtroom for decision evaluation.

Live Demo: https://frontend-phi-ten-83.vercel.app

Submit any decision or idea and Verdict runs it through a full AI courtroom proceeding: specialized agents research, argue for and against, cross-examine witnesses, deliver a ruling, and synthesize a battle-tested version of the original idea — all streamed to the browser in real time over WebSocket.


Architecture

User Input (question + context + output_format)
    |
    v
[Domain Detection] ── classifies decision domain (business, legal, etc.)
    |
    v
[Research Agent] ── produces anonymous research package
    |                (authorship hidden from downstream agents)
    +──────────────+──────────────+
    |              |              |
    v              |              v
[Prosecutor]    ISOLATED    [Defense]
 argues FOR   (cannot see   argues AGAINST
               each other)
    |              |              |
    +──────────────+──────────────+
                   |
                   v
            [Judge Agent]
            cross-examines
                   |
       +-----------+-----------+
       |           |           |
       v           v           v
   [Witness]   [Witness]   [Witness]
    fact         data       precedent
       |           |           |
       +-----------+-----------+
                   |
            [interrupt_before]  ← human-in-the-loop checkpoint
                   |
                   v
            [Judge Agent]
            delivers verdict
                   |
                   v
          [Synthesis Agent]
          battle-tested output
                   |
                   v
           WebSocket Stream --> React Frontend

Adversarial Design Constraints

  • Authorship Blindness: The research package is stripped of source metadata before reaching prosecutor/defense. Neither agent knows who authored the research.
  • Constitutional Directives: Prosecutor MUST argue FOR, Defense MUST argue AGAINST, regardless of personal assessment. Enforced via system prompts.
  • Adversarial Isolation: Prosecutor and defense run in parallel and never see each other's output. The judge is the first node to receive both.
  • Hallucination Guard: Agent outputs are validated against Pydantic schemas. Malformed JSON triggers a retry with temperature=0.3 for deterministic recovery.
  • Checkpointing: LangGraph MemorySaver for development; AsyncRedisSaver activates when REDIS_URL is set (Redis path is implemented but untested in production — development uses MemorySaver exclusively). State persisted at every node for resume/replay.
  • Human-in-the-Loop: interrupt_before=['verdict_with_review'] activates via INTERRUPT_BEFORE_VERDICT=true env var — the confidence gate routes low-confidence verdicts to a review node where human reviewers can intervene before the final ruling.
  • Dynamic Witness Spawning: Conditional edges route through _should_spawn_witnesses — the Judge's cross-examination determines which claims are contested and what witness type to spawn for each. If no claims are contested, witnesses are skipped entirely.
  • Domain-Aware Constitutional Overlays: Loaded from backend/config/domains.yaml at runtime — each domain defines argumentation constraints, evidence hierarchy, and few-shot synthesis anchors.

Tech Stack

Layer Technology
Backend framework FastAPI
Agent orchestration LangGraph (StateGraph + MemorySaver; AsyncRedisSaver when REDIS_URL set)
LLM inference Groq (Llama 3.3 70B Versatile)
Data models Pydantic v2 with field validators
Real-time streaming WebSocket (native FastAPI)
Frontend framework React 18
Build tool Vite
Styling Tailwind CSS
Animations Framer Motion
State management Zustand
Charts Recharts (scope-trimmed, see below)
PDF generation fpdf2
DOCX generation python-docx
Security XSS detection, sanitization, security headers
Rate limiting Token bucket middleware (in-memory)
Resilience Circuit breaker + exponential backoff
Containerization Docker + docker-compose

What's Built and Working

Core Pipeline (fully functional)

  • 6 AI agents: Research, Prosecutor, Defense, Judge, up to 3 Witnesses, Synthesis
  • Constitutional isolation — Prosecutor and Defense share zero memory
  • Authorship blindness — Judge receives arguments with agent identity stripped via strip_authorship()
  • Dynamic Witness spawning — Judge spawns FactWitness, DataWitness, PrecedentWitness based on contested claims via LangGraph conditional edges
  • Domain detection — auto-classifies business / financial / legal / medical / hiring / technology / strategic / product / marketing (9 domains)
  • Domain-aware constitutional overlays loaded from backend/config/domains.yaml at runtime
  • Few-shot synthesis anchors per domain (e.g., "Week 1-2: Implement WorkOS for enterprise SSO")
  • Parallel Prosecutor + Defense execution via asyncio.gather
  • LangGraph StateGraph with MemorySaver checkpointing (conditional AsyncRedisSaver when REDIS_URL set — Redis path implemented but not production-tested)
  • Multi-factor confidence gate: 4-factor routing (domain-adjusted threshold, witness agreement ratio, confidence variance, overrule detection) with 3 verdict paths (normal, low-confidence review, hallucination guard at temperature=0.3)
  • Hallucination guard — agent outputs validated against Pydantic v2 schemas, malformed JSON triggers temperature=0.3 retry
  • Real-time WebSocket streaming with typed StreamEvent objects
  • Export: Markdown, PDF (via fpdf2), DOCX (via python-docx), and structured JSON — all endpoints functional
  • Follow-up questions: context-aware Q&A against session results via POST /api/verdict/{id}/followup
  • Session history: persistent JSON-backed session store via GET /api/verdict/sessions/history, displayed in frontend SessionHistory component
  • Verdict sharing: GET /api/verdict/{id}/share generates short URL token, GET /shared/{token} retrieves results
  • Web search grounding: Research Agent queries Tavily (or DuckDuckGo fallback) for current facts before LLM analysis
  • Inline analysis in session results: argument quality, stability, and dependency graph computed and embedded in every completed session result — no separate endpoint required
  • 503 tests across 22 test files: schemas, graph, API, exports, resilience, cache, middleware, domain config, errors, metrics, security, prompts, integration, graph viz, session FSM, validators, event bus, calibration, LLM helpers (pytest)
  • Input validation on all API request models (question length, context length, format enum)
  • Rate limiting middleware: token bucket per IP with configurable RPM/burst
  • Request timing middleware: X-Request-ID + X-Response-Time headers on all responses
  • Retry with exponential backoff + jitter for transient LLM failures
  • Circuit breaker (CLOSED/OPEN/HALF_OPEN) wired into LLM retry path via call_llm_with_resilience() for external service fault tolerance
  • TTL cache for domain detection to reduce redundant LLM calls
  • Deep health check: Groq API, Redis, session store, uptime reporting
  • Graceful startup/shutdown lifecycle handlers with structured logging
  • Session-aware structured logging via contextvars for async correlation IDs
  • Structured error hierarchy: VerdictError → AgentError/SessionError/ExportError with JSON serialization
  • Pipeline performance metrics: per-agent durations with p50/p95/p99 percentiles, error rates, and pipeline-wide aggregates exposed via /metrics endpoint
  • Pipeline progress tracking: /status endpoint returns completion percentage, current stage, stages remaining, and estimated time to completion
  • Security middleware: XSS pattern detection, HTML entity sanitization, body size limits, security headers
  • Shared LLM utilities: utils/llm_helpers.py — all 6 agents use parse_llm_json(), create_llm(), emit_thinking_phases(), and retry_with_low_temperature() (eliminated ~160 lines of duplicated boilerplate)
  • Centralized prompt templates: ALL 6 agent system prompts defined in agents/prompts.py — zero inline prompt definitions, single source of truth for constitutional directives
  • Session analytics: aggregate ruling distribution, domain breakdown, format usage via /sessions/analytics
  • Pipeline graph visualization: structured topology generation with dynamic witness nodes and routing paths
  • Async event bus: topic-based pub/sub with fire-and-forget delivery for pipeline observability
  • Confidence calibration: binned ECE tracking per agent per domain with Platt scaling and isotonic regression (PAVA) for post-hoc correction
  • Witness-calibrated confidence: _calibrate_from_witnesses() uses witness verdicts as ground truth for per-agent ECE tracking; supports Platt scaling and isotonic regression (PAVA) for post-hoc confidence correction
  • Adaptive temperature: _adaptive_temperature() adjusts prosecutor/defense LLM temperature based on research quality scores
  • Pipeline-wide observability: all 7 graph nodes emit start/complete/error events via event bus with rich payloads (confidence, claim counts, ruling outcomes)
  • Research quality scoring: 5-dimension assessment (breadth, depth, grounding, balance, completeness) with weighted overall score
  • Witness-weighted evidence scoring: Judge computes quantitative pro/defense scores adjusted by witness verdicts (sustained/overruled)
  • Synthesis coverage assessment: measures objection coverage %, action time-boundedness, and strength delta
  • Argument dependency graph: DAG of claim dependencies with BFS cascading impact, coherence scoring, critical path detection — flows into cross-examination and witness prioritization
  • Verdict stability analysis: Monte Carlo perturbation testing (50 runs, ±10% witness confidence) with evidence margin and flip rate — flows into synthesis for stability-aware recommendations
  • Leave-one-out sensitivity analysis: identifies pivotal witnesses whose removal would flip the verdict, computes fragility score as fraction of pivotal witnesses
  • TF-IDF weighted claim similarity: replaces naive keyword overlap with smoothed TF-IDF cosine similarity for more accurate argument dependency detection
  • Topological sort (Kahn's algorithm): dependency-ordered claim traversal in argument graphs — foundations first, dependents last
  • Platt scaling feedback loop: auto-fits calibration coefficients after 10+ witness records, then applies sigmoid correction to raw agent confidence in future verdicts
  • Cross-graph analysis: CrossGraphAnalyzer detects claims from opposing sides that reference the same underlying facts — classifies as contested territory, contradictory foundations, or foundation attacks
  • Structural entropy: Shannon entropy over degree distribution measures how evenly dependencies are distributed across claims — low entropy indicates fragile star topologies with high-impact central claims
  • Argument quality scoring: 6-dimension heuristic assessment (specificity, diversity, calibration, coherence, actionability) with A-D grading — flows through state to inform witness and synthesis behavior
  • Intelligence-driven routing: computed analysis (graph structure, quality scores, stability) persisted in VerdictState and used by downstream nodes for smarter cross-examination, impact-weighted witness prioritization, and stability-aware synthesis with contingency plans
  • Session lifecycle FSM: 5-state machine (created→running→complete/error/expired) with validated transitions wired into API routes
  • Domain-aware input validators: question quality scoring, research package completeness, format-domain compatibility — wired into API and pipeline
  • Quality-aware synthesis: argument quality scores feed into the synthesis prompt, weighting the stronger side's arguments
  • py.typed PEP 561 marker for static type checking support

Frontend (fully functional)

  • Sequential ACT-based courtroom UI (5 Acts: Investigation, Debate, Cross-Examination, Ruling, Synthesis)
  • Speech-bubble debate layout (Prosecutor LEFT, Defense RIGHT)
  • Live agent status indicators with active/complete states
  • Verdict card with typewriter-animated ruling
  • Synthesis card with battle-tested improved version
  • Warm judicial design system (Cormorant Garamond headings, liquid glass cards)
  • Framer Motion ACT transitions and staggered reveal animations
  • Output format selector (Executive, Technical, Legal, Investor)
  • Domain badge auto-detected as user types
  • Voice input via Web Speech API — mic button with animated waveform indicator, transcript streams into text input
  • Analytics panel: Recharts BarChart (claim confidence), RadarChart (argument comparison), StatCards, witness verdicts — wired to live agent data
  • Comparison mode: side-by-side prosecution vs defense view with strength bars, aligned claims, confidence indicators, and witness verdict summary
  • Pipeline visualization: interactive agent pipeline flow diagram with real-time status indicators, parallel branch display, and dynamic witness node rendering
  • Domain-specific PDF reports: 9 domain color themes (business gold, legal indigo, medical red, financial emerald, etc.) with accent-colored titles, section headers, and dividers
  • WebSocket reconnection with exponential backoff (5 attempts, jitter, clean disconnect)
  • Robust export download helper with error handling, response validation, and user feedback

Deployment

Scope Management

The master plan defined a cut rule: "analytics charts are cut before the courtroom UI is degraded." All originally-at-risk features were completed and promoted to Tier 1:

Feature Evidence
Recharts analytics panel AnalyticsPanel.jsx wired to live agent data
Verdict history persistence JSON file store in data/sessions/
Voice input MicButton.jsx + useVoiceInput.js (Chrome/Edge)
Verdict sharing URL GET /{id}/share + GET /shared/{token}
DOCX export GET /api/verdict/{id}/export/docx

Quick Start

Prerequisites

Backend

cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Create .env with your API key
echo "GROQ_API_KEY=your-key-here" > .env

# Run the server
uvicorn main:app --reload

Backend runs at http://localhost:8000

Frontend

cd frontend
npm install
npm run dev

Frontend runs at http://localhost:5173

Docker

# Set your API key in backend/.env
echo "GROQ_API_KEY=your-key-here" > backend/.env

# Start both services
docker-compose up --build

API Endpoints

Method Endpoint Description
POST /api/verdict/start Submit a decision with output_format, returns session_id + detected domain
POST /api/verdict/detect-domain LLM-powered domain detection with format suggestion
GET /api/verdict/formats List available output formats with descriptions
GET /api/verdict/sessions/history List all sessions with metadata
GET /api/verdict/{id}/status Get session status
GET /api/verdict/{id}/result Get complete result JSON
WS /api/verdict/{id}/stream Real-time agent event stream
GET /api/verdict/{id}/export/markdown Export as Markdown report
GET /api/verdict/{id}/export/pdf Export as formatted PDF
GET /api/verdict/{id}/export/json Export as structured JSON
GET /api/verdict/{id}/export/docx Export as formatted DOCX
POST /api/verdict/{id}/followup Context-aware follow-up Q&A
GET /api/verdict/{id}/share Generate shareable verdict URL token
GET /api/verdict/shared/{token} Retrieve verdict by share token
GET /api/verdict/{id}/analysis Argument quality, stability, and dependency graph analysis
GET /api/verdict/{id}/graph Pipeline graph visualization for session
GET /api/verdict/graph/topology Static pipeline topology diagram
GET /api/verdict/sessions/analytics Aggregate session analytics
GET /health Deep health check with dependency readiness
GET /metrics Pipeline performance metrics

POST /api/verdict/start

{
  "question": "Should I pivot my SaaS from B2C to B2B?",
  "context": "Optional additional context...",
  "output_format": "executive"
}

Response includes auto-detected domain and chosen output_format.

POST /api/verdict/detect-domain

{
  "question": "Should we raise a Series A at $15M valuation?"
}

Returns: { "domain": "financial", "confidence": 0.9, "suggested_format": "investor", "reasoning": "..." }

WebSocket Events

Connect to ws://localhost:8000/api/verdict/{session_id}/stream to receive real-time events:

Event Type Agent Description
research_start research Research agent begins analysis
research_complete research Research package ready
prosecutor_thinking prosecutor Prosecution building case
prosecutor_complete prosecutor Prosecution rests
defense_thinking defense Defense building counter-case
defense_complete defense Defense rests
judge_start judge Cross-examination begins
witness_spawned witness_* Specialist witness activated
witness_complete witness_* Witness report filed
verdict_complete judge Final ruling delivered
synthesis_start synthesis Battle-tested synthesis begins
synthesis_complete synthesis Improved idea ready
pipeline_complete All agents finished
error varies Error during processing

Environment Variables

Variable Required Default Description
GROQ_API_KEY Yes Groq API key for LLM inference
REDIS_URL No Redis connection string for production checkpointing (redis://host:6379/0)
INTERRUPT_BEFORE_VERDICT No Set to true to enable human-in-the-loop review before verdict
TAVILY_API_KEY No Tavily API key for high-quality web search grounding (falls back to DuckDuckGo)
LOG_LEVEL No INFO Python logging level

Verdict — every decision deserves a challenger.

About

No description, website, or topics provided.

Resources

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors