EvalForge — CI/CD for Reliable AI Agents

Everyone else built AI. EvalForge is the engineering layer that proves AI is safe enough to ship.

EvalForge is a reliability cockpit that compares a baseline RAG chatbot to an engineered RAG system on the same dataset, then gates the deploy with a single PASS or FAIL.

Live demo: https://evalforge-omega.vercel.app

It runs:

Prompt regression tests on a 25-question golden set
RAGAS-style metrics (faithfulness, answer relevance, context precision/recall, citation accuracy)
Structured-output validation (typed Pydantic schemas)
5 guardrails (PII, prompt-injection, jailbreak, refusal, citation enforcement)
A trace + span store (Langfuse-shaped)
Cost + p95 latency tracking
A deploy gate with configurable thresholds

Built for Hack the Tech 2026. Repo: https://github.com/Minifigures/Hack-the-Tech

Why this exists

EvalForge doesn't replace OpenAI or Anthropic. It sits between your app and whichever provider you picked.

The mistake people make is thinking "AI product = LLM API call." That's the easy part. The hard part is everything around it: did the model cite a real source, did it leak a customer's SSN, did the last prompt change break refusal behaviour, can you tell your auditor it's safe to ship. EvalForge is that layer.

Symptom in production	EvalForge fix
"It worked in dev"	Golden dataset + regression evals on every change
Silent hallucination	Faithfulness + citation enforcement gates
Prompt-injection leaks	Inline guardrail + trace evidence
Cost balloons unnoticed	Per-question $ tracked in SQLite
Latency P95 regressions	Span waterfall + threshold gate
"Should we ship this?"	One PASS/FAIL pill, backed by a Markdown audit report
"Claude or GPT?"	Run both pipelines against the same golden set, compare gates

Real-world use cases

Pre-deploy CI gate. A developer changes a prompt or swaps a model, opens a PR, a GitHub Action runs make eval. If the engineered pipeline regresses on faithfulness, citation accuracy, refusal correctness, or PII leak count, the PR is blocked. "Tests must pass before merge", for AI behaviour.
Model and provider comparison. Today the answer to "can we move from GPT-4o to Haiku to cut cost?" is vibes. With EvalForge it's a deploy-gate diff.
Prompt-regression catching. Refusal compliance and structured-output validity catch the prompts that look better but quietly break safety, before customers see it.
Compliance audit artifact. Healthcare and fintech teams have to prove the system refuses out-of-scope clinical or investment advice and doesn't leak PHI/PII. The per-question table + Markdown deploy report is exactly the artifact a regulator asks for.
Production observability. When a customer reports a bad answer, you don't go "weird, can't repro". You go to /traces, find the trace_id from the request log, see exactly which chunks were retrieved, what the LLM returned, and which guardrails fired.

The mock LLM is just for the offline demo. Set ANTHROPIC_API_KEY and the engineered pipeline talks to Claude directly. Swapping in OpenAI or Gemini is a 20-line change in backend/app/llm/client.py. The eval framework, trace store, guardrails, and deploy gate are provider-agnostic.

Demo in 60 seconds

1. open  https://evalforge-omega.vercel.app   # or localhost:3000
2. /compare → click the healthcare preset chip   # baseline hallucinates, engineered cites
3. /compare → click the safety preset chip       # injection probe; engineered refuses
4. /evals → Run full eval                        # 25 questions × 2 pipelines, ~30s
5. /deploy-gate → Run deploy gate                # baseline FAIL (8 gates), engineered PASS

A scripted walk-through is kept privately and shared on demo day.

Stack

Layer	Tech
Frontend	Next.js 15 App Router, TypeScript (strict), Tailwind CSS
Backend	FastAPI, Python 3.11, SQLModel, Pydantic v2
LLM	Anthropic (default), OpenAI, Gemini — all optional; mock fallback ships in repo
Evals	Local RAGAS-shape implementation in `backend/app/evals/metrics.py`
Tracing	Langfuse-compatible local tracer in `backend/app/traces/`
Guardrails	Custom Guardrails-AI-style chain in `backend/app/guardrails/`
Tests	pytest, Playwright
DB	SQLite (default), Supabase/Postgres swappable

Local run

Prereqs

Python 3.11+
Node 20+
(Optional) ANTHROPIC_API_KEY in .env, without it we run in deterministic mock mode

Quick start

# clone
git clone https://github.com/Minifigures/Hack-the-Tech.git evalforge
cd evalforge

# bootstrap
make install          # backend venv + npm install
cp .env.example .env  # leave keys empty for mock mode

# run backend + frontend (concurrent)
make dev

Open http://localhost:3000.

To run only one side:

make backend          # FastAPI on :8000
make frontend         # Next.js on :3000

CLI eval (no UI needed)

make eval             # runs baseline + engineered, prints scorecard + gate verdict

Output ends with (current numbers from make eval):

[baseline]   verdict: FAIL
  - faithfulness_mean observed 0.34 violates >= 0.8
  - citation_accuracy_mean observed 0.32 violates >= 0.9
  - structured_output_validity observed 0.0 violates = 1.0
  - pii_leak_count observed 1 violates <= 0
  - prompt_injection_bypass_count observed 1 violates <= 0
[engineered] verdict: PASS

Repo layout

app/                 Next.js App Router pages (cockpit + 5 feature pages)
components/          Shared React components (nav, trace card, verdict pill)
lib/                 Frontend API client + formatters
api/index.py         Vercel Python serverless entry (ASGI wrapper around FastAPI)
backend/             FastAPI: rag/, evals/, guardrails/, traces/, deploy_gate/
data/kb/             Knowledge base (healthcare / fintech / security markdown)
data/evals/          25-item golden dataset
tests/               pytest backend tests + Playwright demo flow
scripts/             seed_kb.py, run_evals.py
docs/                ASSUMPTIONS.md and detailed design notes
vercel.json          Function bundling + rewrite rule for /api/* → api/index.py
requirements.txt     Python deps bundled into the Vercel function

(Architecture details kept in private design notes.)

Cockpit pages

Route	What it shows
`/`	Cockpit overview: system status, pipeline diagram, KB inventory
`/compare`	Same question into baseline vs engineered. Answers + citations + latency + cost + guards
`/evals`	Per-metric scorecards (baseline vs engineered) and a per-question table
`/traces`	Recent traces list, with a span waterfall (retrieve → HyDE → rerank → LLM → parse → guards) for the selected trace
`/guardrails`	Probe library: click an adversarial probe to see baseline failures vs engineered refusal, inline
`/deploy-gate`	The hero. Big verdict pill, failed-gate list, full Markdown report

Deploy gate thresholds

Edit backend/app/config.py:

Metric	Default threshold
faithfulness mean	≥ 0.80
context_recall mean	≥ 0.70
citation_accuracy mean	≥ 0.90
structured_output_validity	= 1.00
refusal_correctness	≥ 0.90
pii_leak count	= 0
prompt_injection bypass count	= 0
p95 latency_ms	≤ 4000
cost_per_answer_usd	≤ $0.020

Deploy

EvalForge is built as a Vercel monorepo: Next.js at the root, FastAPI bundled as a Python serverless function under api/index.py. The default deployment runs in deterministic mock mode so the demo works with zero LLM keys.

vercel deploy --prod

Real-LLM mode (free)

Set one env var in the Vercel dashboard and the engineered pipeline starts calling a real model:

Var	What you get	Cost
`GROQ_API_KEY`	Llama 3.3 70B on Groq, OpenAI-compatible, sub-second	Free, no credit card. Sign up at https://console.groq.com
`ANTHROPIC_API_KEY`	Claude Haiku 4.5 via the Anthropic SDK	$5 free credit on signup, then pay per token. Key from https://console.anthropic.com/settings/keys

If both are set, Groq wins (override with EVALFORGE_PROVIDER_ORDER=anthropic,groq). If neither is set, the mock LLM runs and the demo is fully deterministic.

EVALFORGE_USE_MOCK     = auto | always | never  (default: auto)
EVALFORGE_PROVIDER_ORDER = groq,anthropic       (first set key wins)
EVALFORGE_GROQ_MODEL   = llama-3.3-70b-versatile
EVALFORGE_DEFAULT_MODEL = claude-haiku-4-5-20251001

What's the difference between an SDK and an API? The API is the protocol (HTTP endpoint at api.anthropic.com). An SDK is the library in your language that wraps that API so you don't write raw HTTP. EvalForge uses three SDKs: anthropic (Claude), groq (Groq cloud, OpenAI-compatible), and a deterministic mock fallback.

A separate thing is the Claude Agent SDK (claude-agent-sdk): a framework on top of Claude that gives the model file/bash/web tools, a managed tool loop, sessions, and hooks. EvalForge doesn't use it today but it's the natural next surface to evaluate (coding agents with CI, support agents, research agents, compliance assistants).

Persistent traces with Supabase Postgres

By default the function writes traces to a SQLite file in /tmp, which is ephemeral on Vercel (cold starts wipe it). To persist runs across cold starts and across the team, set:

EVALFORGE_DB_URL = postgresql://postgres.<project-ref>:<password>@aws-0-<region>.pooler.supabase.com:6543/postgres

You can grab that string from your Supabase dashboard at Settings → Database → Connection string → Transaction pooler. The schema auto-creates on cold start via SQLModel.metadata.create_all.

The function's EVALFORGE_INDEX_PATH already points at /tmp so the KB index rebuilds on cold start in ~1ms. The Vercel filesystem is read-only outside /tmp, that's why the DB env override matters.

Tracks (Hack the Tech 2026)

AI/ML · Developer Tools & Productivity · Cybersecurity & Privacy · FinTech · Healthcare · Startup & Business Solutions

EvalForge is a single tool that lands cleanly across all six because every track has the same underlying need: proving that an AI system is reliable, safe, cheap, and ready for production.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
api		api
app		app
backend		backend
components		components
data		data
lib		lib
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.npmrc		.npmrc
.vercelignore		.vercelignore
Makefile		Makefile
README.md		README.md
next-env.d.ts		next-env.d.ts
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
postcss.config.mjs		postcss.config.mjs
requirements.txt		requirements.txt
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalForge — CI/CD for Reliable AI Agents

Why this exists

Real-world use cases

Demo in 60 seconds

Stack

Local run

Prereqs

Quick start

CLI eval (no UI needed)

Repo layout

Cockpit pages

Deploy gate thresholds

Deploy

Real-LLM mode (free)

Persistent traces with Supabase Postgres

Tracks (Hack the Tech 2026)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvalForge — CI/CD for Reliable AI Agents

Why this exists

Real-world use cases

Demo in 60 seconds

Stack

Local run

Prereqs

Quick start

CLI eval (no UI needed)

Repo layout

Cockpit pages

Deploy gate thresholds

Deploy

Real-LLM mode (free)

Persistent traces with Supabase Postgres

Tracks (Hack the Tech 2026)

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages