MedEval

Healthcare AI triage assistant where deterministic Python rules make the safety-critical decision and an LLM only writes the explanation. Built on the AHRQ ESI Handbook v4.

🌐 Live Demo

Try it: https://medeval.rajkumarai.dev 🔒

The demo is gated by an API key. Reach out if you'd like the key to try it.

🧠 The Key Idea

LLMs hallucinate. In healthcare, hallucinating an urgency level can kill someone.

So MedEval structurally prevents the LLM from making the urgency decision:

Patient complaint
       ↓
[ LLM ] extract structured facts from free text
       ↓
[ Python rules engine ] decide ESI level (1-5) ← LLM never sees this code
       ↓
[ LLM ] translate clinical rationale into plain English
       ↓
Patient-facing message

The LLM cannot influence the safety-critical decision. Every rule traces back to a specific page of the AHRQ ESI Implementation Handbook v4 — the protocol used by U.S. emergency departments.

🏗️ Architecture

┌─────────────────┐      ┌──────────────────────────────────┐
│  React + Vite   │ ───► │  FastAPI                         │
│  patient UI     │      │   ├─ /health                     │
│  doctor UI      │      │   └─ /triage (X-API-Key auth)    │
└─────────────────┘      │       │                          │
                         │       ▼                          │
                         │  LangGraph pipeline              │
                         │   ├─ extract  (OpenAI)           │
                         │   ├─ triage   (68 YAML rules)    │
                         │   └─ explain  (OpenAI)           │
                         │       │                          │
                         │       ▼                          │
                         │  Langfuse (traces every LLM call)│
                         └──────────────────────────────────┘

Schema Highlights

ExtractedFacts — 95-field Pydantic model. The LLM is constrained to produce only this shape via OpenAI structured output. Every field defaults to False (safe under-trigger > over-trigger).
esi_rules.yaml — 68 rules across all four ESI decision points (A, B, C, D) plus pediatric overlays. Each rule cites its source page. Schema supports nested any_of / all_of and applies_when preconditions.
engine.py — ~100 lines of pure Python. Loads YAML, evaluates rules in handbook order, returns level + rules fired + rationales + decision path.

🛠️ Tech Stack

Layer	Stack
Frontend	React 18, TypeScript, Vite, Tailwind CSS, Lucide icons
Backend	Python 3.12, FastAPI, Pydantic v2
LLM orchestration	LangGraph, LangChain, OpenAI `gpt-4o-mini` (Anthropic planned)
Observability	Langfuse Cloud (per-call traces, latency, tokens)
Auth	API key in `X-API-Key` header
Containerization	Docker (multi-stage frontend build), Docker Compose
Deployment	AWS EC2 `t3.micro`, Ubuntu 24.04, Nginx reverse proxy, HTTPS via Let's Encrypt
Evaluation	`medeval-harness` (published on PyPI), GitHub Actions CI

🚀 Local Development

Prerequisites

Python 3.12+
Node.js 20+
Docker Desktop (for containerized run)
An OpenAI API key + Langfuse account

Backend

cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # then fill in real keys
uvicorn main:app --reload

Frontend

cd frontend
npm install
npm run dev

Open http://localhost:5173.

Or run the whole thing in Docker

docker compose up

☁️ Production Deploy (AWS EC2)

Images are published to Docker Hub:

docker pull raja1566/medeval-backend:latest
docker pull raja1566/medeval-frontend:latest

On the EC2 host:

docker compose up -d

In production, an Nginx reverse proxy on the host terminates TLS (Let's Encrypt, auto-renewing) and routes traffic:

https://medeval.rajkumarai.dev/      → frontend container (:5173)
https://medeval.rajkumarai.dev/api/  → backend container  (:8000)

See docker-compose.yml for the full configuration.

📁 Project Structure

MedEval/
├── backend/
│   ├── agent/                  LangGraph nodes + prompts
│   │   ├── prompts/
│   │   ├── llm.py              provider config (one place to swap)
│   │   ├── extraction.py       Node 1: complaint → facts
│   │   ├── explanation.py      Node 3: result → patient message
│   │   └── graph.py            LangGraph wiring
│   ├── rules/
│   │   ├── esi_rules.yaml      68 traceable rules
│   │   ├── facts.py            95-field ExtractedFacts contract
│   │   ├── result.py           EngineResult shape
│   │   └── engine.py           recursive evaluator
│   ├── main.py                 FastAPI app
│   ├── models.py               TriageRequest
│   ├── security.py             API key auth dependency
│   └── Dockerfile
├── frontend/
│   ├── src/
│   │   ├── api.ts              fetch wrapper + localStorage key mgmt
│   │   ├── types.ts            backend mirror
│   │   └── App.tsx             single-page UI
│   ├── nginx.conf
│   └── Dockerfile              multi-stage (Node build → Nginx)
├── harness/                    Phase 2 — evaluation harness (PyPI package)
│   ├── src/medeval_harness/
│   │   ├── cases.py            case schema + dataset loader
│   │   ├── runner.py           HTTP client that calls the agent
│   │   ├── scorer.py           safety-aware metrics (under/over-triage)
│   │   ├── report.py           rich terminal + JSON reports
│   │   ├── cli.py              `medeval-harness evaluate ...`
│   │   └── data/esi_cases.json 50 cases from the AHRQ handbook
│   └── pyproject.toml
├── .github/workflows/eval.yml  CI: auto-eval, fail-under threshold
├── docs/
│   └── esi-handbook-v4.pdf     source of truth
└── docker-compose.yml

🧪 Evaluation Harness (`medeval-harness`)

A standalone, PyPI-published package that scores the agent against a 50-case dataset built from the worked examples in the AHRQ handbook (chapters 9–10, with official ESI answers).

pip install medeval-harness
medeval-harness evaluate --api-url https://medeval.rajkumarai.dev/api --api-key <key>

Unlike a generic accuracy benchmark, the harness reports safety-aware metrics — because in triage, the direction of an error matters more than the rate:

Metric	Baseline	After prompt tuning
Exact accuracy	66%	68%
Under-triage rate (marked less urgent than reality — dangerous)	20%	10%
Over-triage rate (marked more urgent — safe, wasteful)	14%	22%

The tuning deliberately traded under-triage for over-triage — the safe direction in medicine. Tuning was stopped at this point on purpose: pushing accuracy higher would have meant overfitting the eval set. A proper held-out test set is noted as future work.

The harness runs in GitHub Actions on every change to the agent, rules, or dataset, and fails the build if exact accuracy drops below 60% — catching triage regressions automatically.

🗺️ Roadmap

Phase 1 ✅ Live triage app (rules engine + LLM extraction + LLM explanation + UI + HTTPS deploy)
Phase 2 ✅ Evaluation harness — 50-case ESI dataset, safety-aware scoring, GitHub Actions CI, published as medeval-harness on PyPI
Phase 3 ⬜ Multi-provider LLM router (Claude, GPT, Gemini) with cost dashboard. Deterministic rules published as triage-rules on PyPI.

⚠️ Disclaimer

MedEval is a portfolio project. It is NOT a medical device, has not been reviewed by any regulator, and must not be used for real clinical decisions. All examples and screenshots use synthetic data.

📜 License

MIT — see LICENSE (forthcoming).

Built by Rajkumar N.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedEval

🌐 Live Demo

🧠 The Key Idea

🏗️ Architecture

Schema Highlights

🛠️ Tech Stack

🚀 Local Development

Prerequisites

Backend

Frontend

Or run the whole thing in Docker

☁️ Production Deploy (AWS EC2)

📁 Project Structure

🧪 Evaluation Harness (`medeval-harness`)

🗺️ Roadmap

⚠️ Disclaimer

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
backend		backend
docs		docs
frontend		frontend
harness		harness
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

MedEval

🌐 Live Demo

🧠 The Key Idea

🏗️ Architecture

Schema Highlights

🛠️ Tech Stack

🚀 Local Development

Prerequisites

Backend

Frontend

Or run the whole thing in Docker

☁️ Production Deploy (AWS EC2)

📁 Project Structure

🧪 Evaluation Harness (medeval-harness)

🗺️ Roadmap

⚠️ Disclaimer

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🧪 Evaluation Harness (`medeval-harness`)

Packages