30 real hallucination test cases. Published April 2026. Run them against any LLM.
Published April 2026. 30 factual questions across medical, legal, financial, technical, and general knowledge. Each case has a single verifiable correct answer.
These test cases measure confident incorrect responses — the failure mode where an LLM sounds sure but is wrong.
| System | Medical | Legal | Financial | Technical | General | Overall |
|---|---|---|---|---|---|---|
| GPT-4o (bare) | 60% | 80% | 60% | 80% | 90% | 77% |
| Claude 3.5 Sonnet (bare) | 80% | 80% | 60% | 80% | 90% | 80% |
| Llama 3.3 70B (bare) | 60% | 60% | 60% | 80% | 80% | 70% |
| Claude Opus 4 | ~100% | ~100% | ~100% | ~100% | ~100% | ~100% |
| CertainLogic Brain API | 100% | 100% | 100% | 100% | 100% | 100% |
Full results with per-case breakdowns:
results/certainlogic_results.json
About these results:
- Bare-LLM results (GPT-4o, Claude, Llama): Run via live API calls. Independently reproducible with your own keys.
- CertainLogic Brain API: Proprietary system. Results from April 2026 run. This system is not independently verifiable without API access.
What the test cases measure: Factual correctness on questions with verifiable answers. A system that says "I don't know" scores lower than one that answers correctly. This is one dimension of evaluation, not a comprehensive quality assessment.
LLMs hallucinate on factual questions in ways that are predictable and dangerous. Models that ace reasoning benchmarks will confidently state wrong retirement ages, unsafe drug combinations, or incorrect capital cities. These are systematic failure modes that appear consistently across providers.
# 1. Install dependencies
pip install -r requirements.txt
# 2. Run against your model
python benchmark.py --provider openai --model gpt-4o --api-key YOUR_KEY
# 3. Add auto-scoring
python benchmark.py --provider anthropic --model claude-3-5-sonnet-20241022 --api-key YOUR_KEY \
--evaluator-provider openai --evaluator-model gpt-4o --evaluator-key YOUR_OPENAI_KEYSupported providers: openai, anthropic, openrouter
Results saved to results/<provider>_<model>_<timestamp>.json.
No CertainLogic account required.
| Category | Cases | What We Test |
|---|---|---|
| 🏥 Medical | 5 | Drug dosages, interactions, diagnostic priorities, vaccine schedules, lab reference ranges |
| ⚖️ Legal | 5 | Statute of limitations (state-specific), LLC liability, contract enforceability, at-will employment, GDPR vs CCPA |
| 💰 Financial | 5 | 401(k)/IRA contribution limits, capital gains rates, Social Security retirement age, Roth rules, FDIC limits |
| 💻 Technical | 5 | Python EOL dates, API rate limits, AWS service limits, SQL injection prevention, dependency compatibility |
| 🌍 General | 10 | Historical dates, geographic facts, scientific constants, company histories, sports records |
Financial — Social Security retirement age (fin-003)
Q: What is the full retirement age for Social Security benefits for someone born in 1960? Correct: 67. Common hallucination: 65 (outdated figure).
Medical — Drug interaction (med-002)
Q: Is it safe to take ibuprofen and low-dose aspirin together? Correct: No — ibuprofen reduces aspirin's cardioprotective effect. Common hallucination: "Yes, both are NSAIDs and safe together."
MIT — reproduce, extend, and publish your own results.
CertainLogic Research | Published April 2026