CertainLogic Hallucination Benchmark

30 real hallucination test cases. Published April 2026. Run them against any LLM.

Study Context

Published April 2026. 30 factual questions across medical, legal, financial, technical, and general knowledge. Each case has a single verifiable correct answer.

These test cases measure confident incorrect responses — the failure mode where an LLM sounds sure but is wrong.

Benchmark Results (April 2026 Run)

System	Medical	Legal	Financial	Technical	General	Overall
GPT-4o (bare)	60%	80%	60%	80%	90%	77%
Claude 3.5 Sonnet (bare)	80%	80%	60%	80%	90%	80%
Llama 3.3 70B (bare)	60%	60%	60%	80%	80%	70%
Claude Opus 4	~100%	~100%	~100%	~100%	~100%	~100%
CertainLogic Brain API	100%	100%	100%	100%	100%	100%

Full results with per-case breakdowns: results/certainlogic_results.json

About these results:

Bare-LLM results (GPT-4o, Claude, Llama): Run via live API calls. Independently reproducible with your own keys.
CertainLogic Brain API: Proprietary system. Results from April 2026 run. This system is not independently verifiable without API access.

What the test cases measure: Factual correctness on questions with verifiable answers. A system that says "I don't know" scores lower than one that answers correctly. This is one dimension of evaluation, not a comprehensive quality assessment.

Why This Matters

LLMs hallucinate on factual questions in ways that are predictable and dangerous. Models that ace reasoning benchmarks will confidently state wrong retirement ages, unsafe drug combinations, or incorrect capital cities. These are systematic failure modes that appear consistently across providers.

How to Run

# 1. Install dependencies
pip install -r requirements.txt

# 2. Run against your model
python benchmark.py --provider openai --model gpt-4o --api-key YOUR_KEY

# 3. Add auto-scoring
python benchmark.py --provider anthropic --model claude-3-5-sonnet-20241022 --api-key YOUR_KEY \
  --evaluator-provider openai --evaluator-model gpt-4o --evaluator-key YOUR_OPENAI_KEY

Supported providers: openai, anthropic, openrouter

Results saved to results/<provider>_<model>_<timestamp>.json.

No CertainLogic account required.

Test Case Categories

Category	Cases	What We Test
🏥 Medical	5	Drug dosages, interactions, diagnostic priorities, vaccine schedules, lab reference ranges
⚖️ Legal	5	Statute of limitations (state-specific), LLC liability, contract enforceability, at-will employment, GDPR vs CCPA
💰 Financial	5	401(k)/IRA contribution limits, capital gains rates, Social Security retirement age, Roth rules, FDIC limits
💻 Technical	5	Python EOL dates, API rate limits, AWS service limits, SQL injection prevention, dependency compatibility
🌍 General	10	Historical dates, geographic facts, scientific constants, company histories, sports records

Example Cases

Financial — Social Security retirement age (fin-003)

Q: What is the full retirement age for Social Security benefits for someone born in 1960? Correct: 67. Common hallucination: 65 (outdated figure).

Medical — Drug interaction (med-002)

Q: Is it safe to take ibuprofen and low-dose aspirin together? Correct: No — ibuprofen reduces aspirin's cardioprotective effect. Common hallucination: "Yes, both are NSAIDs and safe together."

License

MIT — reproduce, extend, and publish your own results.

CertainLogic Research | Published April 2026

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cases		cases
docs		docs
results		results
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CertainLogic Hallucination Benchmark

Study Context

Benchmark Results (April 2026 Run)

Why This Matters

How to Run

Test Case Categories

Example Cases

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CertainLogic Hallucination Benchmark

Study Context

Benchmark Results (April 2026 Run)

Why This Matters

How to Run

Test Case Categories

Example Cases

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages