·
49 commits
to main
since this release
GBSE v1.0.0 — ATTA BENCHMARK_002 AFFIRMED
This release records the first official ATTA-affirmed GBSE benchmark result.
Official benchmark status
- Status: AFFIRMED
- officialValid: true
- Official run count: 3
- Expected executions: 168
- Actual executions: 168
- Successful executions: 168
- Errors: 0
- API error rate: 0.0%
Benchmark metrics
- Average flag detection: 90.5%
- Silent hallucination rate: 1.8%
- Silent hallucination rate on hallucination tests: 3.8%
- Must-not-pass failure count: 0
- Clean query pass rate: 100.0%
- Adversarial rejection rate: 100.0%
- False premise rejection rate: 100.0%
- Injection rejection rate: 100.0%
Provenance
- Benchmark code commit: 19b946d
- Proof/result commit: 5f62d2c
- Model: claude-sonnet-4-20250514
- Temperature: 0
- Run mode: official
Included proof artifact
- benchmark-results.json