Scrubbed prediction results for FF-STACK v8 on Humanity's Last Exam — text-only, full 2,158-question set. Submitted to the HLE Leaderboard for Agents with Tools.
Full methodology and writeup: fieldframelabs.ai/posts/hle-methodology
predictions/v8_full_textonly_public.json— per-question results for the full run.verify.py— reproduces the headline score, per-domain breakdown, and calibration table from the file above.
Each entry is keyed by HLE question ID:
{
"<question_id>": {
"domain": "Math",
"stated_confidence": 85,
"correct": "yes"
}
}domain— HLE category.stated_confidence— the agent's self-reported confidence (0-100), ornullif none was emitted.correct— the canonical o3-mini judge's verdict (yes/no), bit-identical to CAIS'scenterforaisafety/hle/hle_eval/run_judge_results.py.
This file deliberately contains no model responses, no gold answers, and no internal pipeline telemetry. It is enough to independently reproduce every aggregate number in the methodology paper — and nothing else. The raw predictions file (full model responses) is available on request for reviewer verification.
1,119 correct / 2,158 = 51.85%
A note on the denominator. 2,153 of the 2,158 text-only questions saved successfully; 5 hit deterministic content-filter refusal patterns and were not saved. Those 5 are scored as incorrect — they stay in the 2,158 denominator and contribute zero to the 1,119 numerator. This file therefore has 2,153 entries while the headline score uses the full 2,158 denominator. verify.py prints both.
python verify.py
Prints the count on both denominators, the per-domain breakdown, and the confidence-calibration buckets — all of which should match the methodology paper.
Scrubbed files for the calibration and holdout sample runs referenced in the methodology paper are being added here as well.
MIT — see LICENSE. Free to use for verification, research, or any other purpose.