HLE-Text-Run — FF-STACK v8 Predictions (Scrubbed)

Scrubbed prediction results for FF-STACK v8 on Humanity's Last Exam — text-only, full 2,158-question set. Submitted to the HLE Leaderboard for Agents with Tools.

Full methodology and writeup: fieldframelabs.ai/posts/hle-methodology

What's here

predictions/v8_full_textonly_public.json — per-question results for the full run.
verify.py — reproduces the headline score, per-domain breakdown, and calibration table from the file above.

The file

Each entry is keyed by HLE question ID:

{
  "<question_id>": {
    "domain": "Math",
    "stated_confidence": 85,
    "correct": "yes"
  }
}

domain — HLE category.
stated_confidence — the agent's self-reported confidence (0-100), or null if none was emitted.
correct — the canonical o3-mini judge's verdict (yes / no), bit-identical to CAIS's centerforaisafety/hle/hle_eval/run_judge_results.py.

This file deliberately contains no model responses, no gold answers, and no internal pipeline telemetry. It is enough to independently reproduce every aggregate number in the methodology paper — and nothing else. The raw predictions file (full model responses) is available on request for reviewer verification.

Score

1,119 correct / 2,158 = 51.85%

A note on the denominator. 2,153 of the 2,158 text-only questions saved successfully; 5 hit deterministic content-filter refusal patterns and were not saved. Those 5 are scored as incorrect — they stay in the 2,158 denominator and contribute zero to the 1,119 numerator. This file therefore has 2,153 entries while the headline score uses the full 2,158 denominator. verify.py prints both.

Verify it

python verify.py

Prints the count on both denominators, the per-domain breakdown, and the confidence-calibration buckets — all of which should match the methodology paper.

Other runs

Scrubbed files for the calibration and holdout sample runs referenced in the methodology paper are being added here as well.

License

MIT — see LICENSE. Free to use for verification, research, or any other purpose.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
predictions		predictions
README.md		README.md
verify.py		verify.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HLE-Text-Run — FF-STACK v8 Predictions (Scrubbed)

What's here

The file

Score

Verify it

Other runs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HLE-Text-Run — FF-STACK v8 Predictions (Scrubbed)

What's here

The file

Score

Verify it

Other runs

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages