Skip to content

FieldframeLabs/HLE-Text-Run

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

HLE-Text-Run — FF-STACK v8 Predictions (Scrubbed)

Scrubbed prediction results for FF-STACK v8 on Humanity's Last Exam — text-only, full 2,158-question set. Submitted to the HLE Leaderboard for Agents with Tools.

Full methodology and writeup: fieldframelabs.ai/posts/hle-methodology

What's here

  • predictions/v8_full_textonly_public.json — per-question results for the full run.
  • verify.py — reproduces the headline score, per-domain breakdown, and calibration table from the file above.

The file

Each entry is keyed by HLE question ID:

{
  "<question_id>": {
    "domain": "Math",
    "stated_confidence": 85,
    "correct": "yes"
  }
}
  • domain — HLE category.
  • stated_confidence — the agent's self-reported confidence (0-100), or null if none was emitted.
  • correct — the canonical o3-mini judge's verdict (yes / no), bit-identical to CAIS's centerforaisafety/hle/hle_eval/run_judge_results.py.

This file deliberately contains no model responses, no gold answers, and no internal pipeline telemetry. It is enough to independently reproduce every aggregate number in the methodology paper — and nothing else. The raw predictions file (full model responses) is available on request for reviewer verification.

Score

1,119 correct / 2,158 = 51.85%

A note on the denominator. 2,153 of the 2,158 text-only questions saved successfully; 5 hit deterministic content-filter refusal patterns and were not saved. Those 5 are scored as incorrect — they stay in the 2,158 denominator and contribute zero to the 1,119 numerator. This file therefore has 2,153 entries while the headline score uses the full 2,158 denominator. verify.py prints both.

Verify it

python verify.py

Prints the count on both denominators, the per-domain breakdown, and the confidence-calibration buckets — all of which should match the methodology paper.

Other runs

Scrubbed files for the calibration and holdout sample runs referenced in the methodology paper are being added here as well.

License

MIT — see LICENSE. Free to use for verification, research, or any other purpose.

About

Full FF Stack Text only HLE run

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages