Add hybridgym_funclocalize benchmark by GaokaiZhang · Pull Request #640 · OpenHands/benchmarks

GaokaiZhang · 2026-04-05T17:55:01Z

Summary

Add four Hybrid-Gym benchmarks (hybridgym_funclocalize, hybridgym_depsearch, hybridgym_funcgen, hybridgym_issuelocalize) from Hybrid-Gym ("Training Coding Agents to Generalize Across Tasks").
func_localize: Agent locates a function/class by natural-language description and writes its (missing) docstring.
dep_search: Agent analyzes a target function, traces its dependencies, and annotates each with a comment.
func_gen: Agent implements a function body from its signature and docstring (body replaced with TODO stub).
issue_localize: Agent reads a GitHub issue and adds comments at relevant code locations.
The benchmarks use python:3.11-bookworm as the base image and clone the target repo at runtime. For docker workspace the agent server image is built on-the-fly; for remote workspace a single static image must be pre-built per benchmark (see below).
func_gen evaluation additionally requires the yiqingxyq/repost:v0 Docker image for running RepoST test scripts.
Datasets on HuggingFace: hybrid-gym/hybrid_gym_func_localize, hybrid-gym/hybrid_gym_dep_search, hybrid-gym/hybrid_gym_func_gen, SWE-Gym/SWE-Gym-Raw.
Follows repo conventions: tool preset support, delegation support, LaminarService integration, standard Evaluation subclass pattern, module-level get_default_tools import.

Remote workspace image

Each benchmark needs a single static agent server image (unlike swebench which needs per-instance images). After merge, build and push once per benchmark:

for BENCH in hybridgym-funclocalize hybridgym-depsearch hybridgym-funcgen hybridgym-issuelocalize; do
  IMAGE_TAG_PREFIX=<prefix> uv run python -c "
from benchmarks.utils.build_utils import build_image
build_image(base_image='python:3.11-bookworm', target_image='ghcr.io/openhands/eval-agent-server',
custom_tag='${BENCH}', target='binary', push=True)
"
done

Until then, users can build to their own registry (see each benchmark's README).

Test plan

All pre-commit checks pass (ruff format, ruff lint, pyright strict) -- 0 errors across all 4 benchmarks
Full test suite passes: 337/337 tests including new hybridgym parametrized metrics tests
Smoke test passed (docker workspace, 1 instance): func_localize resolved 1/1
Remote workspace smoke test: pipeline runs end-to-end (infer + eval)
Imports verified clean (run_infer and eval_infer for all 4 benchmarks)
Follows repo conventions: tool preset support, delegation support, LaminarService integration, standard Evaluation subclass pattern
test_metrics.py updated with hybridgym-specific test instances and metadata

juanmichelini · 2026-04-08T21:37:32Z

@GaokaiZhang thank you! could you fix precommit checks?
I'll run some tests and come back to you

…uelocalize)

GaokaiZhang · 2026-04-09T04:47:38Z

Hi I have done some fixes about the pre-commit checks and also pushed the other three tasks in Hybrid-Gym. Thanks for the notification.

juanmichelini · 2026-04-09T21:25:07Z

✅ Minimal Testing Complete

Successfully tested PR #640 with 1 instance using Docker workspace.

Test Results

Total instances: 1
Resolved: 1/1 (100%)
Errors: 0
Cost: $0.38 (Claude Sonnet 4.5)
Duration: 2m 23s

Validation ✅

✅ Dataset loading from HuggingFace works
✅ Docker workspace setup works
✅ Repository cloning and checkout works
✅ Agent successfully located target class by description
✅ Generated comprehensive docstring (28 lines)
✅ Evaluation criteria validated:
- Target docstring edited: 1/1
- Comments only (no code changes): 1/1
✅ Output format and report generation work
✅ CLI integration works correctly

Test Command

# Inference
uv run hybridgym-funclocalize-infer .llm_config/sonnet-4-5.json \
    --workspace docker \
    --n-limit 1 \
    --note "PR640_minimal_test"

# Evaluation
uv run hybridgym-funclocalize-eval ./eval_outputs/.../output.jsonl \
    --run-id PR640_minimal_test

Report

{
  "total_instances": 1,
  "completed_instances": 1,
  "resolved_instances": 1,
  "unresolved_instances": 0,
  "error_instances": 0,
  "empty_patch_instances": 0
}

The benchmark is ready for merge! 🚀

juanmichelini

LGTM thank you!

enyst requested review from juanmichelini and neubig April 8, 2026 15:15

juanmichelini removed the request for review from neubig April 8, 2026 21:40

Add four Hybrid-Gym benchmarks (funclocalize, depsearch, funcgen, iss…

0b17d77

…uelocalize)

GaokaiZhang force-pushed the main branch from 9dffcbd to 0b17d77 Compare April 9, 2026 04:44

juanmichelini enabled auto-merge (squash) April 9, 2026 21:30

juanmichelini approved these changes Apr 9, 2026

View reviewed changes

juanmichelini merged commit 32b3f85 into OpenHands:main Apr 9, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hybridgym_funclocalize benchmark#640

Add hybridgym_funclocalize benchmark#640
juanmichelini merged 1 commit into
OpenHands:mainfrom
GaokaiZhang:main

GaokaiZhang commented Apr 5, 2026 •

edited

Loading

Uh oh!

juanmichelini commented Apr 8, 2026

Uh oh!

GaokaiZhang commented Apr 9, 2026

Uh oh!

juanmichelini commented Apr 9, 2026

Uh oh!

juanmichelini left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GaokaiZhang commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Remote workspace image

Test plan

Uh oh!

juanmichelini commented Apr 8, 2026

Uh oh!

GaokaiZhang commented Apr 9, 2026

Uh oh!

juanmichelini commented Apr 9, 2026

✅ Minimal Testing Complete

Test Results

Validation ✅

Test Command

Report

Uh oh!

juanmichelini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GaokaiZhang commented Apr 5, 2026 •

edited

Loading