Skip to content

Add hybridgym_funclocalize benchmark#640

Merged
juanmichelini merged 1 commit into
OpenHands:mainfrom
GaokaiZhang:main
Apr 9, 2026
Merged

Add hybridgym_funclocalize benchmark#640
juanmichelini merged 1 commit into
OpenHands:mainfrom
GaokaiZhang:main

Conversation

@GaokaiZhang

@GaokaiZhang GaokaiZhang commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add four Hybrid-Gym benchmarks (hybridgym_funclocalize, hybridgym_depsearch, hybridgym_funcgen, hybridgym_issuelocalize) from Hybrid-Gym ("Training Coding Agents to Generalize Across Tasks").
  • func_localize: Agent locates a function/class by natural-language description and writes its (missing) docstring.
  • dep_search: Agent analyzes a target function, traces its dependencies, and annotates each with a comment.
  • func_gen: Agent implements a function body from its signature and docstring (body replaced with TODO stub).
  • issue_localize: Agent reads a GitHub issue and adds comments at relevant code locations.
  • The benchmarks use python:3.11-bookworm as the base image and clone the target repo at runtime. For docker workspace the agent server image is built on-the-fly; for remote workspace a single static image must be pre-built per benchmark (see below).
  • func_gen evaluation additionally requires the yiqingxyq/repost:v0 Docker image for running RepoST test scripts.
  • Datasets on HuggingFace: hybrid-gym/hybrid_gym_func_localize, hybrid-gym/hybrid_gym_dep_search, hybrid-gym/hybrid_gym_func_gen, SWE-Gym/SWE-Gym-Raw.
  • Follows repo conventions: tool preset support, delegation support, LaminarService integration, standard Evaluation subclass pattern, module-level get_default_tools import.

Remote workspace image

Each benchmark needs a single static agent server image (unlike swebench which needs per-instance images). After merge, build and push once per benchmark:

for BENCH in hybridgym-funclocalize hybridgym-depsearch hybridgym-funcgen hybridgym-issuelocalize; do
  IMAGE_TAG_PREFIX=<prefix> uv run python -c "
from benchmarks.utils.build_utils import build_image
build_image(base_image='python:3.11-bookworm', target_image='ghcr.io/openhands/eval-agent-server',
custom_tag='${BENCH}', target='binary', push=True)
"
done

Until then, users can build to their own registry (see each benchmark's README).

Test plan

  • All pre-commit checks pass (ruff format, ruff lint, pyright strict) -- 0 errors across all 4 benchmarks
  • Full test suite passes: 337/337 tests including new hybridgym parametrized metrics tests
  • Smoke test passed (docker workspace, 1 instance): func_localize resolved 1/1
  • Remote workspace smoke test: pipeline runs end-to-end (infer + eval)
  • Imports verified clean (run_infer and eval_infer for all 4 benchmarks)
  • Follows repo conventions: tool preset support, delegation support, LaminarService integration, standard Evaluation subclass pattern
  • test_metrics.py updated with hybridgym-specific test instances and metadata

@enyst enyst requested review from juanmichelini and neubig April 8, 2026 15:15
@juanmichelini

Copy link
Copy Markdown
Collaborator

@GaokaiZhang thank you! could you fix precommit checks?
I'll run some tests and come back to you

@juanmichelini juanmichelini removed the request for review from neubig April 8, 2026 21:40
@GaokaiZhang

Copy link
Copy Markdown
Contributor Author

Hi I have done some fixes about the pre-commit checks and also pushed the other three tasks in Hybrid-Gym. Thanks for the notification.

@juanmichelini

Copy link
Copy Markdown
Collaborator

✅ Minimal Testing Complete

Successfully tested PR #640 with 1 instance using Docker workspace.

Test Results

  • Total instances: 1
  • Resolved: 1/1 (100%)
  • Errors: 0
  • Cost: $0.38 (Claude Sonnet 4.5)
  • Duration: 2m 23s

Validation ✅

  • ✅ Dataset loading from HuggingFace works
  • ✅ Docker workspace setup works
  • ✅ Repository cloning and checkout works
  • ✅ Agent successfully located target class by description
  • ✅ Generated comprehensive docstring (28 lines)
  • ✅ Evaluation criteria validated:
    • Target docstring edited: 1/1
    • Comments only (no code changes): 1/1
  • ✅ Output format and report generation work
  • ✅ CLI integration works correctly

Test Command

# Inference
uv run hybridgym-funclocalize-infer .llm_config/sonnet-4-5.json \
    --workspace docker \
    --n-limit 1 \
    --note "PR640_minimal_test"

# Evaluation
uv run hybridgym-funclocalize-eval ./eval_outputs/.../output.jsonl \
    --run-id PR640_minimal_test

Report

{
  "total_instances": 1,
  "completed_instances": 1,
  "resolved_instances": 1,
  "unresolved_instances": 0,
  "error_instances": 0,
  "empty_patch_instances": 0
}

The benchmark is ready for merge! 🚀

@juanmichelini juanmichelini enabled auto-merge (squash) April 9, 2026 21:30

@juanmichelini juanmichelini left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you!

@juanmichelini juanmichelini merged commit 32b3f85 into OpenHands:main Apr 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants