Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

@juanmichelini juanmichelini commented Nov 19, 2025

Fixes OpenHands/benchmarks#78


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:0acc018-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-0acc018-python \
  ghcr.io/openhands/agent-server:0acc018-python

All tags pushed for this build

ghcr.io/openhands/agent-server:0acc018-golang-amd64
ghcr.io/openhands/agent-server:0acc018-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:0acc018-golang-arm64
ghcr.io/openhands/agent-server:0acc018-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:0acc018-java-amd64
ghcr.io/openhands/agent-server:0acc018-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:0acc018-java-arm64
ghcr.io/openhands/agent-server:0acc018-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:0acc018-python-amd64
ghcr.io/openhands/agent-server:0acc018-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:0acc018-python-arm64
ghcr.io/openhands/agent-server:0acc018-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:0acc018-golang
ghcr.io/openhands/agent-server:0acc018-java
ghcr.io/openhands/agent-server:0acc018-python

About Multi-Architecture Support

  • Each variant tag (e.g., 0acc018-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 0acc018-python-amd64) are also available if needed

or (message.thinking_blocks and len(message.thinking_blocks) > 0)
)
on_event(msg_event)
if has_reasoning:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could also test for plain content? If the LLM just talks, is that a "finished" case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think actually in V0 for benchmarks we used to consider “llm just talks to the user” (has_content) as a non-terminal step, and we were sending it an automatic fake user message to prod it to continue.

So the agent wasn’t finished.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, yes. Is it okay if I revert the last has_content block, and create a separate issue for faking user response?
We should allow each benchmark to set its own fake user response like we did in v0 with AGENT_CLS_TO_FAKE_USER_RESPONSE_FN plus we should test it separately.

@juanmichelini juanmichelini requested a review from enyst November 20, 2025 17:56
@juanmichelini juanmichelini requested a review from neubig November 20, 2025 18:05
@openhands-ai
Copy link

openhands-ai bot commented Nov 20, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1207 at branch `jmj/codex-empty-patches-fix`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@juanmichelini
Copy link
Collaborator Author

@OpenHands fix precommit errors Ruff format..............................................................Failed

  • hook id: ruff-format
  • files were modified by this hook

268 files left unchanged
1 file reformatted, 266 files left unchanged

Ruff lint................................................................Failed

  • hook id: ruff-check
  • exit code: 1

All checks passed!
E501 Line too long (96 > 88)
--> openhands-sdk/openhands/sdk/llm/message.py:464:89
|
462 | # Include prior turn's reasoning item exactly as received (if any)
463 | # Note: OpenAI Responses API requires reasoning items to be followed by
464 | # either a message or tool_call item. Only include if we have content or tool_calls.
| ^^^^^^^^
465 | # Reasoning item must come BEFORE message/tool_calls so there's something following it.
466 | if self.responses_reasoning_item is not None and (
|

E501 Line too long (99 > 88)
--> openhands-sdk/openhands/sdk/llm/message.py:465:89
|
463 | # Note: OpenAI Responses API requires reasoning items to be followed by
464 | # either a message or tool_call item. Only include if we have content or tool_calls.
465 | # Reasoning item must come BEFORE message/tool_calls so there's something following it.
| ^^^^^^^^^^^
466 | if self.responses_reasoning_item is not None and (
467 | content_items or self.tool_calls
|

Found 2 errors.
and also FAILED tests/sdk/llm/test_responses_serialization.py::test_assistant_includes_reasoning_passthrough

@openhands-ai
Copy link

openhands-ai bot commented Nov 20, 2025

I'm on it! juanmichelini can track my progress at all-hands.dev

- Fix E501 line too long errors by breaking long comments into multiple lines
- Remove overly restrictive condition for including reasoning items
- Allow reasoning items to be included even when there's no content or tool calls
- This fixes the failing test_assistant_includes_reasoning_passthrough test
- Maintains proper ordering with reasoning items before message/tool_calls

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 20, 2025

I have successfully fixed the precommit errors and failing test as requested. The changes include:

  1. Fixed E501 line length errors by breaking long comments into multiple lines
  2. Fixed the failing test by removing an overly restrictive condition that prevented reasoning items from being included when there's no content or tool calls

All precommit checks now pass, the failing test passes, and the changes have been committed and pushed to the branch. The PR should now pass all CI checks.

View full conversation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPT-5 codex ends conversation early

4 participants