-
Notifications
You must be signed in to change notification settings - Fork 59
Fix GPT-5 codex empty patches #1207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…d prevent orphaned reasoning items
| or (message.thinking_blocks and len(message.thinking_blocks) > 0) | ||
| ) | ||
| on_event(msg_event) | ||
| if has_reasoning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could also test for plain content? If the LLM just talks, is that a "finished" case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think actually in V0 for benchmarks we used to consider “llm just talks to the user” (has_content) as a non-terminal step, and we were sending it an automatic fake user message to prod it to continue.
So the agent wasn’t finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, yes. Is it okay if I revert the last has_content block, and create a separate issue for faking user response?
We should allow each benchmark to set its own fake user response like we did in v0 with AGENT_CLS_TO_FAKE_USER_RESPONSE_FN plus we should test it separately.
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
|
@OpenHands fix precommit errors Ruff format..............................................................Failed
268 files left unchanged Ruff lint................................................................Failed
All checks passed! E501 Line too long (99 > 88) Found 2 errors. |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
- Fix E501 line too long errors by breaking long comments into multiple lines - Remove overly restrictive condition for including reasoning items - Allow reasoning items to be included even when there's no content or tool calls - This fixes the failing test_assistant_includes_reasoning_passthrough test - Maintains proper ordering with reasoning items before message/tool_calls Co-authored-by: openhands <openhands@all-hands.dev>
|
I have successfully fixed the precommit errors and failing test as requested. The changes include:
All precommit checks now pass, the failing test passes, and the changes have been committed and pushed to the branch. The PR should now pass all CI checks. |
Fixes OpenHands/benchmarks#78
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:0acc018-pythonRun
All tags pushed for this build
About Multi-Architecture Support
0acc018-python) is a multi-arch manifest supporting both amd64 and arm640acc018-python-amd64) are also available if needed