Skip to content

Conversation

@xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Dec 1, 2025

Problem

The agent was frequently creating markdown files (README.md, CHANGES.md, NOTES.md, etc.) during its work to document changes for users, even when not explicitly requested. This created unnecessary files that needed to be cleaned up.

Root Cause

The system prompt had ambiguous language that could be interpreted as encouraging documentation file creation:

  • TROUBLESHOOTING section said "Document your reasoning process" (could be interpreted as "create a document")
  • DOCUMENTATION section mentioned "If you need to create documentation files for reference" as an option, which made it seem acceptable

Solution

HUMAN: I removed the documentation section completely since it seems not necessary

Impact

The default behavior is now that the agent should NOT write markdown files at all unless the user explicitly requests them. All explanations should be provided in conversation responses instead.

Testing

No code changes - only prompt modifications. Testing will be done through agent interactions to verify markdown files are no longer created automatically.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:21a4a2b-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-21a4a2b-python \
  ghcr.io/openhands/agent-server:21a4a2b-python

All tags pushed for this build

ghcr.io/openhands/agent-server:21a4a2b-golang-amd64
ghcr.io/openhands/agent-server:21a4a2b-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:21a4a2b-golang-arm64
ghcr.io/openhands/agent-server:21a4a2b-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:21a4a2b-java-amd64
ghcr.io/openhands/agent-server:21a4a2b-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:21a4a2b-java-arm64
ghcr.io/openhands/agent-server:21a4a2b-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:21a4a2b-python-amd64
ghcr.io/openhands/agent-server:21a4a2b-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:21a4a2b-python-arm64
ghcr.io/openhands/agent-server:21a4a2b-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:21a4a2b-golang
ghcr.io/openhands/agent-server:21a4a2b-java
ghcr.io/openhands/agent-server:21a4a2b-python

About Multi-Architecture Support

  • Each variant tag (e.g., 21a4a2b-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 21a4a2b-python-amd64) are also available if needed

- Changed TROUBLESHOOTING section to say 'Explain your reasoning process' instead of 'Document your reasoning process'
- Made DOCUMENTATION section much more explicit about NOT creating markdown files
- Added clear instruction: Do NOT create README.md, CHANGES.md, NOTES.md, or any other documentation files unless explicitly requested
- Emphasized that explanations should ALWAYS be in conversation responses, not separate files

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww marked this pull request as ready for review December 1, 2025 20:03
@xingyaoww xingyaoww added the run-eval-50 Runs evaluation on 50 SWE-bench instances label Dec 1, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

Evaluation Triggered

@enyst
Copy link
Collaborator

enyst commented Dec 1, 2025

Just a thought about this PR, not a review. I’d love your thoughts 😅

I should note, for the record, that this is LLM-specific: it looks like a very Sonnet thing, it must have been trained to do it. Sorry, Xingyao, so I feel maybe you believe context engineering is not really taking us far 😅 and maybe you’ll be ultimately proven right, I just feel we don’t get the best experience if we apply a blanket prompt to all LLMs either.

I’m pretty sure there is no need for this for Gemini 2.5 or GPT-5 all variants, or a number of older models like R1.

The important part, it seems to me, is that this is the kind of thing that we mostly needed to adjust system prompts over the past year and a half, and they’re LLM-specific. The majority, I feel, as far as I recall, were LLM-specific: tweaking something that a particular LLM kept doing or not doing.

(There is a reasonable argument to be made that the particular phrases in this PR maybe it don’t hurt other models, and that’s totally possible, but idk, it doesn’t seem completely obvious: for example, according to OpenAI docs, for Codex variants, adding instructions to talk to the user may hurt performance or make it stop/finish early. Idk, this is about talking to the user? “ALWAYS include explanations in your conversation responses”. It does seem like a similar topic with this.)

TBH this is one reason why I think we may need to add dedicated prompts for a few SOTA families. I think it may prove easier for us,

  • to focus on LLM (family) (3-4 families like gemini, gpt-5),
  • eval the whole (tools + prompts) per model,
  • and then we don’t need to re-eval on other models for every Sonnet adjustments, because we don’t make these adjustments in the other 2-3 files. WDYT?

@xingyaoww
Copy link
Collaborator Author

to focus on LLM (family) (3-4 families like gemini, gpt-5),
eval the whole (tools + prompts) per model,
and then we don’t need to re-eval on other models for every Sonnet adjustments, because we don’t make these adjustments in the other 2-3 files. WDYT?

One thing I'm worried about would be the fairness of evals, eg, there's an additional variable of "prompts" when we eval and compare different models, so when one model performs worse than the other, it is hard to tell if it is due to model capability or prompt optimization - this makes the OH evaluation number less trusted since it is no longer easily comparable.

And especially now we don't have a systematic way to optimize system prompt for any model, this makes me worried about maintaining separate files - those files can easily be out of sync, fixes make to one model doesn't propagate to other models.

IMO, we should only separate system prompts out when (1) we have enough manpower to track and maintain system prompts for different models, so we are sure each changes are properly evaluated, OR (2) we have an automated system that optimize system prompts based on a list of evaluation instances (real-world tasks, and we use llm-as-judge to monitor agent behavior.

Also, on the other hand, system prompt describes the expected behavior of agents, which I think is valuable to keep across models (although a lot of times, we don't need these prompts for other models like GPT-5 / Codex) - and I'd be happy to revert those relavant part in this PR that may hurt GPT-5 performances.

@xingyaoww
Copy link
Collaborator Author

xingyaoww commented Dec 2, 2025

it looks like a very Sonnet thing, it must have been trained to do it.

Not really, claude-code doesn't do that often, i suspect it is the "DOCUMENTATION BLOCK" we have in our system prompt. The other alternative would be, we remove the documentation block completely to simplify things down. It was there to inhibit some sonnet 4 behavior

The DOCUMENTATION block is redundant since FILE_SYSTEM_GUIDELINES already
contains the guidance about not creating documentation files. Removing
this block simplifies the prompt while maintaining the same behavior.

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww added run-eval-50 Runs evaluation on 50 SWE-bench instances and removed run-eval-50 Runs evaluation on 50 SWE-bench instances labels Dec 2, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

Evaluation Triggered

@all-hands-bot
Copy link
Collaborator

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19866551464-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: 40edd910333a97a7a32977eafbd4570ba8bfd690
Timestamp: 2025-12-02 18:31:40 UTC

Results Summary

  • Total instances: 500
  • Submitted instances: 50
  • Resolved instances: 30
  • Unresolved instances: 18
  • Empty patch instances: 0
  • Error instances: 2
  • Success rate: 30/50 (60.0%)

View Metadata | View Results | Download Full Results

@xingyaoww
Copy link
Collaborator Author

xingyaoww commented Dec 3, 2025

Actually, I re-run patch eval locally and it is giving 39/50, which is comparable and/or better than @ryanhoangt's number here (35/50)

#419 (comment)

I think this PR is ready for review and merge

Cleaning cached images...
Removed 0 images.
Total instances: 500
Instances submitted: 50
Instances completed: 50
Instances incomplete: 450
Instances resolved: 39
Instances unresolved: 11
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 500
Report written to openhands.eval_output.swebench.json

Copy link
Contributor

@hieptl hieptl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 🙏

@xingyaoww xingyaoww enabled auto-merge (squash) December 3, 2025 16:22
@xingyaoww xingyaoww merged commit 5825094 into main Dec 3, 2025
16 checks passed
@xingyaoww xingyaoww deleted the fix-prevent-markdown-file-creation branch December 3, 2025 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-eval-50 Runs evaluation on 50 SWE-bench instances

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants