Fix system prompt to prevent automatic markdown file creation #1288

xingyaoww · 2025-12-01T19:36:21Z

Problem

The agent was frequently creating markdown files (README.md, CHANGES.md, NOTES.md, etc.) during its work to document changes for users, even when not explicitly requested. This created unnecessary files that needed to be cleaned up.

Root Cause

The system prompt had ambiguous language that could be interpreted as encouraging documentation file creation:

TROUBLESHOOTING section said "Document your reasoning process" (could be interpreted as "create a document")
DOCUMENTATION section mentioned "If you need to create documentation files for reference" as an option, which made it seem acceptable

Solution

HUMAN: I removed the documentation section completely since it seems not necessary

Impact

The default behavior is now that the agent should NOT write markdown files at all unless the user explicitly requests them. All explanations should be provided in conversation responses instead.

Testing

No code changes - only prompt modifications. Testing will be done through agent interactions to verify markdown files are no longer created automatically.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:21a4a2b-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-21a4a2b-python \
  ghcr.io/openhands/agent-server:21a4a2b-python

All tags pushed for this build

ghcr.io/openhands/agent-server:21a4a2b-golang-amd64
ghcr.io/openhands/agent-server:21a4a2b-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:21a4a2b-golang-arm64
ghcr.io/openhands/agent-server:21a4a2b-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:21a4a2b-java-amd64
ghcr.io/openhands/agent-server:21a4a2b-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:21a4a2b-java-arm64
ghcr.io/openhands/agent-server:21a4a2b-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:21a4a2b-python-amd64
ghcr.io/openhands/agent-server:21a4a2b-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:21a4a2b-python-arm64
ghcr.io/openhands/agent-server:21a4a2b-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:21a4a2b-golang
ghcr.io/openhands/agent-server:21a4a2b-java
ghcr.io/openhands/agent-server:21a4a2b-python

About Multi-Architecture Support

Each variant tag (e.g., 21a4a2b-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 21a4a2b-python-amd64) are also available if needed

- Changed TROUBLESHOOTING section to say 'Explain your reasoning process' instead of 'Document your reasoning process' - Made DOCUMENTATION section much more explicit about NOT creating markdown files - Added clear instruction: Do NOT create README.md, CHANGES.md, NOTES.md, or any other documentation files unless explicitly requested - Emphasized that explanations should ALWAYS be in conversation responses, not separate files Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2025-12-01T20:16:15Z

Evaluation Triggered

Trigger: Label 'run-eval-50' on PR Fix system prompt to prevent automatic markdown file creation #1288
SDK: 1c8080d
Eval limit: 50
Models: claude-sonnet-4-5-20250929

enyst · 2025-12-01T21:20:15Z

Just a thought about this PR, not a review. I’d love your thoughts 😅

I should note, for the record, that this is LLM-specific: it looks like a very Sonnet thing, it must have been trained to do it. Sorry, Xingyao, so I feel maybe you believe context engineering is not really taking us far 😅 and maybe you’ll be ultimately proven right, I just feel we don’t get the best experience if we apply a blanket prompt to all LLMs either.

I’m pretty sure there is no need for this for Gemini 2.5 or GPT-5 all variants, or a number of older models like R1.

The important part, it seems to me, is that this is the kind of thing that we mostly needed to adjust system prompts over the past year and a half, and they’re LLM-specific. The majority, I feel, as far as I recall, were LLM-specific: tweaking something that a particular LLM kept doing or not doing.

(There is a reasonable argument to be made that the particular phrases in this PR maybe it don’t hurt other models, and that’s totally possible, but idk, it doesn’t seem completely obvious: for example, according to OpenAI docs, for Codex variants, adding instructions to talk to the user may hurt performance or make it stop/finish early. Idk, this is about talking to the user? “ALWAYS include explanations in your conversation responses”. It does seem like a similar topic with this.)

TBH this is one reason why I think we may need to add dedicated prompts for a few SOTA families. I think it may prove easier for us,

to focus on LLM (family) (3-4 families like gemini, gpt-5),
eval the whole (tools + prompts) per model,
and then we don’t need to re-eval on other models for every Sonnet adjustments, because we don’t make these adjustments in the other 2-3 files. WDYT?

xingyaoww · 2025-12-02T16:06:30Z

to focus on LLM (family) (3-4 families like gemini, gpt-5),
eval the whole (tools + prompts) per model,
and then we don’t need to re-eval on other models for every Sonnet adjustments, because we don’t make these adjustments in the other 2-3 files. WDYT?

One thing I'm worried about would be the fairness of evals, eg, there's an additional variable of "prompts" when we eval and compare different models, so when one model performs worse than the other, it is hard to tell if it is due to model capability or prompt optimization - this makes the OH evaluation number less trusted since it is no longer easily comparable.

And especially now we don't have a systematic way to optimize system prompt for any model, this makes me worried about maintaining separate files - those files can easily be out of sync, fixes make to one model doesn't propagate to other models.

IMO, we should only separate system prompts out when (1) we have enough manpower to track and maintain system prompts for different models, so we are sure each changes are properly evaluated, OR (2) we have an automated system that optimize system prompts based on a list of evaluation instances (real-world tasks, and we use llm-as-judge to monitor agent behavior.

Also, on the other hand, system prompt describes the expected behavior of agents, which I think is valuable to keep across models (although a lot of times, we don't need these prompts for other models like GPT-5 / Codex) - and I'd be happy to revert those relavant part in this PR that may hurt GPT-5 performances.

xingyaoww · 2025-12-02T16:07:24Z

it looks like a very Sonnet thing, it must have been trained to do it.

Not really, claude-code doesn't do that often, i suspect it is the "DOCUMENTATION BLOCK" we have in our system prompt. The other alternative would be, we remove the documentation block completely to simplify things down. It was there to inhibit some sonnet 4 behavior

openhands-sdk/openhands/sdk/agent/prompts/system_prompt.j2

The DOCUMENTATION block is redundant since FILE_SYSTEM_GUIDELINES already contains the guidance about not creating documentation files. Removing this block simplifies the prompt while maintaining the same behavior. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2025-12-02T16:52:03Z

Evaluation Triggered

Trigger: Label 'run-eval-50' on PR Fix system prompt to prevent automatic markdown file creation #1288
SDK:
Eval limit: 50
Models: claude-sonnet-4-5-20250929

all-hands-bot · 2025-12-02T18:31:41Z

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19866551464-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: 40edd910333a97a7a32977eafbd4570ba8bfd690
Timestamp: 2025-12-02 18:31:40 UTC

Results Summary

Total instances: 500
Submitted instances: 50
Resolved instances: 30
Unresolved instances: 18
Empty patch instances: 0
Error instances: 2
Success rate: 30/50 (60.0%)

View Metadata | View Results | Download Full Results

xingyaoww · 2025-12-03T16:19:17Z

Actually, I re-run patch eval locally and it is giving 39/50, which is comparable and/or better than @ryanhoangt's number here (35/50)

#419 (comment)

I think this PR is ready for review and merge

Cleaning cached images...
Removed 0 images.
Total instances: 500
Instances submitted: 50
Instances completed: 50
Instances incomplete: 450
Instances resolved: 39
Instances unresolved: 11
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 500
Report written to openhands.eval_output.swebench.json

hieptl

Thank you! 🙏

xingyaoww marked this pull request as ready for review December 1, 2025 20:03

xingyaoww added the run-eval-50 Runs evaluation on 50 SWE-bench instances label Dec 1, 2025

xingyaoww commented Dec 2, 2025

View reviewed changes

openhands-sdk/openhands/sdk/agent/prompts/system_prompt.j2 Outdated Show resolved Hide resolved

xingyaoww added run-eval-50 Runs evaluation on 50 SWE-bench instances and removed run-eval-50 Runs evaluation on 50 SWE-bench instances labels Dec 2, 2025

Merge branch 'main' into fix-prevent-markdown-file-creation

6a6961d

Merge branch 'main' into fix-prevent-markdown-file-creation

51cc0f6

hieptl approved these changes Dec 3, 2025

View reviewed changes

xingyaoww enabled auto-merge (squash) December 3, 2025 16:22

xingyaoww merged commit 5825094 into main Dec 3, 2025
16 checks passed

xingyaoww deleted the fix-prevent-markdown-file-creation branch December 3, 2025 16:23

enyst mentioned this pull request Dec 4, 2025

Per-model prompt customization: model-family overrides and user-defined prompt pieces #1173

Open

xingyaoww mentioned this pull request Dec 8, 2025

SWE-Bench evaluation discrepancy: 30/50 vs 39/50 resolved instances on same execution data OpenHands/benchmarks#140

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix system prompt to prevent automatic markdown file creation #1288

Fix system prompt to prevent automatic markdown file creation #1288

Uh oh!

xingyaoww commented Dec 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

enyst commented Dec 1, 2025

Uh oh!

xingyaoww commented Dec 2, 2025

Uh oh!

xingyaoww commented Dec 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

all-hands-bot commented Dec 2, 2025

Uh oh!

xingyaoww commented Dec 3, 2025 •

edited

Loading

Uh oh!

hieptl left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix system prompt to prevent automatic markdown file creation #1288

Fix system prompt to prevent automatic markdown file creation #1288

Uh oh!

Conversation

xingyaoww commented Dec 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Impact

Testing

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

enyst commented Dec 1, 2025

Uh oh!

xingyaoww commented Dec 2, 2025

Uh oh!

xingyaoww commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

all-hands-bot commented Dec 2, 2025

🎉 Evaluation Job Completed

Results Summary

Uh oh!

xingyaoww commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hieptl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xingyaoww commented Dec 1, 2025 •

edited by github-actions bot

Loading

xingyaoww commented Dec 2, 2025 •

edited

Loading

xingyaoww commented Dec 3, 2025 •

edited

Loading