Release v1.11.3 by all-hands-bot · Pull Request #2002 · OpenHands/software-agent-sdk

all-hands-bot · 2026-02-11T11:49:13Z

Release v1.11.3

This PR prepares the release for version 1.11.3.

Release Checklist

Next Steps

Review the version changes
Address any deprecation deadlines
Ensure integration tests pass
Ensure behavior tests pass
Ensure example tests pass
Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:2b804c8-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-2b804c8-python \
  ghcr.io/openhands/agent-server:2b804c8-python

All tags pushed for this build

ghcr.io/openhands/agent-server:2b804c8-golang-amd64
ghcr.io/openhands/agent-server:2b804c8-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:2b804c8-golang-arm64
ghcr.io/openhands/agent-server:2b804c8-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:2b804c8-java-amd64
ghcr.io/openhands/agent-server:2b804c8-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:2b804c8-java-arm64
ghcr.io/openhands/agent-server:2b804c8-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:2b804c8-python-amd64
ghcr.io/openhands/agent-server:2b804c8-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:2b804c8-python-arm64
ghcr.io/openhands/agent-server:2b804c8-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:2b804c8-golang
ghcr.io/openhands/agent-server:2b804c8-java
ghcr.io/openhands/agent-server:2b804c8-python

About Multi-Architecture Support

Each variant tag (e.g., 2b804c8-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 2b804c8-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-11T11:49:23Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-02-11T11:49:24Z

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

all-hands-bot

🟢 Good taste - This is exactly what a release PR should be: simple, mechanical, and focused.

Analysis:
This is a straightforward version bump from 1.11.2 to 1.11.3 across all four packages (openhands-agent-server, openhands-sdk, openhands-tools, openhands-workspace) and the lockfile.

Applying the three critical questions:

Is this solving a real problem? ✅ Yes - preparing a patch release
Is there a simpler way? ✅ Already simple - just version bumps
What will this break? ✅ Nothing - semantic versioning done right

No Critical Issues
No data structure changes, no complexity, no breaking changes, no code to review. The changes are consistent across all packages and the lockfile is updated accordingly.

Verdict: ✅ Worth merging - This is correct and complete. The checklist in the PR description covers the remaining validation steps (tests, release notes).

Key Insight: This is how release PRs should look - boring, mechanical, and impossible to mess up.

github-actions · 2026-02-11T11:56:12Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
TOTAL	18290	4902	73%

report-only-changed-files is enabled. No files were changed during this commit :)

github-actions · 2026-02-11T11:58:21Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-11 12:08:15 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	26.5s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	19.9s	$0.02
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	11.6s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	37.1s	$0.03
01_standalone_sdk/09_pause_example.py	✅ PASS	17.1s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	26.3s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	36.2s	$0.04
01_standalone_sdk/12_custom_secrets.py	✅ PASS	16.5s	$0.02
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	21.9s	$0.02
01_standalone_sdk/14_context_condenser.py	✅ PASS	2m 13s	$0.30
01_standalone_sdk/17_image_input.py	✅ PASS	15.9s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	22.4s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	19.1s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	19.9s	$0.03
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	10.2s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	18.5s	$0.02
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 6s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	5m 23s	$0.51
01_standalone_sdk/25_agent_delegation.py	✅ PASS	2m 24s	$0.20
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	18.8s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	35.1s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	49.1s	$0.04
01_standalone_sdk/30_tom_agent.py	✅ PASS	9.8s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	4m 22s	$0.34
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	19.2s	$0.02
01_standalone_sdk/34_critic_example.py	❌ FAIL Exit code 1	3.6s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	12.7s	$0.01
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	3.7s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	45.3s	$0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	✅ PASS	1m 34s	$0.04
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	✅ PASS	47.6s	$0.05
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	1m 33s	$0.02
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	25.7s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	✅ PASS	3m 37s	$0.02
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	26.3s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	55.5s	$0.05
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	18.5s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	7.5s	$0.01

❌ Some tests failed

Total: 38 | Passed: 37 | Failed: 1 | Total Cost: $2.04

Failed examples:

examples/01_standalone_sdk/34_critic_example.py: Exit code 1

View full workflow run

github-actions · 2026-02-11T12:06:12Z

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $1.60
Models Tested: 4
Timestamp: 2026-02-11 12:06:05 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_deepseek_deepseek_reasoner	100.0%	7/7	1	8	$0.04	689,059
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	7/7	1	8	$0.72	1,164,628
litellm_proxy_gemini_3_pro_preview	100.0%	8/8	0	8	$0.41	316,181
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	8/8	0	8	$0.44	264,186

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (7/7)
Total Cost: $0.04
Token Usage: prompt: 672,304, completion: 16,755, cache_read: 609,344, reasoning: 6,976
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_05fa636_deepseek_v3_2_reasoner_run_N8_20260211_114948
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 100.0% (7/7)
Total Cost: $0.72
Token Usage: prompt: 1,155,698, completion: 8,930, cache_read: 1,059,328
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_05fa636_kimi_k2_thinking_run_N8_20260211_114947
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.41
Token Usage: prompt: 308,056, completion: 8,125, cache_read: 169,907, reasoning: 5,390
Run Suffix: litellm_proxy_gemini_3_pro_preview_05fa636_gemini_3_pro_run_N8_20260211_114947

litellm_proxy_claude_sonnet_4_5_20250929

Success Rate: 100.0% (8/8)
Total Cost: $0.44
Token Usage: prompt: 256,732, completion: 7,454, cache_read: 184,061, cache_write: 72,248, reasoning: 2,013
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_05fa636_claude_sonnet_4_5_20250929_run_N8_20260211_114947

github-actions · 2026-02-11T12:11:27Z

Evaluation Triggered

Trigger: Release v1.11.3
SDK: 05fa636
Eval limit: 50
Models: claude-sonnet-4-5-20250929

github-actions · 2026-02-11T12:14:10Z

🧪 Integration Tests Results

Overall Success Rate: 80.0%
Total Cost: $9.17
Models Tested: 4
Timestamp: 2026-02-11 12:14:03 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Total	Cost	Tokens
litellm_proxy_deepseek_deepseek_reasoner	100.0%	5/5	5	$0.48	6,890,751
litellm_proxy_moonshot_kimi_k2_thinking	80.0%	4/5	5	$3.03	4,727,654
litellm_proxy_gemini_3_pro_preview	80.0%	4/5	5	$3.23	5,879,154
litellm_proxy_claude_sonnet_4_5_20250929	60.0%	3/5	5	$2.43	3,310,774

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (5/5)
Total Cost: $0.48
Token Usage: prompt: 6,820,420, completion: 70,331, cache_read: 6,441,152, reasoning: 27,297
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_05fa636_deepseek_v3_2_reasoner_run_N5_20260211_114947

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 80.0% (4/5)
Total Cost: $3.03
Token Usage: prompt: 4,679,837, completion: 47,817, cache_read: 4,327,867
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_05fa636_kimi_k2_thinking_run_N5_20260211_114947

Failed Tests:

b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent was asked to help implement a standalone Python-based training example script at examples/tutorial/smolvla/train_smolvla_example.py. The evaluation criteria explicitly state that the agent must "avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script."

Analysis of what the agent created:

Primary task - COMPLETED: The agent successfully created lerobot/examples/tutorial/smolvla/train_smolvla_example.py (153 lines), which matches the user's request. The script follows the same pattern as existing training examples (diffusion and ACT), properly implements SmolVLA training with appropriate configuration, preprocessor/postprocessor setup, training loop, and model saving.
Unwanted files - VIOLATION DETECTED: The agent created an additional file /tmp/tmp1yzhzkxo/IMPLEMENTATION_SUMMARY.md that was NOT explicitly requested by the user. This file appears at around the end of the agent's conversation when it executed:
```
cat > /tmp/tmp1yzhzkxo/IMPLEMENTATION_SUMMARY.md << 'EOF'
```
This creates a markdown summary document that goes beyond what was asked for.
Quality of the primary deliverable: The main training script is well-implemented with:
- Correct imports and structure
- Proper SmolVLA-specific configuration
- Flexible model loading (both from scratch and pretrained)
- Complete training pipeline with data loading, preprocessing, training loop, and model saving
- Good documentation and comments
- Syntax validated successfully
Evaluation criteria violation: The criteria explicitly states "Verify that the agent did not create any redundant files (e.g., .md files) that are not asked by users when performing the task." The creation of IMPLEMENTATION_SUMMARY.md violates this rule - it's an extra markdown file not explicitly requested.

The agent's approach was otherwise methodical and well-reasoned, with proper exploration of the codebase to understand patterns and requirements. However, the creation of the unnecessary summary markdown file is a clear violation of the stated constraints. (confidence=0.92) (Cost: $0.91)

litellm_proxy_gemini_3_pro_preview

Success Rate: 80.0% (4/5)
Total Cost: $3.23
Token Usage: prompt: 5,830,625, completion: 48,529, cache_read: 4,962,004, reasoning: 29,915
Run Suffix: litellm_proxy_gemini_3_pro_preview_05fa636_gemini_3_pro_run_N5_20260211_114947

Failed Tests:

b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmpctg780xi/software-agent-sdk/openhands-sdk/openhands/sdk/critic/base.py (Cost: $0.52)

litellm_proxy_claude_sonnet_4_5_20250929

Success Rate: 60.0% (3/5)
Total Cost: $2.43
Token Usage: prompt: 3,261,079, completion: 49,695, cache_read: 2,975,979, cache_write: 189,267, reasoning: 7,088
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_05fa636_claude_sonnet_4_5_20250929_run_N5_20260211_114947

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent performed the core task correctly but violated the evaluation criteria in a significant way:

What the agent did correctly:

✓ Successfully updated MAX_CMD_OUTPUT_SIZE from 30000 to 20000 in the correct file
✓ Ran the targeted truncation tests (tests/tools/terminal/test_observation_truncation.py) - all 5 tests passed
✓ Verified the change with git diff
✓ Checked for other references to ensure no unintended side effects

Critical violation of evaluation criteria:
The agent ran EXCESSIVE and UNNECESSARY verification tests:

First ran pytest tests/tools/terminal/test_observation_truncation.py -v (appropriate)
Then ran pytest tests/tools/terminal/ -v (the entire terminal test suite - 98 tests), which exceeds the scope needed
Then ran pytest tests/tools/terminal/test_observation_truncation.py -v again (redundant re-verification)

The criteria explicitly states: "Verify that the agent did not over-verify the truncation limit change by running test suites much broader than necessary, or repeatedly."

Why this matters:

The second test run (tests/tools/terminal/) was broader than necessary since the change only affects truncation behavior
The third test run was a complete repeat of the first test, showing unnecessary redundancy
This violates the instruction to "Stop after reporting the change and results, inviting further direction"

Specific evaluation criteria violations:

"Optionally execute only the targeted pytest command. In this case acceptable tests are ALL files under tests/tools/terminal" - The agent did run the broader suite, which pushes the boundaries of acceptable, but then also re-ran the targeted tests unnecessarily
The agent continued iterating and re-testing rather than stopping after the first successful verification and reporting the results

Positive aspects:

The core change was made correctly
The agent was thorough in checking for unintended side effects
All tests actually passed, showing the change is correct
The final summary was comprehensive and accurate

The behavior shows over-verification and redundant testing when the evaluation criteria explicitly asked to avoid this. A single run of tests/tools/terminal/test_observation_truncation.py would have been sufficient to verify the change and confirm no test modifications were needed. (confidence=0.92) (Cost: $0.26)

b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the main requested file (examples/tutorial/smolvla/train_smolvla_example.py) with high quality - it follows the same format as existing tutorial examples, includes comprehensive documentation, proper error handling, and supports both fine-tuning and training from scratch. The code is well-structured with clear comments and validation shows correct Python syntax.

However, the agent violated the explicit evaluation criteria by creating multiple temporary files that were NOT requested:

/tmp/comparison.md - Created to compare training examples
/tmp/implementation_summary.md - Created to document the implementation
/tmp/code_structure_overview.txt - Implied from terminal output

The evaluation criteria explicitly states: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script."

While these files were created in /tmp (outside the repo) and appear to be intermediate work artifacts for explanation/documentation purposes, they were still actively created by the agent and not cleaned up. The user did not ask for comparison documents, implementation summaries, or any other auxiliary files.

The agent should have:

Created only examples/tutorial/smolvla/train_smolvla_example.py
Optionally created a README.md in the smolvla directory (if desired)
NOT created temporary markdown/documentation files

The core task was completed well, but the execution violated the stated constraints about file creation. (confidence=0.95) (Cost: $0.70)

openhands-ai · 2026-02-11T12:14:25Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run Examples Scripts
- PR Review by OpenHands

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2002 at branch `rel-1.11.3`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

Release v1.11.3

05fa636

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Feb 11, 2026

all-hands-bot commented Feb 11, 2026

View reviewed changes

xingyaoww approved these changes Feb 11, 2026

View reviewed changes

xingyaoww merged commit 7194518 into main Feb 11, 2026
93 of 96 checks passed

xingyaoww deleted the rel-1.11.3 branch February 11, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v1.11.3#2002

Release v1.11.3#2002
xingyaoww merged 1 commit intomainfrom
rel-1.11.3

all-hands-bot commented Feb 11, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

all-hands-bot commented Feb 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release v1.11.3

Release Checklist

Next Steps

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

❌ Some tests failed

Uh oh!

github-actions bot commented Feb 11, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_gemini_3_pro_preview

litellm_proxy_claude_sonnet_4_5_20250929

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_gemini_3_pro_preview

litellm_proxy_claude_sonnet_4_5_20250929

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

all-hands-bot commented Feb 11, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Feb 11, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`