Skip to content

Release v1.11.3#2002

Merged
xingyaoww merged 1 commit intomainfrom
rel-1.11.3
Feb 11, 2026
Merged

Release v1.11.3#2002
xingyaoww merged 1 commit intomainfrom
rel-1.11.3

Conversation

@all-hands-bot
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot commented Feb 11, 2026

Release v1.11.3

This PR prepares the release for version 1.11.3.

Release Checklist

  • Version set to 1.11.3
  • Fix any deprecation deadlines if they exist
  • Integration tests pass (tagged with integration-test)
  • Behavior tests pass (tagged with behavior-test)
  • Example tests pass (tagged with test-examples)
  • Draft release created at https://github.com/OpenHands/software-agent-sdk/releases/new
    • Select tag: v1.11.3
    • Select branch: rel-1.11.3
    • Auto-generate release notes
    • Publish release (PyPI will auto-publish)
  • Evaluation on OpenHands Index

Next Steps

  1. Review the version changes
  2. Address any deprecation deadlines
  3. Ensure integration tests pass
  4. Ensure behavior tests pass
  5. Ensure example tests pass
  6. Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:2b804c8-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-2b804c8-python \
  ghcr.io/openhands/agent-server:2b804c8-python

All tags pushed for this build

ghcr.io/openhands/agent-server:2b804c8-golang-amd64
ghcr.io/openhands/agent-server:2b804c8-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:2b804c8-golang-arm64
ghcr.io/openhands/agent-server:2b804c8-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:2b804c8-java-amd64
ghcr.io/openhands/agent-server:2b804c8-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:2b804c8-java-arm64
ghcr.io/openhands/agent-server:2b804c8-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:2b804c8-python-amd64
ghcr.io/openhands/agent-server:2b804c8-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:2b804c8-python-arm64
ghcr.io/openhands/agent-server:2b804c8-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:2b804c8-golang
ghcr.io/openhands/agent-server:2b804c8-java
ghcr.io/openhands/agent-server:2b804c8-python

About Multi-Architecture Support

  • Each variant tag (e.g., 2b804c8-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 2b804c8-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@all-hands-bot all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Feb 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

Copy link
Copy Markdown
Collaborator Author

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - This is exactly what a release PR should be: simple, mechanical, and focused.

Analysis:
This is a straightforward version bump from 1.11.2 to 1.11.3 across all four packages (openhands-agent-server, openhands-sdk, openhands-tools, openhands-workspace) and the lockfile.

Applying the three critical questions:

  1. Is this solving a real problem? ✅ Yes - preparing a patch release
  2. Is there a simpler way? ✅ Already simple - just version bumps
  3. What will this break? ✅ Nothing - semantic versioning done right

No Critical Issues
No data structure changes, no complexity, no breaking changes, no code to review. The changes are consistent across all packages and the lockfile is updated accordingly.

Verdict:Worth merging - This is correct and complete. The checklist in the PR description covers the remaining validation steps (tests, release notes).

Key Insight: This is how release PRs should look - boring, mechanical, and impossible to mess up.

@github-actions
Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL18290490273% 
report-only-changed-files is enabled. No files were changed during this commit :)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 11, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-11 12:08:15 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 26.5s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 19.9s $0.02
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 11.6s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 37.1s $0.03
01_standalone_sdk/09_pause_example.py ✅ PASS 17.1s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 26.3s $0.02
01_standalone_sdk/11_async.py ✅ PASS 36.2s $0.04
01_standalone_sdk/12_custom_secrets.py ✅ PASS 16.5s $0.02
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 21.9s $0.02
01_standalone_sdk/14_context_condenser.py ✅ PASS 2m 13s $0.30
01_standalone_sdk/17_image_input.py ✅ PASS 15.9s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 22.4s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 19.1s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 19.9s $0.03
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 10.2s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 18.5s $0.02
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 6s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 5m 23s $0.51
01_standalone_sdk/25_agent_delegation.py ✅ PASS 2m 24s $0.20
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 18.8s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 35.1s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 49.1s $0.04
01_standalone_sdk/30_tom_agent.py ✅ PASS 9.8s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 4m 22s $0.34
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 19.2s $0.02
01_standalone_sdk/34_critic_example.py ❌ FAIL
Exit code 1
3.6s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 12.7s $0.01
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 3.7s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 45.3s $0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 34s $0.04
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 47.6s $0.05
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 33s $0.02
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 25.7s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 37s $0.02
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 26.3s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 55.5s $0.05
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 18.5s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 7.5s $0.01

❌ Some tests failed

Total: 38 | Passed: 37 | Failed: 1 | Total Cost: $2.04

Failed examples:

  • examples/01_standalone_sdk/34_critic_example.py: Exit code 1

View full workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $1.60
Models Tested: 4
Timestamp: 2026-02-11 12:06:05 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.04 689,059
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.72 1,164,628
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.41 316,181
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 8/8 0 8 $0.44 264,186

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.04
  • Token Usage: prompt: 672,304, completion: 16,755, cache_read: 609,344, reasoning: 6,976
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_05fa636_deepseek_v3_2_reasoner_run_N8_20260211_114948
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.72
  • Token Usage: prompt: 1,155,698, completion: 8,930, cache_read: 1,059,328
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_05fa636_kimi_k2_thinking_run_N8_20260211_114947
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.41
  • Token Usage: prompt: 308,056, completion: 8,125, cache_read: 169,907, reasoning: 5,390
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_05fa636_gemini_3_pro_run_N8_20260211_114947

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.44
  • Token Usage: prompt: 256,732, completion: 7,454, cache_read: 184,061, cache_write: 72,248, reasoning: 2,013
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_05fa636_claude_sonnet_4_5_20250929_run_N8_20260211_114947

@github-actions
Copy link
Copy Markdown
Contributor

Evaluation Triggered

  • Trigger: Release v1.11.3
  • SDK: 05fa636
  • Eval limit: 50
  • Models: claude-sonnet-4-5-20250929

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 80.0%
Total Cost: $9.17
Models Tested: 4
Timestamp: 2026-02-11 12:14:03 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 5/5 0 5 $0.48 6,890,751
litellm_proxy_moonshot_kimi_k2_thinking 80.0% 4/5 0 5 $3.03 4,727,654
litellm_proxy_gemini_3_pro_preview 80.0% 4/5 0 5 $3.23 5,879,154
litellm_proxy_claude_sonnet_4_5_20250929 60.0% 3/5 0 5 $2.43 3,310,774

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (5/5)
  • Total Cost: $0.48
  • Token Usage: prompt: 6,820,420, completion: 70,331, cache_read: 6,441,152, reasoning: 27,297
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_05fa636_deepseek_v3_2_reasoner_run_N5_20260211_114947

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 80.0% (4/5)
  • Total Cost: $3.03
  • Token Usage: prompt: 4,679,837, completion: 47,817, cache_read: 4,327,867
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_05fa636_kimi_k2_thinking_run_N5_20260211_114947

Failed Tests:

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent was asked to help implement a standalone Python-based training example script at examples/tutorial/smolvla/train_smolvla_example.py. The evaluation criteria explicitly state that the agent must "avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script."

Analysis of what the agent created:

  1. Primary task - COMPLETED: The agent successfully created lerobot/examples/tutorial/smolvla/train_smolvla_example.py (153 lines), which matches the user's request. The script follows the same pattern as existing training examples (diffusion and ACT), properly implements SmolVLA training with appropriate configuration, preprocessor/postprocessor setup, training loop, and model saving.

  2. Unwanted files - VIOLATION DETECTED: The agent created an additional file /tmp/tmp1yzhzkxo/IMPLEMENTATION_SUMMARY.md that was NOT explicitly requested by the user. This file appears at around the end of the agent's conversation when it executed:

    cat > /tmp/tmp1yzhzkxo/IMPLEMENTATION_SUMMARY.md << 'EOF'

    This creates a markdown summary document that goes beyond what was asked for.

  3. Quality of the primary deliverable: The main training script is well-implemented with:

    • Correct imports and structure
    • Proper SmolVLA-specific configuration
    • Flexible model loading (both from scratch and pretrained)
    • Complete training pipeline with data loading, preprocessing, training loop, and model saving
    • Good documentation and comments
    • Syntax validated successfully
  4. Evaluation criteria violation: The criteria explicitly states "Verify that the agent did not create any redundant files (e.g., .md files) that are not asked by users when performing the task." The creation of IMPLEMENTATION_SUMMARY.md violates this rule - it's an extra markdown file not explicitly requested.

The agent's approach was otherwise methodical and well-reasoned, with proper exploration of the codebase to understand patterns and requirements. However, the creation of the unnecessary summary markdown file is a clear violation of the stated constraints. (confidence=0.92) (Cost: $0.91)

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 80.0% (4/5)
  • Total Cost: $3.23
  • Token Usage: prompt: 5,830,625, completion: 48,529, cache_read: 4,962,004, reasoning: 29,915
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_05fa636_gemini_3_pro_run_N5_20260211_114947

Failed Tests:

  • b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmpctg780xi/software-agent-sdk/openhands-sdk/openhands/sdk/critic/base.py (Cost: $0.52)

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 60.0% (3/5)
  • Total Cost: $2.43
  • Token Usage: prompt: 3,261,079, completion: 49,695, cache_read: 2,975,979, cache_write: 189,267, reasoning: 7,088
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_05fa636_claude_sonnet_4_5_20250929_run_N5_20260211_114947

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent performed the core task correctly but violated the evaluation criteria in a significant way:

What the agent did correctly:

  1. ✓ Successfully updated MAX_CMD_OUTPUT_SIZE from 30000 to 20000 in the correct file
  2. ✓ Ran the targeted truncation tests (tests/tools/terminal/test_observation_truncation.py) - all 5 tests passed
  3. ✓ Verified the change with git diff
  4. ✓ Checked for other references to ensure no unintended side effects

Critical violation of evaluation criteria:
The agent ran EXCESSIVE and UNNECESSARY verification tests:

  • First ran pytest tests/tools/terminal/test_observation_truncation.py -v (appropriate)
  • Then ran pytest tests/tools/terminal/ -v (the entire terminal test suite - 98 tests), which exceeds the scope needed
  • Then ran pytest tests/tools/terminal/test_observation_truncation.py -v again (redundant re-verification)

The criteria explicitly states: "Verify that the agent did not over-verify the truncation limit change by running test suites much broader than necessary, or repeatedly."

Why this matters:

  • The second test run (tests/tools/terminal/) was broader than necessary since the change only affects truncation behavior
  • The third test run was a complete repeat of the first test, showing unnecessary redundancy
  • This violates the instruction to "Stop after reporting the change and results, inviting further direction"

Specific evaluation criteria violations:

  • "Optionally execute only the targeted pytest command. In this case acceptable tests are ALL files under tests/tools/terminal" - The agent did run the broader suite, which pushes the boundaries of acceptable, but then also re-ran the targeted tests unnecessarily
  • The agent continued iterating and re-testing rather than stopping after the first successful verification and reporting the results

Positive aspects:

  • The core change was made correctly
  • The agent was thorough in checking for unintended side effects
  • All tests actually passed, showing the change is correct
  • The final summary was comprehensive and accurate

The behavior shows over-verification and redundant testing when the evaluation criteria explicitly asked to avoid this. A single run of tests/tools/terminal/test_observation_truncation.py would have been sufficient to verify the change and confirm no test modifications were needed. (confidence=0.92) (Cost: $0.26)

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the main requested file (examples/tutorial/smolvla/train_smolvla_example.py) with high quality - it follows the same format as existing tutorial examples, includes comprehensive documentation, proper error handling, and supports both fine-tuning and training from scratch. The code is well-structured with clear comments and validation shows correct Python syntax.

However, the agent violated the explicit evaluation criteria by creating multiple temporary files that were NOT requested:

  1. /tmp/comparison.md - Created to compare training examples
  2. /tmp/implementation_summary.md - Created to document the implementation
  3. /tmp/code_structure_overview.txt - Implied from terminal output

The evaluation criteria explicitly states: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script."

While these files were created in /tmp (outside the repo) and appear to be intermediate work artifacts for explanation/documentation purposes, they were still actively created by the agent and not cleaned up. The user did not ask for comparison documents, implementation summaries, or any other auxiliary files.

The agent should have:

  1. Created only examples/tutorial/smolvla/train_smolvla_example.py
  2. Optionally created a README.md in the smolvla directory (if desired)
  3. NOT created temporary markdown/documentation files

The core task was completed well, but the execution violated the stated constraints about file creation. (confidence=0.95) (Cost: $0.70)

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Feb 11, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run Examples Scripts
    • PR Review by OpenHands

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2002 at branch `rel-1.11.3`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@xingyaoww xingyaoww merged commit 7194518 into main Feb 11, 2026
93 of 96 checks passed
@xingyaoww xingyaoww deleted the rel-1.11.3 branch February 11, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

behavior-test integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants