Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Oct 27, 2025

Summary

  • Standardized the Observation base class to unify error handling and output formatting across tools.
  • Introduced ObservationStatus enum and result_status property to compute success/error state.
  • Added standardized output: str and error: str | None fields in the base.
  • Implemented a default to_llm_content that returns a formatted error when present, otherwise the output.
  • Kept backward compatibility by having has_error use bool(self.error) to support both string and boolean-style error flags in existing subclasses.

Motivation and Context
There were 9 different observation classes with minor variations in how they handled errors and outputs. This refactor centralizes those patterns in the base class to reduce duplication and ensure consistent agent-visible content.

What changed

  • openhands-sdk/openhands/sdk/tool/schema.py
    • Added ObservationStatus enum
    • Added standardized output and error fields
    • Added has_error property and result_status computation
    • Added _format_error helper and default to_llm_content implementation
    • Kept visualize behavior consistent with to_llm_content

Backward Compatibility

  • Subclasses can still define their own fields and override to_llm_content/visualize as needed
  • has_error uses bool(self.error) to accommodate legacy subclasses that used error: bool or is_error: bool in their own models without breaking existing behavior
  • No changes were required to existing tools or tests; existing custom to_llm_content implementations (e.g., FileEditor, ExecuteBash, Browser, MCP) continue to work as before and take precedence over the default

Tests

  • Ran pre-commit and the full test suite locally. Only unrelated e2e/environment-dependent tests failed. Schema immutability tests pass with the new base properties.

Related Issues

Checklist

  • Pre-commit hooks pass for modified files
  • Types checked with pyright
  • Minimal, focused change touching only the Observation base to standardize behavior

Co-authored-by: openhands openhands@all-hands.dev

@simonrosenberg can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Base Image Docs / Tags
golang golang:1.21-bookworm Link
java eclipse-temurin:17-jdk Link
python nikolaik/python-nodejs:python3.12-nodejs22 Link

Pull (multi-arch manifest)

docker pull ghcr.io/openhands/agent-server:a9185dc-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-a9185dc-python \
  ghcr.io/openhands/agent-server:a9185dc-python

All tags pushed for this build

ghcr.io/openhands/agent-server:a9185dc-golang
ghcr.io/openhands/agent-server:v1.0.0a5_golang_tag_1.21-bookworm_binary
ghcr.io/openhands/agent-server:a9185dc-java
ghcr.io/openhands/agent-server:v1.0.0a5_eclipse-temurin_tag_17-jdk_binary
ghcr.io/openhands/agent-server:a9185dc-python
ghcr.io/openhands/agent-server:v1.0.0a5_nikolaik_s_python-nodejs_tag_python3.12-nodejs22_binary

The a9185dc tag is a multi-arch manifest (amd64/arm64); your client pulls the right arch automatically.

…atus helpers

- Add ObservationStatus enum and result_status property
- Add standardized output and error fields with default to_llm_content
- Keep subclasses backward compatible by using bool(error) in has_error

Co-authored-by: openhands <openhands@all-hands.dev>
…mpat

- FinishObservation: rename message->output, add message alias; rely on base to_llm_content/visualize
- ThinkObservation: rename content->output, add content alias; rely on base defaults
- MCPToolObservation: adopt base error semantics while preserving is_error; unify has_error and formatting
- TaskTrackerObservation: rename content->output; use base to_llm_content; keep rich visualize
- FileEditorObservation: standardize error formatting prefix as 'Error: '

This aligns all observation classes with the unified Observation base (output: str, error: str|None, has_error, result_status) and removes duplicated logic where possible.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 27, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/mcp
   definition.py513139%47, 60–61, 63–67, 73, 78, 81, 84, 86, 98, 106–107, 109–113, 115–116, 118–124, 126
   tool.py823557%53–56, 60, 63, 66–69, 80, 136–137, 140–144, 146–148, 153, 174, 176–177, 181–183, 215–216, 221, 228–229, 231, 257
openhands-sdk/openhands/sdk/tool
   schema.py1291886%24–26, 28, 37, 119, 215, 221, 225, 236, 243, 253–257, 259–260
openhands-sdk/openhands/sdk/tool/builtins
   finish.py18572%25–28, 59
   think.py241154%27, 30–31, 34, 36–40, 42, 71
openhands-tools/openhands/tools/browser_use
   definition.py1222976%37–38, 41–42, 44, 48, 52–61, 63–64, 66, 114, 173, 229, 282, 338, 391, 438, 485, 540, 591
   impl.py17711833%25, 39, 43–46, 48, 58–61, 66, 68–70, 72–73, 80–82, 84–88, 102, 114, 154–155, 166–167, 181, 187, 201–202, 204–213, 216–225, 227–228, 230–234, 238, 240–241, 246–247, 251–252, 257–258, 262–263, 267–268, 272, 274–275, 277–280, 283–284, 288, 290, 292, 297–298, 302–303, 307–308, 313–314, 320–324, 328–333, 337–340, 342–344, 347, 351–354
openhands-tools/openhands/tools/delegate
   definition.py20480%104, 107, 113, 116
   impl.py978116%29, 31–32, 41–42, 46, 52–53, 56–59, 61, 77–78, 83–84, 93–97, 99, 101, 107, 114–115, 117–119, 124–126, 142–143, 149–151, 159, 161–163, 165, 167–170, 173–176, 178–179, 183–186, 189–191, 196–197, 200–201, 204, 206–210, 212, 215–217, 219–220, 223, 225, 230–232
openhands-tools/openhands/tools/execute_bash
   definition.py1137137%52, 55, 58–59, 61, 64–66, 68–70, 72–74, 76, 100, 108–109, 113–115, 118–119, 121–122, 124–131, 136, 139–141, 144–145, 147–148, 150, 152–154, 156, 160–161, 164–166, 168–169, 172–175, 179–181, 186, 190–192, 195–197, 201–202, 204, 283
   impl.py803161%59, 62, 70–71, 98–100, 102–103, 109, 111, 115, 131, 134, 137–138, 143–144, 147–148, 150, 157–158, 160, 167, 175, 185–186, 195–196, 198
openhands-tools/openhands/tools/execute_bash/terminal
   terminal_session.py1826365%90, 96, 100–102, 124–125, 152, 167–168, 202–204, 209, 212–213, 217, 223, 237–239, 244, 247–248, 252, 258, 277, 279, 282, 284, 300, 315, 320, 328, 331, 364, 368, 371, 374–375, 381–382, 388, 393–394, 400–401, 460–462, 466, 471, 476–477, 481–482, 485–488, 494–495, 498
openhands-tools/openhands/tools/file_editor
   definition.py56885%98, 118, 121, 124, 131, 133, 135, 137
openhands-tools/openhands/tools/glob
   definition.py24770%74, 76–78, 81, 84, 91
   impl.py927716%37–40, 55–56, 58–60, 62, 65, 69–70, 77–78, 80, 83–84, 89–90, 94–95, 100, 108, 110–112, 114–116, 118, 139, 149, 154–158, 160–161, 163, 165, 181–183, 187, 190, 193, 197–198, 200–202, 205–206, 208, 210, 212, 244–245, 248, 251, 253, 256–257, 260–264, 266, 268–269, 271, 273–274, 276
openhands-tools/openhands/tools/grep
   definition.py26773%76, 78–80, 83, 86, 93
   impl.py806716%35–38, 53, 55–58, 66, 69–72, 80–81, 83, 85, 87–89, 91–93, 95, 112–113, 116, 121–123, 127–128, 132, 139, 149–150, 153, 158–162, 164–165, 167, 169, 177, 191, 204–205, 208, 213–216, 219–220, 222–224, 226–228, 230, 232, 240
openhands-tools/openhands/tools/task_tracker
   definition.py1298335%52, 55–57, 59–60, 63–64, 66, 82, 84, 86–87, 90, 93–95, 97–98, 101–107, 109–111, 114, 116–119, 121, 124, 127–128, 130–131, 133–134, 136, 165, 167, 169–171, 178, 180–181, 188–189, 195, 206–207, 209–211, 215–216, 218–221, 223, 228, 234–238, 242, 246–247, 249–250, 252, 254–258
TOTAL11454524654% 

openhands-agent and others added 4 commits October 27, 2025 22:58
- Fix remaining assertion in test_registry to use .output
- Ensure subclass output fields provide defaults to satisfy pyright
- Adjust ExecuteBashObservation output default

Co-authored-by: openhands <openhands@all-hands.dev>
…ed Observation

Co-authored-by: openhands <openhands@all-hands.dev>
… base to_llm_content; update tests

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg
Copy link
Collaborator Author

@OpenHands Let's add a small modification:
def _format_error(self) -> TextContent:
return TextContent(text=f"Tool Execution Error: {self.error}")

Also let's be more consistent with how observations sub-class the base observation

observations should not override 'to_llm_content': to_llm_content is flexible enough so that any error message or any content can be sent back to the llm
when there is an error, output must not be populated in the Observation
For delegate/definition.py, this is useless return super().to_llm_content
For delegation: this should be an error not an output output=f"Unsupported command: {action.command}",. Same for all other errors.
Remove 'command' from the observation. Simply add it in the message if needed. This PR should not modify too much what is sent to the llm "to_llm_content"

@openhands-ai
Copy link

openhands-ai bot commented Oct 28, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

- Update _format_error to use 'Tool Execution Error:' prefix
- Remove command field from DelegateObservation and TaskTrackerObservation
- Remove unnecessary to_llm_content override in DelegateObservation
- Update delegate/impl.py to use error field for errors (not output)
- Prefix success messages with command name (spawn:/delegate:/plan:/view:)
- Update TaskTrackerExecutor to handle errors via error field
- Update tests to match new observation structure
- Ensure output is not populated when error is present

These changes make observations more consistent by:
1. Eliminating the redundant command field
2. Ensuring errors and outputs are mutually exclusive
3. Relying on base class to_llm_content when possible
4. Keeping command context in the message text when needed

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Oct 28, 2025

Summary

I've successfully improved observation consistency across the codebase as requested. The changes have been committed and pushed to PR #929 (openhands/standardize-observation-base).

Changes Made

✅ All requested modifications completed:

  1. Updated error format in _format_error():

    • Changed from "Error: {self.error}" to "Tool Execution Error: {self.error}"
  2. Removed unnecessary to_llm_content overrides:

    • Removed useless super().to_llm_content call from DelegateObservation
    • Left MCPToolObservation's override intact (necessary for structured content)
  3. Ensured errors and outputs are mutually exclusive:

    • When error is set, output is not populated (and vice versa)
    • All error cases now use the error field
    • All success cases now use the output field
  4. Fixed delegation error handling:

    • Moved all error messages from output to error field
    • Unsupported commands, spawn errors, and delegate errors all use error field
  5. Removed command field from observations:

    • Removed from DelegateObservation
    • Removed from TaskTrackerObservation
    • Added command context as message prefix (e.g., "spawn:", "delegate:", "plan:", "view:")
  6. Minimized LLM content changes:

    • Only removed the unnecessary override
    • Base class to_llm_content handles most cases
    • Custom implementations remain where needed (structured content)

Files Modified

  • openhands-sdk/openhands/sdk/tool/schema.py - Updated error format
  • openhands-tools/openhands/tools/delegate/definition.py - Removed command field and unnecessary override
  • openhands-tools/openhands/tools/delegate/impl.py - Fixed error handling, added command prefixes
  • openhands-tools/openhands/tools/task_tracker/definition.py - Removed command field, updated executor
  • tests/tools/delegation/test_delegation.py - Updated tests to match new structure

Test Results

✅ All 77 SDK tool tests pass
✅ All 189+ tool tests pass
✅ All pre-commit checks pass (format, lint, type check)
✅ Net reduction of 21 lines of code

The code is now more consistent, with clear separation between errors and outputs, and observations that follow a standardized pattern.

View full conversation

openhands-agent and others added 9 commits October 28, 2025 11:20
- Remove custom to_llm_content override from MCPToolObservation
- Rename convrted_content to converted_content (typo fix)
- Populate error field with full content when MCP result has error
- Populate output field with formatted content for successful results
- Update tests to match new behavior (single TextContent vs list)
- Error messages now include: header + error notice + content
- Success messages now include: header + content

This makes MCPToolObservation consistent with other observations by
relying on the base class to_llm_content and properly separating
error vs output fields.

Co-authored-by: openhands <openhands@all-hands.dev>
- Remove output field from multiple commands error case
- Move full error message to error field instead
- Update test to check error field instead of output field
- Fix line length by extracting commands_list variable

This makes ExecuteBashObservation consistent with the pattern that
errors should use the error field, not the output field.

Co-authored-by: openhands <openhands@all-hands.dev>
The FileEditorObservation.to_llm_content was simply returning error or output,
which is exactly what the base Observation class now provides by default.
Removing this override allows the observation to use the standardized base
implementation.

Co-authored-by: openhands <openhands@all-hands.dev>
…interface

- Restored 'command' field to TaskTrackerObservation
- Changed 'output' back to 'content' field
- Restored original visualize method that uses command field to determine styling
- Added custom to_llm_content implementation
- Updated executor to populate command field in observations

This preserves the original interface while maintaining consistency with the
base observation standardization changes.

Co-authored-by: openhands <openhands@all-hands.dev>
…o_llm_content

- Added command: str | None field to base Observation class
- Updated to_llm_content to prepend 'Command: {command}' when command is present
- Removed to_llm_content overrides from simple observations (Glob, Grep, TaskTracker)
- Updated executors to populate output field instead of defining custom to_llm_content
- Kept complex to_llm_content overrides (Browser, ExecuteBash) that handle images and metadata

Co-authored-by: openhands <openhands@all-hands.dev>
- Remove 'output', 'error', 'command' field redeclarations from observation subclasses where possible
- Update MCPToolObservation to use output field instead of _content field
- Store text content in output, images in separate images field
- Update MCP tests to match new structure
- Keep FileEditorObservation.command field (required for internal logic)
- When error occurs, populate only error field, not output

Co-authored-by: openhands <openhands@all-hands.dev>
- Update ThinkExecutor to set output='Your thought has been logged.'
- Update FinishExecutor to set output=action.message
- Fix test_view_batch_atomicity.py to use output instead of content
- Fix test_mcp_tool_kind_field.py import issues
- Add Sequence import to glob and grep definitions

These changes ensure observations consistently use the base output field
rather than custom fields, aligning with the standardization effort.

Co-authored-by: openhands <openhands@all-hands.dev>
- Changed base Observation._format_error to use 'Tool Execution Error:' prefix
- Ensured observations don't populate output when there is an error
- Removed command parameter from ExecuteBashObservation instantiations
- Updated ExecuteBashObservation to override output field as str
- Updated ExecuteBashObservation.to_llm_content to return error without output
- Fixed terminal_session.py to only populate error field in error cases
- Removed 'ERROR:' prefix from error messages (base class adds prefix)
- Updated test_bash_session.py to check error field instead of output
- Updated all observation tests to work with new error handling

Co-authored-by: openhands <openhands@all-hands.dev>
@blacksmith-sh

This comment has been minimized.

openhands-agent and others added 3 commits October 28, 2025 14:36
- Removed custom content field, use base class output field instead
- Converted content items are now stored directly in output field
- Error messages include all content combined into error string
- Removed unnecessary complexity from from_call_tool_result
- Updated tests to match new structure (header + content items)

Co-authored-by: openhands <openhands@all-hands.dev>
@blacksmith-sh

This comment has been minimized.

@simonrosenberg
Copy link
Collaborator Author

@OpenHands
I updated the schema of the tool Observation: openhands-sdk/openhands/sdk/tool/schema.py
I added a property ‘command: str’. This property can be narrowed in some subtypes to Literal[ …] 
I also updated the to_llm_content to serialize command, error and output.
Your job is to edit all subclasses of Observation and make sure all subclasses adhere to that base behavior.
Some subclasses had a “command” field on the main branch. Make sure to add it back as a command property. It’s now a property because we need type narrowing: some classes had command: str while others had command: Literal[…]

MCP
openhands-sdk/openhands/sdk/mcp/definition.py
openhands-sdk/openhands/sdk/mcp/tool.py
No need to do initial_message += "\n[An error occurred during execution.]"
Because format error already adds a error msg header

For openhands-sdk/openhands/sdk/tool/builtins/finish.py
And openhands-sdk/openhands/sdk/tool/builtins/think.py
Do NOT populate any field in FinishObservation. The output must be an empty list.

For openhands-tools/openhands/tools/delegate/definition.py
Add command: Literal[“spawn”, “delegate”]

openhands-tools/openhands/tools/delegate/impl.py
When creating DelegateObservation, put back the command=action.command

For openhands-tools/openhands/tools/execute_bash/terminal/terminal_session.py
Reinsert command everywhere in all ExecuteBashObservation
return ExecuteBashObservation(
output=command_output,
metadata=metadata,
)

For file Editor openhands-tools/openhands/tools/file_editor/definition.py
Add command as Literal and make sure implementation works

For task tracker
openhands-tools/openhands/tools/task_tracker/definition.py
Make sure command works well

Make sure all tests and pre-commit hooks are passing

@openhands-ai
Copy link

openhands-ai bot commented Oct 31, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

…fields

- Updated all Observation subclasses to use standardized output and error fields
- Added command property to all observations with type narrowing where needed
- MCP observations: removed redundant to_llm_content overrides, removed error header duplication
- FinishObservation/ThinkObservation: empty output list as specified
- DelegateObservation: added cmd field and command property with Literal type
- ExecuteBashObservation: added cmd field and command property, use raw_output for command output
- FileEditorObservation: added cmd field with Literal type narrowing
- TaskTrackerObservation: added cmd field with Literal type narrowing
- BrowserObservation: wrapped output strings in TextContent
- GrepObservation/GlobObservation: wrapped output strings in TextContent
- Updated all tests to use new field names (cmd, raw_output) and TextContent wrapping

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg
Copy link
Collaborator Author

@OpenHands there are failing actions in this PR. Please fix them
Errors

=========================== short test summary info ============================
Exception ignored in atexit callback: <bound method LocalConversation.close of <openhands.sdk.conversation.impl.local_conversation.LocalConversation object at 0x7f8f813b9460>>
FAILED tests/cross/test_agent_secrets_integration.py::test_mask_secrets - AttributeError: 'ExecuteBashObservation' object has no attribute 'raw_output'
FAILED tests/cross/test_agent_secrets_integration.py::test_mask_changing_secrets - AttributeError: 'ExecuteBashObservation' object has no attribute 'raw_output'
FAILED tests/cross/test_agent_secrets_integration.py::test_masking_persists - AttributeError: 'ExecuteBashObservation' object has no attribute 'raw_output'
FAILED tests/cross/test_stuck_detector.py::test_history_too_short - pydantic_core._pydantic_core.ValidationError: 1 validation error for ExecuteBashObservation
raw_output
Extra inputs are not permitted [type=extra_forbidden, input_value='file1.txt\nfile2.txt', input_type=str]
For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
FAILED tests/cross/test_stuck_detector.py::test_repeating_action_observation_not_stuck_less_than_4_repeats - pydantic_core._pydantic_core.ValidationError: 1 validation error for ExecuteBashObservation
raw_output
Extra inputs are not permitted [type=extra_forbidden, input_value='file1.txt\nfile2.txt', input_type=str]
For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
FAILED tests/cross/test_stuck_detector.py::test_repeating_action_observation_stuck - pydantic_core._pydantic_core.ValidationError: 1 validation error for ExecuteBashObservation
raw_output
Extra inputs are not permitted [type=extra_forbidden, input_value='file1.txt\nfile2.txt', input_type=str]
For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
FAILED tests/cross/test_stuck_detector.py::test_not_stuck_with_different_actions - pydantic_core._pydantic_core.ValidationError: 1 validation error for ExecuteBashObservation
raw_output
Extra inputs are not permitted [type=extra_forbidden, input_value='output from ls', input_type=str]
For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
FAILED tests/cross/test_stuck_detector.py::test_reset_after_user_message - pydantic_core._pydantic_core.ValidationError: 1 validation error for ExecuteBashObservation
raw_output
Extra inputs are not permitted [type=extra_forbidden, input_value='file1.txt\nfile2.txt', input_type=str]
For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
================== 8 failed, 24 passed, 12 warnings in 41.15s ==================
Traceback (most recent call last):
File "/home/runner/_work/software-agent-sdk/software-agent-sdk/openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py", line 371, in close
for tool in self.agent.tools_map.values():
^^^^^^^^^^^^^^^^^^^^
File "/home/runner/_work/software-agent-sdk/software-agent-sdk/openhands-sdk/openhands/sdk/agent/base.py", line 414, in tools_map
raise RuntimeError("Agent not initialized; call initialize() before use")

0 errors, 0 warnings, 0 informations

0 errors, 0 warnings, 0 informations
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:237:45 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:238:44 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:242:47 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:243:44 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:268:48 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:269:44 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:273:48 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:274:44 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:306:48 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:307:44 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:311:48 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_agent_secrets_integration.py:312:44 - error: Cannot access attribute "raw_output" for class "ExecuteBashObservation"
Attribute "raw_output" is unknown (reportAttributeAccessIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_stuck_detector.py
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_stuck_detector.py:62:13 - error: No parameter named "raw_output" (reportCallIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_stuck_detector.py:111:17 - error: No parameter named "raw_output" (reportCallIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_stuck_detector.py:160:17 - error: No parameter named "raw_output" (reportCallIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_stuck_detector.py:301:17 - error: No parameter named "raw_output" (reportCallIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_stuck_detector.py:350:17 - error: No parameter named "raw_output" (reportCallIssue)
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/cross/test_stuck_detector.py:393:13 - error: No parameter named "raw_output" (reportCallIssue)

@openhands-ai
Copy link

openhands-ai bot commented Oct 31, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

@blacksmith-sh

This comment has been minimized.

- Added raw_output property to ExecuteBashObservation for backward compatibility
  - Property extracts text from first TextContent item in output field
  - Enables tests and code to access raw output text seamlessly

- Updated test files to use correct observation field names
  - Changed test_stuck_detector.py to use output=[TextContent(text=...)] instead of raw_output=
  - Tests can still access raw_output via the property for assertions

- All previously failing tests now pass:
  - test_mask_secrets
  - test_mask_changing_secrets
  - test_masking_persists
  - test_history_too_short
  - test_repeating_action_observation_not_stuck_less_than_4_repeats
  - test_repeating_action_observation_stuck
  - test_not_stuck_with_different_actions
  - test_reset_after_user_message

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Oct 31, 2025

Summary

I've successfully fixed all 8 failing tests in PR #929. The changes have been committed and pushed to the openhands/standardize-observation-base branch.

Problem Identified

The PR standardized the Observation base class to use:

  • output: list[TextContent | ImageContent] instead of a raw string
  • error: str | None instead of error: bool

However, tests were still using the old raw_output field, which caused:

  1. AttributeError in test_agent_secrets_integration.py - trying to access .raw_output property that didn't exist
  2. ValidationError in test_stuck_detector.py - trying to pass raw_output= as a constructor parameter (rejected by Pydantic's extra="forbid")

Solution Implemented

1. Added raw_output property to ExecuteBashObservation (openhands-tools/openhands/tools/execute_bash/definition.py)

  • Property extracts text from the first TextContent item in the output list
  • Provides backward compatibility for code accessing .raw_output
  • Returns empty string if output is empty or not TextContent

2. Updated test instantiations (tests/cross/test_stuck_detector.py)

  • Changed from raw_output="..." to output=[TextContent(text="...")]
  • Tests can still access .raw_output via the property for assertions
  • 6 test functions updated with correct constructor parameters

Verification

All 8 previously failing tests now pass:

  • test_mask_secrets
  • test_mask_changing_secrets
  • test_masking_persists
  • test_history_too_short
  • test_repeating_action_observation_not_stuck_less_than_4_repeats
  • test_repeating_action_observation_stuck
  • test_not_stuck_with_different_actions
  • test_reset_after_user_message

All 16 tests in affected files pass

Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)

The changes are minimal, focused, and maintain backward compatibility while adhering to the new standardized Observation base class pattern.

View full conversation

@simonrosenberg
Copy link
Collaborator Author

simonrosenberg commented Oct 31, 2025

@enyst I dont know how I feel about typing output: List[TextContent | ImageContent].

  1. it adds complexity because almost all "real" outputs are just output: str
  2. but the most general case is for MCPTool where output can be a list of text content, image content.
    Perhaps the internals of different Observation subclasses should be whatever makes most sense for them as long as to_llm_content -> List[TextContent | ImageContent]
    But then again I really dont like the current chaos in the current internal output:
output: str                           # ExecuteBash, FileEditor, Browser
content: str                          # TaskTracker, Think
message: str                          # Finish
content: list[TextContent | ImageContent]  # MCPTool

- Removed command property and to_llm_content serialization from base Observation
- Renamed cmd field to command in all Observation subclasses:
  * ExecuteBashObservation
  * FileEditorObservation
  * DelegateObservation
  * TaskTrackerObservation
- Updated all executor implementations to use command instead of cmd
- Updated test files to use command field consistently
- Fixed subprocess.TimeoutExpired to use correct parameter name (cmd)

Co-authored-by: openhands <openhands@all-hands.dev>
@blacksmith-sh

This comment has been minimized.

openhands-agent and others added 3 commits October 31, 2025 16:10
- Updated PlanningFileEditorObservation initialization to use command field
- Aligns with FileEditorObservation parent class field naming

Co-authored-by: openhands <openhands@all-hands.dev>
- Fixed model_copy update to use 'command' field instead of 'cmd'
- Ensures [RESET] prefix is properly included in command field

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg
Copy link
Collaborator Author

@OpenHands /codereview-roasted
Read the diff and think hard about how to factorize all observations into a base observation class tool/schema.py
A current issue is the typing of output which is too complicated for 8/9 use cases.

@openhands-ai
Copy link

openhands-ai bot commented Nov 1, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 1, 2025

Summary

I've completed a critical code review of PR #929 focusing on the Observation base class refactoring, specifically addressing the typing complexity of the output field as requested.

Key Findings

🔴 Critical Issues Identified:

  1. Wrong Data Structure Abstraction - The output: list[TextContent | ImageContent] field is overcomplicated for 8-9 out of 11 observation types:

    • Only Browser and MCP tools actually need ImageContent capability
    • The other 9 tools (Glob, Grep, ExecuteBash, FileEditor, TaskTracker, Delegate, Finish, Think, PlanningFileEditor) just work with plain text
    • This forces unnecessary ceremony: wrap text → store as list → unwrap text
  2. Forced Complexity - Current pattern requires:

    # Writing: wrap, wrap again
    output=[TextContent(text=my_string)]
    
    # Reading: unwrap, type check, extract
    first_item = self.output[0] if self.output else None
    output_text = first_item.text if isinstance(first_item, TextContent) else ""
  3. Abstraction Doesn't Eliminate Special Cases - Most observations still override to_llm_content anyway, making the base implementation largely unused.

📋 Recommended Solution:

Remove the output field entirely and follow these principles:

class Observation(Schema, ABC):
    """Base schema for output observation."""
    
    # ✅ Keep: Standardized error handling
    error: str | None = Field(default=None, ...)
    
    @property
    def has_error(self) -> bool:
        return bool(self.error)
    
    @property
    def result_status(self) -> ObservationStatus:
        return ObservationStatus.ERROR if self.has_error else ObservationStatus.SUCCESS
    
    # ✅ Keep: Interface contract
    @property
    @abstractmethod
    def to_llm_content(self) -> Sequence[TextContent | ImageContent]:
        """Subclasses implement their own formatting."""

Let each observation use natural data structures:

  • GlobObservation: files: list[str] (formats to text in to_llm_content)
  • ExecuteBash: exit_code: int, command: str (formats with metadata)
  • Browser/MCP: Build rich content directly in their to_llm_content

🎯 Core Insight:

The refactor standardizes the wrong thing:

  • Don't standardize: Data storage format (output field)
  • Do standardize: Interface contract (to_llm_content) and error handling

"Bad programmers worry about the code. Good programmers worry about data structures." The current list[TextContent | ImageContent] doesn't naturally represent most tools' output—only 2 out of 11 need it. The right abstraction should make the 9 simple cases trivial and the 2 complex cases possible, not make all 11 complex.


No code changes were made - this was a code review task as requested with the /codereview-roasted command. The detailed analysis provides actionable feedback for refactoring the observation base class to eliminate unnecessary type complexity.

View full conversation

openhands-agent and others added 2 commits November 1, 2025 09:30
- Added Observation.text_output() static method for easy text wrapping
- Added output_as_text property for easy text extraction
- Updated ExecuteBash and Glob to use new helpers
- Reduces ceremony of wrapping/unwrapping TextContent

This addresses the typing complexity while maintaining the standardized
'output' field across all observations (vs 'message', 'content', etc.)

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg
Copy link
Collaborator Author

@OpenHands read the diff of this branch versus main.
Let's update the type in tools/schema.py Observation.output:
Current:
output: list[TextContent | ImageContent] = Field(
default_factory=list,
...
New type:
output: str | list[TextContent | ImageContent] = Field(.....)

This way, for every Observation subclass where the output should be a str ( = all observation subclasses that aren't MCP) simply do output = < some str>
For MCP where the output is a list[TextContent | ImageContent], assign a list of events to output (= current code)

Update to_llm_content so it checks what type is output and returns the correct llm content

Make sure all pre-commit checks and all tests are passing

@openhands-ai
Copy link

openhands-ai bot commented Nov 2, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

- Updated Observation.output type to str | list[TextContent | ImageContent]
- Most non-MCP tools now use simple str output for efficiency
- MCP tools continue to use list[TextContent | ImageContent] for rich content
- Updated to_llm_content to handle both types with isinstance checks
- Updated all observation subclasses (ExecuteBash, FileEditor, Browser) to use str
- Updated test helper functions to handle both output types
- All pre-commit checks and tests passing

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 2, 2025

Summary of Changes

I have successfully updated the Observation.output type in the software-agent-sdk repository as requested. Here's what was accomplished:

✅ Request Checklist

  1. Updated Observation.output type in tools/schema.py:

    • Changed from: output: list[TextContent | ImageContent] = Field(default_factory=list, ...)
    • To: output: str | list[TextContent | ImageContent] = Field(default="", ...)
  2. Updated to_llm_content to handle both types:

    • Added isinstance(self.output, str) check
    • For str: converts to TextContent(text=self.output)
    • For list: extends with the existing list of content
  3. Updated all non-MCP Observation subclasses to use str output:

    • FileEditor: Changed 6 occurrences from output=[TextContent(text=...)] to output=...
    • ExecuteBash: Updated all output creation and property methods (raw_output, to_llm_content, visualize) to handle both types
    • Browser: Updated output creation to use simple strings
  4. MCP tools continue to use rich content:

    • Verified that MCPToolObservation still correctly uses output=content_with_header (list type)
    • No changes needed for MCP - works as expected
  5. Updated test helper functions to handle both output types in:

    • tests/tools/file_editor/conftest.py
    • tests/tools/browser_use/test_browser_executor_e2e.py
    • tests/tools/execute_bash/conftest.py

✅ Verification

  • Pre-commit checks: All passing ✓
  • File Editor tests: 135 tests passed ✓
  • Execute Bash tests: All tested passed ✓
  • Browser tests: 15 tests passed ✓
  • SDK tests: 1039 tests passed ✓
  • MCP tests: 47 tests passed ✓

📦 Changes Summary

11 files modified with a net change of +102 insertions, -73 deletions:

  • Core schema update in openhands-sdk/openhands/sdk/tool/schema.py
  • All three main tools updated (file_editor, execute_bash, browser_use)
  • Test helpers updated to support both output types
  • Example code updated to handle the new type

The changes have been committed and pushed to the openhands/standardize-observation-base branch, updating PR #929. The implementation successfully provides flexibility for tools to use simple strings (more efficient) while maintaining backward compatibility for MCP tools that need rich content with images and structured text.

View full conversation

@blacksmith-sh

This comment has been minimized.

- Updated delegate, grep, glob, and task_tracker tools to use str output
- Simplified example 02_custom_tools.py to use assert for type narrowing
- Fixed test helper assert_browser_observation_success to handle both str and list[TextContent] output types
- Removed unused TextContent imports from tool implementations

This completes the migration to the new Observation.output type that supports both str and list[TextContent | ImageContent].

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 2, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #929 at branch `openhands/standardize-observation-base`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@simonrosenberg
Copy link
Collaborator Author

@OpenHands Please fix all test failures
https://github.com/OpenHands/software-agent-sdk/actions/runs/19013799998/job/54298654120?pr=929

Delegate now uses output: str in Observation
=================================== FAILURES ===================================
_______________________ test_delegate_executor_delegate ________________________
tests/tools/delegation/test_delegation.py:106: in test_delegate_executor_delegate
assert isinstance(output_block, TextContent)
E AssertionError: assert False
E + where False = isinstance('S', TextContent)
=============================== warnings summary ===============================
tests/tools/browser_use/test_browser_executor_e2e.py:170
/home/runner/_work/software-agent-sdk/software-agent-sdk/tests/tools/browser_use/test_browser_executor_e2e.py:170: PytestUnknownMarkWarning: Unknown pytest.mark.e2e - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
@pytest.mark.e2e

@openhands-ai
Copy link

openhands-ai bot commented Nov 3, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

The new Observation base class now uses output: str | list[TextContent | ImageContent]
instead of just list[TextContent]. The DelegateExecutor correctly returns observations
with output as a string, so the tests need to be updated to check the string directly
instead of treating it as a list and accessing output[0].

Changes:
- Updated test_delegate_observation_creation to create observations with string output
- Updated test_delegate_executor_delegate to check output as a string
- Verified to_llm_content property returns the expected TextContent list

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 3, 2025

Summary

I've successfully fixed the test failure in PR #929 related to the standardized Observation base class changes.

Problem

The test test_delegate_executor_delegate was failing because it expected DelegateObservation.output to be a list of TextContent objects (the old behavior), but the new standardized Observation base class defines output: str | list[TextContent | ImageContent]. The DelegateExecutor implementation correctly returns observations with output as a string, but the tests were still treating it as a list and accessing output[0], which resulted in getting just the first character 'S' instead of the expected content.

Changes Made

Updated tests/tools/delegation/test_delegation.py:

  • test_delegate_observation_creation: Changed to create observations with string output and verify both the string output and the to_llm_content property
  • test_delegate_executor_delegate: Updated to check output as a string directly instead of treating it as a list

Verification

✅ All 5 delegation tests now pass
✅ Pre-commit hooks pass (ruff format, ruff lint, pycodestyle, pyright)
✅ Changes committed and pushed to branch openhands/standardize-observation-base

The fix is minimal, focused, and maintains backward compatibility with the new Observation base schema.

View full conversation

@simonrosenberg simonrosenberg requested a review from enyst November 3, 2025 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: standardize Observation class

4 participants