Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Nov 25, 2025

Simplify Cross-Repository Evaluation Workflow Orchestration

Fixes #1265

Summary

This PR simplifies the SDK evaluation workflow by moving orchestration complexity from GitHub Actions (YAML) into Python code in the evaluation repository where it naturally belongs.

Before

  • SDK workflow: 327-line YAML with complex polling logic (~100 lines)
  • Made 40+ GitHub API calls per evaluation to track benchmarks build
  • Timestamp-based run identification (vulnerable to race conditions)
  • Orchestration split between YAML, bash, and multiple repositories

After

  • SDK workflow: Simplified to ~200 lines
  • Removed all polling logic
  • Single dispatch to evaluation repo with all parameters
  • Evaluation repo handles orchestration in Python (testable, maintainable)

Changes

Software Agent SDK Repository

1. Simplified Workflow (run-eval.yml)

Removed:

  • 100+ lines of benchmarks build polling logic
  • GitHub API rate limit concerns
  • Complex timestamp-based run identification
  • Wait/retry loops

Streamlined to:

  1. Validate inputs & authorization
  2. Resolve SDK commit SHA
  3. Resolve model configurations from models.json
  4. Dispatch evaluation workflow with all parameters

2. Model Configuration Management

Created resolve_model_configs.py:

  • Validates model IDs against allowed list in models.json
  • Resolves full model configurations including:
    • LiteLLM model identifiers
    • Temperature and other model-specific parameters
  • Returns properly formatted JSON for evaluation workflow
  • Handles both direct model IDs and comma-separated lists

Replaced model stubs with single source of truth:

  • Removed .github/run-eval/llm_config_model_stubs.json
  • Now using .github/run-eval/models.json as the single authoritative source
  • Simplifies maintenance: one file to update for all model configurations

Benefits:

  • Single file to maintain for model configs
  • No risk of inconsistency between stubs and full configs
  • Easier to add new models
  • Validation happens before evaluation starts

3. Enhanced Input Validation

Added authorization check:

  • Only authorized users (listed in authorized-labelers.txt) can trigger evaluations
  • Applies to both workflow_dispatch and pull_request_target events
  • Prevents unauthorized resource usage

Model ID validation:

  • Validates comma-separated model IDs against models.json
  • Provides clear error messages with available options
  • Fails fast before expensive evaluation starts

4. Testing Feature Branches

New eval_branch input:

  • Allows testing evaluation workflow changes before merging
  • Defaults to main for normal operations
  • Enables safe development and validation of orchestration changes

5. Trigger Reason Propagation

New reason input:

  • Accepts natural language descriptions with spaces
  • Automatically sanitized for Helm compatibility (spaces → hyphens)
  • Passed to evaluation workflow for Slack notifications and logging
  • Provides context for why evaluations were triggered

Code Review Updates

After initial implementation, addressed code review feedback to improve reliability:

1. Fixed Workflow Run Detection (HIGH PRIORITY)

Problem: Timestamp-based detection was unreliable and vulnerable to race conditions

Solution: Implemented baseline run ID comparison using Check Suites API

  • Get baseline run ID before triggering workflow
  • After trigger, poll for new run where check_suite.id > baseline_id
  • Deterministic and race-condition-proof
  • No reliance on timestamps

2. Removed Sanitize-Inputs Job (MEDIUM PRIORITY)

Problem: Unnecessary GitHub Actions job that could be done in Python

Solution: Moved sanitization logic to orchestrate_eval.py

  • Trigger reason now sanitized in Python before helm deployment
  • Reduces workflow complexity
  • More testable and maintainable

3. Added Comprehensive Tests

  • Unit tests for input sanitization
  • Tests for baseline run ID detection logic
  • Edge case handling

Validation

Full Workflow Testing

Test 1: Valid Model ID

Test 2: Evaluation Deployment

Test 3: Invalid Model ID (Validation)

Test 4: Code Review Fixes - End-to-End

Trigger Reason Feature

Input: "Testing fix for helm deployment issue"
Sanitized: "Testing-fix-for-helm-deployment-issue"
Result: ✅ Successfully passed to Kubernetes pod via helm_args

Impact

Reduced Complexity

  • SDK workflow: 327 → ~200 lines (-40%)
  • Polling logic: Removed entirely
  • API calls: 40+ → 1 per evaluation

Improved Maintainability

  • Orchestration logic now in Python (testable)
  • Single source of truth for model configs
  • Clear separation of concerns
  • Deterministic workflow run detection

Better Error Handling

  • Fast-fail validation before expensive operations
  • Clear error messages
  • Authorization checks

Enhanced Features

  • Support for testing feature branches
  • Trigger reason tracking
  • Model ID validation

Testing

  • SDK workflow triggers successfully
  • Model ID validation works
  • Authorization checks function correctly
  • Evaluation workflow receives correct parameters
  • Kubernetes deployment succeeds
  • Trigger reason sanitization and propagation works
  • Invalid inputs are properly rejected
  • Baseline run ID detection works correctly
  • Python-based sanitization works correctly
  • End-to-end workflow with code review fixes succeeds

Related PRs


Co-authored-by: openhands openhands@all-hands.dev


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:0f06ad7-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-0f06ad7-python \
  ghcr.io/openhands/agent-server:0f06ad7-python

All tags pushed for this build

ghcr.io/openhands/agent-server:0f06ad7-golang-amd64
ghcr.io/openhands/agent-server:0f06ad7-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:0f06ad7-golang-arm64
ghcr.io/openhands/agent-server:0f06ad7-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:0f06ad7-java-amd64
ghcr.io/openhands/agent-server:0f06ad7-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:0f06ad7-java-arm64
ghcr.io/openhands/agent-server:0f06ad7-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:0f06ad7-python-amd64
ghcr.io/openhands/agent-server:0f06ad7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:0f06ad7-python-arm64
ghcr.io/openhands/agent-server:0f06ad7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:0f06ad7-golang
ghcr.io/openhands/agent-server:0f06ad7-java
ghcr.io/openhands/agent-server:0f06ad7-python

About Multi-Architecture Support

  • Each variant tag (e.g., 0f06ad7-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 0f06ad7-python-amd64) are also available if needed

- Remove benchmarks build dispatch and polling steps (327 -> ~180 lines)
- Remove polling configuration environment variables
- Orchestration now handled by evaluation repository's Kubernetes job
- Workflow now only validates inputs and dispatches evaluation

This change:
- Reduces GitHub Actions API calls from 40+ to ~2 per evaluation
- Eliminates complex polling logic from YAML
- Improves maintainability and error handling
- Makes the full evaluation flow easier to understand

Co-authored-by: openhands <openhands@all-hands.dev>
Allow specifying which evaluation repo branch to use when dispatching
the evaluation workflow, making it easier to test cross-repo changes.

Co-authored-by: openhands <openhands@all-hands.dev>
- Add models.json with full model configurations to SDK repo
- Add resolve_model_configs.py to lookup model configs by ID
- Update run-eval workflow to resolve and pass model configs
- Change evaluation workflow input from model IDs to model configs

Co-authored-by: openhands <openhands@all-hands.dev>
- Extend authorization check to also validate workflow_dispatch triggers
- Same authorized-labelers.txt list now applies to both PR labels and manual triggers
- Ensures only authorized users can trigger evaluations via any method

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg force-pushed the openhands/orchestration-refactor branch from 45d1ecd to 7b1bc9a Compare November 26, 2025 08:11
- Pass reason input from SDK workflow to evaluation workflow
- Reason is shown in Slack notifications when provided

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg self-assigned this Nov 26, 2025
@simonrosenberg simonrosenberg requested a review from enyst November 26, 2025 11:20
openhands-agent and others added 4 commits November 26, 2025 11:25
- Delete allowed-model-stubs.json (redundant with models.json)
- Extract allowed model IDs directly from models.json
- Rename workflow input from 'model_stubs' to 'model_ids' for clarity
- Update validation logic to use model IDs from models.json
- Improve error messages to show available models on validation failure

This simplifies configuration management by having a single source
of truth for model definitions.

Co-authored-by: openhands <openhands@all-hands.dev>
- Extract find_models_by_id() for better testability and separation of concerns
- Use consistent error handling with get_required_env() and error_exit()
- Standardize error message format (remove redundant 'ERROR:' prefixes)

Co-authored-by: openhands <openhands@all-hands.dev>
Tests cover:
- Single and multiple model lookups
- Order preservation
- Missing model error handling
- Empty list handling
- Full config preservation

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg marked this pull request as ready for review November 26, 2025 16:00
if [ -n "$RUN_ID" ] && [ "$RUN_ID" != "null" ]; then
echo "Found workflow run: $RUN_ID"
else
echo "Waiting for workflow run to appear (attempt $i/$MAX_ATTEMPTS)..."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what happens with all this code building from benchmarks, how will it work now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We moved it to the Kubernetes workflow in our internal OpenHands/evaluation repo 😓 We don't want to keep polling stuff in CI since CI compute is likely more expensive than K8S, and keep these centralized in one place is a bit easier to maintain in the long run anyway

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine if the results are transparent and reproducible IMHO

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is accurate, right?

Applies to both workflow_dispatch and pull_request_target events

Will it comment on PRs, in the case of runs on PRs?

The comment is very valuable because it has logs - or it did, on V0. In particular when people get a not so good result, I think maybe we need to offer all data they need to dig into.

@enyst
Copy link
Collaborator

enyst commented Nov 26, 2025

I think I asked this before, sorry, but I just checked yesterday's PR, and I didn't see any comment in response to eval-1 label?



def main() -> None:
models_json_path = get_required_env("MODELS_JSON_PATH")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we write models.json as a dict explicitly in this file since it seems it is only used here?

echo "User $LABELER is not authorized to trigger eval." >&2
ACTOR="${{ github.actor }}"
if ! grep -Fx "$ACTOR" .github/run-eval/authorized-actors.txt >/dev/null; then
echo "User $ACTOR is not authorized to trigger eval." >&2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just remove this authorization? Only maintainer/ at least people with triage access to the repo can tag a PR, so i think it should be fine even if we don't check for github.action

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note. This might have consequences the other way around though: we basically don't have new maintainers from the community for a long time. If we add an additional criterium, it might become a little harder even?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes both ways. I don't know. How about we let it sink in and give it some thought for a while?

- Remove ACTOR authorization logic and authorized-actors.txt
- Embed model configs in resolve_model_configs.py, remove models.json
- Pass PR_NUMBER to evaluation workflow for result comments
- Fix Python script execution using heredoc for proper YAML formatting

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg added the run-eval-1 Runs evaluation on 1 SWE-bench instance label Nov 26, 2025
@github-actions
Copy link
Contributor

Evaluation Triggered

@OpenHands OpenHands deleted a comment from openhands-ai bot Nov 26, 2025
This allows dispatching to a specific benchmarks repo branch instead of
always using main, enabling end-to-end testing of feature branches.

Co-authored-by: openhands <openhands@all-hands.dev>
…el_config

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

Evaluation Triggered

  • Trigger: Manual trigger: Retry: Testing PR comment functionality after benchmarks build fix
  • SDK: 45afbc3
  • Eval limit: 1
  • Models: claude-sonnet-4-5-20250929

@github-actions
Copy link
Contributor

Evaluation Triggered

  • Trigger: Manual trigger: Testing temporary branch workflow matching implementation
  • SDK: 45afbc3
  • Eval limit: 1
  • Models: claude-sonnet-4-5-20250929

@all-hands-bot
Copy link
Collaborator

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19764754304-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: 45afbc3046d90b78918ec12ba14229c24887eb5e
Timestamp: 2025-11-28 13:26:16 UTC
Reason: Testing-temporary-branch-workflow-matching-implementation

Results Summary

  • Total instances: 500
  • Submitted instances: 1
  • Resolved instances: 1
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 0
  • Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@simonrosenberg
Copy link
Collaborator Author

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19764754304-claude-son Model: litellm_proxy/claude-sonnet-4-5-20250929 Dataset: princeton-nlp/SWE-bench_Verified (test) Commit: 45afbc3046d90b78918ec12ba14229c24887eb5e Timestamp: 2025-11-28 13:26:16 UTC Reason: Testing-temporary-branch-workflow-matching-implementation

Results Summary

  • Total instances: 500
  • Submitted instances: 1
  • Resolved instances: 1
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 0
  • Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@enyst FYI

simonrosenberg and others added 3 commits November 28, 2025 15:19
Rename test_resolve_model_configs.py to test_resolve_model_config.py
to match the actual module name (resolve_model_config.py, singular).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Consolidate SHA resolution logic into a single step instead of
separate checkout and resolution steps. This eliminates unnecessary
git operations and makes the workflow more efficient.

Changes:
- Remove temporary pr_number input parameter (testing complete)
- Resolve SDK commit SHA directly in params step for all event types
  - pull_request_target: Use github.event.pull_request.head.sha
  - release: Use github.event.release.target_commitish
  - workflow_dispatch: Use git rev-parse to convert ref to SHA
- Remove redundant "Checkout evaluated ref for PRs" step
- Remove redundant "Resolve SDK commit SHA for evaluation" step
- Update model_ids description to reference MODELS dict location

This reduces workflow execution time and simplifies maintenance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The previous version tried to checkout github.event.inputs.sdk_ref directly,
which fails when a short SHA is provided because actions/checkout@v4 with
fetch-depth: 0 cannot resolve short SHAs.

Solution: Use github.ref (the branch the workflow runs on) for initial checkout.
This works because:
- For workflow_dispatch: The workflow runs on the branch specified in UI
- For pull_request_target: Uses the PR base branch ref
- For release: Uses the release tag/branch

The actual SDK SHA for evaluation is still correctly resolved later in the
params step (line 136-138) using git rev-parse.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@enyst
Copy link
Collaborator

enyst commented Nov 28, 2025

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19764754304-claude-son Model: litellm_proxy/claude-sonnet-4-5-20250929 Dataset: princeton-nlp/SWE-bench_Verified (test) Commit: 45afbc3046d90b78918ec12ba14229c24887eb5e Timestamp: 2025-11-28 13:26:16 UTC Reason: Testing-temporary-branch-workflow-matching-implementation

Results Summary

  • Total instances: 500
  • Submitted instances: 1
  • Resolved instances: 1
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 0
  • Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@enyst FYI

Thank you, this is starting to work! It doesn't have any links to "Download Full Results" and the rest, though? They look like they would be links, but they're not.

We need links to look into results, IMHO, numbers alone don't tell much. Specially for debugging bad results or simply understanding them.

Edited to add:
For example, like we had before at least?

TEMPORARY CHANGE - will be reverted after testing

This adds a pr_number workflow input so we can test the clickable
GCS links fix in PR comments without needing to trigger via label.

Testing plan:
1. Trigger workflow with pr_number=1267
2. Verify PR comment has clickable HTTPS links
3. Revert this commit after successful test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Evaluation Triggered

  • Trigger: Manual trigger: Test-clickable-GCS-links-in-PR-comments
  • SDK:
  • Eval limit: 1
  • Models: claude-sonnet-4-5-20250929

@all-hands-bot
Copy link
Collaborator

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19769532756-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: e5bbafee3c19cde57a57bfe41f9b9b256043e9e6
Timestamp: 2025-11-28 17:02:06 UTC
Reason: Test-clickable-GCS-links-in-PR-comments

Results Summary

  • Total instances: 500
  • Submitted instances: 1
  • Resolved instances: 1
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 0
  • Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@simonrosenberg
Copy link
Collaborator Author

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19769532756-claude-son Model: litellm_proxy/claude-sonnet-4-5-20250929 Dataset: princeton-nlp/SWE-bench_Verified (test) Commit: e5bbafee3c19cde57a57bfe41f9b9b256043e9e6 Timestamp: 2025-11-28 17:02:06 UTC Reason: Test-clickable-GCS-links-in-PR-comments

Results Summary

  • Total instances: 500
  • Submitted instances: 1
  • Resolved instances: 1
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 0
  • Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@enyst

@enyst
Copy link
Collaborator

enyst commented Nov 28, 2025

Unfortunately, those links are not public, not accessible outside AH I assume. 😢
@simonrosenberg

I believe that's why Mamoodi had made in the past a full log available via GitHub actions. Not sure it was a great way, but it was a way.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

Evaluation Triggered

  • Trigger: Manual trigger: Test-public-bucket-access-and-direct-storage-URLs
  • SDK:
  • Eval limit: 1
  • Models: claude-sonnet-4-5-20250929

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

Evaluation Triggered

  • Trigger: Manual trigger: Test-public-bucket-and-PR-comments-with-feature-branches
  • SDK:
  • Eval limit: 1
  • Models: claude-sonnet-4-5-20250929

@all-hands-bot
Copy link
Collaborator

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19822239823-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: e5bbafee3c19cde57a57bfe41f9b9b256043e9e6
Timestamp: 2025-12-01 12:26:44 UTC
Reason: Test-public-bucket-and-PR-comments-with-feature-branches

Results Summary

  • Total instances: 500
  • Submitted instances: 1
  • Resolved instances: 1
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 0
  • Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

Evaluation Triggered

  • Trigger: Manual trigger: Test-public-bucket-and-PR-comments-with-feature-branches
  • SDK:
  • Eval limit: 1
  • Models: claude-sonnet-4-5-20250929

@all-hands-bot
Copy link
Collaborator

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19832354049-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: 506abb6ea82906e46350f334716d7fd7805969f2
Timestamp: 2025-12-01 18:10:42 UTC
Reason: Test-public-bucket-and-PR-comments-with-feature-branches

Results Summary

  • Total instances: 500
  • Submitted instances: 1
  • Resolved instances: 1
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 0
  • Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@enyst
Copy link
Collaborator

enyst commented Dec 1, 2025

Just to note, it still doesn't like me. 😢

View Metadata | View Results | Download Full Results

AccessDenied Access denied.
Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).

@openhands-ai
Copy link

openhands-ai bot commented Dec 2, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run Eval
    • Run Eval
    • Run Eval

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1267 at branch `openhands/orchestration-refactor`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@simonrosenberg
Copy link
Collaborator Author

Just to note, it still doesn't like me. 😢

View Metadata | View Results | Download Full Results

AccessDenied Access denied.

It's work in progress!

@simonrosenberg
Copy link
Collaborator Author

@enyst btw the URL fixing is done ins a complete other repo, if you want to review the current code t would be awesome :)

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a cursory look, it seems to me like we could maybe simplify some code. Let's take it in though, and have some fun!

If it works now, we can always adjust things as they come.

@simonrosenberg simonrosenberg merged commit 70797e7 into main Dec 2, 2025
21 checks passed
@simonrosenberg simonrosenberg deleted the openhands/orchestration-refactor branch December 2, 2025 14:44
@simonrosenberg
Copy link
Collaborator Author

At a cursory look, it seems to me like we could maybe simplify some code. Let's take it in though, and have some fun!

If it works now, we can always adjust things as they come.

thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-eval-1 Runs evaluation on 1 SWE-bench instance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Cross-Repository Evaluation Workflow Orchestration - Second Attempt

6 participants