Simplify evaluation workflow by removing benchmarks build polling #1267

simonrosenberg · 2025-11-25T18:14:52Z

Simplify Cross-Repository Evaluation Workflow Orchestration

Summary

This PR simplifies the SDK evaluation workflow by moving orchestration complexity from GitHub Actions (YAML) into Python code in the evaluation repository where it naturally belongs.

Before

SDK workflow: 327-line YAML with complex polling logic (~100 lines)
Made 40+ GitHub API calls per evaluation to track benchmarks build
Timestamp-based run identification (vulnerable to race conditions)
Orchestration split between YAML, bash, and multiple repositories

After

SDK workflow: Simplified to ~200 lines
Removed all polling logic
Single dispatch to evaluation repo with all parameters
Evaluation repo handles orchestration in Python (testable, maintainable)

Changes

Software Agent SDK Repository

1. Simplified Workflow (`run-eval.yml`)

Removed:

100+ lines of benchmarks build polling logic
GitHub API rate limit concerns
Complex timestamp-based run identification
Wait/retry loops

Streamlined to:

Validate inputs & authorization
Resolve SDK commit SHA
Resolve model configurations from models.json
Dispatch evaluation workflow with all parameters

2. Model Configuration Management

Created resolve_model_configs.py:

Validates model IDs against allowed list in models.json
Resolves full model configurations including:
- LiteLLM model identifiers
- Temperature and other model-specific parameters
Returns properly formatted JSON for evaluation workflow
Handles both direct model IDs and comma-separated lists

Replaced model stubs with single source of truth:

Removed .github/run-eval/llm_config_model_stubs.json
Now using .github/run-eval/models.json as the single authoritative source
Simplifies maintenance: one file to update for all model configurations

Benefits:

Single file to maintain for model configs
No risk of inconsistency between stubs and full configs
Easier to add new models
Validation happens before evaluation starts

3. Enhanced Input Validation

Added authorization check:

Only authorized users (listed in authorized-labelers.txt) can trigger evaluations
Applies to both workflow_dispatch and pull_request_target events
Prevents unauthorized resource usage

Model ID validation:

Validates comma-separated model IDs against models.json
Provides clear error messages with available options
Fails fast before expensive evaluation starts

4. Testing Feature Branches

New eval_branch input:

Allows testing evaluation workflow changes before merging
Defaults to main for normal operations
Enables safe development and validation of orchestration changes

5. Trigger Reason Propagation

New reason input:

Accepts natural language descriptions with spaces
Automatically sanitized for Helm compatibility (spaces → hyphens)
Passed to evaluation workflow for Slack notifications and logging
Provides context for why evaluations were triggered

Code Review Updates

After initial implementation, addressed code review feedback to improve reliability:

1. Fixed Workflow Run Detection (HIGH PRIORITY)

Problem: Timestamp-based detection was unreliable and vulnerable to race conditions

Solution: Implemented baseline run ID comparison using Check Suites API

Get baseline run ID before triggering workflow
After trigger, poll for new run where check_suite.id > baseline_id
Deterministic and race-condition-proof
No reliance on timestamps

2. Removed Sanitize-Inputs Job (MEDIUM PRIORITY)

Problem: Unnecessary GitHub Actions job that could be done in Python

Solution: Moved sanitization logic to orchestrate_eval.py

Trigger reason now sanitized in Python before helm deployment
Reduces workflow complexity
More testable and maintainable

3. Added Comprehensive Tests

Unit tests for input sanitization
Tests for baseline run ID detection logic
Edge case handling

Validation

Full Workflow Testing

Test 1: Valid Model ID

Run: https://github.com/OpenHands/software-agent-sdk/actions/runs/19703922484
Result: ✅ SUCCESS
Validated: Model ID validation, authorization, workflow dispatch

Test 2: Evaluation Deployment

Run: https://github.com/OpenHands/evaluation/actions/runs/19703926515
Result: ✅ SUCCESS
Validated:
- Kubernetes deployment successful
- Trigger reason sanitization working
- TRIGGER_REASON environment variable passed to pod
- Full orchestration flow complete

Test 3: Invalid Model ID (Validation)

Run: https://github.com/OpenHands/software-agent-sdk/actions/runs/19703857863
Result: ✅ EXPECTED FAILURE - "Model ID 'OpenHands/claude-haiku-4-20251001' not found"
Validated: Input validation working correctly

Test 4: Code Review Fixes - End-to-End

SDK Run: https://github.com/OpenHands/software-agent-sdk/actions/runs/19708030468
Evaluation Run: https://github.com/OpenHands/evaluation/actions/runs/19708035858
Result: ✅ SUCCESS
Validated:
- Baseline run ID detection working correctly
- Sanitization in Python working correctly
- Helm deployment successful with sanitized trigger_reason
- Complete orchestration flow from SDK → Evaluation → Kubernetes

Trigger Reason Feature

Input: "Testing fix for helm deployment issue"
Sanitized: "Testing-fix-for-helm-deployment-issue"
Result: ✅ Successfully passed to Kubernetes pod via helm_args

Impact

Reduced Complexity

SDK workflow: 327 → ~200 lines (-40%)
Polling logic: Removed entirely
API calls: 40+ → 1 per evaluation

Improved Maintainability

Orchestration logic now in Python (testable)
Single source of truth for model configs
Clear separation of concerns
Deterministic workflow run detection

Better Error Handling

Fast-fail validation before expensive operations
Clear error messages
Authorization checks

Enhanced Features

Support for testing feature branches
Trigger reason tracking
Model ID validation

Testing

Related PRs

Evaluation repo: https://github.com/OpenHands/evaluation/pull/55

Co-authored-by: openhands openhands@all-hands.dev

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:0f06ad7-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-0f06ad7-python \
  ghcr.io/openhands/agent-server:0f06ad7-python

All tags pushed for this build

ghcr.io/openhands/agent-server:0f06ad7-golang-amd64
ghcr.io/openhands/agent-server:0f06ad7-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:0f06ad7-golang-arm64
ghcr.io/openhands/agent-server:0f06ad7-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:0f06ad7-java-amd64
ghcr.io/openhands/agent-server:0f06ad7-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:0f06ad7-java-arm64
ghcr.io/openhands/agent-server:0f06ad7-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:0f06ad7-python-amd64
ghcr.io/openhands/agent-server:0f06ad7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:0f06ad7-python-arm64
ghcr.io/openhands/agent-server:0f06ad7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:0f06ad7-golang
ghcr.io/openhands/agent-server:0f06ad7-java
ghcr.io/openhands/agent-server:0f06ad7-python

About Multi-Architecture Support

Each variant tag (e.g., 0f06ad7-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 0f06ad7-python-amd64) are also available if needed

- Remove benchmarks build dispatch and polling steps (327 -> ~180 lines) - Remove polling configuration environment variables - Orchestration now handled by evaluation repository's Kubernetes job - Workflow now only validates inputs and dispatches evaluation This change: - Reduces GitHub Actions API calls from 40+ to ~2 per evaluation - Eliminates complex polling logic from YAML - Improves maintainability and error handling - Makes the full evaluation flow easier to understand Co-authored-by: openhands <openhands@all-hands.dev>

Allow specifying which evaluation repo branch to use when dispatching the evaluation workflow, making it easier to test cross-repo changes. Co-authored-by: openhands <openhands@all-hands.dev>

- Add models.json with full model configurations to SDK repo - Add resolve_model_configs.py to lookup model configs by ID - Update run-eval workflow to resolve and pass model configs - Change evaluation workflow input from model IDs to model configs Co-authored-by: openhands <openhands@all-hands.dev>

- Extend authorization check to also validate workflow_dispatch triggers - Same authorized-labelers.txt list now applies to both PR labels and manual triggers - Ensures only authorized users can trigger evaluations via any method Co-authored-by: openhands <openhands@all-hands.dev>

- Pass reason input from SDK workflow to evaluation workflow - Reason is shown in Slack notifications when provided Co-authored-by: openhands <openhands@all-hands.dev>

- Delete allowed-model-stubs.json (redundant with models.json) - Extract allowed model IDs directly from models.json - Rename workflow input from 'model_stubs' to 'model_ids' for clarity - Update validation logic to use model IDs from models.json - Improve error messages to show available models on validation failure This simplifies configuration management by having a single source of truth for model definitions. Co-authored-by: openhands <openhands@all-hands.dev>

- Extract find_models_by_id() for better testability and separation of concerns - Use consistent error handling with get_required_env() and error_exit() - Standardize error message format (remove redundant 'ERROR:' prefixes) Co-authored-by: openhands <openhands@all-hands.dev>

Tests cover: - Single and multiple model lookups - Order preservation - Missing model error handling - Empty list handling - Full config preservation Co-authored-by: openhands <openhands@all-hands.dev>

.github/run-eval/allowed-model-stubs.json

enyst · 2025-11-26T16:09:38Z

.github/workflows/run-eval.yml

-                      if [ -n "$RUN_ID" ] && [ "$RUN_ID" != "null" ]; then
-                        echo "Found workflow run: $RUN_ID"
-                      else
-                        echo "Waiting for workflow run to appear (attempt $i/$MAX_ATTEMPTS)..."


Out of curiosity, what happens with all this code building from benchmarks, how will it work now?

We moved it to the Kubernetes workflow in our internal OpenHands/evaluation repo 😓 We don't want to keep polling stuff in CI since CI compute is likely more expensive than K8S, and keep these centralized in one place is a bit easier to maintain in the long run anyway

That's fine if the results are transparent and reproducible IMHO

enyst

I understand this is accurate, right?

Applies to both workflow_dispatch and pull_request_target events

Will it comment on PRs, in the case of runs on PRs?

The comment is very valuable because it has logs - or it did, on V0. In particular when people get a not so good result, I think maybe we need to offer all data they need to dig into.

enyst · 2025-11-26T16:20:12Z

I think I asked this before, sorry, but I just checked yesterday's PR, and I didn't see any comment in response to eval-1 label?

xingyaoww · 2025-11-26T16:27:14Z

.github/run-eval/resolve_model_configs.py

+
+
+def main() -> None:
+    models_json_path = get_required_env("MODELS_JSON_PATH")


How about we write models.json as a dict explicitly in this file since it seems it is only used here?

xingyaoww · 2025-11-26T16:28:20Z

.github/workflows/run-eval.yml

-                    echo "User $LABELER is not authorized to trigger eval." >&2
+                  ACTOR="${{ github.actor }}"
+                  if ! grep -Fx "$ACTOR" .github/run-eval/authorized-actors.txt >/dev/null; then
+                    echo "User $ACTOR is not authorized to trigger eval." >&2


Maybe we can just remove this authorization? Only maintainer/ at least people with triage access to the repo can tag a PR, so i think it should be fine even if we don't check for github.action

Just to note. This might have consequences the other way around though: we basically don't have new maintainers from the community for a long time. If we add an additional criterium, it might become a little harder even?

This goes both ways. I don't know. How about we let it sink in and give it some thought for a while?

- Remove ACTOR authorization logic and authorized-actors.txt - Embed model configs in resolve_model_configs.py, remove models.json - Pass PR_NUMBER to evaluation workflow for result comments - Fix Python script execution using heredoc for proper YAML formatting Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2025-11-26T18:33:19Z

Evaluation Triggered

Trigger: Label 'run-eval-1' on PR Simplify evaluation workflow by removing benchmarks build polling #1267
SDK: d17b85d
Eval limit: 1
Models: claude-sonnet-4-5-20250929

This allows dispatching to a specific benchmarks repo branch instead of always using main, enabling end-to-end testing of feature branches. Co-authored-by: openhands <openhands@all-hands.dev>

…el_config Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2025-11-27T17:59:20Z

Evaluation Triggered

Trigger: Manual trigger: Retry: Testing PR comment functionality after benchmarks build fix
SDK: 45afbc3
Eval limit: 1
Models: claude-sonnet-4-5-20250929

github-actions · 2025-11-28T13:09:12Z

Evaluation Triggered

Trigger: Manual trigger: Testing temporary branch workflow matching implementation
SDK: 45afbc3
Eval limit: 1
Models: claude-sonnet-4-5-20250929

all-hands-bot · 2025-11-28T13:26:17Z

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19764754304-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: 45afbc3046d90b78918ec12ba14229c24887eb5e
Timestamp: 2025-11-28 13:26:16 UTC
Reason: Testing-temporary-branch-workflow-matching-implementation

Results Summary

Total instances: 500
Submitted instances: 1
Resolved instances: 1
Unresolved instances: 0
Empty patch instances: 0
Error instances: 0
Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

simonrosenberg · 2025-11-28T14:09:05Z

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19764754304-claude-son Model: litellm_proxy/claude-sonnet-4-5-20250929 Dataset: princeton-nlp/SWE-bench_Verified (test) Commit: 45afbc3046d90b78918ec12ba14229c24887eb5e Timestamp: 2025-11-28 13:26:16 UTC Reason: Testing-temporary-branch-workflow-matching-implementation

Results Summary

Total instances: 500

Submitted instances: 1

Resolved instances: 1

Unresolved instances: 0

Empty patch instances: 0

Error instances: 0

Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@enyst FYI

Rename test_resolve_model_configs.py to test_resolve_model_config.py to match the actual module name (resolve_model_config.py, singular). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Consolidate SHA resolution logic into a single step instead of separate checkout and resolution steps. This eliminates unnecessary git operations and makes the workflow more efficient. Changes: - Remove temporary pr_number input parameter (testing complete) - Resolve SDK commit SHA directly in params step for all event types - pull_request_target: Use github.event.pull_request.head.sha - release: Use github.event.release.target_commitish - workflow_dispatch: Use git rev-parse to convert ref to SHA - Remove redundant "Checkout evaluated ref for PRs" step - Remove redundant "Resolve SDK commit SHA for evaluation" step - Update model_ids description to reference MODELS dict location This reduces workflow execution time and simplifies maintenance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The previous version tried to checkout github.event.inputs.sdk_ref directly, which fails when a short SHA is provided because actions/checkout@v4 with fetch-depth: 0 cannot resolve short SHAs. Solution: Use github.ref (the branch the workflow runs on) for initial checkout. This works because: - For workflow_dispatch: The workflow runs on the branch specified in UI - For pull_request_target: Uses the PR base branch ref - For release: Uses the release tag/branch The actual SDK SHA for evaluation is still correctly resolved later in the params step (line 136-138) using git rev-parse. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

enyst · 2025-11-28T15:53:52Z

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19764754304-claude-son Model: litellm_proxy/claude-sonnet-4-5-20250929 Dataset: princeton-nlp/SWE-bench_Verified (test) Commit: 45afbc3046d90b78918ec12ba14229c24887eb5e Timestamp: 2025-11-28 13:26:16 UTC Reason: Testing-temporary-branch-workflow-matching-implementation

Results Summary

Total instances: 500

Submitted instances: 1

Resolved instances: 1

Unresolved instances: 0

Empty patch instances: 0

Error instances: 0

Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@enyst FYI

Thank you, this is starting to work! It doesn't have any links to "Download Full Results" and the rest, though? They look like they would be links, but they're not.

We need links to look into results, IMHO, numbers alone don't tell much. Specially for debugging bad results or simply understanding them.

Edited to add:
For example, like we had before at least?

TEMPORARY CHANGE - will be reverted after testing This adds a pr_number workflow input so we can test the clickable GCS links fix in PR comments without needing to trigger via label. Testing plan: 1. Trigger workflow with pr_number=1267 2. Verify PR comment has clickable HTTPS links 3. Revert this commit after successful test 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2025-11-28T16:41:50Z

Evaluation Triggered

Trigger: Manual trigger: Test-clickable-GCS-links-in-PR-comments
SDK:
Eval limit: 1
Models: claude-sonnet-4-5-20250929

all-hands-bot · 2025-11-28T17:02:07Z

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19769532756-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: e5bbafee3c19cde57a57bfe41f9b9b256043e9e6
Timestamp: 2025-11-28 17:02:06 UTC
Reason: Test-clickable-GCS-links-in-PR-comments

Results Summary

Total instances: 500
Submitted instances: 1
Resolved instances: 1
Unresolved instances: 0
Empty patch instances: 0
Error instances: 0
Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

simonrosenberg · 2025-11-28T18:10:13Z

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19769532756-claude-son Model: litellm_proxy/claude-sonnet-4-5-20250929 Dataset: princeton-nlp/SWE-bench_Verified (test) Commit: e5bbafee3c19cde57a57bfe41f9b9b256043e9e6 Timestamp: 2025-11-28 17:02:06 UTC Reason: Test-clickable-GCS-links-in-PR-comments

Results Summary

Total instances: 500

Submitted instances: 1

Resolved instances: 1

Unresolved instances: 0

Empty patch instances: 0

Error instances: 0

Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

@enyst

enyst · 2025-11-28T18:27:35Z

Unfortunately, those links are not public, not accessible outside AH I assume. 😢
@simonrosenberg

I believe that's why Mamoodi had made in the past a full log available via GitHub actions. Not sure it was a great way, but it was a way.

github-actions · 2025-12-01T12:06:04Z

Evaluation Triggered

Trigger: Manual trigger: Test-public-bucket-access-and-direct-storage-URLs
SDK:
Eval limit: 1
Models: claude-sonnet-4-5-20250929

github-actions · 2025-12-01T12:12:57Z

Evaluation Triggered

Trigger: Manual trigger: Test-public-bucket-and-PR-comments-with-feature-branches
SDK:
Eval limit: 1
Models: claude-sonnet-4-5-20250929

all-hands-bot · 2025-12-01T12:26:44Z

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19822239823-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: e5bbafee3c19cde57a57bfe41f9b9b256043e9e6
Timestamp: 2025-12-01 12:26:44 UTC
Reason: Test-public-bucket-and-PR-comments-with-feature-branches

Results Summary

Total instances: 500
Submitted instances: 1
Resolved instances: 1
Unresolved instances: 0
Empty patch instances: 0
Error instances: 0
Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

This reverts commit e5bbafe.

github-actions · 2025-12-01T17:55:43Z

Evaluation Triggered

Trigger: Manual trigger: Test-public-bucket-and-PR-comments-with-feature-branches
SDK:
Eval limit: 1
Models: claude-sonnet-4-5-20250929

all-hands-bot · 2025-12-01T18:10:43Z

🎉 Evaluation Job Completed

Evaluation Name: sdk-main-19832354049-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: 506abb6ea82906e46350f334716d7fd7805969f2
Timestamp: 2025-12-01 18:10:42 UTC
Reason: Test-public-bucket-and-PR-comments-with-feature-branches

Results Summary

Total instances: 500
Submitted instances: 1
Resolved instances: 1
Unresolved instances: 0
Empty patch instances: 0
Error instances: 0
Success rate: 1/1 (100.0%)

View Metadata | View Results | Download Full Results

enyst · 2025-12-01T18:18:34Z

Just to note, it still doesn't like me. 😢

View Metadata | View Results | Download Full Results

AccessDenied Access denied.

Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).

openhands-ai · 2025-12-02T08:34:35Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run Eval
- Run Eval
- Run Eval

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1267 at branch `openhands/orchestration-refactor`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

simonrosenberg · 2025-12-02T08:42:32Z

Just to note, it still doesn't like me. 😢

View Metadata | View Results | Download Full Results

AccessDenied Access denied.

It's work in progress!

simonrosenberg · 2025-12-02T13:44:18Z

@enyst btw the URL fixing is done ins a complete other repo, if you want to review the current code t would be awesome :)

.github/workflows/run-eval.yml

enyst

At a cursory look, it seems to me like we could maybe simplify some code. Let's take it in though, and have some fun!

If it works now, we can always adjust things as they come.

simonrosenberg · 2025-12-02T14:46:58Z

At a cursory look, it seems to me like we could maybe simplify some code. Let's take it in though, and have some fun!

If it works now, we can always adjust things as they come.

thank you!!

openhands-agent added 2 commits November 25, 2025 18:06

Add eval_branch parameter for testing feature branches

edeab93

Allow specifying which evaluation repo branch to use when dispatching the evaluation workflow, making it easier to test cross-repo changes. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai bot mentioned this pull request Nov 25, 2025

Improve Cross-Repository Evaluation Workflow Orchestration - Second Attempt #1265

Closed

openhands-agent added 2 commits November 25, 2025 18:27

simonrosenberg force-pushed the openhands/orchestration-refactor branch from 45d1ecd to 7b1bc9a Compare November 26, 2025 08:11

Add trigger_reason propagation to evaluation workflow

fd7cc3d

- Pass reason input from SDK workflow to evaluation workflow - Reason is shown in Slack notifications when provided Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg self-assigned this Nov 26, 2025

simonrosenberg requested a review from enyst November 26, 2025 11:20

openhands-agent and others added 4 commits November 26, 2025 11:25

Add tests for find_models_by_id() function

8ecec55

Tests cover: - Single and multiple model lookups - Order preservation - Missing model error handling - Empty list handling - Full config preservation Co-authored-by: openhands <openhands@all-hands.dev>

change file name

a00a3ed

enyst reviewed Nov 26, 2025

View reviewed changes

.github/run-eval/allowed-model-stubs.json Show resolved Hide resolved

simonrosenberg marked this pull request as ready for review November 26, 2025 16:00

Merge branch 'main' into openhands/orchestration-refactor

1eb3c5c

simonrosenberg requested a review from juanmichelini November 26, 2025 16:01

enyst reviewed Nov 26, 2025

View reviewed changes

xingyaoww reviewed Nov 26, 2025

View reviewed changes

simonrosenberg added the run-eval-1 Runs evaluation on 1 SWE-bench instance label Nov 26, 2025

simonrosenberg requested review from enyst and xingyaoww November 26, 2025 18:28

OpenHands deleted a comment from openhands-ai bot Nov 26, 2025

openhands-agent added 2 commits November 26, 2025 18:34

Add benchmarks_branch parameter to support feature branch testing

644805b

This allows dispatching to a specific benchmarks repo branch instead of always using main, enabling end-to-end testing of feature branches. Co-authored-by: openhands <openhands@all-hands.dev>

Fix module name in workflow from resolve_model_configs to resolve_mod…

7782d48

…el_config Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg and others added 3 commits November 28, 2025 15:19

Revert "Add temporary pr_number input for testing clickable links"

506abb6

This reverts commit e5bbafe.

Merge branch 'main' into openhands/orchestration-refactor

9898814

enyst reviewed Dec 2, 2025

View reviewed changes

.github/workflows/run-eval.yml Show resolved Hide resolved

enyst reviewed Dec 2, 2025

View reviewed changes

.github/workflows/run-eval.yml Show resolved Hide resolved

enyst approved these changes Dec 2, 2025

View reviewed changes

simonrosenberg merged commit 70797e7 into main Dec 2, 2025
21 checks passed

simonrosenberg deleted the openhands/orchestration-refactor branch December 2, 2025 14:44



		def main() -> None:
		models_json_path = get_required_env("MODELS_JSON_PATH")

Simplify evaluation workflow by removing benchmarks build polling #1267

Simplify evaluation workflow by removing benchmarks build polling #1267

Uh oh!

Conversation

simonrosenberg commented Nov 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Simplify Cross-Repository Evaluation Workflow Orchestration

Summary

Before

After

Changes

Software Agent SDK Repository

1. Simplified Workflow (run-eval.yml)

2. Model Configuration Management

3. Enhanced Input Validation

4. Testing Feature Branches

5. Trigger Reason Propagation

Code Review Updates

1. Fixed Workflow Run Detection (HIGH PRIORITY)

2. Removed Sanitize-Inputs Job (MEDIUM PRIORITY)

3. Added Comprehensive Tests

Validation

Full Workflow Testing

Trigger Reason Feature

Impact

Reduced Complexity

Improved Maintainability

Better Error Handling

Enhanced Features

Testing

Related PRs

Uh oh!

Uh oh!

enyst Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

xingyaoww Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

enyst Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

enyst commented Nov 26, 2025

Uh oh!

xingyaoww Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

xingyaoww Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

enyst Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

enyst Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

all-hands-bot commented Nov 28, 2025

🎉 Evaluation Job Completed

Results Summary

Uh oh!

simonrosenberg commented Nov 28, 2025

🎉 Evaluation Job Completed

Results Summary

Uh oh!

enyst commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎉 Evaluation Job Completed

Results Summary

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

simonrosenberg commented Nov 25, 2025 •

edited by github-actions bot

Loading

1. Simplified Workflow (`run-eval.yml`)

enyst commented Nov 28, 2025 •

edited

Loading