Laminar evaluations by Rainhunter13 · Pull Request #175 · OpenHands/benchmarks

Rainhunter13 · 2025-12-18T15:09:14Z

This PR integrates Laminar evaluations into Openhands Benchmarks, enabling dynamic creation of Laminar Evaluations and Datapoints per every run.

Core changes:

Added singleton LaminarService managing Laminar client, evaluations, and datapoints
For every benchmark run, create a Laminar evaluation.
For every instance run, create an evaluation datapoint, create execution span and propagate it to isolated workspaces via environment variable LMNR_SPAN_CONTEXT
Save Laminar metadata (eval/datapoints ids) to be able to update scores upon inference evaluation.
Update evaluation datapoints with scores based on inference evaluation output.

Usage:

export LMNR_PROJECT_API_KEY=your_key_here
uv run swebench-infer ...

simonrosenberg

Can you solve the 2 mutables?
Then there is still an issue because the eval output format is really suboptimal but this can be solved in the standardization tickef!

simonrosenberg

Actually can we not merge this until I standardize the evaluation output format? sorry! Should be done by tomorrow

Issue here #177

Rainhunter13 · 2025-12-18T18:57:31Z

@simonrosenberg addressed the comments!

sure, let me know once you standardise the format, I'll adjust laminar service as needed, thank you

simonrosenberg · 2025-12-29T17:33:27Z

@OpenHands please resolve the conflicts

openhands-ai · 2025-12-29T17:33:36Z

I'm on it! simonrosenberg can track my progress at all-hands.dev

Resolve conflicts: - benchmarks/commit0/run_infer.py: Keep remote workspace support from main, add forward_env parameter to DockerDevWorkspace - pyproject.toml: Combine dependencies from both branches, use lmnr>=0.7.24 - uv.lock: Regenerate lock file Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2025-12-29T17:39:28Z

I've resolved the merge conflicts in PR #175 by merging origin/main into the laminar_evaluations branch. Here's a summary of the conflict resolutions:

Conflicts Resolved

benchmarks/commit0/run_infer.py:
- Kept the remote workspace support from main (adds APIRemoteWorkspace support alongside DockerDevWorkspace)
- Preserved the forward_env parameter for DockerDevWorkspace from the laminar_evaluations branch
- Note: APIRemoteWorkspace doesn't support forward_env parameter, so it was only added to DockerDevWorkspace
pyproject.toml:
- Combined dependencies from both branches
- Kept lmnr>=0.7.24 (the higher version from laminar_evaluations)
- Added all new dependencies from main (deprecation, fastmcp, httpx, litellm, pydantic, etc.)
- Added new script entrypoints from main (multiswebench, swebenchmultimodal)
uv.lock:
- Regenerated the lock file using uv lock to reflect the merged dependencies

Important Note

The branch has been pushed to origin/laminar_evaluations on the OpenHands/benchmarks repository. However, PR #175 is from a fork (lmnr-ai/benchmarks), and I don't have write access to that fork.

The PR author (@Rainhunter13) will need to either:

Pull the changes from origin/laminar_evaluations into their fork's branch, or
The PR can be closed and a new one created from origin/laminar_evaluations

The resolved branch is available at: https://github.com/OpenHands/benchmarks/tree/laminar_evaluations

_{View full conversation}

…c6287dac9c

…4b18480f9e

…02555fa8d6

simonrosenberg · 2026-01-07T15:35:43Z

@Rainhunter13 approved, thanks! and sorry for the delayed process

Rainhunter13 · 2026-01-07T16:02:50Z

@simonrosenberg thanks! Just fixed the tests, do you think it will possible to merge this soon?

* laminar evaluations * revert uv.lock changes * revert uv.lock changes * remove mutable defaults and move score update to laminar service * chore * Update software-agent-sdk submodule to 52604a7fa48f40487c70338a96488fc6287dac9c * use generic report file format * Update software-agent-sdk submodule to e8f73e71b979984242245efeb701ee4b18480f9e * Update software-agent-sdk submodule to 73769d5e9d5f75333054d098a2e86c02555fa8d6 * forward env vars for RemoteWorkspace * fix tests --------- Co-authored-by: openhands <openhands@all-hands.dev>

laminar evaluations

6c8797a

Rainhunter13 force-pushed the laminar_evaluations branch from fac7001 to 6c8797a Compare December 18, 2025 15:22

Rainhunter13 added 2 commits December 18, 2025 15:28

revert uv.lock changes

69002cd

revert uv.lock changes

7c81b7b

neubig requested a review from simonrosenberg December 18, 2025 15:47

simonrosenberg reviewed Dec 18, 2025

View reviewed changes

Comment thread benchmarks/swebench/eval_infer.py Outdated

simonrosenberg reviewed Dec 18, 2025

View reviewed changes

Comment thread benchmarks/utils/laminar.py Outdated

Comment thread benchmarks/utils/models.py Outdated

simonrosenberg reviewed Dec 18, 2025

View reviewed changes

Comment thread benchmarks/commit0/run_infer.py

simonrosenberg mentioned this pull request Dec 18, 2025

Standardize evaluation helpers and file layout #177

Closed

simonrosenberg approved these changes Dec 18, 2025

View reviewed changes

simonrosenberg requested changes Dec 18, 2025

View reviewed changes

remove mutable defaults and move score update to laminar service

b904ae2

chore

2ab78a5

simonrosenberg mentioned this pull request Dec 29, 2025

Add forward_env support to APIRemoteWorkspace OpenHands/software-agent-sdk#1540

Merged

Rainhunter13 added 5 commits December 30, 2025 12:54

Update software-agent-sdk submodule to 52604a7fa48f40487c70338a96488f…

169f99b

…c6287dac9c

use generic report file format

3af5f48

Update software-agent-sdk submodule to e8f73e71b979984242245efeb701ee…

d430203

…4b18480f9e

Update software-agent-sdk submodule to 73769d5e9d5f75333054d098a2e86c…

d9e3204

…02555fa8d6

forward env vars for RemoteWorkspace

9c4cc04

simonrosenberg approved these changes Jan 7, 2026

View reviewed changes

fix tests

8da79b7

simonrosenberg enabled auto-merge (squash) January 7, 2026 16:15

simonrosenberg merged commit 11ec5d1 into OpenHands:main Jan 7, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Laminar evaluations#175

Laminar evaluations#175
simonrosenberg merged 12 commits into
OpenHands:mainfrom
lmnr-ai:laminar_evaluations

Rainhunter13 commented Dec 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simonrosenberg left a comment

Uh oh!

simonrosenberg left a comment •

edited

Loading

Uh oh!

Rainhunter13 commented Dec 18, 2025

Uh oh!

simonrosenberg commented Dec 29, 2025

Uh oh!

openhands-ai Bot commented Dec 29, 2025

Uh oh!

openhands-ai Bot commented Dec 29, 2025

Uh oh!

simonrosenberg commented Jan 7, 2026

Uh oh!

Rainhunter13 commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Rainhunter13 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simonrosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

simonrosenberg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rainhunter13 commented Dec 18, 2025

Uh oh!

simonrosenberg commented Dec 29, 2025

Uh oh!

openhands-ai Bot commented Dec 29, 2025

Uh oh!

openhands-ai Bot commented Dec 29, 2025

Conflicts Resolved

Important Note

Uh oh!

simonrosenberg commented Jan 7, 2026

Uh oh!

Rainhunter13 commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Rainhunter13 commented Dec 18, 2025 •

edited

Loading

simonrosenberg left a comment •

edited

Loading