Skip to content

Laminar evaluations#175

Merged
simonrosenberg merged 12 commits into
OpenHands:mainfrom
lmnr-ai:laminar_evaluations
Jan 7, 2026
Merged

Laminar evaluations#175
simonrosenberg merged 12 commits into
OpenHands:mainfrom
lmnr-ai:laminar_evaluations

Conversation

@Rainhunter13

@Rainhunter13 Rainhunter13 commented Dec 18, 2025

Copy link
Copy Markdown
Contributor

This PR integrates Laminar evaluations into Openhands Benchmarks, enabling dynamic creation of Laminar Evaluations and Datapoints per every run.

Core changes:

  • Added singleton LaminarService managing Laminar client, evaluations, and datapoints
  • For every benchmark run, create a Laminar evaluation.
  • For every instance run, create an evaluation datapoint, create execution span and propagate it to isolated workspaces via environment variable LMNR_SPAN_CONTEXT
  • Save Laminar metadata (eval/datapoints ids) to be able to update scores upon inference evaluation.
  • Update evaluation datapoints with scores based on inference evaluation output.

Usage:

export LMNR_PROJECT_API_KEY=your_key_here
uv run swebench-infer ...
Screenshot 2025-12-18 at 14 17 47

Comment thread benchmarks/swebench/eval_infer.py Outdated
Comment thread benchmarks/utils/laminar.py Outdated
Comment thread benchmarks/utils/models.py Outdated
Comment thread benchmarks/commit0/run_infer.py

@simonrosenberg simonrosenberg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you solve the 2 mutables?
Then there is still an issue because the eval output format is really suboptimal but this can be solved in the standardization tickef!

@simonrosenberg simonrosenberg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually can we not merge this until I standardize the evaluation output format? sorry! Should be done by tomorrow

Issue here #177

@Rainhunter13

Copy link
Copy Markdown
Contributor Author

@simonrosenberg addressed the comments!

sure, let me know once you standardise the format, I'll adjust laminar service as needed, thank you

@simonrosenberg

Copy link
Copy Markdown
Member

@OpenHands please resolve the conflicts

@openhands-ai

openhands-ai Bot commented Dec 29, 2025

Copy link
Copy Markdown

I'm on it! simonrosenberg can track my progress at all-hands.dev

Resolve conflicts:
- benchmarks/commit0/run_infer.py: Keep remote workspace support from main, add forward_env parameter to DockerDevWorkspace
- pyproject.toml: Combine dependencies from both branches, use lmnr>=0.7.24
- uv.lock: Regenerate lock file

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai

openhands-ai Bot commented Dec 29, 2025

Copy link
Copy Markdown

I've resolved the merge conflicts in PR #175 by merging origin/main into the laminar_evaluations branch. Here's a summary of the conflict resolutions:

Conflicts Resolved

  1. benchmarks/commit0/run_infer.py:

    • Kept the remote workspace support from main (adds APIRemoteWorkspace support alongside DockerDevWorkspace)
    • Preserved the forward_env parameter for DockerDevWorkspace from the laminar_evaluations branch
    • Note: APIRemoteWorkspace doesn't support forward_env parameter, so it was only added to DockerDevWorkspace
  2. pyproject.toml:

    • Combined dependencies from both branches
    • Kept lmnr>=0.7.24 (the higher version from laminar_evaluations)
    • Added all new dependencies from main (deprecation, fastmcp, httpx, litellm, pydantic, etc.)
    • Added new script entrypoints from main (multiswebench, swebenchmultimodal)
  3. uv.lock:

    • Regenerated the lock file using uv lock to reflect the merged dependencies

Important Note

The branch has been pushed to origin/laminar_evaluations on the OpenHands/benchmarks repository. However, PR #175 is from a fork (lmnr-ai/benchmarks), and I don't have write access to that fork.

The PR author (@Rainhunter13) will need to either:

  1. Pull the changes from origin/laminar_evaluations into their fork's branch, or
  2. The PR can be closed and a new one created from origin/laminar_evaluations

The resolved branch is available at: https://github.com/OpenHands/benchmarks/tree/laminar_evaluations

View full conversation

@simonrosenberg

Copy link
Copy Markdown
Member

@Rainhunter13 approved, thanks! and sorry for the delayed process

@Rainhunter13

Copy link
Copy Markdown
Contributor Author

@simonrosenberg thanks! Just fixed the tests, do you think it will possible to merge this soon?

@simonrosenberg simonrosenberg enabled auto-merge (squash) January 7, 2026 16:15
@simonrosenberg simonrosenberg merged commit 11ec5d1 into OpenHands:main Jan 7, 2026
2 checks passed
KTanmay1 pushed a commit to Ethara-Ai/milo-bench that referenced this pull request Mar 3, 2026
* laminar evaluations

* revert uv.lock changes

* revert uv.lock changes

* remove mutable defaults and move score update to laminar service

* chore

* Update software-agent-sdk submodule to 52604a7fa48f40487c70338a96488fc6287dac9c

* use generic report file format

* Update software-agent-sdk submodule to e8f73e71b979984242245efeb701ee4b18480f9e

* Update software-agent-sdk submodule to 73769d5e9d5f75333054d098a2e86c02555fa8d6

* forward env vars for RemoteWorkspace

* fix tests

---------

Co-authored-by: openhands <openhands@all-hands.dev>
KTanmay1 pushed a commit to Ethara-Ai/milo-bench that referenced this pull request Mar 3, 2026
* laminar evaluations

* revert uv.lock changes

* revert uv.lock changes

* remove mutable defaults and move score update to laminar service

* chore

* Update software-agent-sdk submodule to 52604a7fa48f40487c70338a96488fc6287dac9c

* use generic report file format

* Update software-agent-sdk submodule to e8f73e71b979984242245efeb701ee4b18480f9e

* Update software-agent-sdk submodule to 73769d5e9d5f75333054d098a2e86c02555fa8d6

* forward env vars for RemoteWorkspace

* fix tests

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants