Skip to content

Remove direct prime-sandboxes dependency from rlm-swe v1#1316

Merged
willccbb merged 8 commits into
mainfrom
codex/audit-v1-env-example-sets
May 9, 2026
Merged

Remove direct prime-sandboxes dependency from rlm-swe v1#1316
willccbb merged 8 commits into
mainfrom
codex/audit-v1-env-example-sets

Conversation

@willccbb
Copy link
Copy Markdown
Member

@willccbb willccbb commented May 8, 2026

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

High Risk
High risk because it replaces the rlm_swe_v1 taskset implementation with dataset-driven, sandbox-backed test staging/execution and adds new sandbox file-transfer/background-job APIs that affect runtime interactions.

Overview
Reworks rlm_swe_v1 to remove Harbor/packaged tasks and instead build tasks from the R2E-Gym/R2E-Gym-Subset dataset, including per-row sandbox image selection, environment variable construction, optional repo filtering, and reward based on running run_tests.sh and parsing pytest summaries.

Adds rollout setup/cleanup hooks to stage hidden tests by archiving/downloading them out of the sandbox and later re-uploading them for scoring; removes the bundled skills/ and tasks/ smoke content and updates packaging/dependencies accordingly.

Expands several v1 example environments to at least 10 examples (more task rows + higher num_examples in pyprojects), refactors some static task sources to be generated from shared lists, and adds new tests to enforce the 10-example minimum and to cover the new rlm_swe_v1 behavior.

Extends verifiers.v1 sandbox helpers (SandboxLease/SandboxHandle) with upload_file, download_file, and run_background_job methods to support the new SWE workflow.

Reviewed by Cursor Bugbot for commit 7eeb50a. Bugbot is set up for automated code reviews on this repo. Configure here.

@willccbb willccbb requested review from rasdani and xeophon May 8, 2026 18:20
Comment thread environments/rlm_swe_v1/rlm_swe_v1.py Outdated
Comment thread environments/rlm_swe_v1/rlm_swe_v1.py
Comment thread environments/rlm_swe_v1/rlm_swe_v1.py
Comment thread environments/rlm_swe_v1/rlm_swe_v1.py
Comment thread tests/test_v1_rlm_swe.py
Comment thread environments/rlm_swe_v1/rlm_swe_v1.py
Comment thread environments/rlm_swe_v1/rlm_swe_v1.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8aa8458. Configure here.

Comment thread environments/rlm_swe_v1/rlm_swe_v1.py Outdated
@willccbb willccbb merged commit 3be6614 into main May 9, 2026
8 checks passed
@willccbb willccbb deleted the codex/audit-v1-env-example-sets branch May 9, 2026 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant