Skip to content

Standardize MuJoCo timestep and increase camera resolution across robot configs#648

Draft
davetcoleman wants to merge 3 commits into
mainfrom
standardize-mujoco-timestep-0.003
Draft

Standardize MuJoCo timestep and increase camera resolution across robot configs#648
davetcoleman wants to merge 3 commits into
mainfrom
standardize-mujoco-timestep-0.003

Conversation

@davetcoleman
Copy link
Copy Markdown
Member

@davetcoleman davetcoleman commented May 20, 2026

This PR standardizes two MuJoCo settings that were inconsistent across the workspace's robot configs, plus a CI-infra commit that the second commit's resolution bump made necessary (GPU runner + live test progress). Three commits, kept separate intentionally (please do not squash).


Commit 1 — timestep="0.003"

Problem

MuJoCo <option timestep> was inconsistent across robot configs in this workspace:

Config Previous timestep (s) Sim Hz
hangar_sim 0.025 40
factory_sim 0.01 100
dual_arm_sim, kitchen_sim 0.005 200
lab_sim, april_tag_sim, grinding_sim, lunar_sim, kinova_sim, space_satellite_sim, space_satellite_sim_camera_cal 0.002 (MuJoCo default — never set) 500

hangar_sim's 25 ms step in particular is well outside the range stiff arm contacts can normally tolerate, and means the 600 Hz ros2_control loop is reading stale physics state for ~15 ticks at a time.

Alternatives considered

  1. Match the MuJoCo default (0.002 s, 500 Hz). Most stable, but the configs currently at 0.005 / 0.01 / 0.025 chose those values presumably because the smaller step was too expensive wall-clock. Going from 0.025 to 0.002 on hangar_sim is a ~12.5× slowdown — likely intolerable for the mobile-base + Nav2 + lidar load.
  2. Pick a value per robot. Leaves the same inconsistency in place; no clear rule for what each new config should pick.
  3. Standardize on 0.003 s (this PR). Matches the value already in use in qualcomm_sim. Trades a ~1.5× speedup on the 8 default-0.002 configs for a ~1.67× / ~3.3× / ~8.3× slowdown on the 4 outliers, in exchange for a uniform value that's stable for all of them.

Chosen approach

<option timestep="0.003" /> set in every MuJoCo-using scene file in the workspace (11 files).

Three edit patterns:

  • Modified existing timestep="..." (4 files): hangar_sim, factory_sim, dual_arm_sim, kitchen_sim.
  • Added timestep="0.003" attribute to an existing <option> (4 files): lab_sim, april_tag_sim, space_satellite_sim, space_satellite_sim_camera_cal.
  • Inserted a new <option timestep="0.003" /> (3 files): grinding_sim, lunar_sim, kinova_sim.

Related

moveit_pro#19223 — asked Griz / Breelyn about the reasoning behind hangar_sim's prior 0.025 s value. If they come back with a stability/performance reason specific to mecanum + Nav2 that 0.003 won't satisfy, we may need to revisit hangar_sim alone.


Commit 2 — resolution="1280 720"

Problem

MuJoCo camera resolutions and offscreen buffer sizes were inconsistent across robot configs:

Config Previous
lunar_sim, hangar_sim, picknik_accessories/ur5e wrist_camera already 1280×720
space_satellite_sim_camera_cal 1920×1080 (intentional — calibration data)
everything else 640×480

picknik_mujoco_ros requires <visual><global offwidth offheight> to be ≥ the largest camera in the scene, so any config that includes a 1280×720 camera (e.g. via picknik_accessories/ur5e) but declares a smaller global currently fails to load with a Camera resolution mismatch URDF rejection.

Chosen approach

  • Every <camera resolution="640 480"> in the workspace's MJCF files bumped to 1280 720 (13 files, 16 camera tags).
  • Every top-level scene's <visual><global offwidth offheight></visual> ensured to be ≥ 1280×720 — bumped lab_sim's existing 640×480 global, and added offwidth/offheight attributes to the existing <global> element in 7 other scenes that previously had a global with only azimuth/elevation.
  • Added a new <visual><global offwidth="1280" offheight="720"/></visual> block to april_tag_sim, which had no <visual> block at all.
  • space_satellite_sim_camera_cal left at 1920×1080.

Skipped

phoebe_sim (submodule external_dependencies/phoebe_ws) is intentionally not included in either commit — it needs a separate PR in PickNikRobotics/phoebe_ws. (It was already at 1280×720 for cameras anyway.)


Commit 3 — enable_gpu: true on a picknik-16-amd64-gpu runner, plus live per-objective test progress

Problem

After commit 2, lab_sim's objectives_integration_test started timing out on CI at 600 s (Exit Code: Timeout, Execution Time: 600.058 s), both on humble and jazzy. Baseline runtime on main is ~388 s. The ~3× per-camera pixel increase (and the addition of the offscreen buffer growth) pushes the perception-heavy objective sweep past the existing timeout budget on picknik-16-amd64 runners.

Alternatives considered

  1. Shrink CI cameras to 640×480 via a mujoco_ci_camera_resolution input. Original plan — attempted in PickNikRobotics/moveit_pro_ci#23, closed without merging. Would have kept CPU runners but introduced a CI-vs-product scene divergence.
  2. Bump the test timeout from 600 s. Moves the goalposts; doesn't address the underlying render cost.
  3. Switch to a GPU runner (this PR). EGL offscreen rendering moves from llvmpipe (CPU) to the GPU, dropping per-frame render cost back below the timeout budget. CI runs the same scene files as the product — no CI-only override.

Chosen approach

  • Bump the reusable workflow ref to PickNikRobotics/moveit_pro_ci@v0.3.1, which appends the CUDA image suffix when enable_gpu is true (moveit_pro_ci#26). Without that fix, v0.3.0 set the runner label correctly but the container kept the non-CUDA image, so MuJoCo's EGL rendering still went through llvmpipe on CPU — i.e. the GPU was on the bare metal but not visible to the test process. That was the suspected cause of the first GPU-runner run still timing out at 600 s.
  • Set enable_gpu: true and switch the runner label from picknik-16-amd64 to picknik-16-amd64-gpu.
  • Production camera resolutions stay at the standardized 1280×720 from commit 2.

Test-diagnostics addition: src/lab_sim/test/conftest.py

The first GPU-runner CI run timed out and the log showed pytest's collected 118 items header and then nothing — no per-test progress, no indication of which objective was running when CTest killed the process. Root cause: pytest's default --capture=fd redirects fds 1 and 2 and only flushes at end-of-run, which doesn't happen on timeout.

Folded in here because it's CI-test infra. Two pytest hooks (pytest_runtest_logstart, pytest_runtest_logreport) write directly to fd 2 via os.write(2, ...), bypassing the capture. Every test emits START <nodeid> on entry and PASSED|FAILED|SKIPPED <nodeid> (<elapsed>s) on completion, plus the short failure reason on failure. Diagnostics-only — no production code touched.

Caveat — fallback if v0.3.1 isn't enough

The bet now is that v0.3.1's CUDA image actually plumbs the GPU through to the container and MuJoCo's EGL uses the GPU's hardware path. mj_step itself is still CPU/solver work — the GPU only helps the render path. If lab_sim's test still times out, the new conftest will name the slow objective (and the timing distribution of those that completed), letting us decide between bumping the test timeout, reviving the mujoco_ci_camera_resolution direction (re-open #23, merge, cut a tag, repin), or fixing a specific slow objective.

Manual verification

  • Smoke-test hangar_sim — the biggest timestep delta (0.025 → 0.003, ~8× more sim work per second). Verify mobile-base locomotion + Nav2 still keeps up.
  • Smoke-test lab_sim and kinova_sim/rafti_fingers — contact-heavy scenes most likely to surface stability regressions from the 0.002 → 0.003 step increase.
  • Start each config and confirm ros2_control_node no longer logs Camera resolution mismatch.
  • Confirm lab_sim's objectives_integration_test completes under 600 s on picknik-16-amd64-gpu for both humble and jazzy. If it still times out, the GPU-runner bet didn't pan out and we need to revisit Commit 3's strategy.

Claude agent checks

Autofilled by Claude when it opens or updates this PR. Each agent that was actually run gets ticked; agents that don't apply are ticked with SKIPPED right after the checkbox plus a brief reason. Humans rarely need to touch these. See .claude/rules/agent-delegation.md for when each agent is required.

  • code-reviewer (required) — 3 findings (1 P0, 2 P1) addressed in fixup commits
  • SKIPPED test-runner (required) — no compiled code in diff (XML configs + CI yaml only)
  • platform-architect-bot (required for backend / C++ / ROS / REST / build-CI changes) — 7 findings; applied 3 (kitchen_sim offwidth, 2 kinova_sim timestep); applied 2 P2 comment fixes (lab_sim, ci.yaml); deferred 2 (hangar_sim wall-clock already in manual-verify list, franka3.xml pre-existing pattern)
  • SKIPPED frontend-noah-bot (required when src/web/frontend/** changes) — no src/web/frontend/** files changed
  • SKIPPED security-auditor (required for security-sensitive paths) — no security-sensitive paths changed
  • test-writer (optional)
  • unit-test-reviewer (optional)

@davetcoleman davetcoleman added this to the 9.4.0 milestone May 20, 2026
@davetcoleman davetcoleman changed the title Standardize MuJoCo timestep to 0.003 s across all robot configs Standardize MuJoCo timestep and camera resolution across robot configs May 20, 2026
@davetcoleman davetcoleman changed the title Standardize MuJoCo timestep and camera resolution across robot configs Standardize MuJoCo timestep and increase camera resolution across robot configs May 20, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

⚠️ This PR modifies 1 file(s) that also exist in PickNikRobotics/moveit_pro_empty_ws.

Consider whether the change should land upstream in moveit_pro_empty_ws first so downstream forks pick it up on the next sync.

Overlapping files
  • .github/workflows/ci.yaml

@davetcoleman davetcoleman requested a review from JWhitleyWork May 21, 2026 01:13
@davetcoleman
Copy link
Copy Markdown
Member Author

Before talking to @JWhitleyWork Claude suggested it implement the "lower camera resolution" change for CI, so that is in here for now. LMK when our servers support GPUs instead!

@JWhitleyWork JWhitleyWork force-pushed the standardize-mujoco-timestep-0.003 branch from 53a2cbd to 2e1e693 Compare May 21, 2026 01:48
@davetcoleman davetcoleman force-pushed the standardize-mujoco-timestep-0.003 branch from 2e1e693 to ce98d97 Compare May 21, 2026 17:47
Comment thread .github/workflows/ci.yaml
davetcoleman and others added 2 commits May 21, 2026 12:12
Sets `<option timestep="0.003">` in every MuJoCo-using scene file in the
workspace, matching the value already in use in qualcomm_sim.

Previous values were inconsistent:
- 0.025 s (hangar_sim, 40 Hz) - 12.5x larger than default
- 0.01  s (factory_sim, 100 Hz)
- 0.005 s (dual_arm_sim, kitchen_sim, 200 Hz)
- 0.002 s (everything else, via MuJoCo default, 500 Hz)

Standardizing on 0.003 s (333 Hz) trades a small wall-time slowdown on
the previously-default-0.002 configs for a large stability improvement
on hangar_sim and factory_sim, where stiff arm contacts were running at
a step well outside the typical safe range.

See PickNikRobotics/moveit_pro#19223 for the
question filed to Griz/Breelyn about hangar_sim's prior 0.025 s value -
that issue may surface a reason to revisit hangar_sim specifically.

phoebe_sim (submodule) is intentionally not included in this PR; it
needs a separate PR in PickNikRobotics/phoebe_ws.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…obot configs

Sets `resolution="1280 720"` on every <camera> in the workspace's MJCF scene
files, and ensures every top-level scene's <global offwidth/offheight> is at
least 1280x720 so the offscreen render buffer fits the largest camera.

Previously inconsistent:
- 1280x720 (lunar_sim, hangar_sim, picknik_accessories/ur5e wrist_camera)
- 640x480  (lab_sim, kitchen_sim, grinding_sim, april_tag_sim, dual_arm_sim,
            factory_sim, space_satellite_sim, kinova_sim and variants)

This was discovered when `lab_sim` failed to start with:

    [ros2_control_node] [ERROR] The published robot description file (urdf)
    seems not to be genuine. ... Camera resolution mismatch, set:
    <global offwidth="1280" offheight="720"/> in the MJCF model.

The wrist_camera in picknik_accessories/mujoco_assets/ur5e/ur5e.xml was
already at 1280x720, but lab_sim's scene declared the global offscreen
buffer at only 640x480, so the hardware plugin rejected the URDF.
Standardizing to 1280x720 everywhere keeps the included ur5e wrist_camera
working in every UR-based config and avoids the same trap silently waiting
in the others.

space_satellite_sim_camera_cal kept at 1920x1080 (intentional higher
resolution for camera-intrinsics calibration data).

phoebe_sim (submodule) is not included; it needs a separate PR in
PickNikRobotics/phoebe_ws (and was already at 1280x720 anyway).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@davetcoleman davetcoleman force-pushed the standardize-mujoco-timestep-0.003 branch from ce98d97 to 11d1536 Compare May 21, 2026 18:13
@JWhitleyWork
Copy link
Copy Markdown
Member

Moving to "ready" and we'll see how CI handles it.

@JWhitleyWork JWhitleyWork marked this pull request as ready for review May 21, 2026 18:19
@davetcoleman davetcoleman force-pushed the standardize-mujoco-timestep-0.003 branch 2 times, most recently from cf36a51 to 79d0b94 Compare May 21, 2026 20:50
…ive test progress

Two CI-infra changes folded together:

1. Bump the reusable workflow ref to PickNikRobotics/moveit_pro_ci v0.3.1, set enable_gpu: true, and switch the runner label from picknik-16-amd64 to picknik-16-amd64-gpu. v0.3.1 appends the CUDA image suffix when enable_gpu is true (moveit_pro_ci#26) -- without it, v0.3.0 set the label but kept the non-CUDA image, so MuJoCo's EGL rendering still ran through llvmpipe on CPU.

2. Add src/lab_sim/test/conftest.py with two pytest hooks (logstart, logreport) that write directly to fd 2, bypassing pytest's --capture=fd. Without this, a CTest timeout kills pytest before any per-test output is flushed, leaving the CI log silent past 'collected N items'. The hooks emit START / PASSED|FAILED|SKIPPED + duration for every test, so the next timeout (if any) names the offending objective.
@davetcoleman davetcoleman force-pushed the standardize-mujoco-timestep-0.003 branch from 79d0b94 to d8aebf3 Compare May 21, 2026 21:16
@davetcoleman
Copy link
Copy Markdown
Member Author

Update — the new conftest paid off: CI on 9.3.0-rc9 is failing on simulator/control startup, not the render budget.

Every parametrized invocation of test_all_objectives is timing out at exactly 180 s in wait_for_gripper_action_server() (objectives_integration_test.py:94):

START   test_all_objectives[...core/clear_snapshot.xml]
FAILED  test_all_objectives[...core/clear_snapshot.xml] (187.4s)
  └─ ../../src/lab_sim/test/objectives_integration_test.py:94: TimeoutError

START   test_all_objectives[...lab_sim/objectives/close_gripper.xml]
FAILED  test_all_objectives[...lab_sim/objectives/close_gripper.xml] (180.2s)
  └─ ../../src/lab_sim/test/objectives_integration_test.py:94: TimeoutError

START   test_all_objectives[...lab_sim/objectives/open_gripper.xml]
FAILED  test_all_objectives[...lab_sim/objectives/open_gripper.xml] (180.3s)
  └─ ../../src/lab_sim/test/objectives_integration_test.py:94: TimeoutError

/robotiq_gripper_controller/gripper_cmd is never registering. The 600 s CTest timeout we were chasing is downstream of that — three tests × 180 s, then the 4th can't fit in the remaining budget.

So this is a sim or ros2_control bringup failure under 9.3.0-rc9 with this PR's scene config (timestep 0.003, cameras 1280×720, <global offwidth offheight> on every scene). Going to repro lab_sim locally on 9.3.0-rc9 next to capture the actual startup error.

@davetcoleman
Copy link
Copy Markdown
Member Author

Bisection result — correcting the prior diagnosis. Opened #655 (just commit 3 from this PR, no scene changes) to discriminate scene-vs-image. #655's CI passes the gripper-wait fine: 34 PASSED, 82 SKIPPED, only 2 failed (Move Flasks to Burners and Push Button via move_with_velocity_and_force.xml, both failing in 0.6s — pre-existing flakes unrelated to this PR's diff). Total lab_sim test time: ~291 s.

This inverts my earlier claim of an image regression. The gripper failure on this PR isn't from 9.3.0-rc9 — it's caused by this PR's scene XML changes (timestep 0.003, cameras 1280×720, <global offwidth offheight>).

Pinged Josh to ask whether any of those specifically would block ros2_control from registering robotiq_gripper_controller. Meanwhile going to bisect further within the scene changes (timestep-only vs. camera-only) to localize which one is the trigger.

@davetcoleman
Copy link
Copy Markdown
Member Author

Blocked by #655

@davetcoleman davetcoleman marked this pull request as draft May 22, 2026 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants