Skip to content

fix: enhance startup for perception services on first run#769

Merged
maciejmajek merged 3 commits intoRobotecAI:mainfrom
KE7:fix/robust-startup-amd-gpu
Mar 19, 2026
Merged

fix: enhance startup for perception services on first run#769
maciejmajek merged 3 commits intoRobotecAI:mainfrom
KE7:fix/robust-startup-amd-gpu

Conversation

@KE7
Copy link
Contributor

@KE7 KE7 commented Mar 12, 2026

  • Replace wget with urllib streaming download in weights.py; logs a progress bar every 5% with speed and ETA. Downloads to a .part temp file and renames atomically on completion, preventing the corrupted weights that caused the original crash.
  • Replace the fixed 180s retry timeout in _is_ros2_stack_ready with an activity-based stale timeout (default 120s) that resets whenever new ROS2 services/topics/actions appear. Weight downloads no longer cause false timeouts regardless of how long they take.

Purpose

Fix startup failures when running RAI with AMD GPUs (gfx1151 / Strix Halo) where GroundingDINO weight downloads caused timeouts and GPU errors were silently swallowed.

Proposed Changes

  • detection_service.py: Replace exc_info=True (unsupported by rclpy logger, raises TypeError) with traceback.format_exc() so GPU errors are actually logged instead of swallowed.
    • weights.py: Replace subprocess/wget with urllib.request streaming download. Writes to a .part temp file and renames atomically on completion, preventing corrupted weight files from interrupted downloads. Logs a progress bar every 5% with speed and ETA.
    • o3de_bridge.py: Replace the fixed 360-retry (~180s) loop in _is_ros2_stack_ready with an activity-based stale timeout that resets whenever new ROS2 interfaces appear. Eliminates false timeouts during first-run weight downloads regardless of how long they take.
    • gripping_points_tools.py: Increase timeout_sec default from 10s to 60s to accommodate GPU inference startup time.

Issues

N/A

Testing

Tested on AMD Radeon 8060S (gfx1151 / Strix Halo) with ROCm 7.0. GroundingDINO detection service initializes successfully with GPU acceleration confirmed.

  - Replace wget with urllib streaming download in weights.py; logs a progress bar every 5% with speed and ETA. Downloads to a .part temp file and renames atomically on completion, preventing the corrupted weights that caused the original crash.
  - Replace the fixed 180s retry timeout in _is_ros2_stack_ready with an activity-based stale timeout (default 120s) that resets whenever new ROS2 services/topics/actions appear. Weight downloads no longer cause false timeouts regardless of how long they take.
@KE7 KE7 changed the title enhance startup for perception services on first run fix: enhance startup for perception services on first run Mar 12, 2026
Copy link
Member

@maciejmajek maciejmajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Tested the rai_perception part. Worked on the following cases:

  • Standard uniterrupted download
  • Various interrupts at various download steps
  • Corrupting a file after download - this case throws an error without redownloading

Seems like there are some tests which need an update.

Please let me know if you will take care of that, I can step in.

============================================================================== short test summary info ==============================================================================
FAILED tests/rai_perception/agents/test_base_vision_agent.py::TestVisionWeightsDownload::test_download_weights_success - Exception: Could not download weights: HTTP Error 404: Not Found
FAILED tests/rai_perception/agents/test_base_vision_agent.py::TestLoadModelWithErrorHandling::test_load_model_corrupted_weights - Exception: Could not download weights: HTTP Error 404: Not Found
FAILED tests/rai_perception/services/test_weights.py::TestDownloadWeights::test_download_weights_success - AttributeError: module 'rai_perception.services.weights' has no attribute 'subprocess'
FAILED tests/rai_perception/services/test_weights.py::TestDownloadWeights::test_download_weights_failure - AttributeError: module 'rai_perception.services.weights' has no attribute 'subprocess'
FAILED tests/rai_perception/services/test_weights.py::TestDownloadWeights::test_download_weights_file_too_small - AttributeError: module 'rai_perception.services.weights' has no attribute 'subprocess'
FAILED tests/rai_perception/services/test_weights.py::TestLoadModelWithErrorHandling::test_load_model_corrupted_weights - AttributeError: module 'rai_perception.services.weights' has no attribute 'subprocess'
FAILED tests/rai_perception/tools/test_pcl_detection_tools.py::test_get_object_gripping_points_tool_auto_declaration - AssertionError: assert 60.0 == 10.0
============================================ 7 failed, 771 passed, 4 skipped, 1 deselected, 11 xfailed, 22 warnings in 110.91s (0:01:50) ============================================

@KE7
Copy link
Contributor Author

KE7 commented Mar 12, 2026

@maciejmajek It would be great if you could help me with the tests. Thank you! I don't have rai running e2e yet on my system as I'm battling dependency and GPU issues

@maciejmajek
Copy link
Member

Sure @KE7. What seems to be the problem? I should be able to help you

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.11%. Comparing base (59d0e89) to head (a3ff4e4).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #769   +/-   ##
=======================================
  Coverage   73.11%   73.11%           
=======================================
  Files          82       82           
  Lines        3571     3571           
=======================================
  Hits         2611     2611           
  Misses        960      960           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@KE7
Copy link
Contributor Author

KE7 commented Mar 18, 2026

@maciejmajek I think I have the tests working now. Any pointers on this ROS failure?

@maciejmajek
Copy link
Member

@KE7 No idea, this was a random crash we see from time to time on humble.

@maciejmajek maciejmajek self-requested a review March 19, 2026 17:49
Copy link
Member

@maciejmajek maciejmajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

@maciejmajek maciejmajek merged commit 3c66900 into RobotecAI:main Mar 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants