A ROS2-based multi-camera perception system with YOLOv5 and RGB-D 3D localization for autonomous exploration and grasping on Boston Dynamics Spot.
This repository implements the perception module of an autonomous robotic system for exploration and object retrieval on a Boston Dynamics Spot platform.
Built on ROS2, this package integrates real-time object detection, RGB-D based 3D localization, and perception-driven interaction.
The system forms part of a closed-loop pipeline combining perception, navigation, and manipulation.
- Real-time object detection using YOLOv5 (mAP@0.5 = 0.962)
- Multi-camera RGB-D perception (5 cameras on Spot)
- 2D β 3D projection using depth + camera intrinsics
- TF-based transformation into global map frame (
spot/map) - Multi-view fusion with spatial clustering
- Temporal filtering for robust detection
- Automatic navigation goal generation
- Pixel-level detection for grasping (hand camera)
- Dataset pipeline from ROS bag β training images
- Ubuntu 22.04
- ROS2 (Humble recommended)
- Python β₯ 3.8
sudo apt update
sudo apt install python3-pip ros-$ROS_DISTRO-vision-msgspip3 install yolov5
pip3 install opencv-python numpymkdir -p ~/yolov5_ws/src
cd ~/yolov5_ws/srcgit clone https://github.com/RobbinSeason/spot_yolo_perception_ros2.gitcd ~/yolov5_ws
colcon build
source install/setup.bashProcesses multi-camera RGB-D data to detect and localize objects in 3D.
Pipeline:
- YOLOv5 detection
- Bounding box center extraction
- Depth lookup (median filtering)
- 2D β 3D projection
- TF transform β
spot/map - Multi-camera clustering
- Navigation goal generation
Outputs:
object_position_3d(vision_msgs/Detection3DArray)/goal_pose(geometry_msgs/PoseStamped)object_markers(RViz visualization)
Provides precise pixel-level localization for grasping.
Pipeline:
- Trigger detection via service
- YOLO inference per frame
- Select highest-confidence detection
- Reject unstable detections
- Temporal fusion (median filtering)
- Output stable pixel coordinates
Outputs:
hand_object_pixel_center(geometry_msgs/PointStamped)
The system uses 5 RGB-D cameras:
/spot/camera/frontleft/image/compressed/spot/camera/frontright/image/compressed/spot/camera/left/image/compressed/spot/camera/right/image/compressed/spot/camera/back/image/compressed
Each camera is paired with:
- Depth image
- Camera info
ros2 launch yolov5_ros2 yolov5_ros2_launch.pyros2 launch yolov5_ros2 hand_yolo_launch.pyros2 service call /hand_pixel/trigger std_srvs/srv/Trigger| Topic | Type |
|---|---|
| RGB images | sensor_msgs/CompressedImage |
| Depth images | sensor_msgs/Image |
| Camera info | sensor_msgs/CameraInfo |
| Hand camera | sensor_msgs/Image |
| Topic | Type | Description |
|---|---|---|
object_position_3d |
Detection3DArray |
3D detections |
/goal_pose |
PoseStamped |
Navigation target |
object_markers |
Marker |
RViz visualization |
hand_object_pixel_center |
PointStamped |
Pixel grasp point |
device: "cpu"
model: "best"
work_offset_distance: 0.4
instance_radius: 2.0
buffer_size: 3
publish_score_thresh: 0.6
sync_mode: "auto"
max_wait_s: 0.05score_thresh: 0.5
stable_frames: 3
max_jump_px: 25.0
timeout_sec: 50.0python3 bag_to_yolo_images.pyros2 bag play your_data.bag
python3 bag_to_yolo_images.pydataset_raw/
βββ frontleft/images/
βββ frontright/images/
βββ left/images/
βββ right/images/
βββ back/images/
yolov5_ros2/
βββ README.md
βββ package.xml
βββ setup.py
βββ config/
β βββ best.pt
β βββ yolov5_params.yaml
β βββ hand_yolo_params.yaml
βββ launch/
β βββ yolov5_ros2_launch.py
β βββ hand_yolo_launch.py
βββ yolov5_ros2/
β βββ yolo_detect_2d.py
β βββ hand_yolo_pixel_trigger_node.py
βββ bag_to_yolo_images.py
- YOLO weights must be placed in:
config/best.pt - TF must include:
camera_frame β spot/map - RGB and depth must be aligned
- Sensitive to depth noise
- False positives may affect grasping
- TF synchronization may impact accuracy
Haoyu Gong
Karlsruhe Institute of Technology (KIT)
MIT License