spot_yolo_perception_ros2

A ROS2-based multi-camera perception system with YOLOv5 and RGB-D 3D localization for autonomous exploration and grasping on Boston Dynamics Spot.

🚀 Overview

This repository implements the perception module of an autonomous robotic system for exploration and object retrieval on a Boston Dynamics Spot platform.

Built on ROS2, this package integrates real-time object detection, RGB-D based 3D localization, and perception-driven interaction.

The system forms part of a closed-loop pipeline combining perception, navigation, and manipulation.

🔥 Highlights

Real-time object detection using YOLOv5 (mAP@0.5 = 0.962)
Multi-camera RGB-D perception (5 cameras on Spot)
2D → 3D projection using depth + camera intrinsics
TF-based transformation into global map frame (spot/map)
Multi-view fusion with spatial clustering
Temporal filtering for robust detection
Automatic navigation goal generation
Pixel-level detection for grasping (hand camera)
Dataset pipeline from ROS bag → training images

⚙️ Dependencies

📦 System Requirements

Ubuntu 22.04
ROS2 (Humble recommended)
Python ≥ 3.8

📥 Install ROS2 Dependencies

sudo apt update
sudo apt install python3-pip ros-$ROS_DISTRO-vision-msgs

📦 Install Python Dependencies

pip3 install yolov5
pip3 install opencv-python numpy

🛠️ Installation

1️⃣ Create Workspace

mkdir -p ~/yolov5_ws/src
cd ~/yolov5_ws/src

2️⃣ Clone Repository

git clone https://github.com/RobbinSeason/spot_yolo_perception_ros2.git

3️⃣ Build

cd ~/yolov5_ws
colcon build
source install/setup.bash

🧠 System Architecture

1️⃣ Global Detection (Exploration Stage)

Processes multi-camera RGB-D data to detect and localize objects in 3D.

Pipeline:

YOLOv5 detection
Bounding box center extraction
Depth lookup (median filtering)
2D → 3D projection
TF transform → spot/map
Multi-camera clustering
Navigation goal generation

Outputs:

object_position_3d (vision_msgs/Detection3DArray)
/goal_pose (geometry_msgs/PoseStamped)
object_markers (RViz visualization)

2️⃣ Hand Camera Detection (Grasping Stage)

Provides precise pixel-level localization for grasping.

Pipeline:

Trigger detection via service
YOLO inference per frame
Select highest-confidence detection
Reject unstable detections
Temporal fusion (median filtering)
Output stable pixel coordinates

Outputs:

hand_object_pixel_center (geometry_msgs/PointStamped)

📷 Multi-Camera Setup

The system uses 5 RGB-D cameras:

/spot/camera/frontleft/image/compressed
/spot/camera/frontright/image/compressed
/spot/camera/left/image/compressed
/spot/camera/right/image/compressed
/spot/camera/back/image/compressed

Each camera is paired with:

Depth image
Camera info

🚀 Usage

1️⃣ Run Global Detection

ros2 launch yolov5_ros2 yolov5_ros2_launch.py

2️⃣ Run Hand Detection

ros2 launch yolov5_ros2 hand_yolo_launch.py

3️⃣ Trigger Hand Detection

ros2 service call /hand_pixel/trigger std_srvs/srv/Trigger

📡 ROS Topics

Subscribed

Topic	Type
RGB images	`sensor_msgs/CompressedImage`
Depth images	`sensor_msgs/Image`
Camera info	`sensor_msgs/CameraInfo`
Hand camera	`sensor_msgs/Image`

Published

Topic	Type	Description
`object_position_3d`	`Detection3DArray`	3D detections
`/goal_pose`	`PoseStamped`	Navigation target
`object_markers`	`Marker`	RViz visualization
`hand_object_pixel_center`	`PointStamped`	Pixel grasp point

⚙️ Parameters

Global Detection

device: "cpu"
model: "best"

work_offset_distance: 0.4
instance_radius: 2.0
buffer_size: 3
publish_score_thresh: 0.6

sync_mode: "auto"
max_wait_s: 0.05

Hand Detection

score_thresh: 0.5
stable_frames: 3
max_jump_px: 25.0
timeout_sec: 50.0

🧪 Dataset Collection (ROS Bag → Images)

▶️ Run Export Script

python3 bag_to_yolo_images.py

💡 Workflow

ros2 bag play your_data.bag
python3 bag_to_yolo_images.py

📁 Output Structure

dataset_raw/
├── frontleft/images/
├── frontright/images/
├── left/images/
├── right/images/
└── back/images/

📁 Project Structure

yolov5_ros2/
├── README.md
├── package.xml
├── setup.py

├── config/
│   ├── best.pt
│   ├── yolov5_params.yaml
│   └── hand_yolo_params.yaml

├── launch/
│   ├── yolov5_ros2_launch.py
│   └── hand_yolo_launch.py

├── yolov5_ros2/
│   ├── yolo_detect_2d.py
│   └── hand_yolo_pixel_trigger_node.py

├── bag_to_yolo_images.py

⚠️ Notes

YOLO weights must be placed in:
```
config/best.pt
```
TF must include:
```
camera_frame → spot/map
```
RGB and depth must be aligned

🧪 Known Limitations

Sensitive to depth noise
False positives may affect grasping
TF synchronization may impact accuracy

👤 Author

Haoyu Gong
Karlsruhe Institute of Technology (KIT)

📄 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
launch		launch
resource		resource
yolov5_ros2		yolov5_ros2
.gitignore		.gitignore
bag_to_yolo_images.py		bag_to_yolo_images.py
package.xml		package.xml
readme.md		readme.md
readme_CN.md		readme_CN.md
setup.cfg		setup.cfg
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

spot_yolo_perception_ros2

🚀 Overview

🔥 Highlights

⚙️ Dependencies

📦 System Requirements

📥 Install ROS2 Dependencies

📦 Install Python Dependencies

🛠️ Installation

1️⃣ Create Workspace

2️⃣ Clone Repository

3️⃣ Build

🧠 System Architecture

1️⃣ Global Detection (Exploration Stage)

2️⃣ Hand Camera Detection (Grasping Stage)

📷 Multi-Camera Setup

🚀 Usage

1️⃣ Run Global Detection

2️⃣ Run Hand Detection

3️⃣ Trigger Hand Detection

📡 ROS Topics

Subscribed

Published

⚙️ Parameters

Global Detection

Hand Detection

🧪 Dataset Collection (ROS Bag → Images)

▶️ Run Export Script

💡 Workflow

📁 Output Structure

📁 Project Structure

⚠️ Notes

🧪 Known Limitations

👤 Author

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages