Multimodal Navigation Transformer

This repository contains a navigation model that predicts short waypoint trajectories from multimodal observations. The main model used here is ViLiNT, a diffusion-based policy conditioned on image context, 3D observations, robot state, and a goal direction.

For environment and installation instructions, see SETUP.md.

1. ViLiNT

ViLiNT takes as input:

a short history of RGB images,
a 3D observation stream,
a robot's embodiment vector,
a goal direction in the robot frame.

It outputs a short trajectory of future position waypoints that can then be converted into velocity commands by a PD controller.

Encoders used

Image encoder: DUNE ViT.
3D encoder: point cloud encoder PointTransformerV3.

How the 3D modality is introduced

The 3D input is first encoded by the point-cloud backbone. Then it is converted into a small set of LiDAR tokens with a tokenizer that groups the scene into angular sectors and radial rings. These 3D tokens are concatenated with the image-history tokens, physics token, and goal token before multimodal fusion with the transformer.

2. How to train the model

The main training entry point is:

cd train
python train.py --config config/vilint.yaml

With Slurm:

cd train
sbatch train.sh

Main training config

The main file to edit is:

train/config/vilint.yaml

Expected dataset format

The training loader in train/mnt_train/data/vilint_dataset.py does not train directly from loose jpg / npy files. The final per-trajectory format is archive-based:

images.tar for RGB frames
points.zarr for stacked point clouds
traj_data.pkl or pose arrays under trajectory.zarr/... for robot poses
width_curve.zarr for the clearance distance ground truth.

The train/test split folders referenced in train/config/vilint.yaml should point to splits whose trajectory names match these archived trajectory folders.

3. How to deploy with ROS

The ROS2 deployment entry point is:

cd deployment/src
python3 deploy_vilint.py --model vilint --imgwaypoints

The full tmux launcher is:

cd deployment/src
bash deploy_vilint.sh

This starts:

the ViLiNT inference node,
the PD waypoint controller,
RViz,
a rosbag recorder.

Files to edit before deployment

deployment/config/models.yaml
- set ckpt_path to the chosen checkpoint to deploy,
- enable or disable mask_image, mask_lidar, and heuristics.
deployment/config/robot.yaml
- set robot velocity limits,
- set control topic names,
- adjust robot dimensions if needed.
deployment/src/topics_names.py
- set the subscribed ROS topics:
  - IMAGE_TOPIC
  - LIDAR_TOPIC
  - ODOM_TOPIC
  - GOAL_TOPIC

Deployment behavior

deploy_vilint.py reads the trained model checkpoint, subscribes to image / LiDAR / odometry / goal topics, predicts waypoints, and publishes them on the waypoint topic. pd_controller.py then converts the waypoint stream into velocity commands.

Docker image

A docker image can be found in test/docker. It allows the user to deploy the model alongside the IsaacSim simulation running with ROS2. It is made for x86 architectures equiped with NVIDIA gpus.

NB: The image build is computationnaly intensive because of the flash-attention building process.

4. How to generate data from rosbags

There are two extraction scripts:

ROS1 bags: train/mnt_train/process_data/process_bags.py
ROS2 bags: train/mnt_train/process_data/process_bags_ros2.py

These scripts first extract each trajectory into a folder of raw files such as:

traj_data.pkl
0.jpg, 1.jpg, ...
0.npy, 1.npy, ...

Then generate the width curves ground truth with:

cd train/mnt_train/process_data
python process_lidar_collision.py /path/to/dataset/datas/trajectories -d dataset_name

After that, you should build the archive format used by ViLiNT_Dataset with:

cd train/mnt_train/process_data
python build_archives.py --root /path/to/dataset/datas/trajectories --overwrite

build_archives.py creates, inside each trajectory folder:

images.tar
points.zarr
aligned_indices.txt
build_summary.json

and converts width_curve.npy to width_curve.zarr when present.

ROS2 example

cd train/mnt_train/process_data
python process_bags_ros2.py \
  --dataset-name husky \
  --input-dir /path/to/ros2_bags \
  --output-dir /path/to/processed_dataset \
  --sample-rate 4.0

ROS1 example

cd train/mnt_train/process_data
python process_bags.py \
  --dataset-name husky \
  --input-dir /path/to/ros1_bags \
  --output-dir /path/to/processed_dataset \
  --sample-rate 4.0

Where to specify which ROS topics to use

The bag-processing topic selection is configured in:

train/mnt_train/process_data/process_bags_config.yaml

If your bag format is different

If your camera, LiDAR, or odometry message format is different, add or adapt the processing functions in:

train/mnt_train/process_data/process_data_utils.py for ROS1
train/mnt_train/process_data/process_data_utils_ros2.py for ROS2

This is where image conversion, point-cloud parsing, and odometry-to-(x, y, yaw) conversion are defined.

Recommended preprocessing flow

Run process_bags.py or process_bags_ros2.py to extract raw images, LiDAR frames, and traj_data.pkl.
Run build_archives.py on the generated trajectory folders to create images.tar and points.zarr.
Point the dataset entries in train/config/vilint.yaml to the resulting dataset root and splits.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
PointTransformerV3 @ 3229e9b		PointTransformerV3 @ 3229e9b
deployment		deployment
diffusion @ 5ba07ac		diffusion @ 5ba07ac
dune @ 1e1a111		dune @ 1e1a111
test		test
tools		tools
train		train
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
SETUP.md		SETUP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Navigation Transformer

1. ViLiNT

Encoders used

How the 3D modality is introduced

2. How to train the model

Main training config

Expected dataset format

3. How to deploy with ROS

Files to edit before deployment

Deployment behavior

Docker image

4. How to generate data from rosbags

ROS2 example

ROS1 example

Where to specify which ROS topics to use

If your bag format is different

Recommended preprocessing flow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Navigation Transformer

1. ViLiNT

Encoders used

How the 3D modality is introduced

2. How to train the model

Main training config

Expected dataset format

3. How to deploy with ROS

Files to edit before deployment

Deployment behavior

Docker image

4. How to generate data from rosbags

ROS2 example

ROS1 example

Where to specify which ROS topics to use

If your bag format is different

Recommended preprocessing flow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages