Skip to content

MIV-XJTU/FSDrive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

arXiv Project Page

Shuang Zeng1,2, Xinyuan Chang1, Mengwei Xie1, Xinran Liu1, Yifan Bai2,3, Zheng Pan1, Mu Xu1, Xing Wei2,

1Amap, Alibaba Group, 2Xi’an Jiaotong University, 3DAMO Academy, Alibaba Group

Official implementation of FSDrive - a VLM can think visually about trajectory planning, advancing autonomous driving towards visual reasoning.

video_demo.mp4

Table of Contents

🛠️ Installation

Create the required environment through the following steps:

git clone https://github.com/MIV-XJTU/FSDrive.git && cd FSDrive

conda create -n FSDrive python=3.10 -y && conda activate FSDrive

# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

cd LLaMA-Factory && pip install -e ".[metrics,deepspeed,liger-kernel,bitsandbytes]" --no-build-isolation

cd .. && pip install -r requirements.txt

📦 Data Preparation

1、Download nuScenes

Download the complete dataset from nuScenes and extract it to ./LLaMA-Factory/data/nuscenes

Or establish a soft connection:

ln -s /path/to/your/nuscenes LLaMA-Factory/data

We used pre-cached data from the nuScenes dataset. The data can be downloaded at Google Drive. The file cached_nuscenes_info.pkl is located in the directory ./create_data. The metrics folder is placed in the directory ./tools/data.

2、Extract visual tokens

Separately extract the visual tokens of the front view from both the pre-trained and fine-tuned data, to facilitate supervised MLLM:

python MoVQGAN/pretrain_data.py
python MoVQGAN/sft_data.py

3、Construct data

Construct pre-training and fine-tuning data that conform to the LLaMA-Factory format respectively:

python create_data/pretrain_data.py
python create_data/sft_data.py --split train # Change to "val" for constructing the validation set

Follow the LLaMA-Factory tutorial and add the dataset information in the file ./LLaMA-Factory/data/dataset_info.json.

🚀 Training

Enter the working directory of LLaMA-Factory:

cd LLaMA-Factory

1、Pre-train

First, pre-train the VLM to activate its visual generation capabilities:

llamafactory-cli train ../configs/pretrain.yaml

2、SFT

Then, based on the pre-trained parameters, fine-tune the VLM to think visually about trajectory planning:

llamafactory-cli train ../configs/sft.yaml

🎯 Infer

Run the following command in the LLaMA-Factory directory to infer test dataset:

python scripts/vllm_infer.py \ 
--model_name_or_path saves/qwen2_vl-7b/sft \
--dataset val_cot_motion \
--template qwen2_vl \
--cutoff_len 32768 \
--max_new_tokens 2048 \
--max_samples 100000 \
--image_resolution 524288 \
--save_name results.jsonl \
--temperature 0.1 \
--top_p 0.1 \
--top_k 10

📈 Evaluation

First, under the FSDrive directory, match the predicted results with the tokens to facilitate the evaluation:

cd ..

python tools/match.py \
--pred_trajs_path ./LLaMA-Factory/results.jsonl \
--token_traj_path ./LLaMA-Factory/data/train_cot_motion.json

Then evaluate the L2 and collision rate indicators for the end-to-end trajectory planning:

python tools/evaluation/evaluation.py \
# Change to "stp3" and use the ST-P3 calculation method
--metric uniad \  
--result_file ./LLaMA-Factory/eval_traj.json

👀 Visualization

Use the following command under the FSDrive directory to visualize the trajectory:

python tools/visualization/visualize_planning.py \
--pred-trajs-path ./LLaMA-Factory/results.jsonl \
--tokens-path ./LLaMA-Factory/eval_traj.json \  
--output-path ./vis_traj

Use the following command under the FSDrive directory to restore the visual tokens to the pixel space and visualize the CoT:

python ./MoVQGAN/vis.py \
--input_json ./LLaMA-Factory/eval_traj.json \
--output_dir ./vis_cot

📜 Citing

If you find FSDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry:

@article{zeng2025FSDrive,
      title={FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving},
      author={Shuang Zeng and Xinyuan Chang and Mengwei Xie and Xinran Liu and Yifan Bai and Zheng Pan and Mu Xu and Xing Wei},
      journal={arXiv preprint arXiv:2505.17685},
      year={2025}
      }

🙏 Acknowledgement

Our work is primarily based on the following codebases:LLaMA-Factory, MoVQGAN, GPT-Driver, Agent-Driver. We are sincerely grateful for their work.

About

Official implementation for "FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages