Shuang Zeng1,2, Xinyuan Chang1, Mengwei Xie1, Xinran Liu1, Yifan Bai2,3, Zheng Pan1, Mu Xu1, Xing Wei2,
1Amap, Alibaba Group, 2Xi’an Jiaotong University, 3DAMO Academy, Alibaba Group
Official implementation of FSDrive - a VLM can think visually about trajectory planning, advancing autonomous driving towards visual reasoning.
video_demo.mp4
- 🛠️ Installation
- 📦 Data Preparation
- 🚀 Training
- 🎯 Infer
- 📈 Evaluation
- 👀 Visualization
- 📜 Citing
- 🙏 Acknowledgement
Create the required environment through the following steps:
git clone https://github.com/MIV-XJTU/FSDrive.git && cd FSDrive
conda create -n FSDrive python=3.10 -y && conda activate FSDrive
# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
cd LLaMA-Factory && pip install -e ".[metrics,deepspeed,liger-kernel,bitsandbytes]" --no-build-isolation
cd .. && pip install -r requirements.txt
1、Download nuScenes
Download the complete dataset from nuScenes and extract it to ./LLaMA-Factory/data/nuscenes
Or establish a soft connection:
ln -s /path/to/your/nuscenes LLaMA-Factory/data
We used pre-cached data from the nuScenes dataset. The data can be downloaded at Google Drive. The file cached_nuscenes_info.pkl
is located in the directory ./create_data
. The metrics
folder is placed in the directory ./tools/data
.
2、Extract visual tokens
Separately extract the visual tokens of the front view from both the pre-trained and fine-tuned data, to facilitate supervised MLLM:
python MoVQGAN/pretrain_data.py
python MoVQGAN/sft_data.py
3、Construct data
Construct pre-training and fine-tuning data that conform to the LLaMA-Factory format respectively:
python create_data/pretrain_data.py
python create_data/sft_data.py --split train # Change to "val" for constructing the validation set
Follow the LLaMA-Factory tutorial and add the dataset information in the file ./LLaMA-Factory/data/dataset_info.json
.
Enter the working directory of LLaMA-Factory:
cd LLaMA-Factory
1、Pre-train
First, pre-train the VLM to activate its visual generation capabilities:
llamafactory-cli train ../configs/pretrain.yaml
2、SFT
Then, based on the pre-trained parameters, fine-tune the VLM to think visually about trajectory planning:
llamafactory-cli train ../configs/sft.yaml
Run the following command in the LLaMA-Factory directory to infer test dataset:
python scripts/vllm_infer.py \
--model_name_or_path saves/qwen2_vl-7b/sft \
--dataset val_cot_motion \
--template qwen2_vl \
--cutoff_len 32768 \
--max_new_tokens 2048 \
--max_samples 100000 \
--image_resolution 524288 \
--save_name results.jsonl \
--temperature 0.1 \
--top_p 0.1 \
--top_k 10
First, under the FSDrive directory, match the predicted results with the tokens to facilitate the evaluation:
cd ..
python tools/match.py \
--pred_trajs_path ./LLaMA-Factory/results.jsonl \
--token_traj_path ./LLaMA-Factory/data/train_cot_motion.json
Then evaluate the L2 and collision rate indicators for the end-to-end trajectory planning:
python tools/evaluation/evaluation.py \
# Change to "stp3" and use the ST-P3 calculation method
--metric uniad \
--result_file ./LLaMA-Factory/eval_traj.json
Use the following command under the FSDrive directory to visualize the trajectory:
python tools/visualization/visualize_planning.py \
--pred-trajs-path ./LLaMA-Factory/results.jsonl \
--tokens-path ./LLaMA-Factory/eval_traj.json \
--output-path ./vis_traj
Use the following command under the FSDrive directory to restore the visual tokens to the pixel space and visualize the CoT:
python ./MoVQGAN/vis.py \
--input_json ./LLaMA-Factory/eval_traj.json \
--output_dir ./vis_cot
If you find FSDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry:
@article{zeng2025FSDrive,
title={FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving},
author={Shuang Zeng and Xinyuan Chang and Mengwei Xie and Xinran Liu and Yifan Bai and Zheng Pan and Mu Xu and Xing Wei},
journal={arXiv preprint arXiv:2505.17685},
year={2025}
}
Our work is primarily based on the following codebases:LLaMA-Factory, MoVQGAN, GPT-Driver, Agent-Driver. We are sincerely grateful for their work.