FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng^1,2, Xinyuan Chang¹, Mengwei Xie¹, Xinran Liu¹, Yifan Bai^2,3, Zheng Pan¹, Mu Xu¹, Xing Wei²,

¹Amap, Alibaba Group, ²Xi’an Jiaotong University, ³DAMO Academy, Alibaba Group

Official implementation of FSDrive - a VLM can think visually about trajectory planning, advancing autonomous driving towards visual reasoning.

video_demo.mp4

🛠️ Installation

Create the required environment through the following steps:

git clone https://github.com/MIV-XJTU/FSDrive.git && cd FSDrive

conda create -n FSDrive python=3.10 -y && conda activate FSDrive

# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

cd LLaMA-Factory && pip install -e ".[metrics,deepspeed,liger-kernel,bitsandbytes]" --no-build-isolation

cd .. && pip install -r requirements.txt

📦 Data Preparation

1、Download nuScenes

Download the complete dataset from nuScenes and extract it to ./LLaMA-Factory/data/nuscenes

Or establish a soft connection：

ln -s /path/to/your/nuscenes LLaMA-Factory/data

We used pre-cached data from the nuScenes dataset. The data can be downloaded at Google Drive. The file cached_nuscenes_info.pkl is located in the directory ./create_data. The metrics folder is placed in the directory ./tools/data.

2、Extract visual tokens

Separately extract the visual tokens of the front view from both the pre-trained and fine-tuned data, to facilitate supervised MLLM:

python MoVQGAN/pretrain_data.py
python MoVQGAN/sft_data.py

3、Construct data

Construct pre-training and fine-tuning data that conform to the LLaMA-Factory format respectively:

python create_data/pretrain_data.py
python create_data/sft_data.py --split train # Change to "val" for constructing the validation set

Follow the LLaMA-Factory tutorial and add the dataset information in the file ./LLaMA-Factory/data/dataset_info.json.

🚀 Training

Enter the working directory of LLaMA-Factory:

cd LLaMA-Factory

1、Pre-train

First, pre-train the VLM to activate its visual generation capabilities:

llamafactory-cli train ../configs/pretrain.yaml

2、SFT

Then, based on the pre-trained parameters, fine-tune the VLM to think visually about trajectory planning:

llamafactory-cli train ../configs/sft.yaml

🎯 Infer

Run the following command in the LLaMA-Factory directory to infer test dataset:

python scripts/vllm_infer.py \ 
--model_name_or_path saves/qwen2_vl-7b/sft \
--dataset val_cot_motion \
--template qwen2_vl \
--cutoff_len 32768 \
--max_new_tokens 2048 \
--max_samples 100000 \
--image_resolution 524288 \
--save_name results.jsonl \
--temperature 0.1 \
--top_p 0.1 \
--top_k 10

📈 Evaluation

First, under the FSDrive directory, match the predicted results with the tokens to facilitate the evaluation:

cd ..

python tools/match.py \
--pred_trajs_path ./LLaMA-Factory/results.jsonl \
--token_traj_path ./LLaMA-Factory/data/train_cot_motion.json

Then evaluate the L2 and collision rate indicators for the end-to-end trajectory planning:

python tools/evaluation/evaluation.py \
# Change to "stp3" and use the ST-P3 calculation method
--metric uniad \  
--result_file ./LLaMA-Factory/eval_traj.json

👀 Visualization

Use the following command under the FSDrive directory to visualize the trajectory:

python tools/visualization/visualize_planning.py \
--pred-trajs-path ./LLaMA-Factory/results.jsonl \
--tokens-path ./LLaMA-Factory/eval_traj.json \  
--output-path ./vis_traj

Use the following command under the FSDrive directory to restore the visual tokens to the pixel space and visualize the CoT:

python ./MoVQGAN/vis.py \
--input_json ./LLaMA-Factory/eval_traj.json \
--output_dir ./vis_cot

📜 Citing

If you find FSDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry:

@article{zeng2025FSDrive,
      title={FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving},
      author={Shuang Zeng and Xinyuan Chang and Mengwei Xie and Xinran Liu and Yifan Bai and Zheng Pan and Mu Xu and Xing Wei},
      journal={arXiv preprint arXiv:2505.17685},
      year={2025}
      }

🙏 Acknowledgement

Our work is primarily based on the following codebases:LLaMA-Factory, MoVQGAN, GPT-Driver, Agent-Driver. We are sincerely grateful for their work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Table of Contents

🛠️ Installation

📦 Data Preparation

🚀 Training

🎯 Infer

📈 Evaluation

👀 Visualization

📜 Citing

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LLaMA-Factory		LLaMA-Factory
MoVQGAN		MoVQGAN
assets		assets
configs		configs
create_data		create_data
tools		tools
README.md		README.md
requirements.txt		requirements.txt

MIV-XJTU/FSDrive

Folders and files

Latest commit

History

Repository files navigation

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Table of Contents

🛠️ Installation

📦 Data Preparation

🚀 Training

🎯 Infer

📈 Evaluation

👀 Visualization

📜 Citing

🙏 Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages