Skip to content

Oranger-l/MGSD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Task Model Training License arXiv

MGSD studies how vision-language models perceive structured visual states and learn to plan over them. It first uses cold-start perception SFT to help the model recover task state from images, then uses OPCD training where a text-only teacher guides a visual student.

The code supports three visual spatial planning tasks:

  • 🧊 FrozenLake: navigate from player to goal while avoiding holes.
  • 🧩 Maze: recover maze connectivity and plan through open corridors.
  • 🖨️ MiniBehaviour: pick up the printer object and drop it next to the table.

Datasets and trained checkpoints will be released separately.

🚀 Pipeline

MGSD pipeline

✨ Highlights

  • Perception-first cold start: SFT targets task-state recognition before downstream planning.
  • Text-teacher / visual-student OPCD: the teacher receives symbolic context, while the student learns from task images.

🗂️ Repository Map

MGSD/
  task_envs/             # task simulators, parsers, renderers, reward utilities
  LlamaFactory/          # cold-start perception SFT pipeline
  EasyR1/                # OPCD training pipeline
  evaluate/              # checkpoint, perception, API, modality-gap evaluation
  api_config_files/      # empty API config templates
  data/                  # local data placeholder
  models/                # local model placeholder
  requirements/          # environment requirement files

🛠️ Environment Setup

MGSD uses three environments:

Environment Purpose Setup
vspsft Cold-start perception SFT requirements/vspsft.txt + editable local LlamaFactory
easyr1 OPCD training follow upstream EasyR1
vlmevalkit checkpoint/API evaluation follow upstream VLMEvalKit

Example SFT environment:

conda create -n vspsft python=3.11 -y
conda activate vspsft
pip install -r requirements/vspsft.txt
pip install -e LlamaFactory

For EasyR1 and VLMEvalKit, use their official installation guides or containers. This repository keeps the MGSD task-specific scripts and wrappers.

🔍 Cold-Start Perception SFT

After unpacking the released SFT data, train the 4B model:

bash LlamaFactory/examples/train_lora/qwen3vl_4b_vsp_tasks_perception_sft.sh

Train the 8B model:

bash LlamaFactory/examples/train_lora/qwen3vl_8b_vsp_tasks_perception_sft.sh

Common overrides include MODEL_NAME_OR_PATH, OUTPUT_DIR, DATASET_DIR, MEDIA_DIR, SWANLAB_PROJECT, and SWANLAB_RUN_NAME.

Merge a trained LoRA checkpoint:

bash LlamaFactory/examples/merge_lora/merge_vsp_tasks_perception_sft.sh \
  LlamaFactory/saves/qwen3-vl-4b/vsp_tasks_perception_lora_sft/checkpoint-828

Set MODEL_SIZE=8b or EXPORT_DIR=... when merging an 8B checkpoint or writing to a custom destination.

🧠 OPCD Training

Prepare mixed OPCD parquet data from the released task data:

bash EasyR1/examples/prepare_qwen3_vl_4b_vsp_tasks_opcd_data.sh

Train 4B:

bash EasyR1/examples/qwen3_vl_4b_vsp_tasks_opcd_base_teacher.sh

Train 8B:

bash EasyR1/examples/qwen3_vl_8b_vsp_tasks_opcd_base_teacher.sh

Common overrides include MODEL_PATH, DATA_ROOT, VAL_RAW_ROOT, TRAIN_FILES, VAL_FILES, PROJECT_NAME, and EXPERIMENT_NAME.

Merge an OPCD checkpoint:

bash EasyR1/examples/merge_vsp_tasks_opcd_checkpoint.sh \
  EasyR1/checkpoints/VSP_Tasks_OPCD/<experiment>/global_step_<N>

By default, the merged Hugging Face checkpoint is written to global_step_<N>/actor/huggingface. Set EXPORT_DIR=... to copy it to a custom checkpoint directory.

📊 Evaluation

Evaluation entry points live in evaluate/.

Perception SFT evaluation:

bash evaluate/perception_eval/vlm.sh
bash evaluate/perception_eval/run_vsp_perception_eval.sh

Local checkpoint VisualPlanning evaluation:

MODEL_PATH=models/ckpts/<your-checkpoint> \
  bash evaluate/ckpt_eval/serve_ckpt_8x.sh

bash evaluate/ckpt_eval/run_ckpt_eval.sh

Closed-source or OpenAI-compatible API evaluation:

export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://your-endpoint/v1
bash evaluate/api_eval/run_api_eval.sh

See evaluate/README.md for more details.

📝 Note

TODO:

  • Release VSP-Tasks training and validation data.
  • Release trained SFT and OPCD checkpoints.
  • Release the arXiv paper.
  • Release the source code.

📬 Contact

For questions or discussion, please contact oranger_wy@163.com.

🙏 Acknowledgements

MGSD builds on several excellent open-source projects:

  • LLaMA Factory for efficient multimodal SFT.
  • EasyR1 for scalable multimodal RL/OPCD training infrastructure.
  • VLMEvalKit for VLM evaluation tooling and ecosystem support.

We sincerely thank the authors and contributors of these frameworks.

📚 Citation

If you find MGSD useful for your research, please consider citing our paper:

@misc{luo2026mgsd,
  title = {Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation},
  author = {Luo, Haocheng and Liu, Jiahui and Zhang, Ruicheng and Zhong, Zhizhou
            and Huang, Jiaqi and Xu, Zunnan and Shi, Quan and Zhou, Jun and Li, Xiu},
  year = {2026},
  eprint = {2606.06076},
  archivePrefix = {arXiv}
}

Paper: https://arxiv.org/pdf/2606.06076

About

Official code repository for MGSD: Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages