Jintao Tong1, Shilin Yan2†‡, Hongwei Xue2, Xiaojun Tang2, Kunyu Shi2,
Guannan Zhang2, Ruixuan Li1‡, Yixiong Zou1‡
†Project Leader ‡Corresponding author
1Huazhong University of Science and Technology, 2Accio Team, Alibaba Group
2025.02.06🚀 Model and Dataset are released!2025.02.05🚀 Training Code is available!2025.02.05📝 We release our latest work SwimBird!
We introduce SwimBird, a hybrid autoregressive MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision–text reasoning. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks.
SwimBird dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning, and (3) interleaved vision–text reasoning.
git clone https://github.com/Accio-Lab/SwimBird.git
cd SwimBird
pip install -r requirements.txt
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
To train the model, follow these steps:
- Replace Qwen3-VL's
chat_template.jsonwith ours. - Download the training datasets SwimBird-SFT-92K and add the dataset absolute directory path as a prefix to all image paths in the JSON files:
python data_process.py absolute_path_to_datasetExample:
python data_process.py /abs_path/SwimBird-ZebraCoT/
python data_process.py /abs_path/SwimBird-MathCanvas/
python data_process.py /abs_path/SwimBird-ThinkMorph/
python data_process.py /abs_path/SwimBird-OpenMMReasoner/- Run the training script with the following command:
bash scripts/train.shWe adopt VLMEvalKit to conduct the evaluation. You can get started as follows:
cd VLMEvalKit
pip install -e.bash test.shThe path to our model: VLMEvalKit-main/vlmeval/vlm/swimbird
See [QuickStar | 快速开始] for more details about arguments.
- If you have any questions about this project, please feel free to contact: tattoo.ysl@gmail.com.
- We are actively seeking self-motivated researchers and research interns to join our team!
- If you find this project useful in your research, please consider citing:
arxiv- We sincerely thank Qwen-VL-Series-Finetune, Skila and others for their contributions, which have provided valuable insights.

