Skip to content

Fxmangd/Octopus

Repository files navigation

Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

Paper Project Page Code License

A principled, privacy-preserving, and rehearsal-free continual learning framework for MLLMs.

Yuehao Liu1 · Shanyan Guan2 · Weijia Zhang1 · Xuanming Shang1 · Yanhao Ge2 · Wei Li2 · Chao Ma1*

1 MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University 2 vivo Mobile Communication Co., Ltd.

Abstract: Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.

pipeline


📢 News & Updates

  • [2026/02] 🎉 Octopus has been accepted to CVPR 2026!
  • [2026/03] 🚀 Project page goes live! Click here to visit.
  • [2026/03] 🔥 We release the training and evaluation code. Stay tuned!

🔮 Highlights

Octopus introduces a novel paradigm to mitigate Catastrophic Forgetting (CF) natively without historical replay buffers or architectural expansion.

  • 💡 History-Free Gradient Orthogonalization (HiFGO): Unveils a principled metric, GPWC (Gradients of Previous parameters Within Current data distribution). HiFGO calculates the gradient sensitivity of historical tasks to steer current optimization.
  • ⚖️ Two-Stage Finetuning Strategy: Decouples unconstrained task alignment (Stage 1) from optimal manifold constrained refinement (Stage 2). This effectively balances plasticity and stability.
  • 🏆 State-of-The-Art Performance: Outperforms state-of-the-art MoE/Regularization strategies across rigorous sequential learning benchmarks like UCIT and CoIN.
  • Zero Inference Overhead: Introduces only a single LoRA module during inference, identically efficient to standard sequential fine-tuning, free from expert routing delays.

🛠️ Installation

Our codebase is elegantly built upon the powerful ms-swift framework for Parameter-Efficient Fine-Tuning (PEFT) of MLLMs.

# 1. Clone the repository
git clone https://github.com/fxmangd/Octopus.git
cd Octopus

# 2. Create conda environment
conda create -n octopus python=3.10
conda activate octopus

# 3. Install requirements
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.53.1 peft==0.15.2 accelerate==1.7.0 vllm==0.7.3 modelscope pandas datasets pydantic einops tensorboardX matplotlib tensorboard
pip install mistral-common==1.0.1 mathruler pycocotools pycocoevalcap pylatexenc

# 4. Install core dependencies including ms-swift
pip install -e .

📊 Dataset Preparation

We evaluate Octopus on the UCIT benchmark and CoIN benchmark.

To reproduce our results, you will need to prepare the dataset and instruction files. You can download the raw datasets and their original instructions directly from their official repositories. For your convenience, we have processed and organized the instruction files. You can download our well-formatted instructions directly from Google Drive:

📥 Download Organized Instructions via Google Drive

📁 Directory Structure

After downloading (and extracting) the data, please ensure that you organize the instruction files into the following directory structure:

├── data/
   	├── UCIT_instructions/
   	|	├── train/
   	|	│   ├── ImageNet-R.json
   	|	│   ├── ArxivQA.json
   	|	│   ├── ...
   	|	│   └── Flickr30k.json
   	|	├── test/
   	|	│   ├── ImageNet-R.json
   	|	│   ├── ArxivQA.json
   	|	│   ├── ...
   	|	│   └── Flickr30k.json
   	├── CoIN_instructions/
   	|	├── train/
   	|	│   ├── SciQA.json
   	|	│   ├── TextVQA.json
   	|	│   ├── ...
   	|	│   └── OCR-VQA.json
   	|	├── test/
   	|	│   ├── ScienceQA.json
   	|	│   ├── TextVQA.json
   	|	│   ├── ...
   	|	│   └── OCR-VQA.json
   	├── UCIT_raw_datas/
   	│   ├── ImageNet-R/
   	│   ├── ArxivQA/
   	│   ├── ...
   	│   └── Flickr30k/
	└── CoIN_raw_datas/
      	├── ScienceQA/
      	├── TextVQA/
      	├── ...
      	└── OCR-VQA/


🚀 Quick Start (Training & Evaluation)

We provide highly optimized shell scripts to conduct sequential learning seamlessly.

1. Sequential Continual Learning on UCIT

To launch the Octopus training pipeline on the UCIT benchmark using LLaVA-v1.5-7b:

cd Octopus
export PYTHONPATH=./

bash examples/Octopus_UCIT/train_all.sh

2. Sequential Continual Learning on CoIN

To launch the Octopus training pipeline on the CoIN benchmark using LLaVA-v1.5-7b:

cd Octopus
export PYTHONPATH=./

bash examples/Octopus_CoIN/train_all.sh

3. Evaluation

To evaluate the final performance after sequential fine-tuning across all tasks:

# Evaluate on UCIT benchmark
python evaluate.py --dataset_name UCIT

# Evaluate on CoIN benchmark
python evaluate.py --dataset_name CoIN

[NOTE] By default, evaluate.py evals the most recently trained model. If you need to specify the path to a pre-trained model, please manually modify the adapter_paths parameter in evaluate.py. Furthermore, note that the LoRA weights obtained from Octopus are incremental updates based on previous tasks. Accordingly, you are required to add the LoRA weights of all historical tasks to adapter_paths, which will be automatically merged during execution.


📊 Main Results

Performance on UCIT Benchmark 📈

Comparison with various methods on UCIT in terms of Avg and Last. The best and second methods are labeled with bold and underline styles. Zero-shot evaluates pretrained model without finetuning. Multi-task jointly finetunes model across all datasets, whereas Sequential Finetune adapts only one LoRA module sequentially to all tasks. These settings provide an empirical characterization of the lower bound, upper bound, and baseline for continual learning methods.

tab1

Qualitative Previews

🏷️ Dataset 1: VizWiz (Captioning)
VizWiz-Caption Example

🏷️ Dataset 2: ImageNet-R (VQA)
CLEVR-Math Example

🏷️ Dataset 3: IconQA (VQA)
IconQA Example

🏷️ Dataset 4: CLEVR-Math (VQA)
ArXivQA Example


✒️ Citation

If you find our work or this code repository helpful for your research, please consider citing our paper:

@article{liu2026octopus,
  title={Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models},
  author={Liu, Yuehao and Guan, Shanyan and Zhang, Weijia and Shang, Xuanming and Ge, Yanhao and Li, Wei and Ma, Chao},
  journal={arXiv preprint arXiv:2605.14938},
  year={2026}
}

🙏 Acknowledgements

This repository is built upon the ms-swift, HiDe-LLaVA, CoIN and LLaVA projects. We sincerely thank the authors for their valuable contributions to the research community.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors