"M⁴-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection" by Jiyuan Liu, Jia Lin, Xiaofei Zhou*, Runmin Cong, Deyang Liu, Zhi Liu 🎉 CVPR 2026 Accepted!
📑 Paper (arXiv) (to be added) | 💻 Code(Github)
We propose M⁴-SAM, a prompt-free framework that adapts SAM2 for RGB-D video salient object detection by introducing modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization.
Key Highlights:
- 💡 Modality-Aware MoE-LoRA: elevates vanilla LoRA with convolutional experts and modality-specific routing for adaptive RGB-D feature fusion and efficient fine-tuning.
- 🧩 Gated Multi-Level Feature Fusion: hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism to balance spatial details and semantic context.
- 🚀 Pseudo-Guided Initialization: bootstraps the memory bank using a coarse mask as a pseudo prior, enabling zero-shot VSOD without manual prompts.
Code is coming soon! Stay tuned.
RDVS, ViDSOD-100 and DViSal
Dependent Models: SAM2 — download sam2.1_hiera_large.pt
Our work would not have been possible without the following open-source projects:
Thanks for their great contributions!
If you find our work useful, please cite our paper, thank you!
@inproceedings{liu2026m4sam,
title={M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection},
author={Liu, Jiyuan and Lin, Jia and Zhou, Xiaofei and Cong, Runmin and Liu, Deyang and Liu, Zhi},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}