Yimin Wei1,2*, Aoran Xiao2*, Hongruixuan Chen1,2, Junshi Xia2, Naoto Yokoya1,2 †
1 The University of Tokyo, 2 RIKEN AIP
* Equal contribution, † Corresponding author
Mar 20th, 2026: The arXiv paper of MM-OVSeg is now online. If you are interested in details of MM-OVSeg, do not hesitate to take a look!!Notice☀️☀️: MM-OVSeg has been accepted by the CVPR 2026 conference on February 21, 2026!! Related data and benchmark suites will be released soon!
- Release Datasets for CVPR version (Feb 22, 2026)
- Release Train/Evaluation code for CVPR version
- Release pre-trained weights for CVPR version
Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical–SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities—optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision–language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions.
# 1. git clone this repository
git clone https://github.com/Jimmyxichen/MM-OVSeg.git
cd MM-OVSeg
# 2. create new anaconda env
conda create -n MMOVSeg python=3.8
conda activate MMOVSeg
# 3. install torch and dependencies
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
# Optional: install the latest version of PyTorch to use the DINO v3 model as backbone. In my case, the versions are PyTorch 2.5.0, Python 3.10, CUDA 12.6, and cuDNN 9.3.0, respectively.
# The dependent versions are not strict, and in general you only need to pay attention to pytorch and detectron2.
We include the following multimodal RS dataset configurations under diverse weather and domain conditions in this repo:
clear-sky weather: PIE-RGB-SAR-cleansynthetic cloud cover with varying opacity (thin vs. thick vs. varied): PIE-RGB-SAR-cloud (varied cloud), DDHR-SK (varied cloud), OpenEarthMap-SAR (OEM-thin & OEM-thick)cross-domain generalization: DDHR-CH (varied cloud)
We provide aboved processed datasets for your convenience. Download them from here.
The authors would also like to give special thanks to GSNet, DINOv3 and SegEarth-OV.
For any questions, please feel free to leave it in the issue section or contact us.