Skip to content

Jimmyxichen/MM-OVSeg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 

Repository files navigation

MM-OVSeg: Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

✨CVPR 2026✨

Yimin Wei1,2*, Aoran Xiao2*, Hongruixuan Chen1,2, Junshi Xia2, Naoto Yokoya1,2 †

1 The University of Tokyo, 2 RIKEN AIP

* Equal contribution, Corresponding author

arXiv paper HuggingFace Dataset

🛎️News

  • Mar 20th, 2026: The arXiv paper of MM-OVSeg is now online. If you are interested in details of MM-OVSeg, do not hesitate to take a look!!
  • Notice☀️☀️: MM-OVSeg has been accepted by the CVPR 2026 conference on February 21, 2026!! Related data and benchmark suites will be released soon!

🔥TODO

  • Release Datasets for CVPR version (Feb 22, 2026)
  • Release Train/Evaluation code for CVPR version
  • Release pre-trained weights for CVPR version

Abstract

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical–SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities—optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision–language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions.

Dependencies and Installation

# 1. git clone this repository
git clone https://github.com/Jimmyxichen/MM-OVSeg.git
cd MM-OVSeg

# 2. create new anaconda env
conda create -n MMOVSeg python=3.8
conda activate MMOVSeg

# 3. install torch and dependencies
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

# Optional: install the latest version of PyTorch to use the DINO v3 model as backbone. In my case, the versions are PyTorch 2.5.0, Python 3.10, CUDA 12.6, and cuDNN 9.3.0, respectively.

# The dependent versions are not strict, and in general you only need to pay attention to pytorch and detectron2.

Datasets

We include the following multimodal RS dataset configurations under diverse weather and domain conditions in this repo:

  1. clear-sky weather: PIE-RGB-SAR-clean
  2. synthetic cloud cover with varying opacity (thin vs. thick vs. varied): PIE-RGB-SAR-cloud (varied cloud), DDHR-SK (varied cloud), OpenEarthMap-SAR (OEM-thin & OEM-thick)
  3. cross-domain generalization: DDHR-CH (varied cloud)

We provide aboved processed datasets for your convenience. Download them from here.

🤝Acknowledgments

The authors would also like to give special thanks to GSNet, DINOv3 and SegEarth-OV.

🙋Q & A

For any questions, please feel free to leave it in the issue section or contact us.

About

Official PyTorch Implementation of MM-OVSeg: Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing [CVPR 2026].

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors