Haoyang Chen1,2, Jing Zhang1,2 β , Hebaixu Wang1,2, Shiqin Wang1, PoHsun Huang1, Jiayuan Li2,3, Haonan Guo1,2, Di Wang1,2 β , Zheng Wang1,2 β , Bo Du1,2 β .
1 Wuhan University, 2 Zhongguancun Academy, 3 Beijing Institute of Technology.
β Corresponding author
News | Abstract | Datasets | Checkpoints | Usage | Statement
2026.3.5
- The paper is post on arXiv (arXiv 2603.04114)
Multi-modal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs.
Figure 1. Quantitative comparison between our proposed Any2Any and other compar methods.
Figure 2. Statistics and example images of the RST-1M dataset.
The RST-1M is Coming Soon.
Figure 3. Overview of the Any2Any framework.
The checkpoints are Coming Soon.
Wait for update.
We provide an inference script:
Wait for update.
Figure 4. Qualitative comparison between our proposed Any2Any and other compar methods.
Figure 5. Qualitative results of our method on unseen remote sensing modality translation tasks with missing paired training data.
If you find Any2Any helpful, please give a β and cite it as follows:
@article{chen2026Any2Any,
title={Any2Any: Unified Arbitrary Modality Translation for Remote Sensing},
author={Chen, Haoyang and Zhang, Jing and Wang, Hebaixu and Wang, Shiqin and Huang, Pohsun and Li, Jiayuan and Guo, Haonan and Wang, Di and Wang, Zheng and Du, Bo},
journal={arXiv preprint arXiv:2603.04114},
year={2026}
}
For any other questions please contact Haoyang Chen at whu.edu.cn.




