This repository represents the official implementation of the paper titled "Towards Generalizable Scene Change Detection (CVPR 2025)".
Jaewoo Kim
·
Uehwan Kim
CVPR 2025
We formulate the research problem by casting a fundamental question: "Can contemporary SCD models detect arbitrary real-world changes beyond the scope of research data?" Our findings, as shown in the figure above, indicate that their reported effectiveness does not hold in real-world applications. Specifically, we observe that they (1) produce inconsistent change masks when the input order is reversed, and (2) exhibit significant performance drops when deployed to unseen domains with different visual features.
In this work, we address these two pivotal SCD problems by proposing a novel framework (GeSCF) and a novel benchmark (GeSCD) to foster SCD research in generalizability.
Follow the steps below to set up the environment for running GeSCF:
git clone https://github.com/1124jaewookim/towards-generalizable-scene-change-detection.git
cd towards-generalizable-scene-change-detection/srcYou can install dependencies manually or using a requirements.txt file.
-
Segment Anything (SAM):
Download from the official Meta AI repository:
👉 https://github.com/facebookresearch/segment-anything
Place the downloaded SAM ViT checkpoints (e.g.,sam_vit_h_4b8939.pth) in thesrc/pretrained_weight/directory. -
SuperPoint:
Download pretrained weights from:
👉 https://github.com/magicleap/SuperPointPretrainedNetwork
Place them in the corresponding directory specified in thesrc/pretrained_weight/directory.
For a comprehensive evaluation of SCD performance, we consider three standard SCD datasets with different characteristics and our proposed ChangeVPR dataset.
Please follow this page to download the VL-CMU-CD dataset.
Please follow this page to download the TSUNAMI dataset.
Please follow this page to download the ChangeSim dataset.
To download our ChangeVPR dataset, go here to download it.
Prepare your dataset in the following structure:
your_dataset_root/
└── ChangeVPR/
└── SF-XL/
├── t0/ # Images at time t0
│ ├── 00000000.png
│ ├── 00000001.png
│ └── ...
├── t1/ # Images at time t1
│ ├── 00000000.png
│ ├── 00000001.png
│ └── ...
└── mask/ # Ground-truth binary change masks
├── 00000000.png
├── 00000001.png
└── ...
or
your_dataset_root/
└── VL_CMU_CD/
└── test/
├── t0/ # Images at time t0
│ ├── 000_1_00_0.png
│ ├── 000_1_01_0.png
│ └── ...
├── t1/ # Images at time t1
│ ├── 000_1_00_0.png
│ ├── 000_1_01_0.png
│ └── ...
└── gt/ # Ground-truth binary change masks
├── 000_1_00_0.png
├── 000_1_01_0.png
└── ...
To run inference on an entire dataset, use the following command src/test.sh for convenience:
CUDA_VISIBLE_DEVICES=0 python test.py \
--test-dataset VL_CMU_CD \
--output-size 512 \
\
--dataset-path F:/GeSCD/VL_CMU_CD/test \
\
--feature-facet key \
--feature-layer 17 \
--embedding-layer 32 \
\
--sam-backbone vit_h \
--pseudo-backbone vit_h \
\
--points-per-side 32 \
--pred-iou-thresh 0.7 \
--stability-score-thresh 0.7📌 Note
| Argument | Description |
|---|---|
--test-dataset |
Dataset name (e.g., VL_CMU_CD, ChangeVPR, TSUNAMI) |
--dataset-path |
Path to the dataset root directory |
--output-size |
Final resolution of the output change mask |
--feature-facet |
Which ViT token to extract (key, query, or value) |
--feature-layer |
ViT layer to extract features from |
--embedding-layer |
ViT layer to extract token embeddings for similarity |
--sam-backbone |
Backbone used in Segment Anything (e.g., vit_h, vit_l, vit_b) |
--pseudo-backbone |
Backbone used in the pseudo mask generator |
--points-per-side |
Controls the sampling density for SAM proposals |
--pred-iou-thresh |
Higher value → fewer but more confident masks |
--stability-score-thresh |
Higher value → fewer but more stable masks |
⚡ Tips for Faster Inference
- Use smaller backbones:
Replacevit_hwithvit_lorvit_bfor--sam-backboneand--pseudo-backbone. - Reduce
--points-per-sideto16for fewer region proposals. - Increase
--pred-iou-threshand--stability-score-threshto filter out weak or noisy masks.
🏆 However, the best performance reported in the paper was achieved using:
--sam-backbone vit_h,--pseudo-backbone vit_h,--points-per-side 32,
--pred-iou-thresh 0.7, and--stability-score-thresh 0.7.
To run inference on a single image pair (e.g., for visualization or quick testing), use the following command src/test_sinlge.sh:
📌 Make sure to set
--test-datasettoRandomwhen testing with a manually specified image path.
CUDA_VISIBLE_DEVICES=0 python test_single.py \
--test-dataset Random \
--output-size 512 \
\
--img-t0-path F:/GeSCD/ChangeVPR/SF-XL/t0/00000001.png \
--img-t1-path F:/GeSCD/ChangeVPR/SF-XL/t1/00000001.png \
--gt-path F:/GeSCD/ChangeVPR/SF-XL/mask/00000001.png \
\
--feature-facet key \
--feature-layer 17 \
--embedding-layer 32 \
\
--sam-backbone vit_h \
--pseudo-backbone vit_h \
\
--points-per-side 32 \
--pred-iou-thresh 0.7 \
--stability-score-thresh 0.7📝 The ground-truth mask (
--gt-path) is optional.
If provided, precision/recall/F1 will be calculated and logged.
Our GeSCF, as a SAM-based zero-shot framework, demonstrates exceptional robustness across a wide range of terrain conditions, extending even to challenging remote sensing change detection scenarios. Below are examples showing the triplets of t0, t1, and GeSCF's corresponding predictions.
We sincerely thank CSCDNet, CDResNet, DR-TANet and C-3PO for providing a strong benchmark of the SCD baselines. We also thank Segment Anything for providing an excellent vision foundation model.
This project shares a similar research direction with other works exploring zero-shot scene change detection.
Notable examples include segment-any-change, zero-shot-scd, and MV3DCD.
If you find the work useful for your research, please cite:
@InProceedings{Kim_2025_CVPR,
author = {Kim, Jae-Woo and Kim, Ue-Hwan},
title = {Towards Generalizable Scene Change Detection},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {24463-24473}
}



