Segment Anything without Supervision

Unsupervised SAM (UnSAM) is a "segment anything" model for promptable and automatic whole-image segmentation which does not require human annotations.

Segment Anything without Supervision
XuDong Wang, Jingfeng Yang, Trevor Darrell
UC Berkeley
NeurIPS 2024

[project page] [arxiv] [colab (UnSAM)] [colab (pseudo-label)] [bibtex]

VideoUnSAM Extension

Extending UnSAM into the temporal domain with DINOv3 features and Sinkhorn optimal-transport mask propagation. See video/ for the new code.

Phase 1 — Static divide-and-conquer (DINOv3)

CutLER divide → DINOv3 ViT-L/16 conquer on a single image.

# End-to-end on the demo image (writes a colored mask overlay)
.venv/bin/python divide_and_conquer/demo_dico.py \
  --backbone dinov3 --output pseudo_masks_output.png

# Divide → conquer → synthetic stills video (UnSAM v1-style augmented "video")
.venv/bin/python divide_and_conquer/divide_conquer_videoV3.py \
  --input docs/demos/sa_234337.jpg \
  --output-video output.mp4 --output-preview preview.png

Phase 2 — Real-video propagation on DAVIS 2017

DAVIS is expected at datasets/davis/DAVIS/{JPEGImages,Annotations}/480p/<clip>/.

# Sanity: DINOv3 patch-cosine map between two frames of a clip.
# Pick a query patch in frame A; heatmap shows where DINOv3 thinks it went in B.
.venv/bin/python -m video.scripts.visualize_frame_similarity \
  --clip blackswan --frame-a 0 --frame-b 20 \
  --query-xy 0.45 0.7 --out sim_blackswan.png

# Single-hop OT mask propagation A → B.
# Uses DAVIS GT at frame A as the source mask (stand-in for a Phase-1 pseudo-mask).
# --upscale 2.0 doubles the DINOv3 feature grid for sharper boundaries.
.venv/bin/python -m video.scripts.propagate_mask \
  --clip blackswan --frame-a 0 --frame-b 20 --instance-id 1 \
  --upscale 2.0 --out prop_blackswan.png

# Same, but also runs SAM ViT-H as a *diagnostic* boundary refiner on the OT
# output. Reports both OT and SAM IoU. NOT used in the pseudo-mask pipeline
# downstream — SAM imports a supervised prior; we keep this here only to
# visualise an "ideal refinement" ceiling. Requires the SAM checkpoint at
# checkpoints/sam_vit_h_4b8939.pth.
.venv/bin/python -m video.scripts.propagate_mask \
  --clip bmx-trees --frame-a 0 --frame-b 15 --instance-id 1 \
  --upscale 2.0 --sam --out bmx_sam.png

# Chained-hop OT propagation: small strides with soft patch-level mass passed
# through every hop (no mid-chain binarisation). Helps slightly on long
# easy/medium clips, collapses on multi-instance hard clips (dogs-jump) —
# that failure motivates the Phase-3 KV memory.
.venv/bin/python -m video.scripts.propagate_chain \
  --clip blackswan --frame-a 0 --frame-b 45 --stride 5 \
  --instance-id 1 --upscale 2.0 --out chain_blackswan.png

One-time setup for Phase 2

# DAVIS 2017 trainval (~800 MB)
mkdir -p datasets/davis && cd datasets/davis
wget https://data.vision.ee.ethz.ch/csergi/share/davis/DAVIS-2017-trainval-480p.zip
unzip -q DAVIS-2017-trainval-480p.zip && cd ../..

# SAM diagnostic (only if you'll pass --sam)
.venv/bin/pip install segment-anything
mkdir -p checkpoints
wget -O checkpoints/sam_vit_h_4b8939.pth https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

# HuggingFace login for the gated DINOv3 weights
huggingface-cli login

Common knobs

--upscale {1.0, 2.0} — DINOv3 feature-grid resolution; 2.0 is the practical default on a 5090.
--blur 0.05 — Sinkhorn entropic regularisation; robust across 0.02–0.2.
--threshold 0.5 — binarisation cutoff as a fraction of heatmap max.
--instance-id N — which DAVIS annotation instance to use as the source mask at frame A.

Status

Phase 1 ✅ (DINOv3 ViT-L/16, bf16, ~2.6 GB peak on 5090)
Phase 2 ✅ (single-hop OT + 2× upscale; chained-hop OT works on easy/medium, fails on multi-instance content)
Phase 3 — KV memory module (next)
Phase 4 — self-training loop
Phase 5 — DAVIS / YouTube-VOS evaluation

Updates

11/19/2025 UnSAMv2 was released!!!! Check it out at: GitHub & UnSAMv2 project page

10/29/2025 Add Hugging Face support for whole image segmentation [HF Link], [Tutorial Notebook]
07/01/2024 Initial commit of UnSAM

Features

The performance gap between unsupervised segmentation models and SAM can be significantly reduced. UnSAM not only advances the state-of-the-art in unsupervised segmentation by 10% but also achieves comparable performance with the labor-intensive, fully-supervised SAM.
The supervised SAM can also benefit from our self-supervised labels. By training UnSAM with only 1% of SA-1B images, a lightly semi-supervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM’s AR by over 6.7% and AP by 3.9% on SA-1B.

Installation

See installation instructions.

Dataset Preparation

See Preparing Datasets for UnSAM.

Method Overview

UnSAM has two major stages: 1) generating pseudo-masks with divide-and-conquer and 2) learning unsupervised segmentation models from pseudo-masks of unlabeled data.

1. Multi-granular Pseudo-mask Generation with Divide-and-Conquer

Our Divide-and-Conquer approach can be used to provide multi-granular masks without human supervision.

Divide-and-Conquer Demo

Try out the demo using Colab:

If you want to run Divide-and-Conquer locally, we provide demo_dico.py that is able to visualize the pseudo-masks. Please download the CutLER's checkpoint from here, and then run it with:

cd divide_and_conquer
python demo_dico.py \
    --input /path/to/input/image \
    --output /path/to/save/output \
    --preprocess true \
    --postprocess true \ #postprocess requires gpu 
    --opts MODEL.WEIGHTS /path/to/cutler_checkpoint \
    MODEL.DEVICE gpu

We give a few demo images in docs/demos/. Following, we give some visualizations of the pseudo-masks on the demo images.

2. Segment Anything without Supervision

Inference Demo for UnSAM with Pre-trained Models (whole image segmentation)

Try out the UnSAM demo using Colab (no GPU needed):

If you want to run UnSAM or UnSAM+ demos locally, we provide demo_whole_image.py that is able to demo builtin configs. Please download UnSAM/UnSAM+'s checkpoints from the model zoo. Run it with:

cd whole_image_segmentation
python demo_whole_image.py \
    --input /path/to/input/image \
    --output /path/to/save/output \
    --opts \
    MODEL.WEIGHTS /path/to/UnSAM_checkpoint \
    MODEL.DEVICE cpu

The configs are made for training, therefore we need to specify MODEL.WEIGHTS to a model from model zoo for evaluation. This command will run the inference and save the results in the local path.

To run on cpu, add MODEL.DEVICE cpu after --opts.
To save outputs to a directory (for images) or a file (for webcam or video), use --output.

Following, we give some visualizations of the model predictions on the demo images.

Gradio Demo for UnSAM with Pre-trained Models (promptable image segmentation)

The following command will pops up a gradio website link in the terminal, on which users can interact with our model. Please download UnSAM/UnSAM+'s checkpoints from the model zoo. For details of the command line arguments, see demo_promptable.py -h or look at its source code to understand its behavior.

To run on cpu, add cpu after --device.

python demo_promptable.py \
    --ckpt /path/to/UnSAM_checkpoint \
    --conf_files configs/semantic_sam_only_sa-1b_swinT.yaml \
    --device gpu

Following, we give some visualizations of the model predictions on the demo images.

Model Evaluation

To evaluate a model's performance on 7 different datasets, please refer to datasets/README.md for instructions on preparing the datasets. Next, select a model from the model zoo, specify the "model_weights", "config_file" and the path to "DETECTRON2_DATASETS" in tools/eval.sh, then run the script.

bash tools/{promptable, whole_image}_eval.sh

Model Zoo

Whole image segmentation

UnSAM achieves the state-of-the-art results on unsupervised image segmentation, using a backbone of ResNet50 and training with only 1% of SA-1B data. We show zero-shot unsupervised image segmentation performance on 7 different datasets, including COCO, LVIS, ADE20K, Entity, SA-1B, Part-ImageNet and PACO.

Methods	Models	Backbone	# of Train Images	Avg.	COCO	LVIS	ADE20K	Entity	SA-1B	PtIn	PACO
Prev. Unsup. SOTA	-	ViT-Base	0.2M	30.1	30.5	29.1	31.1	33.5	33.3	36.0	17.1
UnSAM (ours)	-	ResNet50	0.1M	39.2	40.5	37.7	35.7	39.6	41.9	51.6	27.5
UnSAM (ours)	download	ResNet50	0.4M	41.1	42.0	40.5	37.5	41.0	44.5	52.7	29.7

UnSAM+ can outperform SAM on most experimented benchmarks (including SA-1B), when training UnSAM on 1% of SA-1B with both ground truth masks and our unsupervised labels. This demonstrates that the supervised SAM can also benefit from our self-supervised labels.

Methods	Models	Backbone	# of Train Images	Avg.	COCO	LVIS	ADE20K	Entity	SA-1B	PtIn	PACO
SAM	-	ViT-Base	11M	42.1	49.6	46.1	45.8	45.9	60.8	28.3	18.1
UnSAM+ (ours)	download	ResNet50	0.1M	48.8	52.2	50.8	45.3	49.8	64.8	46.0	32.3

Promptable image segmentation

Despite using a backbone that is 3× smaller and being trained on only 1% of SA-1B, our lightly semi-supervised UnSAM+ surpasses the fully-supervised SAM in promptable segmentation task on COCO.

Methods	Models	Backbone	# of Train Images	Point (Max)	Point (Oracle)
SAM	-	ViT-B/8 (85M)	11M	52.1	68.2
UnSAM (ours)	download	Swin-Tiny (25M)	0.1M	37.6	57.9
UnSAM (ours)	download	Swin-Tiny (25M)	0.4M	41.3	59.1
UnSAM+ (ours)	download	Swin-Tiny (25M)	0.1M	52.4	69.5

License

The majority of UnSAM, CutLER, Detectron2 and DINO are licensed under the CC-BY-NC license, however portions of the project are available under separate license terms: Mask2Former, Semantic-SAM, CascadePSP, Bilateral Solver and CRF are licensed under the MIT license; If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0.

Acknowledgement

This codebase is based on CutLER, SAM, Mask2Former, Semantic-SAM, CascadePSP, BFS, CRF, DINO and Detectron2. We appreciate the authors for open-sourcing their codes.

Ethical Considerations

UnSAM's wide range of detection capabilities may introduce similar challenges to many other visual recognition methods. As the image can contain arbitrary instances, it may impact the model output.

How to get support from us?

If you have any general questions, feel free to email us at XuDong Wang. If you have code or implementation-related questions, please feel free to send emails to us or open an issue in this codebase (We recommend that you open an issue in this codebase, because your questions may help others).

Citation

If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.

@article{wang2024segment,
  title={Segment anything without supervision},
  author={Wang, XuDong and Yang, Jingfeng and Darrell, Trevor},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={138731--138755},
  year={2024}
}

@article{yu2025unsamv2,
  title={UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity},
  author={Yu, Junwei and Darrell, Trevor and Wang, XuDong},
  journal={arXiv preprint arXiv:2511.13714},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
DinoMaskExtraction		DinoMaskExtraction
divide_and_conquer		divide_and_conquer
docs/demos		docs/demos
promptable_segmentation		promptable_segmentation
promptable_video_segmentation		promptable_video_segmentation
scripts		scripts
tools		tools
video		video
whole_image_segmentation		whole_image_segmentation
.gitignore		.gitignore
DATASETS.md		DATASETS.md
INSTALL.md		INSTALL.md
README.md		README.md
requirements.txt		requirements.txt
test_dataset.ipynb		test_dataset.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Segment Anything without Supervision

VideoUnSAM Extension

Phase 1 — Static divide-and-conquer (DINOv3)

Phase 2 — Real-video propagation on DAVIS 2017

One-time setup for Phase 2

Common knobs

Status

Updates

Features

Installation

Dataset Preparation

Method Overview

1. Multi-granular Pseudo-mask Generation with Divide-and-Conquer

Divide-and-Conquer Demo

2. Segment Anything without Supervision

Inference Demo for UnSAM with Pre-trained Models (whole image segmentation)

Gradio Demo for UnSAM with Pre-trained Models (promptable image segmentation)

Model Evaluation

Model Zoo

Whole image segmentation

Promptable image segmentation

License

Acknowledgement

Ethical Considerations

How to get support from us?

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Segment Anything without Supervision

VideoUnSAM Extension

Phase 1 — Static divide-and-conquer (DINOv3)

Phase 2 — Real-video propagation on DAVIS 2017

One-time setup for Phase 2

Common knobs

Status

Updates

Features

Installation

Dataset Preparation

Method Overview

1. Multi-granular Pseudo-mask Generation with Divide-and-Conquer

Divide-and-Conquer Demo

2. Segment Anything without Supervision

Inference Demo for UnSAM with Pre-trained Models (whole image segmentation)

Gradio Demo for UnSAM with Pre-trained Models (promptable image segmentation)

Model Evaluation

Model Zoo

Whole image segmentation

Promptable image segmentation

License

Acknowledgement

Ethical Considerations

How to get support from us?

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages