SAM 3: Segment Anything with Concepts

Meta Superintelligence Labs

For full author list and details, see the paper.

[Paper] [Project] [Demo] [Blog]

SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks. Compared to its predecessor SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an open-vocabulary concept specified by a short text phrase or exemplars. Unlike prior work, SAM 3 can handle a vastly larger set of open-vocabulary prompts. It achieves 75-80% of human performance on our new SA-CO benchmark which contains 270K unique concepts, over 50 times more than existing benchmarks.

This breakthrough is driven by an innovative data engine that has automatically annotated over 4 million unique concepts, creating the largest high-quality open-vocabulary segmentation dataset to date. In addition, SAM 3 introduces a new model architecture featuring a presence token that improves discrimination between closely related text prompts (e.g., "a player in white" vs. "a player in red"), as well as a decoupled detector–tracker design that minimizes task interference and scales efficiently with data.

Installation

Prerequisites

Python 3.12 or higher
PyTorch 2.7 or higher
CUDA-compatible GPU with CUDA 12.6 or higher

Create a new Conda environment:

conda create -n sam3 python=3.12
conda deactivate
conda activate sam3

Install PyTorch with CUDA support:

pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

Clone the repository and install the package:

git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .

Install additional dependencies for example notebooks or development:

# For running example notebooks
pip install -e ".[notebooks]"

# For development
pip install -e ".[train,dev]"

Getting Started

⚠️ Before using SAM 3, please request access to the checkpoints on the SAM 3 Hugging Face repo. Once accepted, you need to be authenticated to download the checkpoints. You can do this by running the following steps (e.g. hf auth login after generating an access token.)

Basic Usage

import torch
#################################### For Image ####################################
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
# Load the model
model = build_sam3_image_model()
processor = Sam3Processor(model)
# Load an image
image = Image.open("<YOUR_IMAGE_PATH.jpg>")
inference_state = processor.set_image(image)
# Prompt the model with text
output = processor.set_text_prompt(state=inference_state, prompt="<YOUR_TEXT_PROMPT>")

# Get the masks, bounding boxes, and scores
masks, boxes, scores = output["masks"], output["boxes"], output["scores"]

#################################### For Video ####################################

from sam3.model_builder import build_sam3_video_predictor

video_predictor = build_sam3_video_predictor()
video_path = "<YOUR_VIDEO_PATH>" # a JPEG folder or an MP4 video file
# Start a session
response = video_predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path=video_path,
    )
)
response = video_predictor.handle_request(
    request=dict(
        type="add_prompt",
        session_id=response["session_id"],
        frame_index=0, # Arbitrary frame index
        text="<YOUR_TEXT_PROMPT>",
    )
)
output = response["outputs"]

Examples

The examples directory contains notebooks demonstrating how to use SAM3 with various types of prompts. To run the examples:

pip install -e ".[notebooks]"
jupyter notebook examples/sam3_image_predictor_example.ipynb

Training on Custom Dataset

This section provides a guide for training SAM3 on your own single-class segmentation dataset.

Step 1: Convert Dataset Format

1.1 Install Dependencies

pip install labelme pycocotools pillow numpy tqdm

1.2 Prepare Labelme Dataset

Ensure your dataset structure is as follows:

labelme_dataset/
├── image1.jpg
├── image1.json
├── image2.jpg
├── image2.json
└── ...

1.3 Run Conversion Script

python labelme_to_coco.py \
    --labelme_dir /path/to/labelme_dataset \
    --output_dir /path/to/coco_dataset \
    --class_name "your_class_name" \
    --train_split 0.8

Parameters:

--labelme_dir: Labelme annotation directory (containing JSON and images)
--output_dir: Output COCO format dataset directory
--class_name: Class name (e.g., "person", "car", "dog")
--train_split: Training set ratio (default 0.8, i.e., 80% train, 20% validation)

Example:

python labelme_to_coco.py \
    --labelme_dir ./my_labelme_data \
    --output_dir ./coco_dataset \
    --class_name "object" \
    --train_split 0.8

1.4 Converted Dataset Structure

coco_dataset/
├── train/
│   ├── images/
│   │   ├── image1.jpg
│   │   ├── image2.jpg
│   │   └── ...
│   └── annotations.json
└── val/
    ├── images/
    │   ├── image3.jpg
    │   ├── image4.jpg
    │   └── ...
    └── annotations.json

Step 2: Configure Training Parameters

2.1 Edit Configuration File

Edit sam3/train/conf/train_config_single_class.yaml and modify the following paths:

paths:
  dataset_root: /path/to/coco_dataset  # Change to your dataset path
  experiment_log_dir: /path/to/experiments  # Change to experiment output directory
  bpe_path: assets/bpe_simple_vocab_16e6.txt.gz  # BPE vocabulary path

dataset:
  class_name: "your_class_name"  # Change to your class name

2.2 Adjust Training Parameters (Optional)

Based on your GPU memory and dataset size, you can adjust:

scratch:
  train_batch_size: 2  # Can be reduced to 1 if GPU memory is insufficient
  val_batch_size: 1
  num_train_workers: 4  # Number of data loading threads
  resolution: 1008  # Image resolution, can be reduced to save memory
  max_epochs: 50  # Number of training epochs

Step 3: Start Training

3.1 Single GPU Training

python sam3/train/train.py -c sam3/train/conf/train_config_single_class.yaml

3.2 Multi-GPU Training

If you have multiple GPUs, modify gpus_per_node in the configuration file:

launcher:
  gpus_per_node: 2  # Change to your number of GPUs

Then run:

python sam3/train/train.py -c sam3/train/conf/train_config_single_class.yaml

3.3 Resume Training from Checkpoint

If training is interrupted, you can resume from a checkpoint:

trainer:
  checkpoint:
    resume_from: /path/to/checkpoint.pt  # Specify checkpoint path

Step 4: Monitor Training

4.1 TensorBoard

You can monitor training progress using TensorBoard:

tensorboard --logdir /path/to/experiments/tensorboard

4.2 View Logs

Training logs are saved in:

/path/to/experiments/logs/

Step 5: Verify Training Results

After training, checkpoints are saved in:

/path/to/experiments/checkpoints/

Validation results are saved in:

/path/to/experiments/dumps/val/

Training Tips

Model Compression: After training, you can compress the model to save space:
```
python tools/compress_model.py
```
This will extract weights and convert to FP16, reducing model size by ~60%.

Save FP16 Models Directly: You can configure training to save FP16 models directly:

checkpoint:
  save_model_only: true  # Only save model weights
  save_fp16: true  # Save in FP16 format
  save_epochs: [10, 20, 30, 40, 50]  # Specify epochs to save

Batch Inference: After training, use the trained model for batch inference:
```
python tools/batch_inference.py
```
See BATCH_INFERENCE_README.md for details.

Common Issues

Q1: Out of Memory (OOM)

Solutions:

Reduce train_batch_size (change to 1)
Lower resolution (change to 800 or smaller)
Reduce num_train_workers

Q2: Slow Training Speed

Solutions:

Increase num_train_workers
Use multi-GPU training
Reduce resolution

Q3: Conversion Script Errors

Possible causes:

Incorrect Labelme JSON format
Image paths not found
Polygon annotation format issues

Solutions:

Check Labelme JSON file format
Ensure image files exist
Ensure annotations are in polygon format

Q4: Class Name Mismatch

Solution: Ensure the class_name in the configuration file matches the name used in the conversion script.

Complete Example Commands

# 1. Convert dataset
python labelme_to_coco.py \
    --labelme_dir ./my_labelme_data \
    --output_dir ./coco_dataset \
    --class_name "object" \
    --train_split 0.8

# 2. Edit paths in configuration file

# 3. Start training
python sam3/train/train.py -c sam3/train/conf/train_config_single_class.yaml

# 4. Monitor training
tensorboard --logdir ./experiments/tensorboard

# 5. Compress model (after training)
python tools/compress_model.py

# 6. Batch inference
python tools/batch_inference.py

Notes

Class Name: The class name used during training and inference must be consistent
Image Format: Supports common image formats (jpg, png, etc.)
Annotation Format: Only supports polygon annotations, not rectangles or other shapes
Pretrained Model: First training will automatically download the pretrained model from HuggingFace, requires internet connection
GPU Requirements: Recommended to use GPU with at least 16GB VRAM

For more detailed training documentation, see 训练指南.md.

Model

SAM 3 consists of a detector and a tracker that share a vision encoder. It has 848M parameters. The detector is a DETR-based model conditioned on text, geometry, and image exemplars. The tracker inherits the SAM 2 transformer encoder-decoder architecture, supporting video segmentation and interactive refinement.

Results

SAM 3 achieves state-of-the-art performance on various benchmarks:

Image Results:

SA-Co/Gold: 54.1 cgF1 (Instance Segmentation), 55.7 cgF1 (Box Detection)
LVIS: 37.2 cgF1 (Instance Segmentation), 40.6 cgF1 (Box Detection)
COCO: 56.4 AP (Box Detection)

Video Results:

SA-V test: 30.3 cgF1, 58.0 pHOTA
YT-Temporal-1B test: 50.8 cgF1, 69.9 pHOTA
SmartGlasses test: 36.4 cgF1, 63.6 pHOTA

SA-Co Dataset

We release 2 image benchmarks, SA-Co/Gold and SA-Co/Silver, and a video benchmark SA-Co/VEval. The datasets contain images (or videos) with annotated noun phrases. Each image/video and noun phrase pair is annotated with instance masks and unique IDs of each object matching the phrase.

HuggingFace host: SA-Co/Gold, SA-Co/Silver and SA-Co/VEval
Roboflow host: SA-Co/Gold, SA-Co/Silver and SA-Co/VEval

Development

To set up the development environment:

pip install -e ".[dev,train]"

To format the code:

ufmt format .

Contributing

See contributing and the code of conduct.

License

This project is licensed under the SAM License - see the LICENSE file for details.

Acknowledgements

We would like to thank the following people for their contributions to the SAM 3 project: Alex He, Alexander Kirillov, Alyssa Newcomb, Ana Paula Kirschner Mofarrej, Andrea Madotto, Andrew Westbury, Ashley Gabriel, Azita Shokpour, Ben Samples, Bernie Huang, Carleigh Wood, Ching-Feng Yeh, Christian Puhrsch, Claudette Ward, Daniel Bolya, Daniel Li, Facundo Figueroa, Fazila Vhora, George Orlin, Hanzi Mao, Helen Klein, Hu Xu, Ida Cheng, Jake Kinney, Jiale Zhi, Jo Sampaio, Joel Schlosser, Justin Johnson, Kai Brown, Karen Bergan, Karla Martucci, Kenny Lehmann, Maddie Mintz, Mallika Malhotra, Matt Ward, Michelle Chan, Michelle Restrepo, Miranda Hartley, Muhammad Maaz, Nisha Deo, Peter Park, Phillip Thomas, Raghu Nayani, Rene Martinez Doehner, Robbie Adkins, Ross Girshik, Sasha Mitts, Shashank Jain, Spencer Whitehead, Ty Toledano, Valentin Gabeur, Vincent Cho, Vivian Lee, William Ngan, Xuehai He, Yael Yungster, Ziqi Pang, Ziyi Dou, Zoe Quake.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
experiments_jiaodai		experiments_jiaodai
sam3		sam3
scripts		scripts
tools		tools
.gitignore		.gitignore
ANNOTATION_GUIDE.md		ANNOTATION_GUIDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SAM3_分割头说明.md		SAM3_分割头说明.md
annotator_config.yaml		annotator_config.yaml
batch_annotate_sam3.py		batch_annotate_sam3.py
pyproject.toml		pyproject.toml
python		python
quick_start.sh		quick_start.sh
sam3_labelme.py		sam3_labelme.py
训练指南.md		训练指南.md

Folders and files

Latest commit

History

Repository files navigation

SAM 3: Segment Anything with Concepts

Installation

Prerequisites

Getting Started

Basic Usage

Examples

Training on Custom Dataset

Step 1: Convert Dataset Format

1.1 Install Dependencies

1.2 Prepare Labelme Dataset

1.3 Run Conversion Script

1.4 Converted Dataset Structure

Step 2: Configure Training Parameters

2.1 Edit Configuration File

2.2 Adjust Training Parameters (Optional)

Step 3: Start Training

3.1 Single GPU Training

3.2 Multi-GPU Training

3.3 Resume Training from Checkpoint

Step 4: Monitor Training

4.1 TensorBoard

4.2 View Logs

Step 5: Verify Training Results

Training Tips

Common Issues

Q1: Out of Memory (OOM)

Q2: Slow Training Speed

Q3: Conversion Script Errors

Q4: Class Name Mismatch

Complete Example Commands

Notes

Model

Results

SA-Co Dataset

Development

Contributing

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages