Meta Superintelligence Labs
For full author list and details, see the paper.
[Paper]
[Project]
[Demo]
[Blog]
SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks. Compared to its predecessor SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an open-vocabulary concept specified by a short text phrase or exemplars. Unlike prior work, SAM 3 can handle a vastly larger set of open-vocabulary prompts. It achieves 75-80% of human performance on our new SA-CO benchmark which contains 270K unique concepts, over 50 times more than existing benchmarks.
This breakthrough is driven by an innovative data engine that has automatically annotated over 4 million unique concepts, creating the largest high-quality open-vocabulary segmentation dataset to date. In addition, SAM 3 introduces a new model architecture featuring a presence token that improves discrimination between closely related text prompts (e.g., "a player in white" vs. "a player in red"), as well as a decoupled detector–tracker design that minimizes task interference and scales efficiently with data.
- Python 3.12 or higher
- PyTorch 2.7 or higher
- CUDA-compatible GPU with CUDA 12.6 or higher
- Create a new Conda environment:
conda create -n sam3 python=3.12
conda deactivate
conda activate sam3- Install PyTorch with CUDA support:
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126- Clone the repository and install the package:
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .- Install additional dependencies for example notebooks or development:
# For running example notebooks
pip install -e ".[notebooks]"
# For development
pip install -e ".[train,dev]"hf auth login after generating an access token.)
import torch
#################################### For Image ####################################
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
# Load the model
model = build_sam3_image_model()
processor = Sam3Processor(model)
# Load an image
image = Image.open("<YOUR_IMAGE_PATH.jpg>")
inference_state = processor.set_image(image)
# Prompt the model with text
output = processor.set_text_prompt(state=inference_state, prompt="<YOUR_TEXT_PROMPT>")
# Get the masks, bounding boxes, and scores
masks, boxes, scores = output["masks"], output["boxes"], output["scores"]
#################################### For Video ####################################
from sam3.model_builder import build_sam3_video_predictor
video_predictor = build_sam3_video_predictor()
video_path = "<YOUR_VIDEO_PATH>" # a JPEG folder or an MP4 video file
# Start a session
response = video_predictor.handle_request(
request=dict(
type="start_session",
resource_path=video_path,
)
)
response = video_predictor.handle_request(
request=dict(
type="add_prompt",
session_id=response["session_id"],
frame_index=0, # Arbitrary frame index
text="<YOUR_TEXT_PROMPT>",
)
)
output = response["outputs"]The examples directory contains notebooks demonstrating how to use SAM3 with various types of prompts. To run the examples:
pip install -e ".[notebooks]"
jupyter notebook examples/sam3_image_predictor_example.ipynbThis section provides a guide for training SAM3 on your own single-class segmentation dataset.
pip install labelme pycocotools pillow numpy tqdmEnsure your dataset structure is as follows:
labelme_dataset/
├── image1.jpg
├── image1.json
├── image2.jpg
├── image2.json
└── ...
python labelme_to_coco.py \
--labelme_dir /path/to/labelme_dataset \
--output_dir /path/to/coco_dataset \
--class_name "your_class_name" \
--train_split 0.8Parameters:
--labelme_dir: Labelme annotation directory (containing JSON and images)--output_dir: Output COCO format dataset directory--class_name: Class name (e.g., "person", "car", "dog")--train_split: Training set ratio (default 0.8, i.e., 80% train, 20% validation)
Example:
python labelme_to_coco.py \
--labelme_dir ./my_labelme_data \
--output_dir ./coco_dataset \
--class_name "object" \
--train_split 0.8coco_dataset/
├── train/
│ ├── images/
│ │ ├── image1.jpg
│ │ ├── image2.jpg
│ │ └── ...
│ └── annotations.json
└── val/
├── images/
│ ├── image3.jpg
│ ├── image4.jpg
│ └── ...
└── annotations.json
Edit sam3/train/conf/train_config_single_class.yaml and modify the following paths:
paths:
dataset_root: /path/to/coco_dataset # Change to your dataset path
experiment_log_dir: /path/to/experiments # Change to experiment output directory
bpe_path: assets/bpe_simple_vocab_16e6.txt.gz # BPE vocabulary path
dataset:
class_name: "your_class_name" # Change to your class nameBased on your GPU memory and dataset size, you can adjust:
scratch:
train_batch_size: 2 # Can be reduced to 1 if GPU memory is insufficient
val_batch_size: 1
num_train_workers: 4 # Number of data loading threads
resolution: 1008 # Image resolution, can be reduced to save memory
max_epochs: 50 # Number of training epochspython sam3/train/train.py -c sam3/train/conf/train_config_single_class.yamlIf you have multiple GPUs, modify gpus_per_node in the configuration file:
launcher:
gpus_per_node: 2 # Change to your number of GPUsThen run:
python sam3/train/train.py -c sam3/train/conf/train_config_single_class.yamlIf training is interrupted, you can resume from a checkpoint:
trainer:
checkpoint:
resume_from: /path/to/checkpoint.pt # Specify checkpoint pathYou can monitor training progress using TensorBoard:
tensorboard --logdir /path/to/experiments/tensorboardTraining logs are saved in:
/path/to/experiments/logs/
After training, checkpoints are saved in:
/path/to/experiments/checkpoints/
Validation results are saved in:
/path/to/experiments/dumps/val/
-
Model Compression: After training, you can compress the model to save space:
python tools/compress_model.py
This will extract weights and convert to FP16, reducing model size by ~60%.
-
Save FP16 Models Directly: You can configure training to save FP16 models directly:
checkpoint: save_model_only: true # Only save model weights save_fp16: true # Save in FP16 format save_epochs: [10, 20, 30, 40, 50] # Specify epochs to save
-
Batch Inference: After training, use the trained model for batch inference:
python tools/batch_inference.py
See BATCH_INFERENCE_README.md for details.
Solutions:
- Reduce
train_batch_size(change to 1) - Lower
resolution(change to 800 or smaller) - Reduce
num_train_workers
Solutions:
- Increase
num_train_workers - Use multi-GPU training
- Reduce
resolution
Possible causes:
- Incorrect Labelme JSON format
- Image paths not found
- Polygon annotation format issues
Solutions:
- Check Labelme JSON file format
- Ensure image files exist
- Ensure annotations are in polygon format
Solution:
Ensure the class_name in the configuration file matches the name used in the conversion script.
# 1. Convert dataset
python labelme_to_coco.py \
--labelme_dir ./my_labelme_data \
--output_dir ./coco_dataset \
--class_name "object" \
--train_split 0.8
# 2. Edit paths in configuration file
# 3. Start training
python sam3/train/train.py -c sam3/train/conf/train_config_single_class.yaml
# 4. Monitor training
tensorboard --logdir ./experiments/tensorboard
# 5. Compress model (after training)
python tools/compress_model.py
# 6. Batch inference
python tools/batch_inference.py- Class Name: The class name used during training and inference must be consistent
- Image Format: Supports common image formats (jpg, png, etc.)
- Annotation Format: Only supports polygon annotations, not rectangles or other shapes
- Pretrained Model: First training will automatically download the pretrained model from HuggingFace, requires internet connection
- GPU Requirements: Recommended to use GPU with at least 16GB VRAM
For more detailed training documentation, see 训练指南.md.
SAM 3 consists of a detector and a tracker that share a vision encoder. It has 848M parameters. The detector is a DETR-based model conditioned on text, geometry, and image exemplars. The tracker inherits the SAM 2 transformer encoder-decoder architecture, supporting video segmentation and interactive refinement.
SAM 3 achieves state-of-the-art performance on various benchmarks:
Image Results:
- SA-Co/Gold: 54.1 cgF1 (Instance Segmentation), 55.7 cgF1 (Box Detection)
- LVIS: 37.2 cgF1 (Instance Segmentation), 40.6 cgF1 (Box Detection)
- COCO: 56.4 AP (Box Detection)
Video Results:
- SA-V test: 30.3 cgF1, 58.0 pHOTA
- YT-Temporal-1B test: 50.8 cgF1, 69.9 pHOTA
- SmartGlasses test: 36.4 cgF1, 63.6 pHOTA
We release 2 image benchmarks, SA-Co/Gold and SA-Co/Silver, and a video benchmark SA-Co/VEval. The datasets contain images (or videos) with annotated noun phrases. Each image/video and noun phrase pair is annotated with instance masks and unique IDs of each object matching the phrase.
- HuggingFace host: SA-Co/Gold, SA-Co/Silver and SA-Co/VEval
- Roboflow host: SA-Co/Gold, SA-Co/Silver and SA-Co/VEval
To set up the development environment:
pip install -e ".[dev,train]"To format the code:
ufmt format .See contributing and the code of conduct.
This project is licensed under the SAM License - see the LICENSE file for details.
We would like to thank the following people for their contributions to the SAM 3 project: Alex He, Alexander Kirillov, Alyssa Newcomb, Ana Paula Kirschner Mofarrej, Andrea Madotto, Andrew Westbury, Ashley Gabriel, Azita Shokpour, Ben Samples, Bernie Huang, Carleigh Wood, Ching-Feng Yeh, Christian Puhrsch, Claudette Ward, Daniel Bolya, Daniel Li, Facundo Figueroa, Fazila Vhora, George Orlin, Hanzi Mao, Helen Klein, Hu Xu, Ida Cheng, Jake Kinney, Jiale Zhi, Jo Sampaio, Joel Schlosser, Justin Johnson, Kai Brown, Karen Bergan, Karla Martucci, Kenny Lehmann, Maddie Mintz, Mallika Malhotra, Matt Ward, Michelle Chan, Michelle Restrepo, Miranda Hartley, Muhammad Maaz, Nisha Deo, Peter Park, Phillip Thomas, Raghu Nayani, Rene Martinez Doehner, Robbie Adkins, Ross Girshik, Sasha Mitts, Shashank Jain, Spencer Whitehead, Ty Toledano, Valentin Gabeur, Vincent Cho, Vivian Lee, William Ngan, Xuehai He, Yael Yungster, Ziqi Pang, Ziyi Dou, Zoe Quake.