# Segment Anything Model (SAM) Research Paper Summary

This notebook is my personal research notebook for understanding the ins and outs of SAM.

Original paper: https://arxiv.org/pdf/2304.02643v1.pdf

<img src="images/paper_hero.jpg" >

Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: 

1. a promptable segmentation task
2. a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering
3. a data engine for collecting SA-1B, our dataset of over 1 billion masks.

# Abstract Summary

- They created the largest segmentation datasets to date with over 1 billion masks on 11M images.
- Model is designed and trained to be promptable so it can transfer zero-shot to new image distributions and tasks.
- Zero-shot performance is competitive with fully supervised results.
- They have released both the model and the dataset.

# Introduction Summary

- Foundation models can generalize to tasks and data distributions beyond those seen during training.
- Empirical trends show this behavior improving with model scale, dataset size and total training compute.
- This paper's goal is to build a foundation model for image segmentation (generalized foundational model).

### 1. Task: 

They defined a promptable segmentation task that is general enough to provide a powerful pretraining objective and to enable a wide range of downstream applications.

Inspired by NLP foundation models for zero-shot and few-shot learning, they propose *promptable segmentation task*, where the goal is to return a valid segmentation mask given any segmentation prompt.

A prompt specifies what to segment in an image and can contain spatial or text information identifying an object.

The requirement of a valid output mask means that even when a prompt is ambiguous and could refer to multiple objects (for example, a point on a shirt may indicate either the shirt or the person wearing it), the output should be a reasonable mask for at least one of those objects. We use the promptable segmentation task as both a pre-training objective and to solve general downstream segmentation tasks via prompt engineering.

### 2. Model:

The task requires a model that supports flexible prompting and can output segmentation masks in real-time when prompted to allow for interactive use.

The model must support flexible prompts, needs to compute masks in amortized real-time to allow interactive use, and must be ambiguity-aware. 

A single design satisfies all three constraints: 

- a powerful image encoder computes an image embedding
- a prompt encoder embeds prompts
- a lightweight mask decoder predicts segmentation masks. This is the Segment Anything Model.

By separating SAM into an image encoder and a fast prompt encoder / mask decoder, the same image embedding can be reused (and its cost amortized) with different prompts.

**Algorithm of model:** 

Given an image embedding, the prompt encoder and mask decoder predict a mask from a prompt in ∼50ms in a web browser. We focus on point, box, and mask prompts, and also present initial results with free-form text prompts. To make SAM ambiguity-aware, we design it to predict multiple masks for a single prompt allowing SAM to naturally handle ambiguity.

### 3. Data Engine:

Need diverse large scale sources of data. Therefore, they built a data engine, i.e., they iterated between using their efficient model to assist in data collection and used the newly collected data to improve the model.

Their data engine has three stages: 

- assisted-manual - SAM assists annotators in annotating masks (classic interactive segmentation setup)
- semi-automatic - SAM automatically generates masks for a subset of objects by prompting it with likely object locations and annotators focus on annotating the remaining objects.
- fully automatic - They prompted SAM with a regular grid of foreground points, yielding on average ~100 high quality masks per image.



## Experiments

- Using a diverse new suite of 23 segmentation datasets, they found SAM produces high-quality masks from a single foreground point, often only slightly below that of the manually annotated ground truth.
- They found consistently strong quantitative and qualitative results on a variety of downstream tasks under a zero-shot transfer protocol using prompt engineering, including edge detection, object proposal generation, instance segmentation, and a preliminary exploration of text-to-mask prediction.

# Segment Anything Task

Inspiration: **Next token prediction task** is used for **foundation model pre-training** and to solve downstream tasks via prompt engineering. 

 - A prompt can be foreground/background points, a rough box or mask, free-form text, or in general, what to segment in an image. 

Then, the **promptable segmentation task** is to return a **valid** segmentation mask given any prompt.

Valid = Even when a prompt is ambigous and could refer to multiple objects, the output should be a reasonable mask for at least one of those objects.

- The promptable segmentation task suggests a natural pre-training algorithm that simulates a sequence of prompts (e.g., points, boxes, masks) for each training sample and compares the model’s mask predictions against the ground truth.

Unlike interactive segmentation whose aim is to eventually predict a valid mask after enough user input, their aim is to always predict a valid mask even when the prompt is ambiguous. This requires specialized modeling and training loss choices.

- Their pre-training task endows the model with the ability to respond appropriately to any prompt at inference time, and thus downstream tasks can be solved by engineering appropriate prompts.

For example, if one has a bounding box detector for cats, cat instance segmentation can be solved by providing the detector’s box output as a prompt to their model. In general, a wide array of practical segmentation tasks can be cast as prompting.

- Related tasks: Interactive segmentation, edge detection, super pixelization, object proposal generation, foreground segmentation, semantic segmentation, instance segmentation, panoptic segmentation, etc.

# Segment Anything Model

<img src = "images/segment_anything_model.png" >

**Image encoder** - MAE pre-trained Vision Transformer (ViT) minimally adapted to process higher resolution inputes. The encoder runs once per image and can be applied prior to prompting the model.

**Prompt encoder** - Considers sparse (points, boxes, text) and dense (masks) prompts.

Points and boxes are represented by positional encodings summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP.

Dense (masks) prompts are embedded using convolutions and summed element-wise with the image embedding

**Mask decoder** - Maps image embedding, prompt embedding, and an output token to a mask (Modification of a Transformer decoder block followed by a dynamic mask prediction head).

The modified decoder block uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa) to update all embedding.

After running two blocks, they upsample the image embedding and an MLP maps the output token to a dynamic linear classifier, which then computes the mask foreground probability at each image location.

**Resolving ambiguity** - Outputs 3 masks to average multiple valid masks (whole, part and subpart). They backprop only the minimum loss over masks. To rank masks, the model predicts a confidence score (estimated IoU) for each mask.

**Efficiency** - Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU, in ∼50ms.

**Loss** - Linear combination of focal and dice loss.