Segment Anything takes inspiration from chat based LLM
where prompting
is an integral part.
It contains three components.
- Image Encoder
- Prompt Encoder
- Mask Decoder
Firstly, the input image is passed through the image encoder
which produces a one-time embedding
for the image.
A prompt decoder
for points, boxes or text.
- For points,
x
andy
coordinates along with foreground and background information becomes input to theencoder
. - For boxes,
bounding box
coordinates become the input of theencoder
- For text,
tokens
become the input.
In case, we provide a mask
as an input, it directly goes through a downsampling
stage. The downsampling happens using 2D
convolution layers. Then the model concatenates it with the image embedding
to get the final vector.
Now, any vector
that the model gets from the lightweight decoder
that creates the final segmentation mask
.
Image Encoder is one of the most powerful components of SAM
. It is built upon MAE pretrained ViT
model.
In this, points
, boxes
and text
act as sparse inputs and masks act as dense inputs. The creators of SAM represent points and bounding boxes using positional encodings
and sum it with learned embeddings
. For text prompts, SAM uses the text encoder
from CLIP. For masks
as prompts, after downsampling, the embedding
is summed element-wise with the input image embedding.
As of now, there are three different scales of ViT
models
- ViT-B SAM (375 MB)
- ViT-L SAM (1.25 GB)
- ViT-H SAM (2.56 GB)
We can use any of these scale versions of SAM ViT
for running inference on video, image etc.
- Download the weights using
hf hub
link of the officialfbaipublicfiles
link like this
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
In this, we are using ViT-H SAM
model.
2. Clone the repo using this
git clone https://github.com/facebookresearch/segment-anything.git
- Run setup.py to install the module
- https://github.com/facebookresearch/segment-anything [Official Repo]
- https://arxiv.org/pdf/2304.02643.pdf [Paper]
- https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/ [Blog]
- https://huggingface.co/ybelkada/segment-anything [HF Hub Weights]
- https://www.kaggle.com/code/raghvender/segment-anything-sam-onnx [Kaggle Notebook]