# Q2 — Text-Driven Image Segmentation with SAM 2 (Colab)

This notebook demonstrates a pipeline: Image -> Text prompt -> Convert prompt to region seeds (GroundingDINO) -> feed seeds to SAM -> display mask overlay.

Run in Colab and provide model checkpoints in the runtime when prompted. Top cells install dependencies; run them first in Colab.

In [None]:
# Colab helper: install required packages when running in Colab
import sys
if 'google.colab' in sys.modules:
    print('Running in Colab: if needed, uncomment install lines below and run this cell to install dependencies (may take several minutes)')
    # Uncomment the following lines in Colab to install
    # !pip install -q git+https://github.com/facebookresearch/segment-anything.git
    # !pip install -q git+https://github.com/IDEA-Research/GroundingDINO.git
    # !pip install -q transformers timm opencv-python-headless matplotlib
else:
    print('Not running in Colab; skip installs')

In [None]:
# Imports (works in local env if packages are installed; in Colab run the install cell first)
import torch, cv2, numpy as np, matplotlib.pyplot as plt
from PIL import Image
# The heavy imports are wrapped so the notebook can be opened without installing packages
try:
    from segment_anything import sam_model_registry, SamPredictor
    from groundingdino.util.inference import load_model, load_image, predict_with_caption
except Exception as e:
    print('Optional imports failed (expected if not installed). Install the packages in Colab before running the full pipeline:', e)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('device ->', device)

## 1) Provide or download an image

You can either upload an image via Colab's UI or download a sample image using the cell below. Replace the URL with your image if desired.

In [None]:
# Example: download a sample image (replace the URL or use upload)
import urllib.request
sample_url = 'https://images.unsplash.com/photo-1518791841217-8f162f1e1131'
img_path = 'sample.jpg'
urllib.request.urlretrieve(sample_url, img_path)
img = cv2.imread(img_path)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(6,6)); plt.imshow(img_rgb); plt.axis('off')

## 2) Set model checkpoint paths (download in Colab)

You must download GroundingDINO and SAM checkpoints into the Colab runtime and set the paths below. Links are in the respective repos. For GroundingDINO use a Swin transformer checkpoint; for SAM, use a ViT checkpoint (sam_vit_h.pth or sam_vit_b.pth).

In [None]:
# Set these paths after you download the weights to the Colab runtime
GROUNDING_DINO_WEIGHTS = '/content/groundingdino_swint_ogc.pth'  # <- download and set in Colab
GROUNDING_DINO_CONFIG = '/content/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py'  # path inside cloned repo
SAM_WEIGHTS = '/content/sam_vit_h_4b8939.pth'  # <- download and set in Colab
print('Set the model weight paths and ensure files exist in Colab before running detection')

## 3) GroundingDINO: convert text prompt to boxes/phrases

The cell below runs GroundingDINO inference to return bounding boxes and phrases for the given text prompt. Boxes are in [x0,y0,x1,y1] (pixel coordinates).

In [None]:
prompt = 'a red bicycle'  # replace with your prompt
print('Prompt:', prompt)
boxes, logits, phrases = None, None, None
try:
    gd_model = load_model(GROUNDING_DINO_CONFIG, GROUNDING_DINO_WEIGHTS, device=device)
    boxes, logits, phrases = predict_with_caption(gd_model, img_path, prompt)
    print('Found phrases:', phrases)
    print('Boxes count:', 0 if boxes is None else len(boxes))
except Exception as e:
    print('GroundingDINO inference failed; ensure weights and config are correct and installed in Colab:', e)

## 4) Run SAM using the detected boxes as prompts

This section loads SAM and feeds the boxes as prompts. SAM will produce masks which we overlay on the image.

In [None]:
try:
    sam = sam_model_registry['vit_h'](checkpoint=SAM_WEIGHTS).to(device)
    predictor = SamPredictor(sam)
    predictor.set_image(img_rgb)
    if boxes is not None and len(boxes) > 0:
        boxes_xyxy = boxes  # GroundingDINO boxes are [x0,y0,x1,y1]
        import torch as _t
        transformed_boxes = predictor.transform.apply_boxes_torch(_t.tensor(boxes_xyxy).to(device), img_rgb.shape[:2])
        masks, scores, logits = predictor.predict_torch(boxes=transformed_boxes, multimask_output=False)
        masks = masks.cpu().numpy()
        # Display masks overlayed
        plt.figure(figsize=(8,8))
        plt.imshow(img_rgb); plt.axis('off')
        for m in masks:
            plt.imshow(np.ma.masked_where(m==0, m), alpha=0.5)
        plt.title('SAM masks (from text prompt)')
        plt.show()
    else:
        print('No boxes found to feed to SAM. Try a different prompt or check GroundingDINO output.')
except Exception as e:
    print('SAM inference failed. Ensure segment-anything is installed and SAM_WEIGHTS path is correct:', e)

## Notes and limitations
- You must download pretrained weights for GroundingDINO and SAM into Colab runtime; links are available in the respective GitHub repos.
- This notebook uses GroundingDINO to convert text prompts to bounding boxes. Alternatives: CLIPSeg, GLIP, or fine-tuned text->box models.
- If detection fails for ambiguous prompts, try more specific phrases or use multiple prompts.

# Q2 — Text-Driven Image Segmentation with SAM 2 (Colab)

This notebook demonstrates a pipeline: Image -> Text prompt -> Convert prompt to region seeds (GroundingDINO) -> feed seeds to SAM -> display mask overlay.

Run in Colab and provide model checkpoints in the runtime when prompted. Top cells install dependencies; run them first in Colab.

In [None]:
# Install dependencies in Colab (uncomment when running in Colab)
# !pip install -q git+https://github.com/facebookresearch/segment-anything.git
# !pip install -q git+https://github.com/IDEA-Research/GroundingDINO.git
# !pip install -q transformers timm opencv-python-headless matplotlib
# Note: You will need to download pretrained weights for GroundingDINO and SAM in Colab and set the paths below.

In [7]:
import torch, cv2, numpy as np, matplotlib.pyplot as plt
from PIL import Image
# The segment-anything and groundingdino imports are left inside try/except so the notebook can be inspected without running installs
try:
    from segment_anything import sam_model_registry, SamPredictor
    from groundingdino.util.inference import load_model, load_image, predict_with_caption
except Exception as e:
    print('Optional imports failed (expected if not installed). Install the packages in Colab before running the full pipeline:', e)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

ModuleNotFoundError: No module named 'torch'

## 1) Provide or download an image
You can either upload an image via Colab's UI or download a sample image using the cell below. Replace the URL with your image if desired.

In [None]:
# Example: download a sample image (replace the URL or use upload)
import urllib.request
sample_url = 'https://images.unsplash.com/photo-1518791841217-8f162f1e1131'
img_path = 'sample.jpg'
urllib.request.urlretrieve(sample_url, img_path)
img = cv2.imread(img_path)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(6,6)); plt.imshow(img_rgb); plt.axis('off')

## 2) Set model checkpoint paths (download in Colab)

You must download GroundingDINO and SAM checkpoints into the Colab runtime and set the paths below. Links are in the respective repos. For GroundingDINO use a Swin transformer checkpoint; for SAM, use a ViT checkpoint (sam_vit_h.pth or sam_vit_b.pth).
If you don't have checkpoints, the notebook will print an instruction message.

In [None]:
# Set these paths after you download the weights to the Colab runtime
GROUNDING_DINO_WEIGHTS = '/content/groundingdino_swint_ogc.pth'  # <- download and set in Colab
GROUNDING_DINO_CONFIG = '/content/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py'  # path inside cloned repo
SAM_WEIGHTS = '/content/sam_vit_h_4b8939.pth'  # <- download and set in Colab
print('Set the model weight paths and ensure files exist in Colab before running detection')

## 3) GroundingDINO: convert text prompt to boxes/phrases
The cell below runs GroundingDINO inference to return bounding boxes and phrases for the given text prompt. Boxes are in [x0,y0,x1,y1] (pixel coordinates). If you prefer CLIP-based alternatives, you can swap in a different detector.

In [None]:
prompt = 'a red bicycle'  # replace with your prompt
print('Prompt:', prompt)
boxes, logits, phrases = None, None, None
try:
    gd_model = load_model(GROUNDING_DINO_CONFIG, GROUNDING_DINO_WEIGHTS, device=device)
    boxes, logits, phrases = predict_with_caption(gd_model, img_path, prompt)
    print('Found phrases:', phrases)
    print('Boxes shape:', None if boxes is None else len(boxes))
except Exception as e:
    print('GroundingDINO inference failed; ensure weights and config are correct and installed in Colab:', e)

## 4) Run SAM using the detected boxes as prompts
This section loads SAM and feeds the boxes as prompts. SAM will produce masks which we overlay on the image.

In [None]:
try:
    sam = sam_model_registry['vit_h'](checkpoint=SAM_WEIGHTS).to(device)
    predictor = SamPredictor(sam)
    predictor.set_image(img_rgb)
    if boxes is not None and len(boxes) > 0:
        boxes_xyxy = boxes  # GroundingDINO boxes are [x0,y0,x1,y1]
        import torch as _t
        transformed_boxes = predictor.transform.apply_boxes_torch(_t.tensor(boxes_xyxy).to(device), img_rgb.shape[:2])
        masks, scores, logits = predictor.predict_torch(boxes=transformed_boxes, multimask_output=False)
        masks = masks.cpu().numpy()
        # Display masks overlayed
        plt.figure(figsize=(8,8))
        plt.imshow(img_rgb); plt.axis('off')
        for m in masks:
            plt.imshow(np.ma.masked_where(m==0, m), alpha=0.5)
        plt.title('SAM masks (from text prompt)')
        plt.show()
    else:
        print('No boxes found to feed to SAM. Try a different prompt or check GroundingDINO output.')
except Exception as e:
    print('SAM inference failed. Ensure segment-anything is installed and SAM_WEIGHTS path is correct:', e)

## Notes and limitations
- You must download pretrained weights for GroundingDINO and SAM into Colab runtime; links are available in the respective GitHub repos.
- This notebook uses GroundingDINO to convert text prompts to bounding boxes. Alternatives: CLIPSeg, GLIP, or fine-tuned text->box models.
- If detection fails for ambiguous prompts, try more specific phrases or use multiple prompts.
