<a href="https://colab.research.google.com/github/NGLYRY/dsc140b-final-/blob/main/DSC140B_HW4_VLG_CBM_Annotation_2a_TODO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Concept Bottleneck Models and Automated Annotation using Grounding DINO

In this notebook, you will learn how to automatically generate concept annotations using a foundation model for open-vocabulary object detection called [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO). This pipeline mimics the approach proposed in VLG-CBM for reducing manual concept labeling.

We will:
- Load and configure a pre-trained Grounding DINO model
- Define interpretable concept sets for a target class
- Generate bounding box annotations for an input image
- Visualize annotations across multiple confidence thresholds

Let's begin!

In [5]:
# Install required libraries and Grounding DINO
# IMPORTANT: DO NOT MODIFY except the lines marked TODO

HOME_DIR = '/home' # TODO: Set HOME_DIR location
!pip install -q torch torchvision matplotlib
!git clone https://github.com/IDEA-Research/GroundingDINO.git {HOME_DIR}
%cd {HOME_DIR}/groundingdino/models/GroundingDINO/csrc/MsDeformAttn
!sed -i 's/value.type()/value.scalar_type()/g' ms_deform_attn_cuda.cu
!sed -i 's/value.scalar_type().is_cuda()/value.is_cuda()/g' ms_deform_attn_cuda.cu
%cd {HOME_DIR}
!pip install -q -e .

fatal: destination path '/home' already exists and is not an empty directory.
/home/groundingdino/models/GroundingDINO/csrc/MsDeformAttn
/home
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.8/46.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.5/181.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.2/256.2 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
# TODO: Move files in DSC140B_HW4_VLG_CBM_Annotation_2a.zip to HOME_DIR
import os
os.chdir(HOME_DIR)
print("Current working directory: ", os.getcwd())

Current working directory:  /home


In [7]:
import os
import json
import torch
import numpy as np
import matplotlib.pyplot as plt
import random
from PIL import Image, ImageDraw, ImageFont
from torchvision import transforms
from tqdm import tqdm
from IPython.display import display


from groundingdino.util.slconfig import SLConfig
from groundingdino.util.utils import clean_state_dict
from groundingdino.models import build_model
import groundingdino.datasets.transforms as T



In [8]:
# Check device in use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device: ", device)

Using device:  cuda


In [9]:
# IMPORTANT: DO NOT MODIFY THIS CELL
seed =42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

### Load the Pre-trained Grounding DINO Model

We now load the model checkpoint and configuration. This will allow us to perform open-vocabulary object detection based on text prompts.

In [10]:
def load_annotation_model(config_path, checkpoint_path, device="cuda"):
    """
    Loads the Grounding DINO model and tokenizer.

    Args:
        config_path (str): Path to the model configuration file.
        checkpoint_path (str): Path to the model weights (.pth file).
        device (str): Device to load model on ("cuda" or "cpu").

    Returns:
        model (torch.nn.Module): Grounding DINO model.
        tokenizer: Tokenizer used for prompt encoding.
    """
    args = SLConfig.fromfile(config_path)
    args.device = device
    model = build_model(args)
    checkpoint = torch.load(checkpoint_path, map_location=device) # TODO: load from checkpoint and map to CPU
    model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
    model.eval()
    model.to(device) # TODO: Set model into eval mode and move to device
    return model, model.tokenizer

# Download checkpoint if not already present
if not os.path.exists("groundingdino_swinb_cogcoor.pth"):
    !wget https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swinb_cogcoor.pth \
        -O groundingdino_swinb_cogcoor.pth

# Load model
model_config_path = "groundingdino/config/GroundingDINO_SwinB_cfg.py"
model_checkpoint_path = "groundingdino_swinb_cogcoor.pth"
device = "cuda" if torch.cuda.is_available() else "cpu" # TODO: Set device based on the torch.cuda.is_available() function

model, tokenizer = load_annotation_model(model_config_path, model_checkpoint_path, device)


--2025-06-11 08:45:52--  https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swinb_cogcoor.pth
Resolving huggingface.co (huggingface.co)... 18.164.174.17, 18.164.174.118, 18.164.174.23, ...
Connecting to huggingface.co (huggingface.co)|18.164.174.17|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/74/12/7412fdcd8b26caa1c47919c53fafeb25db279907c51c15d923d8526a874dd651/46270f7a822e6906b655b729c90613e48929d0f2bb8b9b76fd10a856f3ac6ab7?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27groundingdino_swinb_cogcoor.pth%3B+filename%3D%22groundingdino_swinb_cogcoor.pth%22%3B&Expires=1749635152&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0OTYzNTE1Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy83NC8xMi83NDEyZmRjZDhiMjZjYWExYzQ3OTE5YzUzZmFmZWIyNWRiMjc5OTA3YzUxYzE1ZDkyM2Q4NTI2YTg3NGRkNjUxLzQ2MjcwZjdhODIyZTY5MDZiNjU1YjcyOWM5MDYxM2U0ODkyOWQwZjJiYjhiOWI3N

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


final text_encoder_type: bert-base-uncased


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### Setup Grounding DINO prompt

A **concept set** for a class in a Concept Bottleneck Model (CBM) refers to a collection of interpretable and human-understandable attributes that characterize the class. The image provided belongs to the class *Black-footed Albatross*, and its associated concept set is defined below.

To ensure that each concept (e.g., “black feet”, “long wings”) is treated as a distinct entity by Grounding DINO, we concatenate concepts using periods (`"."`) as delimiters. This punctuation helps the model tokenize and attend to each phrase individually during detection.

Additionally, including the class name (e.g., *Black-footed Albatross*) at the beginning of the prompt forces the model to focus on attributes specific to that class with greater precision.

In [11]:
CLASS_LABEL = "Black footed Albatross"

# Define concepts associated with the Black footed Albatross class
CONCEPT_SET = [
    "black feet",
    "dark wingtips",
    "large size",
    "large wingspan",
    "long wings",
    "white body",
    "yellow beak",
    "yellow bill"
]

# Construct prompt: Class name followed by dot-separated concepts
prompt = CLASS_LABEL + "." + " . ".join(CONCEPT_SET)
print("Grounding DINO Prompt:", prompt)

Grounding DINO Prompt: Black footed Albatross.black feet . dark wingtips . large size . large wingspan . long wings . white body . yellow beak . yellow bill


### Load Image and Apply Transforms

We prepare the input image using standard transformations expected by Grounding DINO.

In [12]:
# TODO: Define transform that resize images to 800x800, converts the image to tensor, and then
# applies the following normalization: MEAN: [0.485, 0.456, 0.406] and STD: [0.229, 0.224, 0.225]
transform = transforms.Compose([
    transforms.Resize((800, 800)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

raw_transform = transforms.Compose([transforms.Resize((800, 800))])
image_pil = Image.open(f"/{HOME_DIR}/Black_Footed_Albatross.jpg")  # TODO: LOAD PIL IMAGE
image_pil = raw_transform(image_pil)
image_tensor = transform(image_pil) # TODO: Apply transform to image

FileNotFoundError: [Errno 2] No such file or directory: '//home/Black_Footed_Albatross.jpg'

## Run Inference with Grounding DINO and process annotations

Now that we have a preprocessed image and a natural language prompt, we can run inference using the Grounding DINO model and process the model output. We define a threshold (`THRESHOLD`) that determines whether a predicted concept-region pair is confident enough to be included in the final annotations. The threshold is applied to a "perplexity-like" confidence score computed from the logits for each concept span.

In [None]:
def get_predictions(model, image_tensor, prompts):
    """
    Runs inference on the image and returns prediction logits and boxes.

    Args:
        model: Grounding DINO model
        image_tensor: Normalized tensor of shape (1, 3, H, W)
        prompts: List of prompt strings

    Returns:
        logits (Tensor): Raw concept confidence scores
        boxes (Tensor): Predicted bounding boxes
    """
    with torch.no_grad():
        outputs = model(image_tensor, captions=prompts)
    return outputs["pred_logits"].sigmoid(), outputs["pred_boxes"]

In [None]:
def process_annotations(image_pil, prompt, logits, boxes, tokenizer, threshold=0.35):
    """
    Processes model outputs to extract bounding boxes and concept labels based on a confidence threshold.

    Args:
        image_pil (PIL.Image): Original (resized) image for which annotations are generated.
        prompt (str): Prompt string containing the class and dot-separated concepts.
        logits (torch.Tensor): Output logits from Grounding DINO (after sigmoid), shape: (1, num_boxes, num_tokens).
        boxes (torch.Tensor): Normalized bounding boxes from Grounding DINO, shape: (1, num_boxes, 4).
        tokenizer: Tokenizer used to tokenize the prompt.
        threshold (float): Per-concept threshold for including bounding boxes based on perplexity (confidence proxy).

    Returns:
        annotations (list of dict): List of dictionaries with keys:
            - "concept": The human-readable concept string
            - "box": Bounding box in [x_min, y_min, x_max, y_max] format, scaled to image size
    """
    annotations = []
    W, H = image_pil.size  # Note: PIL uses (width, height)

    # Convert model outputs to NumPy arrays
    logits = logits[0].cpu().numpy()
    boxes = boxes[0].cpu().numpy()

    # Tokenize prompt and remove start/end tokens
    prompt_tokenized = tokenizer(prompt)["input_ids"][1:-1]
    logits = logits[:, 1:-1]  # Remove logits for special tokens

    # Identify concept boundaries in token sequence (period token has ID 1012)
    split_indices = [i for i, token_id in enumerate(prompt_tokenized) if token_id == 1012]
    start = 0

    for split in split_indices:
        # Slice out tokens for one concept
        concept_ids = prompt_tokenized[start:split]
        concept_text = tokenizer.decode(concept_ids).strip()

        # Get the logits associated with this concept span
        concept_logits = logits[:, start:split]

        for j in range(len(concept_logits)):
            # Approximate confidence using geometric mean (perplexity-like)
            prob = np.prod(concept_logits[j])
            perplexity = prob ** (1 / len(concept_ids)) if len(concept_ids) > 0 else 0

            if perplexity > threshold:
                # Convert box from [cx, cy, w, h] to [x0, y0, x1, y1] in image coordinates
                box = boxes[j]
                box[[0, 2]] *= W
                box[[1, 3]] *= H
                box[0] -= box[2] / 2  # x0 = cx - w/2
                box[1] -= box[3] / 2  # y0 = cy - h/2
                box[2] += box[0]      # x1 = x0 + w
                box[3] += box[1]      # y1 = y0 + h

                # Save annotation
                annotations.append({
                    "concept": concept_text,
                    "box": box
                })

        start = split + 1  # Move to next concept span

    return annotations

## Visualize Annotations

Now that we have processed annotations we can visualize the annotated image


In [None]:
def plot_annotations(image, annotations, title=None):
    """
    Annotate a PIL image with bounding boxes and concept labels using PIL's drawing utilities,
    and display the result using IPython display.

    Args:
        image (PIL.Image): Input image to annotate.
        annotations (list of dict): Each dict must contain 'box' (xyxy format) and 'concept'.
        title (str, optional): Optional title to print before displaying the image.
    """
    image_copy = image.copy()
    draw = ImageDraw.Draw(image_copy)

    # Optional: Load a font (fallback to default if unavailable)
    try:
        font = ImageFont.truetype("arial.ttf", size=12)
    except IOError:
        font = ImageFont.load_default()

    for ann in annotations:
        box = ann["box"]
        concept = ann["concept"]
        try:
          draw.rectangle(box, outline="red", width=2)
          draw.text((box[0], box[1]), concept, fill="black", font=font)
        except ValueError:
          pass

    if title:
        print(title)
    display(image_copy)

# plot image with annotations
THRESHOLDS = [0.1, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5, 0.6]
logits, boxes = .... # TODO: Get predictions. Note that the models take a batch as input
for THRESHOLD in THRESHOLDS:
  annotations = process_annotations(image_pil, prompt, logits.cpu().detach().clone(), boxes.cpu().detach().clone(), tokenizer, threshold=THRESHOLD)
  print("Plotting for thresold: ", THRESHOLD)
  plot_annotations(image_pil, annotations)