# Leveraging Large Language Models for Automated Image Annotation

# Introduction
## Background

The rise of supervised learning in computer vision has created a huge demand for high-quality labeled datasets. But manual annotation is slow, expensive, and often inconsistent.

The Data Annotation Bottleneck
- High cost – Labeling data is expensive.
- Low efficiency – It takes too long.
- Errors & inconsistencies – Humans make mistakes.

Most current systems, like YOLO, rely on supervised models to label images. But they have serious limitations:
- Require constant fine-tuning for different tasks.
- Depend on fixed input words, limiting flexibility.

A Smarter Approach: Multimodal Large Language Models (LLMs)
- Adapt easily to new categories – No need for retraining.
- Understand context & semantics – Go beyond simple keywords.
- Affordable & accessible – Many powerful LLMs offer API access, cutting costs.

By combining vision and language models, we can make annotation faster, cheaper, and more accurate.

## Overview of the solution

Understanding Images with OwlViT

- Combines text and image features to identify objects based on descriptions.

Smart Segmentation with SAM

- Uses text prompts to pinpoint and segment objects in images.

Seamless Pipeline: OwlViT + SAM
- OwlViT links text descriptions to objects in the image.
- SAM precisely segments the identified objects based on OwlViT’s insights or user prompts.
- Generates clean annotations—bounding boxes, masks, and labels—ready for AI model training.

## Business Value
Saves Time & Money

- Cuts down on labeling costs and effort, making AI development more affordable and efficient.

Makes AI More Accessible

- Helps small businesses and research teams get high-quality labeled data without huge budgets.

Drives Innovation

- Speeds up AI adoption and inspires new breakthroughs in technology.

Boosts Model Performance

- Creates more balanced datasets, reducing bias and improving real-world accuracy.


# Data and Data Preprocessing

We plan to use the Microsoft COCO (Common Objects in Context) dataset [Lin et al., 2014] to support our analysis and modeling tasks. OwlViT and SAM are pre-trained. We use the data to fine-tuning our pipline. 

- Feature: Images, texts

- Annotations: The image and text data will be processed separately through an Image Encoder and a Text Encoder, mapping them into a shared high-dimensional embedding space.

- Labels: Segmentation masks

## Basic Statistics

In [None]:
from pycocotools.coco import COCO
import json
import numpy as np
import collections
import matplotlib.pyplot as plt
import torch
from torchvision.ops import box_iou
from pycocotools.coco import COCO
import torchvision.transforms as transforms
from PIL import Image
import albumentations as A
import cv2

In [None]:
# import coco dataset
coco_annotation_path = "datasets/coco/annotations/instances_train2017.json"
coco = COCO(coco_annotation_path)

def convert_bbox_format(bbox):
    """Convert COCO bbox [x_min, y_min, width, height] format to [x_center, y_center, width, height] format."""
    x_min, y_min, width, height = bbox
    x_center = x_min + width / 2
    y_center = y_min + height / 2
    return [x_center, y_center, width, height]

In [None]:
# Number and names of Categories
categories = coco.loadCats(coco.getCatIds())
num_categories = len(categories)
category_names = [cat['name'] for cat in categories]
print(f"Number of Categories: {num_categories}")
print(f"Names of Categories: {category_names}")

In [None]:
# Number of images, annotations and average number of annotations per images
num_images = len(coco.getImgIds())
print(f"Number of Images: {num_images}")

num_annotations = len(coco.getAnnIds())
print(f"Number of Annotations: {num_annotations}")

image_ids = coco.getImgIds()
num_objects_per_image = [len(coco.getAnnIds(imgIds=[img_id])) for img_id in image_ids]
mean_objects_per_image = np.mean(num_objects_per_image)
print(f"Average Number of Annotations per Images: {mean_objects_per_image:.2f}")

In [None]:
# objects per categories
category_counts = collections.Counter()
for ann in coco.loadAnns(coco.getAnnIds()):
    category_counts[ann["category_id"]] += 1

for cat_id, count in category_counts.items():
    cat_name = coco.loadCats([cat_id])[0]["name"]
    print(f"{cat_name}: {count} objects")

In [None]:
# The distribution of object areas
area_list = [ann['area'] for ann in coco.loadAnns(coco.getAnnIds())]
plt.hist(area_list, bins=50, log=True)
plt.xlabel('Object Area')
plt.ylabel('Number')
plt.title('Histogram of Object Area')
plt.show()

## Outlier/abnormal sample dectection
- Identifies and removes images that do not have any annotations.
- Removes duplicate annotations using IoU (Intersection over Union).
- Removes objects with area smaller than min_area

## Preprocessing
- Converts them to RGB format.
- Normalizes the pixel values.
- Converts COCO format [x, y, w, h] to [x1, y1, x2, y2]
- Applies data augmentation (if enabled).

## Data augmentation
- Horizontal flipping
- 90-degree rotation
- Color jittering
- Cutout augmentation

In [None]:
class COCOPipeline:
    def __init__(self, annotation_path, iou_threshold=0.9, min_area=100, augment=True):
        """
        COCO Data Cleaning and Augmentation Pipeline.
        :param annotation_path: Path to COCO JSON annotation file
        :param iou_threshold: IoU threshold to remove duplicate annotations
        :param min_area: Minimum area to keep an object
        :param augment: Whether to apply data augmentation
        """
        self.coco = COCO(annotation_path)
        self.iou_threshold = iou_threshold
        self.min_area = min_area
        self.augment = augment
        self.filtered_images = []
        self.filtered_annotations = []

    def remove_no_annotation_images(self):
        """Removes images without annotations"""
        annotated_imgs = {ann["image_id"] for ann in self.coco.loadAnns(self.coco.getAnnIds())}
        all_imgs = set(self.coco.getImgIds())
        empty_imgs = all_imgs - annotated_imgs

        print(f"Number of images without annotations: {len(empty_imgs)}")
        self.filtered_images = [img for img in self.coco.loadImgs(list(annotated_imgs))]

    def remove_duplicate_annotations(self):
        """Removes duplicate annotations (IoU > iou_threshold)"""
        image_to_annotations = {}
        for ann in self.coco.loadAnns(self.coco.getAnnIds()):
            img_id = ann["image_id"]
            if img_id not in image_to_annotations:
                image_to_annotations[img_id] = []
            image_to_annotations[img_id].append(ann)

        filtered_annotations = []
        for img_id, anns in image_to_annotations.items():
            if len(anns) < 2:
                filtered_annotations.extend(anns)
                continue  

            # Get bounding boxes and categories
            bboxes = torch.tensor([self.convert_bbox_format(ann["bbox"]) for ann in anns])
            categories = [ann["category_id"] for ann in anns]

            # Compute IoU matrix
            iou_matrix = box_iou(bboxes, bboxes)

            # Store indices of annotations to keep
            keep = set(range(len(anns)))
            for i in range(len(anns)):
                for j in range(i + 1, len(anns)):
                    if iou_matrix[i, j] > self.iou_threshold and categories[i] == categories[j]:
                        if j in keep:
                            keep.remove(j)

            filtered_annotations.extend([anns[i] for i in keep])

        self.filtered_annotations = filtered_annotations
        print(f"Number of non-duplicate annotations: {len(self.filtered_annotations)}")

    def filter_small_objects(self):
        """Removes objects with area smaller than min_area"""
        self.filtered_annotations = [ann for ann in self.filtered_annotations if ann['area'] >= self.min_area]
        print(f"Number of annotations after filtering small objects: {len(self.filtered_annotations)}")

    def preprocess_image(self, image_path):
        """Preprocesses an image by resizing, normalizing, and applying data augmentation"""
        image = Image.open(image_path).convert('RGB')
        image = np.array(image)

        if self.augment:
            image = self.apply_augmentation(image)

        transform = transforms.Compose([
            transforms.Resize((256, 256)),  # Resize to a fixed size
            transforms.ToTensor(),  # Convert to PyTorch Tensor
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize
        ])
        return transform(Image.fromarray(image))

    def apply_augmentation(self, image):
        """Applies data augmentation using Albumentations"""
        augment = A.Compose([
            A.HorizontalFlip(p=0.5),  # 50% probability of horizontal flip
            A.RandomRotate90(p=0.5),  # 90-degree rotation
            A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1, p=0.5),  # Color jitter
            A.Cutout(num_holes=3, max_h_size=20, max_w_size=20, p=0.5),  # Cutout augmentation
        ])
        return augment(image=image)['image']

    @staticmethod
    def convert_bbox_format(bbox):
        """Converts COCO format [x, y, w, h] to [x1, y1, x2, y2]"""
        x, y, w, h = bbox
        return [x, y, x + w, y + h]

    def run_pipeline(self):
        """Executes the full pipeline"""
        print("Starting data cleaning and augmentation...")
        self.remove_no_annotation_images()
        self.remove_duplicate_annotations()
        self.filter_small_objects()
        print("Data cleaning and augmentation completed!")

# Run the COCO preprocessing pipeline
annotation_path = "datasets/coco/annotations/instances_train2017.json"
coco_pipeline = COCOPipeline(annotation_path, augment=True)
coco_pipeline.run_pipeline()

# Models

## 1. CLIP (Contrastive Language-Image Pretraining)

CLIP, proposed by OpenAI, is a **multi-modal learning model** that associates **images** and **text**, enabling **zero-shot classification and retrieval**.

### CLIP's Internal Structure

CLIP consists of **two encoders**:

- **Vision Encoder**
    - Uses **ViT (Vision Transformer)** or ResNet to extract image features
    - Converts images into **fixed-dimensional feature vectors** (typically 512-dimensional)

- **Text Encoder**
    - Uses a **Transformer-based architecture** (similar to GPT-2)
    - Converts input text into **feature vectors of the same dimension** (512-dimensional)

### CLIP's Computation Process

1. **Input Image and Text**
    - The image is processed through the **ViT vision encoder**, extracting **image feature vectors**
    - The text is processed through the **Transformer text encoder**, extracting **text feature vectors**

2. **Similarity Computation**
    - Computes the **cosine similarity** between image and text features to match **the most relevant text description**
    - Through **contrastive learning**, CLIP maximizes similarity for correct matches and minimizes similarity for incorrect ones

### Limitations of CLIP

- **Only performs image-text matching**, without directly **outputting bounding boxes**, making it unsuitable for object detection
- Requires **additional region proposals**, increasing computational complexity

---

## 2. OwlViT (Open-World Learning Vision Transformer)

OwlViT, developed by Google, is a **ViT-based model for open-vocabulary object detection**, capable of **directly predicting bounding boxes**, overcoming CLIP’s limitations.

### OwlViT's Internal Structure

OwlViT is an improved version of **Detection Transformer (DETR) + ViT**, consisting of three main components:

1. **ViT Vision Encoder**
    - Similar to CLIP's ViT encoder, it converts input images into **visual feature vectors (tokens)**
    - Unlike CLIP, OwlViT requires **region-based information**, so it includes **positional encoding** to support object detection

2. **Cross-Modality Transformer**
    - Works similarly to CLIP’s **contrastive learning mechanism**, but not only computes similarity—it also **outputs object detection bounding boxes**
    - The input text acts as a **query**, interacting with visual features to **localize target regions**

3. **Detection Head**
    - Directly predicts the **bounding box coordinates and confidence scores** for each candidate region
    - The confidence scores are used for **Non-Maximum Suppression (NMS)** to refine the results

### OwlViT's Computation Process

1. **Input Image and Text**
    - The image is processed through ViT to extract features, while the text is encoded into feature vectors
2. **Cross-Modal Interaction**
    - Computes the matching score between text and image features to generate candidate object regions
3. **Bounding Box Prediction**
    - The detection head outputs **bounding box coordinates and confidence scores**
4. **Post-Processing**
    - **NMS filters out low-confidence boxes** and **merges high IoU boxes** to improve accuracy

### Advantages of OwlViT

**Directly generates bounding boxes from text queries**, without needing region proposals like CLIP
**Supports open-vocabulary object detection**, recognizing **zero-shot categories**

---

## 3. SAM (Segment Anything Model)

SAM, developed by Meta, is a **high-precision object segmentation model** that generates object masks based on **prompts**.

### SAM's Internal Structure

SAM consists of **three main components**:

1. **ViT Vision Encoder**
    - Like CLIP and OwlViT, SAM also uses **ViT** for image feature extraction
    - However, SAM requires **high-resolution feature maps**, so it adopts a **high-capacity ViT variant (ViT-Huge)**

2. **Prompt Encoder**
    - SAM supports **various types of prompts**, including:
        - **Point**: The user clicks on an object, and SAM predicts the mask for that region
        - **Box**: The user provides a bounding box, and SAM generates a precise mask
    - In this project, we use **OwlViT-generated bounding boxes as SAM’s input**

3. **Mask Decoder**
    - Combines **ViT visual features + prompt information** to generate the **final segmentation mask**

### SAM's Computation Process

1. **Input Image and Bounding Box**
    - OwlViT first generates bounding boxes, which are then used as **prompts** for SAM
2. **Feature Extraction**
    - SAM extracts **high-resolution image features** using ViT
3. **Prompt Encoding**
    - SAM processes the bounding box information and adjusts the prediction scope
4. **Mask Generation**
    - The Mask Decoder outputs **high-quality object segmentation masks**


# Model Selection – CLIP + SAM vs. OwlViT + SAM

## Initial Attempt: CLIP + SAM

### Why Initially Choose CLIP?

- CLIP is a powerful **vision-language model** that understands the relationship between images and text, supporting **zero-shot learning**.
- It is well-suited for **image classification** and **open-category object recognition**, allowing it to handle **unseen categories**.

### Issues with CLIP + SAM

- **CLIP cannot directly provide bounding boxes**, only assessing the similarity between an entire image and a text query.
- **Requires an additional region proposal algorithm**:
    - Since CLIP cannot directly detect objects, **Selective Search, edge detection**, or other methods must first generate **candidate regions**.
    - Each candidate region is then matched with CLIP using **text similarity**, selecting the highest-matching region as the detection result.
- **Multi-step process increases computational complexity**:
    1. Generate multiple candidate regions.
    2. Use CLIP to compute the similarity score for each region with the text query.
    3. Select the highest-matching region as the detection result.
    4. Use SAM for object segmentation.
- **Accuracy limitations**:
    - The quality of candidate region proposals determines final detection performance, making it prone to missing objects or false detections.
    - Additional computation steps lead to slower inference speed.

---

## Final Choice: OwlViT + MobileSAM

### Why Switch to OwlViT?

- **OwlViT is an open-vocabulary object detection model** that can **directly generate bounding boxes from text queries**, eliminating the need for additional region proposals.
- **End-to-end object detection**: Given an image and a text query, the model directly outputs **bounding boxes**, avoiding CLIPs **multi-step processing**, improving detection speed and accuracy.

---

## Complete Workflow of OwlViT + MobileSAM

1. **Input image + text query**.
2. **OwlViT processes the image and text**, directly outputting **bounding boxes + confidence scores**.
3. **Filter out low-confidence bounding boxes**, then apply **Non-Maximum Suppression (NMS)** to remove overlapping boxes.
4. **Merge high IoU (Intersection over Union) detection boxes**, improving stability.
5. **MobileSAM receives the final bounding boxes** and generates high-quality segmentation masks.
6. **Final output**: Image with accurately segmented objects.

---

## Why Is OwlViT + SAM Superior?

| **Comparison** | **CLIP + SAM** | **OwlViT + SAM** |
| --- | --- | --- |
| **Zero-shot detection ability** | (Image-level only) |  (Object-level, direct detection) |
| **Can output bounding boxes?** |  Requires extra steps |  Direct output |
| **Requires region proposals?** |  Yes |  No |
| **Merging high IoU boxes** |  No |  Using NMS and box merging |
| **Detection accuracy** |  Depends on region proposal quality |  Optimized for precision |
| **Computation speed** |  High due to extra steps |  More efficient |

---


# Performance Metrics

Our goal is to evaluate the **object detection** and **segmentation performance** of **OwlViT + MobileSAM**. The evaluation focuses on three key aspects: **detection accuracy, segmentation quality, and computational efficiency**.



## 1. Object Detection Evaluation Metrics

The detection capability of **OwlViT** determines the final segmentation quality. Therefore, we evaluate the following metrics:

### **1.1 IoU (Intersection over Union)**
- Measures the **overlap ratio** between the predicted bounding box and the ground truth bounding box:
$$
\text{IoU} = \frac{\text{Area of intersection}}{\text{Area of union}}
$$
- **IoU > 0.5** is considered a correct detection (standard threshold).
- **IoU > 0.9** represents high-quality detection.

### **1.2 mAP (Mean Average Precision)(Later)**
- Evaluates object detection performance across **different IoU thresholds**, calculating the average **AP** over multiple IoU values.
- AP is calculated as follows:
    - Compute the **area under the Precision-Recall curve**:
    $$
    \text{AP} = \int_{0}^{1} P(R) \, dR
    $$
    - The final **mAP (Mean AP)** is obtained by averaging AP across all categories.

---

## 2. Segmentation Evaluation Metrics

**MobileSAM** is responsible for object segmentation. The evaluation metrics include:

### **2.1 mIoU (Mean Intersection over Union)**
- Computes the average IoU between the predicted mask and the ground truth mask:
 $$
\text{mIoU} = \frac{1}{N} \sum_{i=1}^{N} \frac{\text{Area of intersection}}{\text{Area of union}}
$$
- **mIoU > 0.8** represents high-quality segmentation.

### **2.2 Dice Coefficient**
- Measures the similarity between two regions:
$$
\text{Dice} = \frac{2 \times |A \cap B|}{|A| + |B|}
$$
- A Dice score closer to 1 indicates better segmentation performance.
---

## 3. Computational Efficiency

### **3.1 FPS (Frames Per Second)**
- Measures **the number of images processed per second**, assessing the system’s real-time performance:
$$
\text{FPS} = \frac{\text{Number of processed images}}{\text{Total time (seconds)}}
$$
- **Target values**:
    - FPS > 2 for batch processing
    - FPS > 10 for real-time applications

### **3.2 Inference Time**
- Measures the total inference time for **OwlViT + MobileSAM** on a **single image**:
$$
T_{\text{total}} = T_{\text{OwlViT}} + T_{\text{MobileSAM}}
$$


# **Next Steps**

After the mid-term report, our primary goal is to **optimize the post-processing pipeline for detection and segmentation to improve detection accuracy, segmentation quality, and inference efficiency**. The next optimization steps mainly include **improving confidence filtering, NMS, and high IoU box merging strategies**.

---

## **1. Object Detection Optimization**

### **1.1 NMS Strategy Optimization**

Currently, `apply_nms` uses **Hard-NMS** (based on `torchvision.ops.nms()`), which directly removes high-IoU boxes. This can cause:

- **False negatives**, where adjacent objects are incorrectly removed
- **A drastic reduction in the number of detected boxes**, negatively affecting recall

#### **Optimization Plan**

- **Introduce Soft-NMS**
    - Instead of directly removing boxes, apply **exponential decay** to confidence scores based on IoU:
  
      $$
      \text{scores} = \text{scores} \times e^{- (\text{IoU}^2) / \sigma}
      $$

    - **Low IoU boxes retain more confidence**, while **high IoU boxes decay more significantly**, preserving more detection information.

- **Applicable scenarios**:
    - **Crowded object detection** (e.g., pedestrian or vehicle detection)
    - **Reduce false negatives and improve recall**

---


## **2. Segmentation Optimization**

### **2.1 Multi-Mask Optimization in MobileSAM**

#### **Current Issues**
- Using `multimask_output=False` may result in missing small objects.

#### **Optimization Plan**
- Enable `multimask_output=True` to generate multiple masks and select the best one.
- **Objective**: Ensure complete segmentation of targets in complex scenes.

---

### **2.2 Mask Post-Processing**

- **Add Mask Area Filtering**
    - **Remove very small masks** (e.g., those with <500 pixels) to avoid detecting noise.

---

## **3. Evaluation Expansion**

### **3.1 Testing on Different Datasets**

- **Zero-Shot Evaluation**:
    - The model has primarily been evaluated on COCO; testing on **unseen categories** is necessary.
- **Cross-Dataset Testing**:
    - Plan to evaluate on **LVIS and Object365** to ensure generalization.

---

### **3.2 More Detailed Performance Analysis**

- **IoU Histogram Analysis**
    - Analyze the distribution of predicted vs. ground truth IoU to refine detection accuracy.
- **Segmentation Quality vs. Object Size Analysis**
    - Observe `mIoU` performance across different object scales to enhance segmentation stability.



