# Leveraging Large Language Models for Automated Image Annotation

# Introduction

## Background

The rapid advancement of supervised learning in computer vision has led to an increasing demand for high-quality labeled datasets. However, obtaining such datasets remains a significant challenge due to the high costs and inefficiencies of manual annotation. 

The Bottleneck of Data Annotation
- High cost
- Low efficiency
- Inconsistencies and errors

Existing automated annotation systems predominantly rely on supervised models, such as the YOLO series, to generate bounding boxes and labels for images. However, these models face significant limitations:

- Must fine-tuning or training a new model for subproblems
- Require standard input words

Currently, numerous well-trained large language models (LLMs) are available for commercial use, often providing accessible APIs without necessitating the substantial costs associated with pretraining such models. 

Leveraging Multimodal Large Language Models (LLMs) for Universal Annotation
- Flexible adaptation to new categories
- Semantic understanding and context awareness
- low-cost deployment and accessibility

## Overview of the solution

- Multimodal Representation Learning with OwlViT
Uses both text and image embeddings to match visual objects with text descriptions.
- Object Detection and Segmentation with SAM
perform text-prompted segmentation
- Pipeline Integration: OwlViT + SAM
Step 1: Use OwlViT to associate textual descriptions with image content.
Step 2: Use SAM to segment specific objects or regions of interest based on OwlViT's semantic embeddings or user-provided prompts.
Step 3: Generate annotations, including bounding boxes, masks, and descriptive labels, in a format compatible with training supervised models.

## Business Value
- Cost and Time Efficiency
Reduces the expenses and effort required for data labeling, making AI development more accessible.
- Democratization of AI
Enables small enterprises and research institutions to access high-quality labeled data.
- Fostering Innovation
Accelerates AI adoption and encourages the development of new technologies.
- Enhanced Model Performance
Improves generalizability by mitigating biases through diverse datasets.


# Data and Data Preprocessing

We plan to use the Conceptual Captions dataset and the Microsoft COCO (Common Objects in Context) dataset [Lin et al., 2014] to support our analysis and modeling tasks. OwlViT and Sam are pre-trained. We use the data to fine-tuning our pipline. 

- Large number of images along with their descriptive texts.

- The image and text data will be processed separately through an Image Encoder and a Text Encoder, mapping them into a shared high-dimensional embedding space.

## Basic Statistics

In [None]:
from pycocotools.coco import COCO
import json
import numpy as np
import collections
import matplotlib.pyplot as plt
import torch
from torchvision.ops import box_iou
from pycocotools.coco import COCO
import torchvision.transforms as transforms
from PIL import Image
import albumentations as A
import cv2

In [None]:
# import coco dataset
coco_annotation_path = "datasets/coco/annotations/instances_train2017.json"
coco = COCO(coco_annotation_path)

def convert_bbox_format(bbox):
    """Convert COCO bbox [x_min, y_min, width, height] format to [x_center, y_center, width, height] format."""
    x_min, y_min, width, height = bbox
    x_center = x_min + width / 2
    y_center = y_min + height / 2
    return [x_center, y_center, width, height]

In [None]:
# Number and names of Categories
categories = coco.loadCats(coco.getCatIds())
num_categories = len(categories)
category_names = [cat['name'] for cat in categories]
print(f"Number of Categories: {num_categories}")
print(f"Names of Categories: {category_names}")

In [None]:
# Number of images, annotations and average number of annotations per images
num_images = len(coco.getImgIds())
print(f"Number of Images: {num_images}")

num_annotations = len(coco.getAnnIds())
print(f"Number of Annotations: {num_annotations}")

image_ids = coco.getImgIds()
num_objects_per_image = [len(coco.getAnnIds(imgIds=[img_id])) for img_id in image_ids]
mean_objects_per_image = np.mean(num_objects_per_image)
print(f"Average Number of Annotations per Images: {mean_objects_per_image:.2f}")

In [None]:
# objects per categories
category_counts = collections.Counter()
for ann in coco.loadAnns(coco.getAnnIds()):
    category_counts[ann["category_id"]] += 1

for cat_id, count in category_counts.items():
    cat_name = coco.loadCats([cat_id])[0]["name"]
    print(f"{cat_name}: {count} objects")

In [None]:
# The distribution of object areas
area_list = [ann['area'] for ann in coco.loadAnns(coco.getAnnIds())]
plt.hist(area_list, bins=50, log=True)
plt.xlabel('Object Area')
plt.ylabel('Number')
plt.title('Histogram of Object Area')
plt.show()

## outlier/abnormal sample dectection
- Identifies and removes images that do not have any annotations.
- Removes duplicate annotations using IoU (Intersection over Union).
- Removes objects with area smaller than min_area

## Preprocessing
- Converts them to RGB format.
- Normalizes the pixel values.
- Converts COCO format [x, y, w, h] to [x1, y1, x2, y2]
- Applies data augmentation (if enabled).

## Data augmentation
- Horizontal flipping
- 90-degree rotation
- Color jittering
- Cutout augmentation

In [None]:
class COCOPipeline:
    def __init__(self, annotation_path, iou_threshold=0.9, min_area=100, augment=True):
        """
        COCO Data Cleaning and Augmentation Pipeline.
        :param annotation_path: Path to COCO JSON annotation file
        :param iou_threshold: IoU threshold to remove duplicate annotations
        :param min_area: Minimum area to keep an object
        :param augment: Whether to apply data augmentation
        """
        self.coco = COCO(annotation_path)
        self.iou_threshold = iou_threshold
        self.min_area = min_area
        self.augment = augment
        self.filtered_images = []
        self.filtered_annotations = []

    def remove_no_annotation_images(self):
        """Removes images without annotations"""
        annotated_imgs = {ann["image_id"] for ann in self.coco.loadAnns(self.coco.getAnnIds())}
        all_imgs = set(self.coco.getImgIds())
        empty_imgs = all_imgs - annotated_imgs

        print(f"Number of images without annotations: {len(empty_imgs)}")
        self.filtered_images = [img for img in self.coco.loadImgs(list(annotated_imgs))]

    def remove_duplicate_annotations(self):
        """Removes duplicate annotations (IoU > iou_threshold)"""
        image_to_annotations = {}
        for ann in self.coco.loadAnns(self.coco.getAnnIds()):
            img_id = ann["image_id"]
            if img_id not in image_to_annotations:
                image_to_annotations[img_id] = []
            image_to_annotations[img_id].append(ann)

        filtered_annotations = []
        for img_id, anns in image_to_annotations.items():
            if len(anns) < 2:
                filtered_annotations.extend(anns)
                continue  

            # Get bounding boxes and categories
            bboxes = torch.tensor([self.convert_bbox_format(ann["bbox"]) for ann in anns])
            categories = [ann["category_id"] for ann in anns]

            # Compute IoU matrix
            iou_matrix = box_iou(bboxes, bboxes)

            # Store indices of annotations to keep
            keep = set(range(len(anns)))
            for i in range(len(anns)):
                for j in range(i + 1, len(anns)):
                    if iou_matrix[i, j] > self.iou_threshold and categories[i] == categories[j]:
                        if j in keep:
                            keep.remove(j)

            filtered_annotations.extend([anns[i] for i in keep])

        self.filtered_annotations = filtered_annotations
        print(f"Number of non-duplicate annotations: {len(self.filtered_annotations)}")

    def filter_small_objects(self):
        """Removes objects with area smaller than min_area"""
        self.filtered_annotations = [ann for ann in self.filtered_annotations if ann['area'] >= self.min_area]
        print(f"Number of annotations after filtering small objects: {len(self.filtered_annotations)}")

    def preprocess_image(self, image_path):
        """Preprocesses an image by resizing, normalizing, and applying data augmentation"""
        image = Image.open(image_path).convert('RGB')
        image = np.array(image)

        if self.augment:
            image = self.apply_augmentation(image)

        transform = transforms.Compose([
            transforms.Resize((256, 256)),  # Resize to a fixed size
            transforms.ToTensor(),  # Convert to PyTorch Tensor
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize
        ])
        return transform(Image.fromarray(image))

    def apply_augmentation(self, image):
        """Applies data augmentation using Albumentations"""
        augment = A.Compose([
            A.HorizontalFlip(p=0.5),  # 50% probability of horizontal flip
            A.RandomRotate90(p=0.5),  # 90-degree rotation
            A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1, p=0.5),  # Color jitter
            A.Cutout(num_holes=3, max_h_size=20, max_w_size=20, p=0.5),  # Cutout augmentation
        ])
        return augment(image=image)['image']

    @staticmethod
    def convert_bbox_format(bbox):
        """Converts COCO format [x, y, w, h] to [x1, y1, x2, y2]"""
        x, y, w, h = bbox
        return [x, y, x + w, y + h]

    def run_pipeline(self):
        """Executes the full pipeline"""
        print("Starting data cleaning and augmentation...")
        self.remove_no_annotation_images()
        self.remove_duplicate_annotations()
        self.filter_small_objects()
        print("Data cleaning and augmentation completed!")

# Run the COCO preprocessing pipeline
annotation_path = "datasets/coco/annotations/instances_train2017.json"
coco_pipeline = COCOPipeline(annotation_path, augment=True)
coco_pipeline.run_pipeline()