# 🧠 Vision AI Labeling Pipeline with LLM + Grounding DINO + SAM


This notebook creates a semi-automated labeling pipeline:
- Use GPT-4V or Claude to identify visible objects.
- Use **Grounding DINO** for bounding box detection.
- Use **SAM** (Segment Anything Model) for precise segmentation (optional).
- Export YOLOv5/YOLOv8 compatible annotations.

---

## 🔧 Step 1: Install Required Dependencies


In [None]:
# # Clone repo
# !git clone https://github.com/IDEA-Research/GroundingDINO.git
# %cd GroundingDINO

# # Install core dependencies
# !pip install -r requirements.txt supervision

# # Install transformers separately
# !pip install transformers==4.30.2

# # Avoid wheel errors: copy manually into Colab Python path
# !cp -r groundingdino /usr/local/lib/python3.11/dist-packages/


In [None]:
# ⚠️ Run only if using Colab or setting up fresh environment
!pip install opencv-python pillow matplotlib transformers
!pip install git+https://github.com/IDEA-Research/GroundingDINO.git
!pip install git+https://github.com/facebookresearch/segment-anything.git



In [None]:
!cd "/Users/prudhvivuda/Documents/vaultlyai code/ai-labelling/GroundingDINO"
!pip install -e .

## 📥 Step 2: Upload or Load Image(s)

In [None]:
import torch
from PIL import Image
import numpy as np
from GroundingDINO.groundingdino.util.inference import predict

# Load image and convert to tensor
image_path = "/Users/prudhvivuda/Documents/vaultlyai code/ai-labelling/blender motor + air fryer + microwave.jpg"  # change this
image_pil = Image.open(image_path).convert("RGB")
image_np = np.array(image_pil)

# GroundingDINO expects a torch Tensor in shape [C, H, W] and dtype float32
image_tensor = torch.tensor(image_np).permute(2, 0, 1).unsqueeze(0).float() / 255.0  # normalize to [0, 1]

# Choose device (use 'cuda' if available)
device = "cuda" if torch.cuda.is_available() else "cpu"
image_tensor = image_tensor.to(device)
# model = model.to(device)



## 🧠 Step 3: Use GPT-4V or Claude to Suggest Object Classes

In [None]:

# Copy the suggested object names from GPT-4V or Claude
detected_objects = ["air fryer", "micro wave", "blender", "box"]  # Example


## 🎯 Step 4: Run Grounding DINO to Get Bounding Boxes

In [None]:
!wget -P weights/ \
  https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth


In [None]:
!git clone https://github.com/IDEA-Research/GroundingDINO.git


In [None]:

# NOTE: Pseudocode — replace with your actual Grounding DINO inference logic
from groundingdino.util.inference import load_model, predict


model = load_model("GroundingDINO/groundingdino/config/GroundingDINO_SwinB_cfg.py", "weights/groundingdino_swinb_cogcoor.pth",)
import torch

# Ensure model and image tensor are on CPU
device = "cpu"
model = model.to(device)
image_tensor = image_tensor.to(device)

for obj in detected_objects:
    boxes, logits, phrases = predict(
        model=model,
        image=image_tensor[0],  # single image tensor [C, H, W]
        caption=obj,
        box_threshold=0.3,
        text_threshold=0.25,
        device="cpu"  # explicitly use CPU
    )
    print(f"{obj} ➜ {len(boxes)} detections")


    # Draw or store boxes here


## 🖼 Step 5: (Optional) Refine Boxes with SAM

In [None]:

# NOTE: Pseudocode — replace with actual SAM usage
from segment_anything import SamPredictor, sam_model_registry

sam = sam_model_registry["vit_h"](checkpoint="path/to/sam_vit_h.pth")
sam_predictor = SamPredictor(sam)
sam_predictor.set_image(np.array(image))

for box in boxes:
    masks = sam_predictor.predict(box=box)
    # Use mask or convert to bbox


## 📝 Step 6: Save as YOLO Labels

In [None]:

# For each object: class_id, x_center, y_center, width, height (normalized)
def save_yolo_label(file_name, class_id, box, image_width, image_height):
    x1, y1, x2, y2 = box
    x_center = ((x1 + x2) / 2) / image_width
    y_center = ((y1 + y2) / 2) / image_height
    width = (x2 - x1) / image_width
    height = (y2 - y1) / image_height

    label_str = f"{class_id} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}\n"
    with open(file_name, "a") as f:
        f.write(label_str)


## ✅ Step 7: Use GPT-4V or Claude to Review Labels

In [None]:

# Show cropped image + label and ask GPT-4V:
# “This was labeled as ‘air fryer’. Is that correct?”
# Optionally automate via API (if model supports vision)
