# Object Detection  - YOLO & OWL-ViT
This tutorial demonstrates how to use YOLO (You Only Look Once) from the [Ultralytics](https://github.com/ultralytics/yolov5) library for object detection. It includes steps for:

- Running object detection inference on images/videos
- Fine-tuning YOLO for custom datasets
- Comparing YOLO with OWl-VIT for zero-shot learning.


## 1. Perform Object Detection Inference
First thing We'll use YOLOv8 from Ultralyics for object detection on a sample image.
We aim to utilize the pre-trained YOLOv8 model to detect objects in a sample image. This involves loading the model, providing an image for input, and interpreting the model's predictions.

**Key Concepts:**
- **Inference**: The process of using a trained model to make predictions on new data.
- **YOLOv8**: A state-of-the-art version of the YOLO (You Only Look Once) architecture, known for its speed and accuracy in object detection tasks.

**Steps:**
1. Load the YOLOv8 model using the Ultralytics library.
2. Perform inference on a sample image to detect objects.
3. Visualize the results, including bounding boxes and class labels.

**Support Material:**
- https://docs.ultralytics.com/models/yolov8/
- https://docs.ultralytics.com/tasks/detect/

In [None]:
# Import YOLO and load a pre-trained model
from ultralytics import YOLO
import cv2

# Load the YOLOv8 pre-trained model
model = YOLO('yolov8n.pt')  # nano model for quick inference

# Run inference on a sample image

results = model('images/street_scene.jpg', save = True)  # Displays image with detections

for result in results:
    print(result.boxes)  # Boxes object for bounding box outputs



## 2. Fine-Tuning YOLO on Custom Dataset
Fine-tuning YOLO requires a dataset formatted in the YOLO format. We'll use a small public dataset for demonstration.
We will adapt the pre-trained YOLO model to a custom dataset. This process, known as fine-tuning, enables YOLO to specialize in detecting specific objects not included in its original training.

**Key Concepts:**
- **Fine-tuning**: Adapting a pre-trained model to new data by continuing the training process.
- **Custom Dataset**: A dataset that contains specific objects relevant to a new application, different from those YOLO was trained on (e.g. https://docs.ultralytics.com/datasets/detect/signature/.) Does it work? yes, no? why not? what can you do?

**Steps:**
1. Prepare the custom dataset by organizing images and labels in the required format.
2. Configure the YOLO training pipeline.
3. Train the model and evaluate its performance.

**Support Material:** 
- https://docs.ultralytics.com/modes/train/
- https://docs.ultralytics.com/modes/val/




In [None]:
# Download a sample dataset (e.g., Signature)
!wget -q https://github.com/ultralytics/assets/releases/download/v0.0.0/signature.zip
!unzip -oq signature.zip -d ./datasets


In [None]:
# Train YOLO on the dataset
results = model.train(data='./datasets/signature.yaml', epochs=10, imgsz=640, batch=8)

In [12]:
model = YOLO("runs/detect/train2/weights/best.pt")  # load a custom model, check the path depending on your output before!! # Hier das beste training einf√ºgen

# Predict with the model
results = model.predict("images/example_signature.jpg", conf=0.75) #check params if you need to improve detection


image 1/1 /workspaces/B3-AP3-MultimodalInteraction_ObjDet/images/example_signature.jpg: 640x480 (no detections), 132.7ms
Speed: 3.3ms preprocess, 132.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 480)


## 3. Zero-Shot Learning with OWL-ViT
Switch to `OWL-ViT` to see how it performs with zero-shot learning capabilities. Zero-shot means detecting objects without prior specific training.

OWL-ViT (Open Vocabulary Learning with Vision Transformers) is a cutting-edge model designed for open vocabulary object detection. Unlike traditional models, OWL-ViT combines vision transformers with text embeddings, enabling it to:\n\n
- Understand textual descriptions of objects, even if it hasn't seen them during training.
- Detect and classify objects based on descriptive input, making it suitable for diverse applications.
- Perform zero-shot learning by generalizing to new object classes without additional training.\n\n"

**Steps in Using OWL-ViT:**
1. Model Initialization**: Set up the OWL-ViT model.
2. Text Input for Object Descriptions: Provide descriptive prompts (e.g., 'a red car' or 'a black cat to guide detection.
3. Inference and Visualization: Process an image or video, detect objects based on text descriptions and visualize results.\n\n"

OWL-ViT excels in scenarios where predefined object classes are insufficient, such as detecting rare or domain-specific objects.

**Support Material**:
- https://huggingface.co/docs/transformers/en/model_doc/owlvit


In [13]:
import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
from PIL import Image
import torch

from transformers import OwlViTProcessor, OwlViTForObjectDetection

image = Image.open("images/street_scene.jpg")

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")


text_labels = [["a person on the floor", "a church "]]

inputs = processor(text=text_labels, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.tensor([(image.height, image.width)])

# Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
results = processor.post_process_grounded_object_detection(
    outputs=outputs, target_sizes=target_sizes, threshold=0.1, text_labels=text_labels
)
# Retrieve predictions for the first image for the corresponding text queries
result = results[0]
boxes, scores, text_labels = result["boxes"], result["scores"], result["text_labels"]

for box, score, text_label in zip(boxes, scores, text_labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")


def show_box(box, ax):
    print(box)
    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(
        plt.Rectangle((x0, y0), w, h, edgecolor="green", facecolor=(0, 0, 0, 0), lw=2)
    )


def show_boxes_and_labels_on_image(raw_image, boxes, labels, scores):
    plt.figure(figsize=(10, 10))
    plt.imshow(raw_image)
    for i, box in enumerate(boxes):
        box = [round(i, 2) for i in box.tolist()]
        show_box(box, plt.gca())
        plt.text(
            x=box[0],
            y=box[1] - 12,
            s=f"{labels[i]}: {scores[i]:,.4f}",
            c="beige",
            path_effects=[pe.withStroke(linewidth=4, foreground="darkgreen")],
        )
    plt.axis("on")
    plt.show()
    #lt.savefig("streetscene_with_detections.jpg")



# Show the image with the bounding boxes
show_boxes_and_labels_on_image(
    image,
    boxes,
    text_labels,
    scores
)


preprocessor_config.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/775 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/613M [00:00<?, ?B/s]

Detected a church  with confidence 0.116 at location [706.68, 35.65, 870.62, 342.14]
Detected a person on the floor with confidence 0.119 at location [1265.73, 347.33, 1309.63, 423.22]
Detected a person on the floor with confidence 0.144 at location [729.04, 301.7, 873.73, 509.65]
Detected a person on the floor with confidence 0.109 at location [932.93, 316.24, 1046.45, 524.26]
Detected a person on the floor with confidence 0.144 at location [1084.81, 326.57, 1179.79, 470.67]
Detected a person on the floor with confidence 0.11 at location [1295.77, 343.81, 1340.44, 421.08]
Detected a person on the floor with confidence 0.139 at location [1337.02, 344.97, 1389.21, 425.86]
Detected a person on the floor with confidence 0.14 at location [1558.68, 294.65, 1722.22, 665.15]
Detected a person on the floor with confidence 0.127 at location [1264.44, 347.31, 1313.74, 428.22]
Detected a person on the floor with confidence 0.13 at location [1418.98, 349.44, 1474.32, 440.56]
Detected a person on t

<Figure size 1000x1000 with 1 Axes>