# Real-time Object Detection with YOLO and OpenCV

## Overview:

This script demonstrates real-time object detection using the You Only Look Once (YOLO) model, DEtection TRansformers (DETR) model and OpenCV. YOLO is a popular deep learning-based object detection algorithm that is known for its speed and accuracy. DETR is an object detection model that directly predicts object bounding boxes and class labels using transformer-based encoder-decoder architecture. OpenCV is a powerful library used for computer vision tasks, including image processing and object detection.

## Concepts:

### YOLO (You Only Look Once):
   - YOLO is a real-time object detection algorithm that detects objects in images or video frames.
   - It divides the input image into a grid and predicts bounding boxes and class probabilities for each grid cell simultaneously.
   - YOLO can detect multiple objects in a single pass through the neural network, making it extremely fast.

### OpenCV:
   - OpenCV (Open Source Computer Vision Library) is an open-source computer vision and machine learning software library.
   - It provides a wide range of tools and algorithms for image and video processing tasks.
   - OpenCV is widely used for tasks such as object detection, facial recognition, and image segmentation.

## Code Explanation:

### Importing Libraries:
  - The script imports necessary libraries including `cv2` for OpenCV, `YOLO` from `ultralytics` for object detection, and `supervision` for annotations.

### Initializing YOLO Model:
  - The YOLO model is initialized using pre-trained weights (`yolov8s.pt`). These weights are obtained from training on a large dataset and are used to perform object detection.

### Initializing Webcam Capture:
  - The script initializes webcam capture using OpenCV's `VideoCapture` class. If the webcam cannot be opened, an error message is printed and the script exits.

### Real-time Object Detection Loop:
  - The script enters a while loop to continuously capture frames from the webcam and perform object detection on each frame.
  - Each frame captured from the webcam is passed through the YOLO model to detect objects.
  - Detected objects are annotated with bounding boxes and labels using the `supervision` library.
  - Annotated frames are displayed in real-time using OpenCV's `imshow` function.
  - The loop continues until the user presses the 'q' key, at which point the webcam is released and OpenCV windows are closed.


### Object Detection:
   - Object detection is a computer vision task that involves detecting and locating objects within an image or video frame.
   - It differs from image classification, which identifies the main object in an entire image, by providing the precise location of each object along with its class label.
   - Object detection algorithms typically use machine learning techniques, such as deep neural networks, to perform this task.

### Real-time Object Detection vs Batch Object Detection:
   - Real-time object detection refers to the ability to perform object detection on live video streams in real-time, usually at frame rates of at least 30 frames per second (FPS).
   - Batch object detection, on the other hand, involves processing a batch of images or video frames offline, without the constraint of real-time processing.
   - Real-time object detection is often used in applications such as video surveillance, autonomous driving, and augmented reality, where timely detection of objects is critical.

### YOLOv8 Architecture:
   - YOLOv8 (You Only Look Once version 8) is an improvement over previous versions of the YOLO algorithm, known for its efficiency and accuracy in object detection tasks.
   - YOLOv8 is based on a deep convolutional neural network architecture that divides the input image into a grid of cells and predicts bounding boxes and class probabilities for each cell simultaneously.
   - It uses a single neural network to predict multiple bounding boxes and class probabilities for each object in the image, making it extremely fast and efficient.


In [1]:
import cv2
from ultralytics import YOLO  # Import YOLO model from Ultralytics
import supervision as sv  # Import the supervision library for annotations

class ObjectDetectionWithWebcam:
    """
    This class performs real-time object detection using a webcam and YOLO model.

    Attributes:
        model (YOLO): YOLO object detection model.
        webcam (cv2.VideoCapture): Webcam object for capturing frames.
    """

    def __init__(self, model_weights: str = 'yolov8s.pt'):
        """
        Initializes the ObjectDetectionWithWebcam class.

        Args:
            model_weights (str): Path to the YOLO model weights file (default is 'yolov8s.pt').
        """
        self.model = YOLO(model_weights)
        self.webcam = cv2.VideoCapture(0)

        if not self.webcam.isOpened():
            raise RuntimeError("Cannot open webcam")

    def __del__(self):
        """
        Cleans up resources by releasing the webcam.
        """
        self.webcam.release()
        cv2.destroyAllWindows()

    def detect_objects(self):
        """
        Performs real-time object detection using the webcam and displays the annotated frames.
        """
        while True:
            # Read frame from webcam
            ret, frame = self.webcam.read()

            if not ret:
                print("Can't receive frame (stream end?), Exiting ...")
                break
            
            # Perform object detection on the frame using the YOLO model
            results = self.model(frame)[0]

            # Convert YOLO detections to Supervision Detections format
            detections = sv.Detections.from_ultralytics(results)

            # Create a bounding box annotator with specified thickness
            bounding_box_annotator = sv.BoundingBoxAnnotator(
                thickness=4
            )

            # Create a label annotator
            label_annotator = sv.LabelAnnotator()

            # Filter out detections with class_id not equal to 0 (background class)
            detections = detections[detections.class_id != 0]

            # Get labels for each detected object
            labels = [
                self.model.model.names[class_id]
                for class_id
                in detections.class_id
            ]

            # Annotate the frame with bounding boxes
            annotated_image = bounding_box_annotator.annotate(
                scene=frame, detections=detections)

            # Annotate the frame with labels
            annotated_image = label_annotator.annotate(
                scene=annotated_image, detections=detections, labels=labels)

            # Display the annotated frame
            cv2.imshow("Object Detection", annotated_image)

            # Exit loop if 'q' key is pressed
            if cv2.waitKey(1) == ord("q"):
                break

# Usage example:
if __name__ == "__main__":
    # Initialize ObjectDetectionWithWebcam class
    detector = ObjectDetectionWithWebcam()

    # Perform real-time object detection
    detector.detect_objects()
    detector.__del__()






0: 384x640 (no detections), 86.7ms
Speed: 2.9ms preprocess, 86.7ms inference, 355.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 63.9ms
Speed: 1.3ms preprocess, 63.9ms inference, 0.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 83.4ms
Speed: 1.7ms preprocess, 83.4ms inference, 0.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 95.3ms
Speed: 1.1ms preprocess, 95.3ms inference, 0.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 57.5ms
Speed: 1.3ms preprocess, 57.5ms inference, 0.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 59.5ms
Speed: 1.2ms preprocess, 59.5ms inference, 0.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 56.3ms
Speed: 1.2ms preprocess, 56.3ms inference, 0.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 65.1ms
Speed: 1.1ms preprocess, 65.1ms inference, 0.3ms postprocess per image at shape (

### DETR (DEtection TRansformers) Model:
   - DETR is a state-of-the-art object detection model that utilizes transformer architecture, originally proposed by Facebook AI.
   - Unlike traditional object detection models that rely on anchor boxes and proposal generation, DETR directly predicts object bounding boxes and class labels in a single pass using transformer-based encoder-decoder architecture.
   - It has been shown to achieve competitive performance on object detection benchmarks with fewer heuristics and hyperparameters.

## YOLOv8 vs. DETR:

### YOLOv8:
   - YOLOv8 is well-suited for real-time applications where speed and efficiency are crucial, such as video surveillance and object tracking.
   - It provides a simpler and faster approach to object detection compared to DETR, making it easier to deploy in resource-constrained environments.

### DETR:
   - DETR offers a novel approach to object detection using transformer architecture, which allows for end-to-end training and inference.
   - It is suitable for applications where precise localization and accurate detection of objects are important, such as autonomous driving and medical imaging.


In [1]:
import cv2
from transformers import pipeline
from PIL import ImageDraw, Image
import numpy as np
import supervision as sv

class ObjectDetectionWithWebcam:
    """
    This class performs real-time object detection using a webcam and DETR model.

    Attributes:
        detector: DETR object detection pipeline.
        webcam (cv2.VideoCapture): Webcam object for capturing frames.
    """

    def __init__(self, checkpoint: str = "facebook/detr-resnet-50"):
        """
        Initializes the ObjectDetectionWithWebcam class.

        Args:
            checkpoint (str): Name or path of the DETR checkpoint (default is "facebook/detr-resnet-50").
        """
        self.detector = pipeline(model=checkpoint, task="object-detection")
        self.webcam = cv2.VideoCapture(0)

        if not self.webcam.isOpened():
            raise RuntimeError("Cannot open webcam")

    def __del__(self):
        """
        Cleans up resources by releasing the webcam.
        """
        self.webcam.release()
        cv2.destroyAllWindows()

    def detect_objects(self):
        """
        Performs real-time object detection using the webcam and displays the annotated frames.
        """
        while True:
            # Read frame from webcam
            ret, frame = self.webcam.read()

            if not ret:
                print("Can't receive frame (stream end?), Exiting ...")
                break

            # Convert frame to RGB format
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frame = Image.fromarray(frame)

            # Predict objects in the frame
            predictions = self.detector(frame, candidate_labels=["human face"])

            # Annotate the frame with predicted bounding boxes and labels
            draw = ImageDraw.Draw(frame)
            for prediction in predictions:
                box = prediction["box"]
                label = prediction["label"]
                score = prediction["score"]

                xmin, ymin, xmax, ymax = box.values()
                draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
                draw.text((xmin, ymin), f"{label}: {round(score, 2)}", fill="white")

            # Convert annotated frame back to OpenCV format
            frame = np.array(frame)
            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)

            # Display the annotated frame
            cv2.imshow("Object Detection", frame)

            # Exit loop if 'q' key is pressed
            if cv2.waitKey(1) == ord("q"):
                break

# Usage example:
if __name__ == "__main__":
    # Initialize ObjectDetectionWithWebcam class
    detector = ObjectDetectionWithWebcam()

    # Perform real-time object detection
    detector.detect_objects()
    detector.__del__()


  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at facebook/detr-resnet-50 were not used when initializing DetrForObjectDetection: ['model.backbone.conv_encoder.model.layer1.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing DetrForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DetrForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


KeyboardInterrupt: 

# DETR vs. YOLOv8 Architecture Comparison

## DETR Architecture:
- **Encoder-Decoder Architecture**: DETR utilizes a transformer-based encoder-decoder architecture.
- **Encoder**: The encoder processes the input image using a series of transformer encoder layers to extract high-level features.
- **Decoder**: The decoder generates object queries and attends to the encoded image features to predict object bounding boxes and class labels.
- **Positional Encoding**: DETR uses positional encoding to provide spatial information to the transformer model.
- **Learnable Class Embeddings**: Instead of using predefined anchor boxes, DETR predicts object classes using learnable class embeddings.
- **Direct Prediction**: DETR directly predicts object bounding boxes and class labels in a single pass without the need for anchor box generation or non-maximum suppression.

## YOLOv8 Architecture:
- **Single-stage Object Detector**: YOLOv8 is a single-stage object detection model based on a deep convolutional neural network (CNN).
- **Backbone Network**: YOLOv8 typically uses a CNN backbone network such as Darknet or ResNet to extract features from the input image.
- **Grid-based Prediction**: YOLOv8 divides the input image into a grid of cells and predicts bounding boxes and class probabilities for each cell.
- **Anchor Boxes**: YOLOv8 uses predefined anchor boxes at different scales and aspect ratios to predict object locations and sizes.
- **Non-maximum Suppression**: YOLOv8 performs post-processing steps such as non-maximum suppression to remove redundant detections and refine the final set of predicted bounding boxes.
- **Efficiency and Speed**: YOLOv8 is known for its efficiency and speed, making it suitable for real-time object detection tasks.

## Differences:
- **Architecture Type**: DETR uses a transformer-based encoder-decoder architecture, while YOLOv8 uses a single-stage CNN-based architecture.
- **Prediction Strategy**: DETR directly predicts object bounding boxes and class labels in a single pass, while YOLOv8 uses anchor boxes and grid-based prediction.
- **Handling of Anchor Boxes**: DETR does not rely on predefined anchor boxes, whereas YOLOv8 uses anchor boxes for object localization.
- **Performance vs. Speed**: DETR may offer better accuracy and precise localization but may be slower compared to the highly efficient YOLOv8, which sacrifices a bit of precision for speed.

In summary, DETR and YOLOv8 represent different approaches to object detection, with DETR focusing on accuracy and direct prediction using transformers, while YOLOv8 prioritizes efficiency and speed using a single-stage CNN architecture with anchor boxes. The choice between the two depends on the specific requirements of the application, balancing accuracy, speed, and computational resources.
