# Introduction

Computer Vision is a fields of computer science that focuses on enabling artificial systems to extract information from images and its variants (e.g.: video sequences, views from multiple cameras, multi-dimensional data from a 3D scanner, medical scanning devices, etc.). It makes use of algorithmic models that allow a computer to "teach" itself the context of visual data, learning the patterns that distinguish an image from another. Computational vision is rapidly gaining popularity for automated AI vision inspection, remote monitoring, and automation.

Computer Vision has become the backbone of numerous practical applications that significantly impact our daily lives and companies across industries, from retail to security, healthcare, construction, automotive, manufacturing, logistics, and agriculture.

In this scenario, one of the most groundbreaking approaches for Computer Vision is the **You Only Look Once** (YOLO) models family.

# Development History

The first version of YOLO(You Look Only Once) was conceived by Joseph Redmon at al. in their 2015 [paper](https://arxiv.org/abs/1506.02640), and represented a groundbreaking approach in Computer Vision, particularly in object *detection tasks.* At the, time the conventional object detection frameworks (e.g.: RCNN) relied on a two-step approach: for a given image, one model is responsible for extraction of regions of objects, and a second model is responsible for classification and refinement of localization of objects. YOLOv1 (how the first version became known) challenged this convention by proposing a single neural network that predictions bounding boxes and class probabilities directly from full images in one evaluation. The approach significantly increased the speed of detection, making real-time object detection feasible.

![RCNN pipeline](imgs/rcnn-pipeline.png)
*RCNN's multi-stage detection. Credits: RCNN original [paper](https://arxiv.org/abs/1311.2524).*

![YOLOv1 pipeline](imgs/yolov1-pipeline.png)
*YOLOv1 unified detection. Credits: YOLOv1 original [paper](https://arxiv.org/abs/1506.02640).*

Following the initial release, the YOLO architecture underwent several iterations and improvements, leading to version like YOLOv2 ([YOLO9000](https://arxiv.org/abs/1612.08242)), [YOLOv3](https://arxiv.org/abs/1804.02767), and further, each introducing enhancements in speed, accuracy, and the ability to detect smaller objects. [YOLOv4](https://arxiv.org/abs/2004.10934), introduced by Alexey Bochkovskiy, focused on optimizing the speed, accuracy trade-off, making it highly efficient without specialized hardware.

The [Ultralytics](https://www.ultralytics.com/) team contributed significantly to the YOLO legacy with their [YOLOv5](https://docs.ultralytics.com/yolov5/) model, which brought improvements in terms of simplicity, speed, and performance. They continued this trend with the development of [YOLOv6](https://arxiv.org/abs/2209.02976) and [YOLOv8](https://docs.ultralytics.com/), which incorporates advanced features and improving upon the accuracy and efficiency of its predecessors.

YOLOv8 also supports a full range of vision AI tasks, including detection, segmentation, pose estimation, tracking and classification. This versatility allows users to leverage YOLOv8's capabilities across diverse application and domains.

Currently, [YOLO-NAS](https://github.com/Deci-AI/super-gradients/blob/master/YOLONAS.md) and [YOLOv9](https://arxiv.org/abs/2402.13616) were conceived with remarkable improvements in efficiency, accuracy and adaptability.

# Architecture

<img src="imgs/yolov8_architecture.jpg" alt="YOLOv8 Architecture" width=800>*Credits: GitHub user [RangeKing](https://github.com/RangeKing) ([original post](https://github.com/ultralytics/ultralytics/issues/189))*</img>

The YOLOv8 architecture was designed to perform object detection tasks with high efficiency and accuracy. While maintaining the core principle of performing object detection in a single pass through the network, YOLOv8 introduces several key improvements and features to enhance performance, incorporating advanced techniques such as:
- **Cross-stage Partial Networks (CSPNet):** A backbone designed to reduce redundancy in network layers, reducing it's complexity and improving learning efficiency and model scalability without compromising performance.
- **Path Aggregation Network (PANet):** An architecture that enahnces feature extraction and integration ensuring rich semantic information is carried through the network for accurate detection.
- **Spatial Pyramid Pooling (SPP):** A pooling strategy that increases the network's robustness to object scale variations, improving detection of objects of various sizes.

Additionally, YOLOv8 employs advanced data augmentation techniques and loss functions ([CIoU](https://arxiv.org/abs/1911.08287) and [DFL](https://arxiv.org/abs/2006.04388)) to fine-tune the model's performance further, ensuring it remains robust against a wide variety of images and scenarios.

# Key Features

YOLOv8 support a versatile range of Computer Vision tasks and pre-trained models for each:
- **Image classification**, with models pre-trained in the [ImageNet dataset](https://www.image-net.org/).
- **Object detections**, **object segmentation** and **human pose estimation**, with models pre-trained in the [COCO dataset](https://cocodataset.org/#home).

Additionally, there are model variatesion for each of these tasks, each variation targeted to run on systems with different hardware specifications.

# Code Requirements

## Imports

In [1]:
import os
from typing import Callable
from urllib.request import urlopen

import cv2 as cv
import imageio
import numpy as np
from pytube import YouTube
from ultralytics.models import YOLO

## Utility Functions

### Image Resize Function

This function resizes a image to meet a fixed width. This is helpful when creating GIF's, as too big images can result in unwanted big GIFs.

In [2]:
def imresize(img: np.ndarray, width: int) -> np.ndarray:
    """
    Resizes an image to a specified width while maintaining aspect ratio.

    Args:
        img (np.ndarray): Input image in the form of a NumPy array.
        width (int): Desired width of the output image.

    Returns:
        np.ndarray: Resized image as a NumPy array.
    """
    _, old_width, _ = img.shape
    factor = width / old_width
    return cv.resize(img, None, fx=factor, fy=factor, interpolation=cv.INTER_LINEAR)

### Create GIF for Predictions

This functions converts a sequence of mode predictions on video data into an animated GIF. It takes as input the path to the video file, the output path for the GIF, the model used for predictions, and optimal parameters for duration and *frames per second* (*pfs*). The function processes the video, applies the model to generate the predictions for each frame, and compiles these frames into a GIF. This utility is particularly useful for demonstrating object detection, segmentation and pose estimation capabilities in a dynamic, easily shareable format.

In [3]:
def create_gif_for_predictions(
    video_path: str,
    output_gif_path: str,
    inference_func: Callable[[np.ndarray], np.ndarray],
    img_width: int = 640,
    max_frames: int = 300,
    fps: int = 30,
) -> None:
    """
    Creates a GIF from a video by applying an inference function to each frame.

    Args:
        video_path (str): Path to the input video file.
        output_gif_path (str): Path to save the output GIF file.
        inference_func (Callable[[np.ndarray], np.ndarray]): Function that performs inference on each frame.
        img_width (int, optional): Desired width of each frame in the GIF. Defaults to 640.
        max_frames (int, optional): Maximum number of frames to process from the video. Defaults to 300.
        fps (int, optional): Frames per second for the output GIF. Defaults to 30.

    Returns:
        None
    """
    cap = cv.VideoCapture(video_path)
    img_list = []
    frame_count = -1
    while cap.isOpened() and frame_count < max_frames:
        success, frame = cap.read()
        if success:
            frame_count += 1
            annotated_frame = inference_func(frame)
            resized_frame = imresize(
                cv.cvtColor(annotated_frame, cv.COLOR_BGR2RGB), width=img_width
            )
            img_list.append(resized_frame)
        else:
            break
    cap.release()
    imageio.mimsave(output_gif_path, img_list, fps=fps, loop=0)

### Displya Video for Predictions

Displays the video along with real-time model predictions in a sample dedicated window. It accepts the path to the video file and the model as inputs. This function offers an immediate, visual understanding of the model's performance on dynamic scenes.

In [4]:
def display_video_for_predictions(
    video_path: str,
    inference_func: Callable[[np.ndarray], np.ndarray],
    window_name: str = "Prediction Results",
    fps: int = 60,
) -> None:
    """
    Displays a video with predictions by applying an inference function to each frame in real-time.

    Args:
        video_path (str): Path to the input video file.
        inference_func (Callable[[np.ndarray], np.ndarray]): Function that performs inference on each frame.
        window_name (str, optional): Name of the display window. Defaults to "Prediction Results".
        fps (int, optional): Frames per second for displaying the video. Defaults to 60.

    Returns:
        None
    """
    cap = cv.VideoCapture(video_path)
    cv.namedWindow(window_name, cv.WINDOW_NORMAL)
    while cap.isOpened():
        success, frame = cap.read()
        if success:
            annotated_frame = inference_func(frame)
            cv.imshow(window_name, annotated_frame)
            if cv.waitKey(1000 // fps) & 0xFF == ord("q"):
                break
        else:
            break
    cap.release()
    cv.destroyAllWindows()

### Download Sample Image

This function downloads a image from a specified URL and returns the local path to the downloaded file. This will help us to keep organized.

In [5]:
def download_sample_image(url: str) -> str:
    """
    Downloads an image from a given URL and saves it to the local file system.

    Args:
        url (str): URL of the image to be downloaded.

    Returns:
        str: Path to the downloaded image.

    Notes:
        If the image already exists in the target directory, it will not be downloaded again.
    """
    img = urlopen(url)
    filename = url.split("/")[-1]
    img_path = f"data/{filename}"
    if os.path.isfile(img_path):
        print("Sample image already downloaded")
    else:
        if not os.path.isdir("data"):
            os.makedirs("data")
        with open(img_path, "wb") as f:
            f.write(img.read())
        print("Sample image downloaded successfully")
    return img_path

### Download Sample Video

This function downloads a video from a specified URL and returns the local path to the downloaded file. It is essential for preparing the data used in demonstrations, ensuring that the same content is accessible for object detection, segmentation, and human pose estimation tasks.

In [6]:
def download_sample_video(url: str) -> str:
    """
    Downloads a video from YouTube and saves it to the local file system.

    Args:
        url (str): URL of the YouTube video to be downloaded.

    Returns:
        str: Path to the downloaded video.

    Notes:
        If the video already exists in the target directory, it will not be downloaded again.
    """
    youtube_obj = YouTube(url).streams.get_highest_resolution()
    video_filepath = os.path.join("data", youtube_obj.default_filename)
    if os.path.isfile(video_filepath):
        print("Sample file already downloaded")
    else:
        try:
            print(video_filepath, youtube_obj.get_file_path())
            youtube_obj.download(output_path="data")
            print("Sample video downloaded successfully")
        except Exception as e:
            raise Exception(f"An error has occurred: {e}")
    return video_filepath

## Sample Data

In this section, we'll download the sample data that will be used to demonstrate how to use the YOLO API on real data.

In [7]:
# Image depicting an outdoor environment.
bus_img = "https://ultralytics.com/images/bus.jpg"
# Video of people  walking in a open indoor space.
people_walking_video = "https://youtu.be/ORrrKXGx2SE?si=UZqWGkFnUn7wYdck"
# Video of cars in a traffic road.
traffic_cars_video = "https://youtu.be/MNn9qKG2UFI?si=2U6waPKQJOsTSJYC"
# Video of soccer games moments.
soccer_moments_video = "https://youtu.be/aTTOQtSOX3I?si=w1Gvm6hI0qySu5qt"

# Download each sample.
bus_img_path = download_sample_image(bus_img)
people_walking_video_path = download_sample_video(people_walking_video)
traffic_cars_video_path = download_sample_video(traffic_cars_video)
soccer_moments_video_path = download_sample_video(soccer_moments_video)

Sample image already downloaded
Sample file already downloaded
Sample file already downloaded
Sample file already downloaded


# Object Detection

Our first use case will be the task that made YOLO  models famous.

First, we need to load the pre-trained model. As mentioned before, each Ultralytics' YOLO models have variants suited to run on a variety of hardware specifications, and this is where Ultralytics API shines. You can download and load each variant by specifying a suffix in the desired model name.

In [8]:
# model = YOLO("models/yolov8n.pt")            <-- "nano" model, which has the lowest inference time and hardware requirements, but also the lowest accuracy.
# model = YOLO("models/yolov8s.pt")            <-- "small" model.
# model = YOLO("models/yolov8m.pt")            <-- "medium" model, which is a balance between speed and accuracy.
# model = YOLO("models/yolov8l.pt")            <-- "large" model.
model = YOLO("models/yolov8x.pt")            # <-- "extra-large" model, which has the highest accuracy but also the highest inference time and hardware requirements.

The first time we run the above cell, it'll download the specified model to the **models** local folder. With the model in place, `model` has the loaded model ready for inference.

# Image Classification

# Semantic Segmentation

# Human Pose Estimation