<a href="https://colab.research.google.com/github/milieureka/redback-orion/blob/main/Crowd_Monitoring/small_object_detection/sahi_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Detecting Small Objects with SAHI

**Project Name**: Small object dection using SAHI, visualize on FifftyOne platform

**Author**: Miley Nguyen  

**Team**: RedBack - Crowd Mornitoring


Object detection is one of the fundamental tasks in computer vision, but detecting small objects can be particularly challenging.

I'll apply SAHI [Slicing Aided Hyper Inference](https://ieeexplore.ieee.org/document/9897990) with Ultralytics' YOLOv8 model to detect small objects in a crowd human images, and then evaluate these predictions to better understand how slicing impacts detection performance.

This notebook covers the following:

- Loading the VisDrone dataset from the Hugging Face Hub
- Applying Ultralytics' YOLOv8 model to the images and video
- Using SAHI to run inference on slices of the images and video
- Evaluating model performance with and without SAHI (compare with the groud truth label)

## Setup and Installation

Dependencies:

-  `Python 3.10.12`
- `fiftyone` for dataset exploration and manipulation.
- `huggingface_hub` Python library for accessing models and datasets.
- `ultralytics` official package for running YOLOv8 models, including inference and training.
- `sahi` for slicing aided hyper inference.
- `IPython` interactive shell capabilities, displaying rich media like videos.
- `opencv-python` (cv2) reading and manipulating video and image frames.
- `os` – for file management tasks.
- [`video`](https://github.com/milieureka/redback-orion/blob/main/Crowd_Monitoring/small_object_detection/resources/Open%20Day%20at%20Deakin%20University%20(online-video-cutter.com).mp4) sample for inference on video. In my code, i download code and upload to Google drive directory.

In [None]:
pip install -U fiftyone sahi ultralytics huggingface_hub --quiet

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.huggingface as fouh
from fiftyone import ViewField as F
from ultralytics import YOLO
from IPython.display import Video, display, YouTubeVideo
import os
import cv2

I use available [VisDrone](https://github.com/VisDrone/VisDrone-Dataset) dataset, this aldready been annotated, more convinient for for the evaluation.
The dataset can be accessed in the FiftyOne Hugging Face hub.

In [None]:
# load a subset of VisDrone dataset directly from the Hugging Face Hub
dataset = fouh.load_from_hub("Voxel51/VisDrone2019-DET", name="sahi-test", max_samples=100, overwrite=True)
dataset_view = dataset.take(50, 50)

Before adding any predictions, I launch the data to FiftyOne App which is a graphical user interface that makes it easy to explore and rapidly gain intuition into the datasets.

In [None]:
session = fo.launch_app(dataset_view)

![VisDrone](https://github.com/milieureka/redback-orion/blob/main/Crowd_Monitoring/small_object_detection/resources/Fiftyone%20app.png?raw=1)

## Standard Inference with YOLOv8

In [None]:
# Load YOLOv8 model from FiftyOne model integration with Ultralytics
model = foz.load_zoo_model("yolov8l-coco-torch")
ckpt_path = model.config.model_path

# Apply the model to the dataset for prediction
dataset.apply_model(model, label_field="base_model")

In [None]:
# Visualize prediction on FiftyOne app
session = fo.launch_app(dataset_view)

![Base Model Predictions](https://github.com/milieureka/redback-orion/blob/main/Crowd_Monitoring/small_object_detection/resources/yolov8_predict.gif?raw=True)

While the model does a pretty good job of detecting objects, it struggles with the small objects, especially people in the distance. This can happen with large images, as most detection models are trained on fixed-size images. As an example, YOLOv8 is trained on images with maximum side length $640$. When we feed it an image of size $1920$ x $1080$, the model will downsample the image to $640$ x $360$ before making predictions. This downsampling can cause small objects to be missed, as the model may not have enough information to detect them.

For evaluation, we need to standardize the class labels. This is because the classes detected by our *YOLOv8l* model differ from those in the VisDrone dataset. The YOLO model was trained on the [COCO dataset](https://docs.voxel51.com/user_guide/dataset_zoo/datasets.html#coco-2017), which contains 80 classes, while the VisDrone dataset includes only 12 classes, along with an `ignore_regions` class. To ensure consistency, we will remove the unmatched classes between the two datasets and map the VisDrone classes to their corresponding COCO classes as follows:

In [None]:
# Map common classes to the dataset
mapping = {"pedestrians": "person", "people": "person", "van": "car"}
mapped_view = dataset.map_labels("ground_truth", mapping)

In [None]:
# Define fuction to filter labels from VisDrone to only include the classes that are in common:

def get_label_fields(sample_collection):
    label_fields = list(
        sample_collection.get_field_schema(embedded_doc_type=fo.Detections).keys()
    )
    return label_fields

def filter_all_labels(sample_collection):
    label_fields = get_label_fields(sample_collection)

    filtered_view = sample_collection

    for lf in label_fields:
        filtered_view = filtered_view.filter_labels(
            lf, F("label").is_in(["person", "car", "truck"]), only_matches=False
        )
    return filtered_view

In [None]:
filtered_view = filter_all_labels(mapped_view).take(50, 50)

In [None]:
session.view = filtered_view.view()

## Detecting Small Objects with SAHI

Theoretically, one could train a model on larger images to improve detection of small objects, but this would require more memory and computational power. Another option is to introduce a sliding window approach, where we split the image into smaller patches, run the model on each patch, and then combine the results. This is the idea behind [Slicing Aided Hyper Inference](https://github.com/obss/sahi) (SAHI).

<figure>
  <img src="https://raw.githubusercontent.com/obss/sahi/main/resources/sliced_inference.gif" alt="Alt text" style="width:100%">
  <figcaption style="text-align:center; color:gray;">Illustration of Slicing Aided Hyper Inference. Image courtesy of SAHI Github Repo.</figcaption>
</figure>

In [None]:
# Import detection model from sahi framework
from sahi import AutoDetectionModel
from sahi.predict import get_prediction, get_sliced_prediction

In [None]:
# Define model and define instances and classes
detection_model = AutoDetectionModel.from_pretrained(
    model_type='yolov8',
    model_path=ckpt_path,
    confidence_threshold=0.25, ## same as the default value for YOLOv8 model
    image_size=640,
    device="cpu",
)

In [None]:
# Define function for prediction
def predict_with_slicing(sample, label_field, **kwargs):
    result = get_sliced_prediction(
        sample.filepath, detection_model, verbose=0, **kwargs
    )
    sample[label_field] = fo.Detections(detections=result.to_fiftyone_detections())

In [None]:
kwargs = {"overlap_height_ratio": 0.2, "overlap_width_ratio": 0.2}

for sample in dataset.iter_samples(progress=True, autosave=True):
    predict_with_slicing(sample, label_field="small_slices", slice_height=320, slice_width=320, **kwargs)
    predict_with_slicing(sample, label_field="large_slices", slice_height=480, slice_width=480, **kwargs)

These inference times are much longer than the original inference time. This is because the model running on multiple slices *per* image, which increases the number of forward passes the model has to make. This is a trade-off when making to improve the detection of small objects.

In [None]:
filtered_view = filter_all_labels(mapped_view).take(50, 50)

In [None]:
session = fo.launch_app(filtered_view, auto=False)
session.show()

![Sliced Model Predictions](https://github.com/voxel51/fiftyone/blob/v0.24.1/docs/source/tutorials/images/sahi_slices.gif?raw=1)

The results certainly look promising! From a few visual examples, slicing seems to improve the coverage of ground truth detections, and smaller slices in particular seem to lead to more of the `person` detections being captured.

## Evaluating SAHI Predictions

Running evaluation routine comparing our predictions from each of the prediction label fields to the ground truth labels. The `evaluate_detections()` method will mark each detection as a true positive, false positive, or false negative. Here the default IoU threshold is $0.5$, but it can be adjusted as needed:

In [None]:
base_results = filtered_view.evaluate_detections("base_model", gt_field="ground_truth", eval_key="eval_base_model")
large_slice_results = filtered_view.evaluate_detections("large_slices", gt_field="ground_truth", eval_key="eval_large_slices")
small_slice_results = filtered_view.evaluate_detections("small_slices", gt_field="ground_truth", eval_key="eval_small_slices")

In [None]:
print("Base model results:")
base_results.print_report()

print("-" * 50)
print("Large slice results:")
large_slice_results.print_report()

print("-" * 50)
print("Small slice results:")
small_slice_results.print_report()

We can see that as we introduce more slices, the number of false positives increases, while the number of false negatives decreases. This is expected, as the model is able to detect more objects with more slices, but also makes more mistakes! To minimize false positives, more agressive confidence thresholding can be applied, but even without doing this the $F_1$-score has significantly improved.

# Video inference

## Standard Inference with YOLOv8

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Set the path to your video file on Google Drive
video_path = '/content/drive/MyDrive/Deakin_open_day/Open Day at Deakin University (online-video-cutter.com).mp4'

# Load the YOLOv8 model
model_video = YOLO('yolov8n.pt')

# Perform object detection on the video
model_video.predict(
    source=video_path,
    save=True,
    project='runs/detect',
    name='predict1',
    exist_ok=True
)

# Display the annotated video
output_dir = 'runs/detect/predict1'
output_files = os.listdir(output_dir)
print('Files in output directory:', output_files)

# Find the output video file
annotated_video = None
for file in output_files:
    if file.endswith('.mp4') or file.endswith('.avi'):
        annotated_video = os.path.join(output_dir, file)
        break

if annotated_video:
    display(Video(annotated_video, embed=True))
else:
    print('No output video found.')

# Optional: Save the annotated video back to Google Drive
import shutil

drive_output_path = '/content/drive/MyDrive/Deakin_open_day/yolov8_annotated_video.mp4'
shutil.copy(annotated_video, drive_output_path)

## Inference with YOLOv8 + SAHI

Since the SAHI framework is designed to detect objects in images, it haven't yet to directly process videos for inference like the YOLO model. Therefore, we need to extract frames from the video, run the object detection on each frame, and then reassemble the frames back into a video. This process can be somewhat complex. However, Ultralytics provides a pre-built package for object detection on videos, which simplifies the task. A more efficient approach is to clone their GitHub repository and execute the detection using the provided command-line tools, which also saves considerable computation time  (Ultralytics, 2023).

In [None]:
# Clone the Ultralytics repository
!git clone https://github.com/ultralytics/ultralytics.git

In [None]:
%cd /content/ultralytics/examples/YOLOv8-SAHI-Inference-Video

In [None]:
# Build the command string
cmd = f'python yolov8_sahi.py --source "{video_path}" --save-img'

# Execute the command
get_ipython().system(cmd)

In [None]:
output_dir = 'ultralytics_results_with_sahi/exp'  # This is the default directory in the script
print('Files in output directory:', os.listdir(output_dir))

In [None]:


output_files = os.listdir(output_dir)
annotated_video = None
for file in output_files:
    if file.endswith('.mp4') or file.endswith('.avi'):
        annotated_video = os.path.join(output_dir, file)
        break

if annotated_video:
    print(f"Annotated video path: {annotated_video}")
    display(Video(annotated_video, embed=True))
else:
    print('No output video found.')


In [None]:

def create_side_by_side_video(video1_path, video2_path, output_path):
    # Open the video files
    cap1 = cv2.VideoCapture(video1_path)
    cap2 = cv2.VideoCapture(video2_path)

    # Check if videos opened successfully
    if not cap1.isOpened():
        print(f"Error opening video file {video1_path}")
        return
    if not cap2.isOpened():
        print(f"Error opening video file {video2_path}")
        return

    # Get properties from the first video
    fps1 = cap1.get(cv2.CAP_PROP_FPS)
    width1 = int(cap1.get(cv2.CAP_PROP_FRAME_WIDTH))
    height1 = int(cap1.get(cv2.CAP_PROP_FRAME_HEIGHT))

    # Get properties from the second video
    fps2 = cap2.get(cv2.CAP_PROP_FPS)
    width2 = int(cap2.get(cv2.CAP_PROP_FRAME_WIDTH))
    height2 = int(cap2.get(cv2.CAP_PROP_FRAME_HEIGHT))

    # Use the minimum FPS of the two videos to avoid speeding up any video
    fps = min(fps1, fps2)

    # Define the codec and create VideoWriter object
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')

    # Output video size will be (width1 + width2) x max(height1, height2)
    output_width = width1 + width2
    output_height = max(height1, height2)

    out = cv2.VideoWriter(output_path, fourcc, fps, (output_width, output_height))

    while True:
        # Read frames from both videos
        ret1, frame1 = cap1.read()
        ret2, frame2 = cap2.read()

        # Break the loop if any video ends
        if not ret1 or not ret2:
            break

        # Resize frames to have the same height
        if height1 != height2:
            # Calculate the scaling factors
            scale_factor1 = output_height / height1
            scale_factor2 = output_height / height2

            # Resize frames
            frame1 = cv2.resize(frame1, (int(width1 * scale_factor1), output_height))
            frame2 = cv2.resize(frame2, (int(width2 * scale_factor2), output_height))

        # Concatenate frames horizontally
        combined_frame = cv2.hconcat([frame1, frame2])

        # Write the combined frame to the output video
        out.write(combined_frame)

    # Release all resources
    cap1.release()
    cap2.release()
    out.release()
    print("left is YOLOv8, right is YOLO + SAHI prediction")

# Example usage
video1_path = '/content/drive/MyDrive/Deakin_open_day/yolov8_annotated_video.mp4'
video2_path = '/content/ultralytics/examples/YOLOv8-SAHI-Inference-Video/ultralytics_results_with_sahi/exp/Open Day at Deakin University (online-video-cutter.com).mp4'
output_path = 'side_by_side_output.mp4'

create_side_by_side_video(video1_path, video2_path, output_path)

# Display the video in the notebook
Video(output_path, embed=True, width=800)


In [None]:
# Show the results
# Replace with your YouTube video ID
youtube_id = 'xTj8JKMn0_4'

# Display the YouTube video
YouTubeVideo(youtube_id, width=560, height=315)

# Limitation

Although SAHI is a great framework to comprehense detect objects, however the trade off is long computing time (it's approx. 10 times longer than predict with the pretrain YOLOv8)
To maximize the effectiveness of SAHI, there're few experiment they suggest to do:

- Slicing hyperparameters, such as slice height and width, and overlap.
- Base object detection models, as SAHI is compatible with many models, including YOLOv5, and Hugging Face Transformers models.
- Confidence thresholding (potentially on a class-by-class basis), to reduce the number of false positives.
- Post-processing techniques, such as [non-maximum suppression (NMS)](https://docs.voxel51.com/api/fiftyone.utils.labels.html#fiftyone.utils.labels.perform_nms), to reduce the number of overlapping detections.
- Human-in-the-loop (HITL) workflows, to correct ground truth labels.

# References

1. Marks, J. (2024). *How to Detect Small Objects*. Voxel51. [Link to Article](https://voxel51.com/blog/how-to-detect-small-objects/)
2. Jocher, G. & Rizwan, M. (2023).  *Ultralytics Docs: Using YOLOv8 with SAHI for Sliced Inference*. Ultralytics YOLO Docs. [Link to Article](https://docs.ultralytics.com/guides/sahi-tiled-inference/)

3. Ultralytics. (2023). *Using Ultralytics YOLOv8 with SAHI on videos*. [Link to Article](https://ultralytics.medium.com/using-ultralytics-yolov8-with-sahi-on-videos-3d524087dd33)