# Real time image segmentation

The purpose of this notebook is to show how we can perform real time image segmentation using [Ultralytics](https://docs.ultralytics.com/) YOLO model.  
Ultralytics provides a really simple API that allows to use powerful computer vision models optimized for real time use cases.  
  
We'll see how to:
- Perform real time image segmentation.
- Quantize the model in order to reduce its size and its inference time to make it more suitable for mobile or embedded devices.

## YOLOv11

Here we use the YOLOv11 nano version which is the smaller version of the YOLOv11 model optimized for fast inference.  
There are bigger versions that are more suitable if you want to maximize the model precision at the expense of inference time: see [YOLOv11 page](https://docs.ultralytics.com/models/yolo11/#overview).

In [1]:
import cv2
import time
from ultralytics import YOLO

model_name = 'yolo11n-seg'

# Load YOLOv8 segmentation model
model = YOLO(f"{model_name}.pt")

In [2]:
def segment_webcam():
    # Start video capture from webcam
    cap = cv2.VideoCapture(0)
    if not cap.isOpened():
        print("Error: Could not open webcam.")
        return

    inference_times = []

    while True:
        # Read frame from webcam
        ret, frame = cap.read()
        if not ret:
            break

        # Measure inference time
        start_time = time.time()
        results = model(frame, task='segment')
        end_time = time.time()

        # Calculate and store inference time
        inference_times.append(end_time - start_time)

        # Extract masks, bounding boxes, etc.
        annotated_frame = results[0].plot()  # Annotated frame with segmentations

        # Display the output
        cv2.imshow(f'{model_name} Real-Time Segmentation', annotated_frame)

        # Press 'q' to exit the loop
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    # Release the webcam and close windows
    cap.release()
    cv2.destroyAllWindows()

    # Compute and print mean inference time
    if inference_times:
        mean_inference_time = sum(inference_times) / len(inference_times)
        print(f"Mean Inference Time: {mean_inference_time * 1000:.4f} ms")
        return mean_inference_time
    
mean_inference_time = segment_webcam()


0: 480x640 1 person, 266.8ms
Speed: 6.6ms preprocess, 266.8ms inference, 16.0ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 person, 139.1ms
Speed: 3.0ms preprocess, 139.1ms inference, 3.0ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 person, 115.1ms
Speed: 2.0ms preprocess, 115.1ms inference, 3.0ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 person, 109.7ms
Speed: 2.0ms preprocess, 109.7ms inference, 2.5ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 person, 120.0ms
Speed: 1.0ms preprocess, 120.0ms inference, 3.5ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 person, 142.0ms
Speed: 2.0ms preprocess, 142.0ms inference, 4.5ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 person, 125.5ms
Speed: 2.0ms preprocess, 125.5ms inference, 4.0ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 person, 128.3ms
Speed: 2.0ms preprocess, 128.3ms inference, 3.0ms postprocess per image a

In [3]:
print(f"Mean Inference Time: {mean_inference_time * 1000:.4f} ms")

Mean Inference Time: 177.8286 ms


## Model export and quantization

The purpose here is to export the model to the onnx format and quantize it in order to reduce its size and inference time.

#### Export to onnx using Ultralytics export function

Note: onnxruntime flutter plugin requires opset version < 19 and ONNX IR version < 10, see [ONNXRuntime compatibility](https://onnxruntime.ai/docs/reference/compatibility.html).

In [4]:
model.export(format="onnx", opset=17, imgsz=320)  # Export ONNX format

Ultralytics 8.3.28  Python-3.11.3 torch-2.5.1+cpu CPU (Intel Core(TM) i7-8750H 2.20GHz)

[34m[1mPyTorch:[0m starting from 'yolo11n-seg.pt' with input shape (1, 3, 320, 320) BCHW and output shape(s) ((1, 116, 2100), (1, 32, 80, 80)) (5.9 MB)

[34m[1mONNX:[0m starting export with onnx 1.14.1 opset 17...
[34m[1mONNX:[0m slimming with onnxslim 0.1.36...
[34m[1mONNX:[0m export success  3.0s, saved as 'yolo11n-seg.onnx' (11.1 MB)

Export complete (3.2s)
Results saved to [1mC:\Users\33650\Documents\M2IA2\Computer_vision\seg_project[0m
Predict:         yolo predict task=segment model=yolo11n-seg.onnx imgsz=320  
Validate:        yolo val task=segment model=yolo11n-seg.onnx imgsz=320 data=/ultralytics/ultralytics/cfg/datasets/coco.yaml  
Visualize:       https://netron.app


'yolo11n-seg.onnx'

#### Quantize the model using ONNXRuntime quantizer

Now that we have our model exported to onnx format, we can quantize it using the ONNXRuntime quantizer. (cf [ONNXRuntime quantization](https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html))  
  
There are 2 kind of quantization:
- Dynamic quantization: computes quantization parameters (zero-point and scaling factors) on the fly. These calculations increase the cost of inference, while usually achieve higher accuracy comparing to static ones. This type of quantization is usually more suitable for transformers and linear layers.
- Static quantization: quantization parameters are calculated during the quantization process, not during inference. It first runs the model using a set of images called calibration data. During these runs, we compute the quantization parameters for each activations. These quantization parameters are written as constants to the quantized model and used for all inputs. Static quantization is recommended for CNN models.

Since our model is a CNN based model, we use static quantization.  
  
During the calibration process, we need to preprocess the images the same way it will be done during inference.  
We use the CalibrationDataReader class from ONNXRuntime (cf [examples](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization)) with a set of 10 random images from the COCO dataset available in the [calibration_img](./calibration_img/) (the images should be similar to the ones the model will get as input).

In [5]:
from onnxruntime.quantization import CalibrationDataReader
import os
import cv2
import numpy as np

def letterbox(im, new_shape=(640, 640), color=(114, 114, 114)):
    # Resize and pad image while maintaining aspect ratio
    shape = im.shape[:2]  # current shape [height, width]
    ratio = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    new_unpad = (int(round(shape[1] * ratio)), int(round(shape[0] * ratio)))
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # width and height padding
    dw /= 2  # divide padding into two sides
    dh /= 2

    im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
    im = cv2.copyMakeBorder(
        im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color
    )  # add border
    return im, ratio, (dw, dh)

class YOLOCalibrationDataReader(CalibrationDataReader):
    def __init__(self, calibration_image_folder, model_input_name, input_size=(640, 640)):
        self.data_dir = calibration_image_folder
        self.model_input_name = model_input_name
        self.input_size = input_size
        self.image_files = os.listdir(self.data_dir)
        self.preprocess_images()
        self.data_iter = iter(self.input_batches)

    def preprocess_images(self):
        self.input_batches = []
        for image_file in self.image_files:
            image_path = os.path.join(self.data_dir, image_file)
            image = cv2.imread(image_path)
            if image is None:
                continue
            # Preprocess the image (resize, normalize, etc.)
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            image, _, _ = letterbox(image, new_shape=self.input_size)
            image = image.astype(np.float32) / 255.0
            image = np.expand_dims(image.transpose(2, 0, 1), axis=0)
            batch = {self.model_input_name: image}
            self.input_batches.append(batch)

    def get_next(self):
        return next(self.data_iter, None)

Perform static quantization.  
The last layer is very sensitive to quantization, therefore we exclude it from the quantization process to get good predictions.

In [6]:
from onnxruntime.quantization import quantize_static, CalibrationMethod, QuantType, QuantFormat, quant_pre_process
import onnxruntime as ort

# Define paths
model_fp32 = 'yolo11n-seg.onnx'
model_preprocessed = 'yolo11n-seg-preprocessed.onnx'
model_quant = 'yolo11n-seg-quantized.onnx'

session = ort.InferenceSession(model_fp32, providers=['CPUExecutionProvider'])
input_shape = session.get_inputs()[0].shape  # [batch_size, channels, height, width]
input_height, input_width = input_shape[2], input_shape[3]

# Create data reader
calibration_data_reader = YOLOCalibrationDataReader(
    calibration_image_folder='./calibration_img',
    model_input_name=session.get_inputs()[0].name,
    input_size=(input_height, input_width)
)

# Quantization configuration
quantization_config = {
    'activation_type': QuantType.QUInt8,  # Quantize activations to uint8
    'weight_type': QuantType.QInt8,      # Quantize weights to int8
    'quant_format': QuantFormat.QOperator,               # Quantization format
    'per_channel': False,                 # Enable per-channel quantization
    'calibrate_method': CalibrationMethod.MinMax,
}

quant_pre_process(model_fp32, model_preprocessed, skip_symbolic_shape=True)

nodes_to_exclude = ["/model.23/Concat_6"]

# Perform static quantization
quantize_static(
    model_input=model_preprocessed,
    model_output=model_quant,
    calibration_data_reader=calibration_data_reader,
    quant_format=quantization_config['quant_format'],
    activation_type=quantization_config['activation_type'],
    weight_type=quantization_config['weight_type'],
    per_channel=quantization_config['per_channel'],
    calibrate_method=quantization_config['calibrate_method'],
    nodes_to_exclude=nodes_to_exclude
)

#### Real-time inference using float model

The onnx model does not include the post-processing (Non-max-suppression) contrary to the torch ultralytics model.  
Therefore, we need to perform post-processing manually.

In [1]:
import cv2
import time
import numpy as np
import onnxruntime as ort
import torch
import torchvision
from ast import literal_eval



iou_threshold = 0.7 # Lower values result in fewer detections by eliminating overlapping boxes, useful for reducing duplicates.
conf_threshold = 0.25 # Sets the minimum confidence threshold for detections. Objects detected with confidence below this threshold will be disregarded. Adjusting this value can help reduce false positives.

COCO_CLASS_NAMES = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
    'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep',
    'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard',
    'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard',
    'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork',
    'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange',
    'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair',
    'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv',
    'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
    'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase',
    'scissors', 'teddy bear', 'hair dryer', 'toothbrush'
]

def letterbox(im, new_shape=(640, 640), color=(114, 114, 114)):
    # Resize and pad image while maintaining aspect ratio
    shape = im.shape[:2]  # current shape [height, width]
    ratio = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    new_unpad = (int(round(shape[1] * ratio)), int(round(shape[0] * ratio)))
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # width and height padding
    dw /= 2  # divide padding into two sides
    dh /= 2

    im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
    im = cv2.copyMakeBorder(
        im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color
    )  # add border
    return im, ratio, (dw, dh)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def postprocess(frame, outputs, input_width, input_height, orig_shape, ratio, dw, dh, class_names, colors):
    # Extract outputs
    predictions = outputs[0]  # Shape: [1, 116, 8400]
    mask_protos = outputs[1]  # Shape: [1, 32, 160, 160]

    # Transpose and reshape predictions
    predictions = predictions[0].transpose(1, 0)  # Shape: [8400, 116]
    mask_protos = mask_protos[0]  # Shape: [32, 160, 160]

    # Parse predictions
    num_classes = 80  # Number of classes (COCO dataset)
    num_mask_coeffs = predictions.shape[1] - 4 - num_classes  # Should be 32

    # Extract boxes, class scores, and mask coefficients
    boxes = predictions[:, :4]
    class_scores = predictions[:, 4 : 4 + num_classes]
    mask_coeffs = predictions[:, 4 + num_classes :]

    # Compute final scores and class IDs
    scores = class_scores.max(axis=1)
    class_ids = class_scores.argmax(axis=1)

    # Filter out low-confidence detections
    keep = scores > conf_threshold
    boxes = boxes[keep]
    scores = scores[keep]
    class_ids = class_ids[keep]
    mask_coeffs = mask_coeffs[keep]

    #print(f"Number of detections before NMS: {len(boxes)}")

    if boxes.shape[0] == 0:
        # No detections
        return frame

    # Adjust boxes to original image scale
    boxes[:, 0] = (boxes[:, 0] - dw) / ratio  # x_center
    boxes[:, 1] = (boxes[:, 1] - dh) / ratio  # y_center
    boxes[:, 2] /= ratio  # width
    boxes[:, 3] /= ratio  # height

    # Convert boxes from (x_center, y_center, w, h) to (x1, y1, x2, y2)
    x1 = boxes[:, 0] - boxes[:, 2] / 2
    y1 = boxes[:, 1] - boxes[:, 3] / 2
    x2 = boxes[:, 0] + boxes[:, 2] / 2
    y2 = boxes[:, 1] + boxes[:, 3] / 2
    boxes_xyxy = np.stack([x1, y1, x2, y2], axis=1)

    # Clip boxes to image boundaries
    orig_height, orig_width = orig_shape
    boxes_xyxy[:, [0, 2]] = boxes_xyxy[:, [0, 2]].clip(0, orig_width - 1)
    boxes_xyxy[:, [1, 3]] = boxes_xyxy[:, [1, 3]].clip(0, orig_height - 1)

    # Perform Non-Maximum Suppression
    indices = torchvision.ops.nms(
        torch.from_numpy(boxes_xyxy).float(), torch.from_numpy(scores).float(), iou_threshold
    )
    indices = indices.numpy()

    boxes_xyxy = boxes_xyxy[indices]
    scores = scores[indices]
    class_ids = class_ids[indices]
    mask_coeffs = mask_coeffs[indices]

    # Debugging: Print the number of detections
    #print(f"Number of detections after NMS: {len(indices)}")

    # Compute masks
    mask_protos_flat = mask_protos.reshape(mask_protos.shape[0], -1)  # [32, H*W]
    masks = sigmoid(np.dot(mask_coeffs, mask_protos_flat))  # [N, H*W]
    masks = masks.reshape(-1, mask_protos.shape[1], mask_protos.shape[2])  # [N, H, W]

    # Resize masks to the size of the preprocessed image
    masks_resized = np.array(
        [cv2.resize(mask, (input_width, input_height), interpolation=cv2.INTER_LINEAR) for mask in masks]
    )

    # Remove padding from masks
    dh_int, dw_int = int(round(dh)), int(round(dw))
    masks_cropped = masks_resized[:, dh_int:input_height - dh_int, dw_int:input_width - dw_int]

    # Resize masks to the original image size
    masks_resized = np.array(
        [cv2.resize(mask, (orig_width, orig_height), interpolation=cv2.INTER_LINEAR) for mask in masks_cropped]
    )

    # Apply threshold to get binary masks
    masks_bin = (masks_resized > 0.5).astype(np.uint8)

    # Overlay masks on the frame
    annotated_frame = frame.copy()
    for box, mask, class_id in zip(boxes_xyxy.astype(int), masks_bin, class_ids):
        color = colors[class_id].tolist()
        x1, y1, x2, y2 = box

        # Apply mask
        mask_3d = np.stack([mask] * 3, axis=-1)
        annotated_frame = np.where(
            mask_3d, annotated_frame * 0.5 + np.array(color) * 0.5, annotated_frame
        ).astype(np.uint8)

        # Draw bounding box
        cv2.rectangle(annotated_frame, (x1, y1), (x2, y2), color, 2)

        # Put class label
        class_name = class_names[class_id]
        cv2.putText(
            annotated_frame,
            class_name,
            (x1, y1 - 10),
            cv2.FONT_HERSHEY_SIMPLEX,
            0.5,
            color,
            2,
        )

    return annotated_frame

def segment_webcam_onnx():
    # Load the ONNX model
    # session = ort.InferenceSession('yolo11n-seg.onnx', providers=['CUDAExecutionProvider']) # Use CUDA for GPU acceleration (if available)
    session = ort.InferenceSession('yolo11n-seg.onnx', providers=['CPUExecutionProvider'])
    input_name = session.get_inputs()[0].name
    input_shape = session.get_inputs()[0].shape  # [batch_size, channels, height, width]
    input_height, input_width = input_shape[2], input_shape[3]

    #model_meta = session.get_modelmeta().custom_metadata_map["names"]
    class_names = COCO_CLASS_NAMES
    #class_names = literal_eval(model_meta)

    # Generate random colors for each class
    COLORS = np.random.randint(0, 255, size=(len(class_names), 3), dtype='uint8')

    # Start video capture from the webcam
    cap = cv2.VideoCapture(0)
    if not cap.isOpened():
        print("Error: Could not open webcam.")
        return

    inference_times = []

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        orig_height, orig_width = frame.shape[:2]

        # Preprocess the frame
        img, ratio, (dw, dh) = letterbox(frame, (input_width, input_height))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = img.astype(np.float32) / 255.0  # Normalize to [0, 1]
        img = np.expand_dims(img.transpose(2, 0, 1), axis=0)  # HWC to CHW and add batch dimension

        # Run inference
        start_time = time.time()
        outputs = session.run(None, {input_name: img})
        end_time = time.time()
        inference_times.append(end_time - start_time)

        # Postprocess outputs
        annotated_frame = postprocess(frame, outputs, input_width, input_height, orig_shape=(orig_height, orig_width), ratio=ratio, dw=dw, dh=dh, class_names=class_names, colors=COLORS)

        # Display the output
        cv2.imshow('ONNX Real-Time Segmentation', annotated_frame)

        # Press 'q' to exit the loop
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    # Release resources
    cap.release()
    cv2.destroyAllWindows()

    # Calculate and print mean inference time
    mean_inference_time_float_model = np.mean(inference_times)
    print(f"Mean Inference Time: {mean_inference_time_float_model * 1000:.2f} ms")
    return mean_inference_time_float_model

mean_inference_time_float_model = segment_webcam_onnx()

Mean Inference Time: 21.39 ms


#### Real-time inference using quantized model

In [2]:
import cv2
import time
import numpy as np
import onnxruntime as ort
import torch
import torchvision
from ast import literal_eval



iou_threshold = 0.7 # Lower values result in fewer detections by eliminating overlapping boxes, useful for reducing duplicates.
conf_threshold = 0.25 # Sets the minimum confidence threshold for detections. Objects detected with confidence below this threshold will be disregarded. Adjusting this value can help reduce false positives.



def letterbox(im, new_shape=(640, 640), color=(114, 114, 114)):
    # Resize and pad image while maintaining aspect ratio
    shape = im.shape[:2]  # current shape [height, width]
    ratio = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    new_unpad = (int(round(shape[1] * ratio)), int(round(shape[0] * ratio)))
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # width and height padding
    dw /= 2  # divide padding into two sides
    dh /= 2

    im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
    im = cv2.copyMakeBorder(
        im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color
    )  # add border
    return im, ratio, (dw, dh)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def postprocess(frame, outputs, input_width, input_height, orig_shape, ratio, dw, dh, class_names, colors):
    # Extract outputs
    predictions = outputs[0]  # Shape: [1, 116, 8400]
    mask_protos = outputs[1]  # Shape: [1, 32, 160, 160]

    # Transpose and reshape predictions
    predictions = predictions[0].transpose(1, 0)  # Shape: [8400, 116]
    mask_protos = mask_protos[0]  # Shape: [32, 160, 160]

    # Parse predictions
    num_classes = 80  # Number of classes (COCO dataset)
    num_mask_coeffs = predictions.shape[1] - 4 - num_classes  # Should be 32

    # Extract boxes, class scores, and mask coefficients
    boxes = predictions[:, :4]
    class_scores = predictions[:, 4 : 4 + num_classes]
    mask_coeffs = predictions[:, 4 + num_classes :]

    # Compute final scores and class IDs
    scores = class_scores.max(axis=1)
    class_ids = class_scores.argmax(axis=1)

    # Filter out low-confidence detections
    keep = scores > conf_threshold
    boxes = boxes[keep]
    scores = scores[keep]
    class_ids = class_ids[keep]
    mask_coeffs = mask_coeffs[keep]

    #print(f"Number of detections before NMS: {len(boxes)}")

    if boxes.shape[0] == 0:
        # No detections
        return frame

    # Adjust boxes to original image scale
    boxes[:, 0] = (boxes[:, 0] - dw) / ratio  # x_center
    boxes[:, 1] = (boxes[:, 1] - dh) / ratio  # y_center
    boxes[:, 2] /= ratio  # width
    boxes[:, 3] /= ratio  # height

    # Convert boxes from (x_center, y_center, w, h) to (x1, y1, x2, y2)
    x1 = boxes[:, 0] - boxes[:, 2] / 2
    y1 = boxes[:, 1] - boxes[:, 3] / 2
    x2 = boxes[:, 0] + boxes[:, 2] / 2
    y2 = boxes[:, 1] + boxes[:, 3] / 2
    boxes_xyxy = np.stack([x1, y1, x2, y2], axis=1)

    # Clip boxes to image boundaries
    orig_height, orig_width = orig_shape
    boxes_xyxy[:, [0, 2]] = boxes_xyxy[:, [0, 2]].clip(0, orig_width - 1)
    boxes_xyxy[:, [1, 3]] = boxes_xyxy[:, [1, 3]].clip(0, orig_height - 1)

    # Perform Non-Maximum Suppression
    indices = torchvision.ops.nms(
        torch.from_numpy(boxes_xyxy).float(), torch.from_numpy(scores).float(), iou_threshold
    )
    indices = indices.numpy()

    boxes_xyxy = boxes_xyxy[indices]
    scores = scores[indices]
    class_ids = class_ids[indices]
    mask_coeffs = mask_coeffs[indices]

    # Debugging: Print the number of detections
    #print(f"Number of detections after NMS: {len(indices)}")

    # Compute masks
    mask_protos_flat = mask_protos.reshape(mask_protos.shape[0], -1)  # [32, H*W]
    masks = sigmoid(np.dot(mask_coeffs, mask_protos_flat))  # [N, H*W]
    masks = masks.reshape(-1, mask_protos.shape[1], mask_protos.shape[2])  # [N, H, W]

    # Resize masks to the size of the preprocessed image
    masks_resized = np.array(
        [cv2.resize(mask, (input_width, input_height), interpolation=cv2.INTER_LINEAR) for mask in masks]
    )

    # Remove padding from masks
    dh_int, dw_int = int(round(dh)), int(round(dw))
    masks_cropped = masks_resized[:, dh_int:input_height - dh_int, dw_int:input_width - dw_int]

    # Resize masks to the original image size
    masks_resized = np.array(
        [cv2.resize(mask, (orig_width, orig_height), interpolation=cv2.INTER_LINEAR) for mask in masks_cropped]
    )

    # Apply threshold to get binary masks
    masks_bin = (masks_resized > 0.5).astype(np.uint8)

    # Overlay masks on the frame
    annotated_frame = frame.copy()
    for box, mask, class_id in zip(boxes_xyxy.astype(int), masks_bin, class_ids):
        color = colors[class_id].tolist()
        x1, y1, x2, y2 = box

        # Apply mask
        mask_3d = np.stack([mask] * 3, axis=-1)
        annotated_frame = np.where(
            mask_3d, annotated_frame * 0.5 + np.array(color) * 0.5, annotated_frame
        ).astype(np.uint8)

        # Draw bounding box
        cv2.rectangle(annotated_frame, (x1, y1), (x2, y2), color, 2)

        # Put class label
        class_name = class_names[class_id]
        cv2.putText(
            annotated_frame,
            class_name,
            (x1, y1 - 10),
            cv2.FONT_HERSHEY_SIMPLEX,
            0.5,
            color,
            2,
        )

    return annotated_frame

def segment_webcam_onnx():
    # Load the ONNX model
    # session = ort.InferenceSession('yolo11n-seg.onnx', providers=['CUDAExecutionProvider']) # Use CUDA for GPU acceleration (if available)
    session = ort.InferenceSession('yolo11n-seg-quantized.onnx', providers=['CPUExecutionProvider'])
    input_name = session.get_inputs()[0].name
    input_shape = session.get_inputs()[0].shape  # [batch_size, channels, height, width]
    input_height, input_width = input_shape[2], input_shape[3]

    model_meta = session.get_modelmeta().custom_metadata_map["names"]
    class_names = literal_eval(model_meta)

    # Generate random colors for each class
    COLORS = np.random.randint(0, 255, size=(len(class_names), 3), dtype='uint8')

    # Start video capture from the webcam
    cap = cv2.VideoCapture(0)
    if not cap.isOpened():
        print("Error: Could not open webcam.")
        return

    inference_times = []

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        orig_height, orig_width = frame.shape[:2]

        # Preprocess the frame
        img, ratio, (dw, dh) = letterbox(frame, (input_width, input_height))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = img.astype(np.float32) / 255.0  # Normalize to [0, 1]
        img = np.expand_dims(img.transpose(2, 0, 1), axis=0)  # HWC to CHW and add batch dimension

        # Run inference
        start_time = time.time()
        outputs = session.run(None, {input_name: img})
        end_time = time.time()
        inference_times.append(end_time - start_time)

        # Postprocess outputs
        annotated_frame = postprocess(frame, outputs, input_width, input_height, orig_shape=(orig_height, orig_width), ratio=ratio, dw=dw, dh=dh, class_names=class_names, colors=COLORS)

        # Display the output
        cv2.imshow('ONNX Real-Time Segmentation', annotated_frame)

        # Press 'q' to exit the loop
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    # Release resources
    cap.release()
    cv2.destroyAllWindows()

    # Calculate and print mean inference time
    mean_inference_time_quantized_model = np.mean(inference_times)
    print(f"Mean Inference Time: {mean_inference_time_quantized_model * 1000:.2f} ms")
    return mean_inference_time_quantized_model

mean_inference_time_quantized_model = segment_webcam_onnx()

Mean Inference Time: 19.87 ms


#### Compare inference time

We should observe that quantized model inference time is lower than its float counterpart (although it's not a huge difference for this model, further investigation needs to be done).  
Also, if you check the size of the quantized model should be almost 4 times smaller than its float counterpart.

In [9]:
print(f"Mean Inference Time (Float Model): {mean_inference_time_float_model * 1000:.2f} ms")
print(f"Mean Inference Time (Quantized Model): {mean_inference_time_quantized_model * 1000:.2f} ms")

Mean Inference Time (Float Model): 21.62 ms
Mean Inference Time (Quantized Model): 23.24 ms
