# On-device Learning: Teacher-Student use case using Transfer Learning for object detection

## I. Introduction
### 1. Teacher-Student Machine Learning:
Teacher-student machine learning with automatic labeling is an advanced technique where a large, well-trained model (the teacher) is used to generate labels for a dataset that lacks annotations. The teacher model predicts labels for the unlabeled data, which are then used to train a smaller, simpler model (the student). This process allows the student model to learn from the teacher's knowledge, effectively transferring the teacher's expertise to the student. This method is particularly useful for creating efficient models for deployment on resource-constrained devices, while also reducing the need for manual data labeling.

### 2. Transfer Learning:
Transfer learning is a machine learning technique where a pre-trained model, developed for one task, is adapted to perform a different but related task. Instead of training a model from scratch, transfer learning leverages the knowledge gained from the initial task to improve the performance and efficiency of the new task. This approach is particularly useful when there is limited data available for the new task, as it allows the model to benefit from the extensive training of the pre-trained model. This tutorial showcases the transfer learning technique where knowledge gained from training a model on one task is leveraged to improve the performance of a model on a different but related task. Instead of starting the learning process from scratch, transfer learning allows us to transfer the knowledge or features learned by a pre-trained model to a new task.

### 3. Introduction of the use case:
In this notebook, we will demonstrate the concept of transfer learning using an example dataset and evaluate the ORT training API on an **STM32MP257** device. We will employ a custom dataset where data is directly **retrieved**, **processed**, and **labeled** for training directly on the device. For this tutorial, we will leverage the SSD MobileNetV2 model which has been trained on large-scale image datasets such as PASCAL VOC for object detection (which has 20 classes). We will use this model for detecting custom data into one class. The class for this use case will be person but the user is free to adapt the use case into his needs. The initial layers of SSD MobileNetV2 serve as a feature extractor, capturing generic visual features applicable to various tasks, and only the final layer will be trained for the task at hand. The figure below summarizes the whole workflow followed to realize this Teacher Student Machine Learning workflow:

<div style="text-align: center;">
    <img src="/usr/local/x-linux-ai/resources/ODL_teacher_student_workflow.png">
</div>

### 4. Prerequisites:
In order to be able to retrain the model using Onnxruntime training API, it is mandatory to generate the training artifacts:
*  `Training model (onnx.ModelProto)`: Contains the base model graph, loss subgraph and the gradient graph.
*  `Eval Model (onnx.ModelProto)`: Contains the base model graph and the loss subgraph.
*  `Optimizer Model (onnx.ModelProto)`: Contains the optimizer graph.
*  `Checkpoint (Directory)`: Contains the model parameters split into 2 .pbsec files, one for frozen parameters and one for trainable parameters.
These artifacts can be generated by running the following [the dedicated wiki article](https://wiki.st.com/stm32mpu/wiki/How_to_generate_training_artifacts_for_on-device_learning_feature#) on your host computer. These artifacts should be then deployed on the **STM32MP257** board using either the `scp` tool or the drag and drop functionality provided by this Jupyter-lab.


_______________________________________________________________________
## II. Data collection using on-device camera IMX335 sensor
The primary advantage of on-device learning is that the data remains on the device, ensuring enhanced privacy and security. This approach eliminates the need to transfer sensitive data to external servers for processing, thereby reducing the risk of data breaches and unauthorized access. It allows for personalized models that can adapt to individual user behavior and preferences without compromising data security. Therefore the need to use on-device camera sensor. 

In [None]:
import cv2
import os
import glob
import numpy as np
import sys
import shutil
import subprocess
import random
import torch

import xml.etree.ElementTree as ET
from xml.dom import minidom
import supervision as sv

# Widget libraries for interaction with user
from IPython.display import display
from PIL import Image
import matplotlib.pyplot as plt
import ipywidgets as widgets
import threading
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
JUPYTER_ROOT_PATH = "/usr/local/x-linux-ai/on-device-learning/"

### 1. Raw data collection
In th process of data collection, it is crucial to ensure that the data is generic and uniform across all classes. This means that the dataset should be balanced, with an equal or proportionate number of samples for each class to prevent bias towar,
In the next cells start by creating the directory where all the raw data will be stored. Please make sure that your disk partition has enough space.

In [None]:
num_samples_images = 50                  # Number of images to be retrieved
retrieval_frequency = 5                     # Number of frames between two retrieved frames
input_height = 240                          # Height of frames to be displayed
input_width  = 320                          # Width of frames to be displayed
input_nn_width = 256                        # NN input width
input_nn_height = 256                       # NN input height
dataset_dir = JUPYTER_ROOT_PATH + "data"
if (os.path.exists(dataset_dir) == False):
    os.mkdir(dataset_dir)


In [None]:
# Widgets for interaction with the user
# ================
stopButton = widgets.ToggleButton(
    value=False,
    description='Stop',
    disabled=False,
    button_style='danger', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Description',
    icon='square' # (FontAwesome names without the `fa-` prefix)
)

startRetrieval = widgets.ToggleButton(
    value=False,
    description='Start data retrieval',
    disabled=False,
    button_style='info', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Description',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)

progressBar = widgets.IntProgress(
    value=0,
    min=1,
    max=num_samples_images,
    description='Data collection:',
    bar_style='success', # 'success', 'info', 'warning', 'danger' or ''
    orientation='horizontal'
)

class_box = widgets.HBox([startRetrieval, progressBar, stopButton])

# Display function
# ================
def preview(button):
    from PIL import Image
    # Define the GStreamer pipeline
    gst_pipeline = (
        "libcamerasrc ! "
        "video/x-raw,width={frame_width},height={frame_height},format=RGB16 !"
        "queue leaky=2 max-size-buffers=1 !"
        "videoconvert !"
        "appsink"
    ).format(frame_width=640, frame_height=480)

    # Use the pipeline with cv2.VideoCapture
    cap = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)
    display_handle=display(None, display_id=True)
    image_idx = 0
    class_idx = 0
    while True:
        _, frame = cap.read()
        frame_cvt = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame_rsz = cv2.resize(np.array(frame_cvt), (input_width, input_height))
        display_handle.update(Image.fromarray(frame_rsz))
        if startRetrieval.value == True:
            if image_idx < num_samples_images:
                image_path = dataset_dir + '/' + 'image_' +str(image_idx) + '.jpg'
                frame_saved = cv2.resize(np.array(frame), (input_nn_width, input_nn_height))
                cv2.imwrite(image_path, frame_saved)
                image_idx = image_idx + 1
                progressBar.value = image_idx
            if image_idx == num_samples_images:
                image_idx = 0
                startRetrieval.value = False
                class_idx = class_idx + 1
                print("Data collected successfully")
        if stopButton.value==True:
            cap.release()
            display_handle.update(None)


In [None]:
display(class_box)
thread = threading.Thread(target=preview, args=(stopButton,))
thread.start()

_______________________________________________________________________
## III. Preparing dataset in PASCAL VOC format with automatic labeling using Teacher Model
Real-Time Detection Transformer **(RT-DETR)**, developed by Baidu, is a cutting-edge end-to-end object detector that provides real-time performance while maintaining high accuracy. It is based on the idea of DETR (the NMS-free framework), meanwhile introducing conv-based backbone and an efficient hybrid encoder to gain real-time speed. For this use case, **RT-DETR**  is used as a teacher model because of its high accuracy in automatic labeling by leveraging its robust detection capabilities to generate precise annotations for unlabeled datasets. This model can identify and classify objects in images and videos with remarkable speed and accuracy, making it an ideal choice for creating labeled datasets without extensive manual effort. By using **RT-DETR** as a teacher model, the generated labels can then be used to train smaller, student models, facilitating the development of efficient and high-performing machine learning systems. </br>

In order to annotate the collected data, we will need to run inferences using the **STAI_MPU API** and the **RT-DETR** model exported previously to ONNX format. Some of the model's operator are not supported by the NPU compiler, hence the unique execution provider for these inference session should be the CPU. For these reasons and due to the teacher model size, each of the inferences should take few seconds to reach high accuracy predictions. 

In [None]:
from stai_mpu import stai_mpu_network
import numpy as np

confidence_threshold = 0.5

### 1. Input pre-processing:
Below we define the util function `preprocess_input` required to preprocess the input image to fit in the RT-DETRv2 input shape and data type. We apply also some normalization to scale input pixel values between 0 and 1 since the model expects float32 tensor as an input. It is important to mention that this function is used to pre-process the inputs for the student model also but with a different resolution.

In [None]:
def preprocess_input(image, input_width, input_height):
    img_height, img_width, num_channel = image.shape
    input_img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    input_img = cv2.resize(input_img, (input_width, input_height))
    input_img = input_img.transpose(2, 0, 1)
    input_img = (np.float32(input_img) - 127.5) / 127.5
    expand_tensor = np.expand_dims(input_img, axis=0)
    input_tensor = expand_tensor.copy()
    return input_tensor

### 2. Output post-processing:
Below we define the util function required to post-process the output tensor, which means applying a serie of transformations to the raw outputs of an object detection model to convert them into meaningful and usable results. These steps typically include filtering, refining, and formatting the model's predictions to produce the final set of detected objects with their associated bounding boxes, class labels, and confidence scores.

The function **bbox_cxcywh_to_xyxy** converts bounding boxes from the format *(center_x, center_y, width, height)* to the format *(x_min, y_min, x_max, y_max)*.

In [None]:
def bbox_cxcywh_to_xyxy(boxes):
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = cx - 0.5 * w
    y1 = cy - 0.5 * h
    x2 = cx + 0.5 * w
    y2 = cy + 0.5 * h
    return np.stack([x1, y1, x2, y2], axis=1)

Next we define the postprocess_rtdetr function that processes the raw output of an object detection model to produce filtered and formatted bounding boxes, confidence scores, and class labels. This function performs several key steps to ensure the output is usable such as score normalization, confidence thresholding, target label filtering to make sure the model detects only the required classes and finally bounding box clipping. A quick reminder that the **RT-DETR** model is NMS-free. 

In [None]:
def postprocess_rtdetr(output, target_labels=None):
        output = np.array(output[0])
        boxes, scores = output[:, :4], output[:, 4:]
        if not (np.all((scores > 0) & (scores < 1))):
            scores = 1 / (1 + np.exp(-scores))
        boxes = bbox_cxcywh_to_xyxy(boxes)
        _max = scores.max(-1)
        _mask = _max > confidence_threshold
        boxes, scores = boxes[_mask], scores[_mask]
        labels, scores = scores.argmax(-1), scores.max(-1)

        # As the model is pretrained on several classes, it may detect unwanted classes
        # target_labels allows to select
        if target_labels is not None:
            # Filter by target label
            target_label_indices = [i for i, label in enumerate(
                labels) if label in target_labels]

            labels = np.array([labels[i] for i in target_label_indices])
            boxes = np.array([boxes[i] for i in target_label_indices])
            scores = np.array([scores[i] for i in target_label_indices])

        if scores.shape[0] == 0:
            return [], [], []

        x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
        x1 = np.floor(np.minimum(np.maximum(1, x1 * input_nn_width), input_nn_width - 1)).astype(int)
        y1 = np.floor(np.minimum(np.maximum(1, y1 * input_nn_height), input_nn_height - 1)).astype(int)
        x2 = np.ceil(np.minimum(np.maximum(1, x2 * input_nn_width), input_nn_width - 1)).astype(int)
        y2 = np.ceil(np.minimum(np.maximum(1, y2 * input_nn_height), input_nn_height - 1)).astype(int)
        boxes = np.stack([x1, y1, x2, y2], axis=1)

        return boxes, scores, labels

### 3. Preparing the teacher model instance for inference:
To run the inference on the retrieved image, we will be using the STAI_MPU API on the RT-DETR model in ONNX format and the CPU execution engine only since some of the operators are not supported by the NPU Hardware.

In [None]:
labels_file_path = JUPYTER_ROOT_PATH + "student_model/ssd_mobilenet_v2/labels.txt"
teacher_model_path = JUPYTER_ROOT_PATH +  "teacher_model/rt-detr/rtdetr-l.onnx"
stai_teacher_model = stai_mpu_network(model_path=teacher_model_path, use_hw_acceleration=False)

# Read input tensor information
num_inputs = stai_teacher_model.get_num_inputs()
input_tensor_infos = stai_teacher_model.get_input_infos()

# Read output tensor information
num_outputs = stai_teacher_model.get_num_outputs()
output_tensor_infos = stai_teacher_model.get_output_infos()
output_tensor_shape = output_tensor_infos[0].get_shape()

input_tensor_shape = input_tensor_infos[0].get_shape()
input_width =  input_tensor_shape[2]
input_height =  input_tensor_shape[3]
input_channel =  input_tensor_shape[1]

### 4. Converting the dataset to PASCAL VOC format:

The **format_predictions_for_xml** function is designed to format object detection predictions into a structured list of dictionaries. It takes four parameters: nb_detections (the number of detected objects), class_names (a list of class names), class_ids (a list of class IDs corresponding to the detected objects), and boxes (a list of bounding box coordinates for each detected object). For each detection, the function retrieves the class name using the class ID, extracts the bounding box coordinates, and creates a dictionary with the class name and bounding box coordinates (xmin, xmax, ymin, ymax). These dictionaries are then appended to a list called predictions, which is returned at the end of the function. This list will be used later to generate XML representations of the predictions.


In [None]:
def format_predictions_for_xml(nb_detections, class_names, class_ids, boxes):
        predictions = []
        for i in range(nb_detections):
            name = class_names[class_ids[i]]
            bbox = boxes[i]
            predictions.append({
                'name': name,
                'xmin': bbox[0],
                'xmax': bbox[2],
                'ymin': bbox[1],
                'ymax': bbox[3]
            })
        return predictions

The **generate_voc_xml_annotation** function creates an XML file in the PASCAL VOC format, which is commonly used for object detection datasets. This function takes various parameters including the image dimensions (width, height, depth), a list of objects with their bounding box coordinates, and the output file path. It generates an XML structure that includes metadata about the image and detailed annotations for each object, such as the object name and bounding box coordinates. The resulting XML file is then written to the specified output path, formatted in a human-readable way.

In [None]:
def generate_voc_xml_annotation(width, height, depth, objects, output_file, filename="NA", path="NA"):
        """
        Create an XML file in VOC format.

        :param filename: Name of the image file
        :param path: Path to the image file
        :param width: Width of the image
        :param height: Height of the image
        :param depth: Depth of the image (number of channels)
        :param objects: List of objects, where each object is a dictionary with keys:
                        'name', 'xmin', 'xmax', 'ymin', 'ymax'
        :param output_file: Path to the output XML file
        """
        annotation = ET.Element("annotation")

        folder = ET.SubElement(annotation, "folder")
        folder.text = ""

        filename_elem = ET.SubElement(annotation, "filename")
        filename_elem.text = filename
        path_elem = ET.SubElement(annotation, "path")
        path_elem.text = path
        source = ET.SubElement(annotation, "source")
        database = ET.SubElement(source, "database")
        database.text = "ST"
        size = ET.SubElement(annotation, "size")
        width_elem = ET.SubElement(size, "width")
        width_elem.text = str(width)
        height_elem = ET.SubElement(size, "height")
        height_elem.text = str(height)
        depth_elem = ET.SubElement(size, "depth")
        depth_elem.text = str(depth)
        segmented = ET.SubElement(annotation, "segmented")
        segmented.text = "0"
        for obj in objects:
            object_elem = ET.SubElement(annotation, "object")
            name = ET.SubElement(object_elem, "name")
            name.text = obj['name']
            pose = ET.SubElement(object_elem, "pose")
            pose.text = "Unspecified"
            truncated = ET.SubElement(object_elem, "truncated")
            truncated.text = "0"
            difficult = ET.SubElement(object_elem, "difficult")
            difficult.text = "0"
            occluded = ET.SubElement(object_elem, "occluded")
            occluded.text = "0"
            bndbox = ET.SubElement(object_elem, "bndbox")
            xmin = ET.SubElement(bndbox, "xmin")
            xmin.text = str(obj['xmin'])
            xmax = ET.SubElement(bndbox, "xmax")
            xmax.text = str(obj['xmax'])
            ymin = ET.SubElement(bndbox, "ymin")
            ymin.text = str(obj['ymin'])
            ymax = ET.SubElement(bndbox, "ymax")
            ymax.text = str(obj['ymax'])
        # Create a new XML file with the results
        tree = ET.ElementTree(annotation)
        tree.write(output_file)
        # Pretty print the XML
        xml_str = minidom.parseString(ET.tostring(
            annotation)).toprettyxml(indent="    ")
        with open(output_file, "w") as f:
            f.write(xml_str)

### 5. Splitting the dataset into train, test and eval sets:
Next, we define the `split_dataset` function that organizes a raw dataset of images into three subsets: training, testing, and evaluation. It first creates the necessary directories (train, test, eval) within a specified dataset path. It then lists and filters image files from the raw dataset directory, shuffles them to ensure randomness, and splits them into three groups: 70% for training, 10% for testing, and the remaining 20% for evaluation. Finally, it moves the images to their respective subdirectories and prints a message indicating the completion of the dataset splitting process.

In [None]:
def split_dataset(dataset_raw_path, dataset_split_path, train_percent=0.7, test_percent=0.1):
    # Create dataset directory and subdirectories if they don't exist
    train_dir = os.path.join(dataset_split_path, "train")
    test_dir = os.path.join(dataset_split_path, "test")
    eval_dir = os.path.join(dataset_split_path, "eval")

    os.makedirs(train_dir, exist_ok=True)
    os.makedirs(test_dir, exist_ok=True)
    os.makedirs(eval_dir, exist_ok=True)

    # Split new_images into train, test, and eval
    new_images_dir = dataset_raw_path
    image_files = [
        os.path.join(new_images_dir, f)
        for f in os.listdir(new_images_dir)
        if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif'))
    ]

    random.shuffle(image_files)
    total_images = len(image_files)
    train_split = int(train_percent * total_images)
    test_split = int(test_percent * total_images)
    eval_split = total_images - train_split - test_split

    train_images = image_files[:train_split]
    test_images = image_files[train_split:train_split + test_split]
    eval_images = image_files[train_split + test_split:]

    # Move images to respective subfolders
    for image_path in train_images:
        shutil.move(image_path, os.path.join(train_dir, os.path.basename(image_path)))
    for image_path in test_images:
        shutil.move(image_path, os.path.join(test_dir, os.path.basename(image_path)))
    for image_path in eval_images:
        shutil.move(image_path, os.path.join(eval_dir, os.path.basename(image_path)))

    print("Dataset split complete.")


train_percent = 0.7
test_percent = 0.1
split_dataset(dataset_dir, dataset_dir, train_percent, test_percent)

### 6. Generating labels in Pascal VOC format and visualizing annotations:
Now that all the util functions required have been defined, we are ready to generate the labels using the teacher model. These labels are intended to be used for the student model training. The following cell consist of running the complete loop by applying the pre-processing on the images, running the inference using the RT-DETR model, applying the post-processing on the output tensor and reformatting it to Pascal VOC XML format to prepare it for training the student model.

In [None]:
label_idx = 0
annotation_progress = widgets.IntProgress(
    value=0,
    min=1,
    max=num_samples_images,
    description='Data annotation:',
    bar_style='success', # 'success', 'info', 'warning', 'danger' or ''
    orientation='horizontal'
)
display(annotation_progress)

annotations_dir = "generated_annotations"
class_names = [name.strip() for name in open(labels_file_path).readlines()]

if not os.path.exists(annotations_dir):
    os.mkdir(annotations_dir)
    print(f"Directory '{annotations_dir}' created.\n")
else:
    print(f"Directory '{annotations_dir}' already exists.\n")

for subdir in ['train', 'test', 'eval']:
    subdir_path = os.path.join(dataset_dir, subdir)
    for img_filename in os.listdir(subdir_path):
        if img_filename.lower().endswith(("jpg", "png", "jpeg")):
            img = cv2.imread(f"{subdir_path}/{img_filename}")
            preprocessed_img = preprocess_input(img, input_width, input_height)
            stai_teacher_model.set_input(0, preprocessed_img)
            stai_teacher_model.run()
            output_tensors = stai_teacher_model.get_output(index=0)
            #output_tensors = session.run(output_names, {input_names[0]: preprocessed_img})
            boxes, scores, class_ids = postprocess_rtdetr(output=output_tensors, target_labels=[0])
            if type(class_ids) == list:
                print(f"No object found in {img_filename}")
                nb_detections = 0
            else:
                nb_detections = class_ids.shape[0]
            objects = format_predictions_for_xml(nb_detections, class_names, class_ids, boxes)
            img_name_without_extension = os.path.splitext(img_filename)[0]
            output_file = f"{subdir_path}/{img_name_without_extension}.xml"
            generate_voc_xml_annotation(
                input_width, input_height, input_channel, objects, output_file, filename=img_filename, path=f"{subdir_path}/{img_filename}")
            label_idx = label_idx + 1
            annotation_progress.value = label_idx

Next, we will visualize the generated labels to ensure their quality and accuracy before proceeding with the student model training. This step involves displaying a subset of the annotated images with their corresponding **bounding boxes**, as predicted by the teacher model. By doing so, we can manually inspect the annotations to verify that the objects are correctly identified and localized. This visual inspection is crucial for identifying any potential errors or inconsistencies in the labeling process, allowing us to make necessary adjustments and ensure that the student model is trained on high-quality data. The following cell will display a grid of annotated images, providing a clear and intuitive overview of the labeling results.

In [None]:
def parse_voc_xml(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    boxes = []
    scores = []
    class_ids = []
    class_name_to_id = {'person': 1}  # Example mapping, extend as needed

    for obj in root.findall('object'):
        bbox = obj.find('bndbox')
        xmin = int(bbox.find('xmin').text)
        ymin = int(bbox.find('ymin').text)
        xmax = int(bbox.find('xmax').text)
        ymax = int(bbox.find('ymax').text)
        boxes.append([xmin, ymin, xmax, ymax])
        scores.append(1.0)  # Assuming score is 1.0 for all detections
        class_name = obj.find('name').text
        class_ids.append(class_name_to_id.get(class_name, 0))  # Default to 0 if class not found

    if not boxes:
        boxes = np.empty((0, 4))
    else:
        boxes = np.array(boxes)

    scores = np.array(scores)
    class_ids = np.array(class_ids)

    return sv.Detections(xyxy=boxes, confidence=scores, class_id=class_ids)

def display_annotated_images(image_paths, detections_list):
    images = [cv2.imread(path) for path in image_paths]

    box_annotator = sv.BoxAnnotator()
    annotated_images = []
    for image, detections in zip(images, detections_list):
        annotated_image = box_annotator.annotate(
            scene=image.copy(),
            detections=detections
        )
        annotated_images.append(annotated_image)

    # Convert BGR images to RGB for matplotlib
    annotated_images_rgb = [cv2.cvtColor(img, cv2.COLOR_BGR2RGB) for img in annotated_images]

    # Create a figure with 1 row and len(image_paths) columns
    fig, axes = plt.subplots(1, len(image_paths), figsize=(20, 5))

    # Display each image in the grid
    for ax, img in zip(axes, annotated_images_rgb):
        ax.imshow(img)
        ax.axis('off')  # Hide axes

    plt.show()

# Example usage
image_paths = glob.glob(f"{dataset_dir}/train/*.jpg")
xml_paths = glob.glob(f"{dataset_dir}/train/*.xml")

detections_list = [parse_voc_xml(xml) for xml in xml_paths]
display_annotated_images(image_paths[:5], detections_list[:5])

### 7. Data loader instance for the SSD MobileNetv2:
To facilitate the training process of the student model, we need to define a data loader that efficiently handles the loading and batching of our dataset. The data loader will be responsible for reading the images and their corresponding Pascal VOC XML annotations, applying any necessary transformations and augmentations, and organizing the data into batches suitable for training. This ensures that the model receives data in a consistent and optimized manner, improving the training efficiency and performance. The data loader will also handle shuffling the dataset to ensure that the model generalizes well and does not overfit to specific data patterns.  </br> We start by defining a class that applies some transformations on the dataset before feeding it into the dataloader: converting image data to float32, normalizing bounding box coordinates to a percentage of the image dimensions, resizing the image to a specified size, subtracting a mean value from the image for normalization, dividing by a standard deviation, and transposing the image dimensions to have the channel first. This ensures the images are in the correct format and scale for training. 

In [None]:
class TransformData:
    def __init__(self, size, mean=0, std=1.0):
        """
        Args:
            size: the size of the final image.
            mean: mean pixel value per channel.
            std: standard deviation for normalization.
        """
        self.mean = np.array(mean, dtype=np.float32)
        self.size = size
        self.std = std

    def __call__(self, img, boxes, labels):
        """
        Args:
            img: the output of cv.imread in RGB layout.
            boxes: bounding boxes in the form of (x1, y1, x2, y2).
            labels: labels of boxes.
        """
        img = img.astype(np.float32)
        height, width, _ = img.shape
        if boxes.size != 0:
            boxes[:, 0] /= width
            boxes[:, 2] /= width
            boxes[:, 1] /= height
            boxes[:, 3] /= height
        img = cv2.resize(img, (self.size, self.size))
        img -= self.mean
        img /= self.std
        img = np.transpose(img, (2, 0, 1))

        return img, boxes, labels

The following cell will define the data loader imported from torch library, for that purpose we will be needing to define a Custom dataset that follows PASCAL VOC formatting by detailing the steps for reading the data, applying transformations, and batching the images and labels for the training loop.

In [None]:
from pathlib import Path
import logging

class VOCDataset:
    def __init__(self, root, dataset_type='train', transform=None, target_transform=None, label_file=None):
        """Dataset for VOC data.
        Args:
            root: the root of the dataset, the directory contains the following sub-directories:
                Annotations, test, eval, and train.
            dataset_type: specify the dataset type ('train', 'test', or 'eval').
        """
        self.root = root
        self.dataset_type = dataset_type
        self.transform = transform
        self.target_transform = target_transform

        # Determine the image directory based on the dataset type
        image_dir = self.dataset_type
        files = glob.glob(os.path.join(self.root, image_dir, "*.[jp][pn]g"))
        self.ids = [Path(file).stem for file in files]

        # Read class names from the labels file if it exists
        if os.path.isfile(labels_file_path):
            with open(labels_file_path, 'r') as infile:
                classes = [line.strip() for line in infile]
            classes.insert(0, 'BACKGROUND')
            self.class_names = tuple(classes)
            logging.info("VOC Labels read from file: " + str(self.class_names))
        else:
            logging.info("No labels file, using default VOC classes.")
            self.class_names = ('BACKGROUND', 'aeroplane', 'bicycle',
                                'bird', 'boat', 'bottle', 'bus',
                                'car', 'cat', 'chair', 'cow',
                                'diningtable', 'dog', 'horse',
                                'motorbike', 'person', 'pottedplant',
                                'sheep', 'sofa', 'train', 'tvmonitor')

        self.class_dict = {class_name: i for i, class_name in enumerate(self.class_names)}

    def __getitem__(self, index):
        image_id = self.ids[index]
        image = self._read_image(image_id)
        boxes, labels = self._get_annotation(image_id)

        if self.transform:
            image, boxes, labels = self.transform(image, boxes, labels)
        if self.target_transform:
            boxes, labels = self.target_transform(boxes, labels)
        return image, boxes, labels

    def __len__(self):
        return len(self.ids)

    def _get_annotation(self, image_id):
        annotation_file = os.path.join(self.root, self.dataset_type, f"{image_id}.xml")
        objects = ET.parse(annotation_file).findall("object")
        boxes, labels = [], []

        for obj in objects:
            class_name = obj.find('name').text.lower().strip()
            if class_name in self.class_dict:
                bbox = obj.find('bndbox')
                x1 = float(bbox.find('xmin').text) - 1
                y1 = float(bbox.find('ymin').text) - 1
                x2 = float(bbox.find('xmax').text) - 1
                y2 = float(bbox.find('ymax').text) - 1
                boxes.append([x1, y1, x2, y2])
                labels.append(self.class_dict[class_name])

        return np.array(boxes, dtype=np.float32), np.array(labels, dtype=np.int64)

    def _read_image(self, image_id):
        image_dir = self.dataset_type
        image_file = os.path.join(self.root, image_dir, f"{image_id}.jpg")
        if not os.path.exists(image_file):
            image_file = os.path.join(self.root, image_dir, f"{image_id}.png")

        image = cv2.imread(image_file)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        return image

_______________________________________________________________________
## IV. Training the student Neural Network: SSD MobileNet V2 for object detection
With the labeled dataset prepared and verified, we can now proceed to train the student model. The student model will be trained using the annotations generated by the teacher model, leveraging the high-quality labels to learn and generalize object detection tasks. This training process involves feeding the pre-processed images and their corresponding labels into the student model, optimizing the model parameters through iterative learning. The goal is to achieve a model that performs efficiently and accurately in detecting objects, even with potentially fewer resources or simpler architecture compared to the teacher model. The following cell will initiate the training process, detailing the configuration, hyperparameters, and training loop necessary to build a robust student model. </br> The first step here will consist of generating the Anchor Boxes to be used with the Anchox Box matcher.

### 1. Generating the Anchor box for the SSD MobileNetV2 model:
We start by defining the `AnchorBoxMatcher` class which is used to prepare ground truth data for training an object detection model by matching ground truth boxes to predefined anchor boxes and encoding the matched boxes into a format suitable for training. The following cell defines a class that implements this behavior which should be a transformation for targets later on.

In [None]:
#Compute the areas of rectangles given two corners.
def area_of(left_top, right_bottom) -> torch.Tensor:
    hw = torch.clamp(right_bottom - left_top, min=0.0)
    return hw[..., 0] * hw[..., 1]

def corner_form_to_center_form(boxes):
    return torch.cat([(boxes[..., :2] + boxes[..., 2:]) / 2,
                       boxes[..., 2:] - boxes[..., :2]], boxes.dim() - 1)

def center_form_to_corner_form(locations):
    return torch.cat([locations[..., :2] - locations[..., 2:]/2,
                     locations[..., :2] + locations[..., 2:]/2], locations.dim() - 1)

def iou_of(boxes0, boxes1, eps=1e-5):
    overlap_left_top = torch.max(boxes0[..., :2], boxes1[..., :2])
    overlap_right_bottom = torch.min(boxes0[..., 2:], boxes1[..., 2:])

    overlap_area = area_of(overlap_left_top, overlap_right_bottom)
    area0 = area_of(boxes0[..., :2], boxes0[..., 2:])
    area1 = area_of(boxes1[..., :2], boxes1[..., 2:])
    return overlap_area / (area0 + area1 - overlap_area + eps)

def convert_boxes_to_locations(center_form_boxes, center_form_anchors, center_variance, size_variance):
    # anchors can have one dimension less
    if center_form_anchors.dim() + 1 == center_form_boxes.dim():
        center_form_anchors = center_form_anchors.unsqueeze(0)
    return torch.cat([
        (center_form_boxes[..., :2] - center_form_anchors[...,
         :2]) / center_form_anchors[..., 2:] / center_variance,
        torch.log(center_form_boxes[..., 2:] /
                  center_form_anchors[..., 2:]) / size_variance
    ], dim=center_form_boxes.dim() - 1)

# Assign ground truth boxes and targets to anchors.
def assign_anchors(gt_boxes, gt_labels, corner_form_anchors, iou_threshold):
    if gt_boxes.size(0) == 0:
        gt_boxes = torch.zeros((1, 1, 4), dtype=torch.float32)
        gt_labels = torch.zeros((1,), dtype=torch.int64)
        no_gt = True
    else:
        no_gt = False
    ious = iou_of(gt_boxes.unsqueeze(0), corner_form_anchors.unsqueeze(1))
    best_target_per_anchor, best_target_per_anchor_index = ious.max(1)
    best_anchor_per_target, best_anchor_per_target_index = ious.max(0)
    
    # Ensure best_anchor_per_target_index is a 1-dimensional tensor
    best_anchor_per_target_index = best_anchor_per_target_index.view(-1)

    for target_index, anchor_index in enumerate(best_anchor_per_target_index):
        best_target_per_anchor_index[anchor_index] = target_index
    best_target_per_anchor.index_fill_(0, best_anchor_per_target_index, 2)
    if no_gt:
        labels = torch.zeros((corner_form_anchors.size(0),), dtype=torch.int64)
        boxes = torch.zeros((corner_form_anchors.size(0), 4), dtype=torch.float32)
    else:
        labels = gt_labels[best_target_per_anchor_index]
        labels[best_target_per_anchor < iou_threshold] = 0  # the background id
        boxes = gt_boxes[best_target_per_anchor_index]
    return boxes, labels

In [None]:
class AnchorBoxMatcher(object):
    def __init__(self, center_form_anchors, center_variance, size_variance, iou_threshold):
        self.center_variance = center_variance
        self.size_variance = size_variance
        self.iou_threshold = iou_threshold
        self.center_form_anchors = center_form_anchors
        self.corner_form_anchors = center_form_to_corner_form(center_form_anchors)

    def __call__(self, gt_boxes, gt_labels):
        if type(gt_boxes) is np.ndarray:
            gt_boxes = torch.from_numpy(gt_boxes)
        if type(gt_labels) is np.ndarray:
            gt_labels = torch.from_numpy(gt_labels)
        boxes, labels = assign_anchors(gt_boxes, gt_labels, self.corner_form_anchors, self.iou_threshold)
        boxes = corner_form_to_center_form(boxes)
        locations = convert_boxes_to_locations(boxes, self.center_form_anchors, self.center_variance, self.size_variance)
        return locations, labels

The function `generate_anchors` allows the generation predefined bounding boxes of various sizes and aspect ratios that are placed uniformly across the image. The function takes a list of specifications (specs) that define the feature map sizes, shrinkage, box sizes, and aspect ratios for each layer of the SSD model. It calculates the center coordinates, widths, and heights of the anchor boxes relative to the image size and returns them as a NumPy array.

In [None]:
import collections
from typing import List
import itertools
import math

AnchorBoxSizes = collections.namedtuple('AnchorBoxSizes', ['min', 'max'])
AnchorSpec = collections.namedtuple( 'AnchorSpec', ['feature_map_size', 'shrinkage', 'box_sizes', 'aspect_ratios'])

def generate_anchors(specs: List[AnchorSpec], image_size, clamp=True):
    """Generate SSD Anchor Boxes.
    It returns the center, height and width of the anchor boxes. The values are relative to the image size
    Args:
        specs: AnchorSpec about the shapes of sizes of anchor boxes.image_size: image size.
        clamp: if true, clamp the values to make fall between [0.0, 1.0]
    Returns:
        anchors (num_anchors, 4): The anchor boxes represented as [[center_x, center_y, w, h]]. All the values are relative to the image size.
    """
    anchors = []
    for spec in specs:
        scale = image_size / spec.shrinkage
        for j, i in itertools.product(range(spec.feature_map_size), repeat=2):
            x_center = (i + 0.5) / scale
            y_center = (j + 0.5) / scale

            # small sized square box
            size = spec.box_sizes.min
            h = w = size / image_size
            anchors.append([ x_center, y_center, w, h])

            # big sized square box
            size = math.sqrt(spec.box_sizes.max * spec.box_sizes.min)
            h = w = size / image_size
            anchors.append([x_center, y_center, w, h])

            # change h/w ratio of the small sized box
            size = spec.box_sizes.min
            h = w = size / image_size
            for ratio in spec.aspect_ratios:
                ratio = math.sqrt(ratio)
                anchors.append([x_center, y_center, w * ratio, h / ratio])
                anchors.append([x_center, y_center, w / ratio, h * ratio])

    anchors = torch.tensor(anchors)
    if clamp:
        torch.clamp(anchors, 0.0, 1.0, out=anchors)
    return anchors

In [None]:
from torch.utils.data import DataLoader

ssd_input_size = 300
ssd_specs = [
    AnchorSpec(19, 16, AnchorBoxSizes(60, 105), [2, 3]),
    AnchorSpec(10, 32, AnchorBoxSizes(105, 150), [2, 3]),
    AnchorSpec(5, 64, AnchorBoxSizes(150, 195), [2, 3]),
    AnchorSpec(3, 100, AnchorBoxSizes(195, 240), [2, 3]),
    AnchorSpec(2, 150, AnchorBoxSizes(240, 285), [2, 3]),
    AnchorSpec(1, 300, AnchorBoxSizes(285, 330), [2, 3])
]

center_variance = 0.1
size_variance   = 0.2
iou_threshold   = 0.45
image_mean      = np.array([127, 127, 127])  # RGB layout 
image_std       = 128.0
batch_size      = 4

ssd_anchors = generate_anchors(specs=ssd_specs,  image_size=ssd_input_size)
target_transform = AnchorBoxMatcher(ssd_anchors, center_variance, size_variance, iou_threshold)

# Loading the training dataset using the dataloader from Torch
train_transform = TransformData(size=ssd_input_size, mean=image_mean, std=image_std)
train_dataset = VOCDataset(dataset_dir, transform=train_transform, target_transform=target_transform)
train_loader  = DataLoader(train_dataset, batch_size, shuffle=True)

# Loading the validation dataset using the dataloader from Torch
valid_transform = TransformData(size=ssd_input_size, mean=image_mean, std=image_std) ###############################################################################
valid_dataset = VOCDataset(dataset_dir, dataset_type= "eval", transform=valid_transform, target_transform=target_transform)
valid_loader  = DataLoader(valid_dataset, batch_size, shuffle=False)

### 2. Loading the student model and the training artifacts:
As mentionned ealier The **SSD MobileNet V2** model, also known as the student model, is previously trained on **PASCAL VOC** dataset and this use case serves only for model specialization to make models perform in high performance and accuracy in some tasks where the data of training is collected on device. </br> 
The first step consist of loading the SSD MobileNet V2 model along with the training artifacts which are the training, the evaluation and the optimizer subgraphs.  For that purpose, we use from the `onnxruntime-training` python module the classes `Module` to load the training and the eval graphs, `Optimizer` to load the optimizer and the `CheckpointState` to load the previously pre-trained weights if there are any. We start by defining the required training parameters.

In [None]:
import onnxruntime.training.api as orttraining

# Training Parameters
artifacts_dir_path = JUPYTER_ROOT_PATH + "student_model/ssd_mobilenet_v2/training_artifacts/"
learning_rate = 0.005

checkpoint_state = orttraining.CheckpointState.load_checkpoint(
    f"{artifacts_dir_path}checkpoint")

model = orttraining.Module(
    f"{artifacts_dir_path}training_model.onnx",
    checkpoint_state,
    f"{artifacts_dir_path}eval_model.onnx",
)

optimizer = orttraining.Optimizer(
    f"{artifacts_dir_path}optimizer_model.onnx", model
)
optimizer.set_learning_rate(learning_rate=learning_rate)

### 3. Launching the training loop along with model evaluation
Now that all the objects and variables necessary for the training have been instanciated, we are set to launch the **training loop** as follows. The method `model.train()` set the model in training model by calling the training subgraph, the method `optimizer.step()` updates the model parameters based on the computed gradients.and This `model.lazy_reset_grad()` method sets the internal state of the module such that the module gradients will be scheduled to be reset just before the new gradients are computed on the next invocation of train().

In [None]:
def train(model, dataloader, optimizer, epoch, num_epochs):
    losses = []
    for i, data in enumerate(dataloader):
        model.train()
        images, boxes, labels = data
        loss, confs, _ = model(np.array(images), np.array(labels).astype(np.float32), np.array(boxes))
        optimizer.step()
        model.lazy_reset_grad()
        losses.append(loss.item())
    return sum(losses) / len(losses)

Next, we define the `evaluation loop`, we start by setting the model into eval mode and by looping around all the batches, we call the evaluation graph which returns the **evaluation loss** metric.

In [None]:
def eval(model, dataloader):
    model.eval()
    losses = []
    for i, data in enumerate(dataloader):
        images, boxes, labels = data
        loss, _, _ = model(np.array(images), np.array(labels).astype(np.float32), np.array(boxes))
        losses.append(loss.item())
    return sum(losses) / len(losses)

By setting the `num_epochs` variable to a certain value depending on the complexity of the use case, we are ready to launch the on-device learning loop by calling the functions `train()` and `eval()` successively at each pass of the `num_epochs`. You should notice the Validation Loss being lower than the Training Loss and both of them are decreasing as the training advances in epochs. Otherwise, your model should be overfitting due to dataset issues.

In [None]:
num_epochs = 50
for epoch in range(0, num_epochs):
    train_loss = train(model=model,
                       dataloader=train_loader, 
                       optimizer=optimizer, 
                       epoch=epoch, 
                       num_epochs=num_epochs)
    val_loss = eval(model=model, dataloader=valid_loader)
    print(f"Epoch: {epoch + 1} / {num_epochs}, Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}")

### 4. Exporting model for inference
After making sure that the training session is reaching quite satisfactory validation loss results, it is time to **export the model for inference**. This involves converting the trained model into a format that can be efficiently used for making predictions on new data. This process typically includes saving the model architecture, weights, and any necessary preprocessing steps into a deployable format such as ONNX. The exported model can then be loaded into an inference engine or runtime environment, where it can process input data and generate predictions in real-time or batch mode. </br>


In [None]:
inference_models_dir = JUPYTER_ROOT_PATH + 'student_model/ssd_mobilenet_v2/inference_artifacts'

if os.path.exists(f"{inference_models_dir}/ssd_mobilenet_v2.onnx"):
    model.export_model_for_inferencing(f"{inference_models_dir}/new_ssd_mobilenet_v2.onnx", ["confs", "out_boxes"])
    print(f"Model exported to: {inference_models_dir}/new_ssd_mobilenet_v2.onnx")
else:
    model.export_model_for_inferencing(f"{inference_models_dir}/ssd_mobilenet_v2.onnx", ["confs", "out_boxes"])
    print(f"Model exported to: {inference_models_dir}/ssd_mobilenet_v2.onnx.onnx")

_______________________________________________________________________
## V. Running inference using the newly exported model:
Before running the inference using the new updated model on-device, it is necessary to apply some transformations on the model to make it run in optimal performances and AI hardware acceleration chip; We are talking here about the **NPU** (Neural Processing Unit) available on the STM32MP25 product family. Among these transformations we can mention the **static quantization** and **making the dynamic shapes fixed** to allow execution on the NPU.

### 1. Quantizing and optimizing the student model:
Retraining a convolutional neural network is done generally in floating point format. In order to take advantage of the NPU acceleration, 8 bit linear quantization is required. For that purpose we are going to be using the **static quantization**.</br>
The static quantization method initially executes the model with a set of inputs known as **calibration data**. Throughout these executions, the quantization parameters for each activation are calculated. These parameters are then embedded as constants in the quantized model and applied to all inputs. Our quantization tool offers support for three calibration methods: MinMax, Entropy, and Percentile.

In [None]:
from onnxruntime.quantization import quant_pre_process, quantize_static, QuantFormat, QuantType, CalibrationDataReader
import onnxruntime as ort

def _preprocess_images(images_folder: str, height: int, width: int, nb_images=10):
    unconcatenated_batch_data = []
    img_counter = 0
    for img_filename in sorted(os.listdir(images_folder)):
        if img_counter > nb_images:
            break
        image = cv2.imread(images_folder + f"/{img_filename}")
        if image is None:
            continue
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image, _, _ = train_transform(image, np.empty((0, 4), dtype=np.float32), np.empty((0,), dtype=np.int32))
        input_tensor = np.expand_dims(image, axis=0)
        unconcatenated_batch_data.append(input_tensor)
        img_counter += 1
    batch_data = np.concatenate(np.expand_dims(unconcatenated_batch_data, axis=0), axis=0)
    return batch_data

class SSDDataReader(CalibrationDataReader):
    def __init__(self, calibration_image_folder: str, model_path: str, nb_images: int):
        self.enum_data = None

        # Use inference session to get input shape.
        session = ort.InferenceSession(model_path, None, providers=['CPUExecutionProvider'])
        (_, _, height, width) = session.get_inputs()[0].shape

        # Convert image to input data
        self.nhwc_data_list = _preprocess_images(calibration_image_folder, height, width, nb_images)
        self.input_name = session.get_inputs()[0].name
        self.datasize = len(self.nhwc_data_list)

    def get_next(self):
        if self.enum_data is None:
            self.enum_data = iter(
                [{self.input_name: nhwc_data}
                    for nhwc_data in self.nhwc_data_list]
            )
        return next(self.enum_data, None)

    def rewind(self):
        self.enum_data = None

######## Quantization ########
calib_dir_path = dataset_dir + "/train/"
nb_images_calib = num_samples_images * 0.4

if os.path.exists(f"{inference_models_dir}/new_ssd_mobilenet_v2.onnx"):
    float_model_path = f"{inference_models_dir}/new_ssd_mobilenet_v2.onnx"
else:
    float_model_path = f"{inference_models_dir}/ssd_mobilenet_v2.onnx"

# Preprocessing before quantization
quant_pre_process(float_model_path, f"{inference_models_dir}/new_ssd_mobilenet_v2_pp.onnx")

# Static quantization
quant_model_name = "new_ssd_mobilenet_v2_quant"
datareader = SSDDataReader(calib_dir_path, float_model_path, nb_images=nb_images_calib)

quantize_static(f"{inference_models_dir}/new_ssd_mobilenet_v2_pp.onnx",
                f"{inference_models_dir}/{quant_model_name}.onnx",
                calibration_data_reader=datareader, activation_type=QuantType.QInt8,
                weight_type=QuantType.QInt8, quant_format=QuantFormat.QDQ,
                per_channel=True, reduce_range=True)

print("Quantization done!")

You should notice the generation of a new model `new_ssd_mobilenet_v2_quant.onnx` in your filesystem.

### 2. Making dynamic input shapes fixed:
If an ONNX model can potentially be used with VSINPU Execution Provider as reported by the model usability checker, it may benefit from making the input shapes ‘fixed’. This is because VSINPU EP does not support dynamic input shapes. Fixing the dynamic shape simply means making the batch size dimension ‘fixed’ by setting it to 1 to allow the model to run on the NPU hardware inference accelerator. ONNXRuntime provides a tool that allows making the dynamic shapes fixed for an ONNX model by running the following cell:

In [None]:
!python3 -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param batch --dim_value 1 student_model/ssd_mobilenet_v2/inference_artifacts/new_ssd_mobilenet_v2_quant.onnx student_model/ssd_mobilenet_v2/inference_artifacts/ssd_mobilenet_v2_quant_fixed.onnx

### 3. Defining the SSD MobileNet V2 post process function:
The function `convert_locations_to_boxes` takes care of converting regressional location results of SSD into boxes in the form of **(center_x, center_y, h, w)**. It takes as an argument the locations retrieved from the output tensor of the model, the anchors and two float parameters representing the center_variance and the size_variance. It returns the values of the boxes coordinates relative to the image size.

In [None]:
def convert_locations_to_boxes(locations, priors, center_variance, size_variance):
    # priors can have one dimension less.
    if len(priors.shape) + 1 == len(locations.shape):
        priors = np.expand_dims(priors, 0)
    return np.concatenate([
        locations[..., :2] * center_variance *
        priors[..., 2:] + priors[..., :2],
        np.exp(locations[..., 2:] * size_variance) * priors[..., 2:]
    ], axis=len(locations.shape) - 1)

The `hard_nms` function performs **Hard Non-Maximum Suppression (NMS)** on a set of bounding boxes with associated scores. It takes in a list of bounding boxes and their scores, an Intersection over Union (IoU) threshold, a parameter top_k to limit the number of results, and a candidate_size to consider only the top-scoring candidates. The function sorts the boxes by their scores, then iteratively selects the box with the highest score, removes it from the list, and suppresses all other boxes that have an IoU greater than the specified threshold with the selected box. This process continues until the desired number of boxes (top_k) is selected or all candidates are processed. The function returns the list of selected boxes.

In [None]:
def hard_nms(box_scores, iou_threshold, top_k=-1, candidate_size=200):
    scores = box_scores[:, -1]
    boxes = box_scores[:, :-1]
    picked = []
    indexes = np.argsort(scores)
    indexes = indexes[-candidate_size:]
    while len(indexes) > 0:
        current = indexes[-1]
        picked.append(current)
        if 0 < top_k == len(picked) or len(indexes) == 1:
            break
        current_box = boxes[current, :]
        indexes = indexes[:-1]
        rest_boxes = boxes[indexes, :]
        iou = iou_of(rest_boxes,  torch.from_numpy(np.expand_dims(current_box, axis=0)))
        indexes = indexes[iou <= iou_threshold]

    return box_scores[picked, :]

We define next the `postprocess_ssdmobilenetv2` function which applies the `convert_locations_to_boxes` and the `hard_nms` on the output tensors of the **SSD MobileNet V2** model. The expected results are numpy arrays containing the boxes coordinates adjusted to the preview image, the confidences of each of the classes (BACKGROUND and Person in this case) and the class IDs. 

In [None]:
def postprocess_ssdmobilenetv2(scores, boxes, conf_threshold, iou_threshold, preview_shape):
        preview_image_width = preview_shape[0]
        preview_image_height = preview_shape[1]
        # Apply softmax to the scores
        scores = np.exp(scores) / np.sum(np.exp(scores), axis=2, keepdims=True)
        boxes = convert_locations_to_boxes(boxes, ssd_anchors, 0.1, 0.2)
        boxes = np.array(center_form_to_corner_form(torch.from_numpy(boxes)))
        boxes = np.array(boxes[0])
        scores = np.array(scores[0])
        picked_box_probs = []
        picked_labels = []
        for class_index in range(1, scores.shape[1]):
            probs = scores[:, class_index]
            mask = probs > conf_threshold
            probs = probs[mask]
            if probs.shape[0] == 0:
                continue
            subset_boxes = boxes[mask, :]
            box_probs = np.concatenate([subset_boxes, probs.reshape(-1, 1)], axis=1)

            box_probs = hard_nms(torch.from_numpy(box_probs), iou_threshold=iou_threshold)
            picked_box_probs.append(box_probs)
            picked_labels.extend([class_index] * box_probs.shape[0])

        if not picked_box_probs:
            picked_box_probs = np.array([])
            boxes = np.empty((0, 4))
        else:
            picked_box_probs = np.concatenate(picked_box_probs)
            picked_box_probs[:, 0] *= preview_image_width
            picked_box_probs[:, 1] *= preview_image_height
            picked_box_probs[:, 2] *= preview_image_width
            picked_box_probs[:, 3] *= preview_image_height
            boxes = picked_box_probs[:, :4]
            probs = picked_box_probs[:, 4]
        return boxes, probs, picked_labels

### 4. Running the inference using the STAI_MPU API with NPU acceleration:
The last step of this tutorial is to run inference using the newly updated model after being quantized and have its dynamic shapes fixed using the STAI_MPU API and the NPU hardware acceleration. We use the the `postprocess_ssdmobilenetv2` function to feed the boxes coordinates, the confidences and the class ID to the Supervision `Detections` class. As for the preprocessing, we apply the same function as the Teacher model but with different image size, since they consume similar data with different resolutions. We use the `display_annotated_images` function defined previously to display the detection results of the new student model. You should notice improved detection performances.

In [None]:
from stai_mpu import stai_mpu_network
import cv2
import time
import numpy as np

# Instantiate the student ONNX model with the use_hw_acceleration flag
student_model_path = inference_models_dir + "/ssd_mobilenet_v2_quant_fixed.onnx"
stai_student_model = stai_mpu_network(model_path=student_model_path, use_hw_acceleration=True)

# Read input tensor information
num_inputs = stai_student_model.get_num_inputs()
input_tensor_infos = stai_student_model.get_input_infos()
input_tensor_shape = input_tensor_infos[0].get_shape()
input_tensor_dtype = input_tensor_infos[0].get_dtype()
nn_input_width =  input_tensor_shape[2]
nn_input_height =  input_tensor_shape[3]
nn_input_channel =  input_tensor_shape[1]

# Read output tensor information
num_outputs = stai_student_model.get_num_outputs()
output_tensor_infos = stai_student_model.get_output_infos()
output_tensor_shape = output_tensor_infos[0].get_shape()

# Filtering parameters
conf_threshold = 0.7
iou_threshold = 0.3

def run_inference(image_paths):
    detections_list = []
    for img_path in image_paths:
        img = cv2.imread(img_path)
        preprocessed_img = preprocess_input(img, nn_input_width, nn_input_height)
        stai_student_model.set_input(0, np.array(preprocessed_img))
        # Run inference using the STAI_MPU on the newly exported model
        start_time = time.time()
        stai_student_model.run()
        print(f'Inference time for {img_path}: {time.time() - start_time}\n')
        output_tensor_conf = stai_student_model.get_output(index=0)
        output_tensor_bbox = stai_student_model.get_output(index=1)
        boxes, scores, class_ids = postprocess_ssdmobilenetv2(output_tensor_conf,
                                                              output_tensor_bbox,
                                                              conf_threshold,
                                                              iou_threshold,
                                                              img.shape)
        detections = sv.Detections(xyxy=boxes, confidence=scores, class_id=np.array(class_ids))
        detections_list.append(detections)
    return detections_list

image_paths = glob.glob(f"{dataset_dir}/test/*.jpg")
detections_list = run_inference(image_paths[:10])
display_annotated_images(image_paths[:10], detections_list)

As it can be noticed, the model can You are welcome to deploy your new model to an object detection application running in real time video stream with Gstreamer and the STAI_MPU API.