# TAO Image Classification (TF2) with STEdgeAI Developer Cloud on NVIDIA Pretrained Model

Transfer learning is the process of transfering learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. 

This notebook provides a complete life cycle of the model training, optimization and benchmarking using [NVIDIA TAO Toolkit](https://developer.nvidia.com/tao-toolkit) and [STEdgeAI Developer Cloud](https://stm32ai.st.com/stm32-cube-ai-dc/).

NVIDIA Train Adapt Optimize (TAO) Toolkit is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

[STEdgeAI Developer Cloud](https://stm32ai-cs.st.com/home) is a free-of-charge online platform and services allowing the creation, optimization, benchmarking, and generation of AI models for the STM32 microcontrollers. It is based on the [STEdgeAI Core](https://www.st.com/en/development-tools/stedgeai-core.html) core technology.



<br>

<img style="float: center;background-color: white; width: 1080" src="../../docs/TAO-STM32CubeAI.png" width="1080">

<br>

## License

This software component is licensed by ST under BSD-3-Clause license,
the "License"; 

You may not use this file except in compliance with the
License. 

You may obtain a copy of the License at: https://opensource.org/licenses/BSD-3-Clause

Copyright (c) 2023 STMicroelectronics. All rights reserved.

Copyright (c) 2023 Nvidia. All rights reserved.

## Learning Objectives
In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Take a pretrained efficientnet_b0 model from ngc and finetune on a sample dataset converted from COCO2014 to perform person detection,
* Prune the finetuned model,
* Retrain the pruned model to recover lost accuracy,
* Export the pruned model as an onnx model,
* Quantize the model using onnxruntime,
* Run Benchmarking of the quantized onnx model (finetuned, pruned, retrained, and quantized) using STEdgeAI Developer Cloud to know the footprints and embeddability of the models.

At the end of this notebook, you will have generated a trained and optimized `classification` model which was imported from outside TAO Toolkit, and that may be deployed via [STEdgeAI Developer Cloud](https://stm32ai-cs.st.com/home).

### Table of Contents
This notebook shows an example use case for classification using the Train Adapt Optimize (TAO) Toolkit.

0. [Set up env variables and map drives](#head-0)
1. [Installing the TAO Launcher](#head-1)
2. [Prepare dataset and pretrained model](#head-2)
    1. [Download, prepare and split the dataset into train/test/val](#head-2-1)
    2. [Download pretrained model from NGC](#head-2-2)
3. [Provide training specification](#head-3)
4. [Finetune the pretrained model using TAO training](#head-4)
5. [Evaluate trained model](#head-5)
    1. [(optional) Export the trained model as onnx format and check the accuracy](#head-5-1)
6. [Prune the trained model](#head-6)
7. [Retrain the pruned model](#head-7)
8. [Testing the finetuned pruned model](#head-8)
    1. [Export the pruned, and retrained model as onnx format](#head-8-1)
    2. [Quantizing the exported onnx model using onnxruntime](#head-8-2)
9. [Benchmarking the optimized model using STEdgeAI Developer Cloud for embeddability](#head-9)

## 0. Set up env variables and map drives <a class="anchor" id="head-0"></a>
When using the purpose-built pretrained models from NGC, please make sure to set the `$KEY` environment variable to the key as mentioned in the model overview. Failing to do so, can lead to errors when trying to load them as pretrained models.

The following notebook requires the user to set an env variable called the `$LOCAL_PROJECT_DIR` as the path to the users workspace. Please note that the dataset to run this notebook is expected to reside in the `$LOCAL_PROJECT_DIR/data`, while the TAO experiment generated collaterals will be output to `$LOCAL_PROJECT_DIR/classification_tf2`. More information on how to set up the dataset and the supported steps in the TAO workflow are provided in the subsequent cells.

*Note: Please make sure to remove any stray artifacts/files from the `$USER_EXPERIMENT_DIR` or `$DATA_DOWNLOAD_DIR` paths as mentioned below, that may have been generated from previous experiments. Having checkpoint files etc may interfere with creating a training graph for a new experiment.*

*Note: This notebook currently is by default set up to run training using 1 GPU. To use more GPU's please update the env variable `$NUM_GPUS` accordingly*

In [None]:
# Setting up env variables for cleaner command line commands.
import os

%env KEY=nvidia_tlt
%env NUM_GPUS=1
%env USER_EXPERIMENT_DIR=/workspace/tao-experiments/classification_tf2
%env DATA_DOWNLOAD_DIR=/workspace/tao-experiments/data

# Set this path if you don't run the notebook from the samples directory.
# %env NOTEBOOK_ROOT=~/tao-samples/classification_tf2

# Please define this local project directory that needs to be mapped to the TAO docker session.
# The dataset expected to be present in $LOCAL_PROJECT_DIR/data, while the results for the steps
# in this notebook will be stored at $LOCAL_PROJECT_DIR/classification_tf2
# !PLEASE MAKE SURE TO UPDATE THIS PATH!.
os.environ["LOCAL_PROJECT_DIR"] = "/home/user/stm32ai-tao/"

os.environ["LOCAL_DATA_DIR"] = os.path.join(
    os.getenv("LOCAL_PROJECT_DIR", os.getcwd()),
    "data"
)
os.environ["LOCAL_EXPERIMENT_DIR"] = os.path.join(
    os.getenv("LOCAL_PROJECT_DIR", os.getcwd()),
    "classification_tf2"
)

# The sample spec files are present in the same path as the downloaded samples.
os.environ["LOCAL_SPECS_DIR"] = os.path.join(
    os.getenv("NOTEBOOK_ROOT", os.getcwd()),
    "specs"
)
%env SPECS_DIR=/workspace/tao-experiments/classification_tf2/tao_person/specs

# Showing list of specification files.
!ls -rlt $LOCAL_SPECS_DIR

The cell below maps the project directory on your local host to a workspace directory in the TAO docker instance, so that the data and the results are mapped from outside to inside of the docker instance.

In [None]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")

# Define the dictionary with the mapped drives
drive_map = {
    "Mounts": [
        # Mapping the data directory
        {
            "source": os.environ["LOCAL_PROJECT_DIR"],
            "destination": "/workspace/tao-experiments"
        },
        # Mapping the specs directory.
        {
            "source": os.environ["LOCAL_SPECS_DIR"],
            "destination": os.environ["SPECS_DIR"]
        },
    ],
    "DockerOptions":{
        "user": "{}:{}".format(os.getuid(), os.getgid())
    }
}

# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(drive_map, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

#### (Optional) A. Set proxy variables if working behind corporate proxies.

The following section sets the proxies and ssl verification flag when the users are working behind the proxies. This setup is necessary to be able to communicate with internet.

Replace the `userName`, `password`, and `proxy_port` with your correct username, password and proxy port.

In [4]:
# set proxies
import os
# os.environ["http_proxy"]='http://userName:password.@example.com:proxy_port'
# os.environ["https_proxy"] = 'http://userName:password.@example.com:proxy_port'
# os.environ["NO_SSL_VERIFY"]="1"
# os.environ["SSL_VERIFY"]="False"

## 1. Installing the TAO launcher <a class="anchor" id="head-1"></a>
The TAO launcher is a python package distributed as a python wheel listed in PyPI. You may install the launcher by executing the following cell.

Please note that TAO Toolkit recommends users to run the TAO launcher in a virtual env with python 3.6.9. You may follow the instruction in this [page](https://virtualenvwrapper.readthedocs.io/en/latest/install.html) to set up a python virtual env using the `virtualenv` and `virtualenvwrapper` packages. Once you have setup virtualenvwrapper, please set the version of python to be used in the virtual env by using the `VIRTUALENVWRAPPER_PYTHON` variable. You may do so by running

```sh
export VIRTUALENVWRAPPER_PYTHON=/path/to/bin/python3.x
```
where x >= 6 and <= 8

We recommend performing this step first and then launching the notebook from the virtual environment. In addition to installing TAO python package, please make sure of the following software requirements:
* python >=3.6.9 < 3.8.x
* docker-ce > 19.03.5
* docker-API 1.40
* nvidia-container-toolkit > 1.3.0-1
* nvidia-container-runtime > 3.4.0-1
* nvidia-docker2 > 2.5.0-1
* nvidia-driver > 455+

Once you have installed the pre-requisites, please log in to the docker registry nvcr.io by following the command below

```sh
docker login nvcr.io
```

You will be triggered to enter a username and password. The username is `$oauthtoken` and the password is the API key generated from `ngc.nvidia.com`. Please follow the instructions in the [NGC setup guide](https://docs.nvidia.com/ngc/ngc-overview/index.html#generating-api-key) to generate your own API key.


In [None]:
# SKIP this cell IF you have already installed the TAO launcher.
!pip3 install nvidia-tao

In [None]:
# View the versions of the TAO launcher
!tao info

## 2. Prepare datasets and pre-trained model <a class="anchor" id="head-2"></a>
**NOTE**: If you have already downloaded, unpacked and prepared the dataset files once, you can skip these steps.

We will be using the modified version of COCO2014 dataset for the tutorial. To find more details please visit this [link](https://pjreddie.com/projects/coco-mirror/). 

#### Download the dataset
To download all the files needed for the dataset in the right location, please uncomment and run the section below.

In [None]:
!mkdir $LOCAL_DATA_DIR
!wget -O $LOCAL_DATA_DIR/train2014.zip https://pjreddie.com/media/files/train2014.zip
!wget -O $LOCAL_DATA_DIR/val2014.zip https://pjreddie.com/media/files/val2014.zip
!wget -O $LOCAL_DATA_DIR/labels.tgz https://pjreddie.com/media/files/coco/labels.tgz

#### Verify the download.
Checking if the dataset zip files are present in the data directory.

In [None]:
# Check that file is present
import os
DATA_DIR = os.environ.get('LOCAL_DATA_DIR')
print(DATA_DIR)
if not ( os.path.isfile(os.path.join(DATA_DIR , 'train2014.zip')) and 
        os.path.isfile(os.path.join(DATA_DIR , 'val2014.zip')) and
        os.path.isfile(os.path.join(DATA_DIR , 'labels.tgz')) ):
    print('One or more data files for the dataset are not found.\nPlease download the dataset by running the Download Dataset section!')
else:
    print('Found dataset.')

#### Unpack the files

The downloaded files are in the form of the `zip` and `tgz` format. Running the following code section will unzip and unpack these files.

**NOTE** : `> /dev/null` executes the command silently without printing the output of the command.

In [None]:
!unzip $LOCAL_DATA_DIR/train2014.zip -d $LOCAL_DATA_DIR/> /dev/null
!unzip $LOCAL_DATA_DIR/val2014.zip -d $LOCAL_DATA_DIR/> /dev/null
!tar -xzvf $LOCAL_DATA_DIR/labels.tgz -C $LOCAL_DATA_DIR/> /dev/null

Verifying if the files are unpacked as folder.

In [None]:
!ls $LOCAL_DATA_DIR/

### A. Split the dataset into train/val/test <a class="anchor" id="head-2-1"></a>

For creating the person detection use case we are converting the COCO2014 Dataset into a format where it has only two classes, i.e. `person` and `not_person`. 
In addition to this we are applying an additional filter and removing all the images with person class where the size of the person is too small (covering less than 20% of the image area). 
That is why after preparation instead of 118,287 we have only 84,810 images in total.

In [None]:
# install pip requirements
!pip3 install tqdm
!pip3 install matplotlib==3.3.3
!pip3 install pandas3

Following code section filters the dataset into `person` and `not_person` classes.

In [None]:
import pandas as pd
import shutil
from pathlib import Path
def filter_coco(area_threshold, labels_dir, input_dir, output_dir):
    """Filter COCO dataset subset filtering person area.

    Args:
      area_threshold: Threshold of fraction of image area below which
      persons are filtered.
      labels_dir: COCO dataset labels directory path.
      input_dir: COCO dataset path.
      output_dir: new dataset output path.
    """
    labels_dpath = Path(labels_dir)
    labels_fpaths = labels_dpath.glob('*.txt')
    input_dpath = Path(input_dir)
    output_dpath = Path(output_dir)
    f = open(input_dpath.stem + '.txt','w+')
    f.write('filename label\n')
    for label_fpath in labels_fpaths:
        img_fname = label_fpath.name.replace('.txt', '.jpg')
        label = 'not_person'
        annotations = pd.read_csv(label_fpath, delimiter=' ', header=None)
        persons = annotations.loc[annotations[0] == 0]
        if persons.shape[0] != 0:
            big_persons = ((persons[3] * persons[4] * 100.0) > area_threshold).sum()
            if big_persons > 0:
                label = 'person'
            else:
                continue
        src_fpath = input_dpath / img_fname
        dst_dpath = output_dpath / label
        dst_dpath.mkdir(parents=True, exist_ok=True)
        shutil.copy(src_fpath, dst_dpath)
        f.write(img_fname + ' ' + label + '\n')
    f.close()
train_labels_path = os.path.join(DATA_DIR, 'labels', 'train2014')
val_labels_path = os.path.join(DATA_DIR, 'labels', 'val2014')
train_images_dir = os.path.join(DATA_DIR, 'train2014')
val_images_dir = os.path.join(DATA_DIR, 'val2014')
result_dir = os.path.join(DATA_DIR, 'person_dataset')
filter_coco(area_threshold = 20.0, labels_dir = train_labels_path, input_dir = train_images_dir, output_dir = result_dir)
filter_coco(area_threshold = 20.0, labels_dir = val_labels_path, input_dir = val_images_dir, output_dir = result_dir)

Splitting the dataset into `train`, `val`, `test` portions.

In [None]:
import os
import glob
import shutil
from random import shuffle
from tqdm import tqdm

DATA_DIR=os.environ["LOCAL_DATA_DIR"]
SOURCE_DIR=os.path.join(DATA_DIR, 'person_dataset')
TARGET_DIR=os.path.join(DATA_DIR,'split')

# removing existing split directory
!rm -rf $TARGET_DIR

# list dir
print(os.walk(SOURCE_DIR))
dir_list = next(os.walk(SOURCE_DIR))[1]
# for each dir, create a new dir in split
for dir_i in tqdm(dir_list):
        newdir_train = os.path.join(TARGET_DIR, 'train', dir_i)
        newdir_val = os.path.join(TARGET_DIR, 'val', dir_i)
        newdir_test = os.path.join(TARGET_DIR, 'test', dir_i)
        
        if not os.path.exists(newdir_train):
                os.makedirs(newdir_train)
        if not os.path.exists(newdir_val):
                os.makedirs(newdir_val)
        if not os.path.exists(newdir_test):
                os.makedirs(newdir_test)

        img_list = glob.glob(os.path.join(SOURCE_DIR, dir_i, '*.jpg'))
        # shuffle data
        shuffle(img_list)

        for j in range(int(len(img_list)*0.7)):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR, 'train', dir_i))

        for j in range(int(len(img_list)*0.7), int(len(img_list)*0.8)):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR, 'val', dir_i))
                
        for j in range(int(len(img_list)*0.8), len(img_list)):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR, 'test', dir_i))
                
print('Done splitting dataset.')

Verifying if the portions are created and all the sub-folders have all the classes.

In [None]:
!ls $LOCAL_DATA_DIR/split/train
!ls $LOCAL_DATA_DIR/split/val
!ls $LOCAL_DATA_DIR/split/test

### B. Download pretrained models <a class="anchor" id="head-2-2"></a>

 We will use NGC CLI to get the pre-trained models. For more details, go to ngc.nvidia.com and click the SETUP on the navigation bar.

In [None]:
# Installing NGC CLI on the local machine.
## Download and install
%env CLI=ngccli_cat_linux.zip
!mkdir -p $LOCAL_PROJECT_DIR/ngccli

# Remove any previously existing CLI installations
!rm -rf $LOCAL_PROJECT_DIR/ngccli/*
!wget "https://ngc.nvidia.com/downloads/$CLI" -P $LOCAL_PROJECT_DIR/ngccli
!unzip -u "$LOCAL_PROJECT_DIR/ngccli/$CLI" -d $LOCAL_PROJECT_DIR/ngccli/
!rm $LOCAL_PROJECT_DIR/ngccli/*.zip 
os.environ["PATH"]="{}/ngccli/ngc-cli:{}".format(os.getenv("LOCAL_PROJECT_DIR", ""), os.getenv("PATH", ""))

In [None]:
!ngc registry model list nvidia/tao/pretrained_classification_tf2

In [None]:
!mkdir -p $LOCAL_EXPERIMENT_DIR/pretrained_efficientnet_b0

In [None]:
# Pull pretrained model from NGC
!ngc registry model download-version nvidia/tao/pretrained_classification_tf2:efficientnet_b0 --dest $LOCAL_EXPERIMENT_DIR/pretrained_efficientnet_b0

In [None]:
print("Check that model is downloaded into dir.")
!ls -l $LOCAL_EXPERIMENT_DIR/pretrained_efficientnet_b0/pretrained_classification_tf2_vefficientnet_b0

## 3. Provide training specification <a class="anchor" id="head-3"></a>
* Training dataset
* Validation dataset
* Pre-trained models
* Other training (hyper-)parameters such as batch size, number of epochs, learning rate etc.

In [None]:
!cat $LOCAL_SPECS_DIR/spec.yaml

## 4. Run TAO training <a class="anchor" id="head-4"></a>
* Provide the sample spec file and the output directory location for models

In [None]:
# logging in the docker session
!docker login nvcr.io

In [None]:
!rm -rf $LOCAL_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output
!mkdir -p $LOCAL_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output
# !sed -i "s|RESULTSDIR|$USER_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output|g" $LOCAL_SPECS_DIR/spec.yaml
# !sed -i "s|ENC_KEY|$KEY|g" $LOCAL_SPECS_DIR/spec.yaml

In [None]:
!tao model classification_tf2 train -e $SPECS_DIR/spec.yaml

### INFO
- To run this training in data parallelism using multiple GPU's, please pass the **number of gpu devices** to be used using `--gpus` parameter. For example for running the training on two gpu devices on parallel the training command will be:
 ```!tao classification_tf2 train -e $SPECS_DIR/spec.yaml --gpus 2```
- The training can be intrupted and then relaunched at any point. To resume from a checkpoint, just relaunch training with the same spec file.

## 5. Evaluate trained models <a class="anchor" id="head-5"></a>

In this step, we assume that the training is complete and the model from the final epoch (`efficientnet-b0_xxx.tlt`) is available.

In [None]:
# get the last checkpoints
last_checkpoint = ''
for f in os.listdir(os.path.join(os.environ["LOCAL_EXPERIMENT_DIR"],'tao_person/results_efficientnet_b0/output/', 'train')):
    if f.startswith('efficientnet-b'):
        last_checkpoint = last_checkpoint if last_checkpoint > f else f
print(f'Last checkpoint: {last_checkpoint}')

In [None]:
# Set LAST_CHECKPOINT in the environment variables file
%env LAST_CHECKPOINT={last_checkpoint}

In [None]:
!tao model classification_tf2 evaluate -e $SPECS_DIR/spec.yaml evaluate.checkpoint="$USER_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output/train/$LAST_CHECKPOINT"

### (optional) A. Exporting the trained model to onnx model 
This step is needed if one wants to estimate the footprints of the trained model using STEdgeAI.

In [None]:
!mkdir -p ./results_efficientnet_b0/exports
# export the model checkpoint as .onnx file
!tao model classification_tf2 export -e $SPECS_DIR/spec.yaml\
                                 export.checkpoint=$USER_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output/train/$LAST_CHECKPOINT \
                                 export.onnx_file=$USER_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/exports/efficientnet_b0.onnx

### (optional) B. Evaluating the onnx model
This is optional sanity check to evaluate the exported trained model to check if the preprocessing in the onnx evaluation and the TAO evaluation are the same.

In [None]:
!pip install tensorflow==2.9
!pip install onnxruntime==1.14.1
!pip install onnx==1.12.0
!pip install scikit-learn==0.24.2
!pip install matplotlib==3.3.3
!pip install pillow

In [None]:
# evaluate the onnx model before pruning
import sys
sys.path.append('../')
import onnx_utils
onnx_utils.evaluate_onnx_model('./results_efficientnet_b0/exports/efficientnet_b0.onnx',
                               os.path.join(os.environ["LOCAL_DATA_DIR"],'split/test/'),
                               save_path='./results_efficientnet_b0/efficientnet_b0',
                              img_width = 128, img_height = 128)

## 6. Prune trained models <a class="anchor" id="head-6"></a>
* Specify pre-trained model
* Equalization criterion
* Threshold for pruning
* Exclude prediction layer that you don't want pruned (e.g. predictions)

Usually, you just need to adjust `prune.threshold` for accuracy and model size trade off. Higher `threshold` gives you smaller model (and thus higher inference speed) but worse accuracy. The threshold to use is depend on the dataset. `0.7` is just a starting point. If the retrain accuracy is good, you can increase this value to get smaller models. Otherwise, lower this value to get better accuracy.

`prune.min_num_filters` is the minimum number of filters to keep per layer after the pruning step. Smaller the value smaller the resulting model. However, this might lower the accuracy. The users can adjust this value depending on their needs.

In [None]:
!tao model classification_tf2 prune -e $SPECS_DIR/spec.yaml \
            prune.checkpoint="$USER_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output/train/$LAST_CHECKPOINT" \
            prune.threshold=0.7 prune.min_num_filters=8

In [None]:
print('Pruned model:')
print('------------')
!ls -rlt $LOCAL_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output/prune

## 7. Retrain pruned models <a class="anchor" id="head-7"></a>
* Model needs to be re-trained to bring back accuracy after pruning
* Specify re-training specification

In [None]:
!rm -rf $LOCAL_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output_retrain
!mkdir -p $LOCAL_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output_retrain

!cat $LOCAL_SPECS_DIR/spec_retrain.yaml

In [None]:
!tao model classification_tf2 train -e $SPECS_DIR/spec_retrain.yaml

## 8. Testing the model! <a class="anchor" id="head-8"></a>

In this step, we assume that the training is complete and the model from the final epoch (`efficientnet-b0_xxx.tlt`) is available.

In [None]:
# get the last checkpoints
last_checkpoint = ''
for f in os.listdir(os.path.join(os.environ["LOCAL_EXPERIMENT_DIR"],'tao_person/results_efficientnet_b0/output_retrain', 'train')):
    if f.startswith('efficientnet-b'):
        last_checkpoint = last_checkpoint if last_checkpoint > f else f
print(f'Last checkpoint: {last_checkpoint}')
%env LAST_CHECKPOINT={last_checkpoint}

In [None]:
!tao model classification_tf2 evaluate -e $SPECS_DIR/spec.yaml \
evaluate.checkpoint=$USER_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output_retrain/train/$LAST_CHECKPOINT

### A. Exporting the trained model as onnx model <a class="anchor" id="head-8-1"></a>

The following section exports the pruned and retrained model as an onnx model.

In [None]:
!mkdir -p ./results_efficientnet_b0/exports

!tao model classification_tf2 export -e $SPECS_DIR/spec_retrain.yaml\
                                 export.checkpoint=$USER_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/output_retrain/train/$LAST_CHECKPOINT \
                                 export.onnx_file=$USER_EXPERIMENT_DIR/tao_person/results_efficientnet_b0/exports/pruned_efficientnet_b0.onnx

### A.1. Evaluating the exported onnx model

The following section evaluates the exported onnx model. This should get the same accuracy as the one before converting to onnx.

In [None]:
!pip install tensorflow==2.9
!pip install onnxruntime==1.14.1
!pip install onnx==1.12.0
!pip install scikit-learn==0.24.2
!pip install matplotlib==3.3.3
!pip install pillow

In [None]:
# evaluate the onnx model before pruning
import sys
sys.path.append('../')
import onnx_utils
onnx_utils.evaluate_onnx_model('./results_efficientnet_b0/exports/pruned_efficientnet_b0.onnx',
                               os.path.join(os.environ["LOCAL_DATA_DIR"],'split/test/'),
                               save_path='./results_efficientnet_b0/efficientnet_b0_pruned',
                              img_width = 128, img_height = 128)

### B. Quantizing the exported onnx model using onnxruntime <a class="anchor" id="head-8-2"></a>

The following sections converts the exported onnx model to int8 to reduce the footprints and improve the inference time.
This will first require to create a subsample of the data to calibrate the quantization of the model

### B.1. Create a calibration dataset
This should have samples from both classes `person`, and `not_person`.

In [None]:
# create a subsample dataset for the quantization
import os
import glob
import shutil
from random import shuffle
from tqdm import tqdm

DATA_DIR=os.environ["LOCAL_DATA_DIR"]
SOURCE_DIR=os.path.join(DATA_DIR, 'split/train/')
TARGET_DIR=os.path.join(DATA_DIR,'split/subset_calibration_dataset')

!rm -rf TARGET_DIR

samples_per_class = 200
# list dir
print(os.walk(SOURCE_DIR))
dir_list = next(os.walk(SOURCE_DIR))[1]
# for each dir, create a new dir in split
for dir_i in tqdm(dir_list):
        if not os.path.exists(TARGET_DIR):
                os.makedirs(TARGET_DIR)
                
        img_list = glob.glob(os.path.join(SOURCE_DIR, dir_i, '*.jpg'))
        # shuffle data
        shuffle(img_list)

        for j in range(samples_per_class):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR))                
print('Done creating calibration dataset.')

#### B.2. Quantize the model using QDQ quantization to int8 weights
The following section quantize the float32 onnx model to int8 quantized onnx model.

In [None]:
# quantize the model
import sys
sys.path.append('../')
import onnx_utils
onnx_utils.quantize_onnx_model(input_model = './results_efficientnet_b0/exports/pruned_efficientnet_b0.onnx', 
                              calibration_dataset_path = os.path.join(os.environ["LOCAL_DATA_DIR"],'split','subset_calibration_dataset'))

#### B.3. Evaluate the quantized model

In [None]:
# evaluate the onnx model before pruning
import sys
sys.path.append('../')
import onnx_utils
onnx_utils.evaluate_onnx_model('./results_efficientnet_b0/exports/pruned_efficientnet_b0_QDQ_quant.onnx',
                               os.path.join(os.environ["LOCAL_DATA_DIR"],'split/test/'),
                               save_path='./results_efficientnet_b0/efficientnet_b0_pruned_quantized',
                              img_width = 128, img_height = 128)

#### B.4. Changing the opset of the onnx model to use them with STEdgeAI
The following section will change the opset of the exported onnx models to make sure it is best supported by STEdgeAI.

In [None]:
import sys
sys.path.append('../')
import onnx_utils
onnx_utils.change_opset('./results_efficientnet_b0/exports/efficientnet_b0.onnx', target_opset = 14)
onnx_utils.change_opset('./results_efficientnet_b0/exports/pruned_efficientnet_b0.onnx', target_opset = 14)
onnx_utils.change_opset('./results_efficientnet_b0/exports/pruned_efficientnet_b0_QDQ_quant.onnx', target_opset = 14)

## 9. Benchmarking the optimzed model using STEdgeAI Developer Cloud <a class="anchor" id="head-9"></a>

Getting the package for connecting to STEdgeAI Developer Cloud.

In [None]:
!pip install gitdir
!pip install ipywidgets
!pip install marshmallow

In [None]:
!gitdir https://github.com/STMicroelectronics/stm32ai-modelzoo-services/tree/main/common/stm32ai_dc

import os
import shutil

# Reorganize local folders
if os.path.exists('./stm32ai_dc'):
    shutil.rmtree('./stm32ai_dc')
shutil.move('./common/stm32ai_dc', './stm32ai_dc')
shutil.rmtree('./common')

##### Import, helper and UI functions

In [None]:
import os
import sys
from typing import List
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from stm32ai_dc import (CliLibraryIde, CliLibrarySerie, CliParameters, MpuParameters, MpuEngine, AtonParameters,
                        CloudBackend, Stm32Ai)

sys.path.append(os.path.abspath('stm32ai'))
os.environ['STATS_TYPE'] = 'stm32ai_tao'

# create a directory for outputs for stm32ai developer cloud operations
stm32ai_output_dir = './results_efficientnet_b0/stm32ai_outputs'
os.makedirs(stm32ai_output_dir, exist_ok=True)


def get_mpu_options(board_name: str = None) -> tuple:
    """
    Get MPU benchmark options depending on MPU board selected
    Each MPU board has different settings,
    i.e. different number of cpu_cores or engine (CPU only or HW_Accelerator also)
    Input:
        board_name:str, name of the mpu board
    Returns:
        tuple: engine_used and num_cpu_cores.
    """

    #define configuration by MPU board
    STM32MP257F_EV1 = {
        "engine": MpuEngine.HW_ACCELERATOR,
        "cpu_cores": 2
    }

    STM32MP157F_DK2 = {
        "engine": MpuEngine.CPU,
        "cpu_cores": 2
    }

    STM32MP135F_DK = {
        "engine": MpuEngine.CPU,
        "cpu_cores": 1
    }

    #recover parameters based on board name:
    if board_name == "STM32MP257F-EV1":
        engine_used = STM32MP257F_EV1.get("engine")
        num_cpu_cores = STM32MP257F_EV1.get("cpu_cores")
    elif board_name == "STM32MP157F-DK2":
        engine_used = STM32MP157F_DK2.get("engine")
        num_cpu_cores = STM32MP157F_DK2.get("cpu_cores")
    elif board_name == "STM32MP135F-DK":
        engine_used = STM32MP135F_DK.get("engine")
        num_cpu_cores = STM32MP135F_DK.get("cpu_cores")
    else :
        engine_used = MpuEngine.CPU
        num_cpu_cores = 1

    return engine_used, num_cpu_cores

def analyze_footprints(report: object = None) -> None:
    """
    Analyzes the memory footprint of a STEdgeAI model.

    Args:
        report: A report object containing information about the model.

    Returns:
        None
    """
    activations_ram: float = report.ram_size / 1024
    runtime_ram: float = report.estimated_library_ram_size / 1024
    total_ram: float = activations_ram + runtime_ram
    weights_rom: float = report.rom_size / 1024
    code_rom: float = report.estimated_library_flash_size / 1024
    total_flash: float = weights_rom + code_rom
    macc: float = report.macc / 1e6
    print("[INFO] : STEdgeAI model memory footprint")
    print("[INFO] : MACCs : {} (M)".format(macc))
    print("[INFO] : Total Flash : {0:.1f} (KiB)".format(total_flash))
    print("[INFO] :     Flash Weights  : {0:.1f} (KiB)".format(weights_rom))
    print("[INFO] :     Estimated Flash Code : {0:.1f} (KiB)".format(code_rom))
    print("[INFO] : Total RAM : {0:.1f} (KiB)".format(total_ram))
    print("[INFO] :     RAM Activations : {0:.1f} (KiB)".format(activations_ram))
    print("[INFO] :     RAM Runtime : {0:.1f} (KiB)".format(runtime_ram))


def benchmark_model(stmai:object,
                    model_path:str,
                    model_name:str,
                    optimization:str,
                    from_model:str,
                    board_name:str,
                    allocateInputs:bool =True,
                    allocateOutputs:bool=True) -> float:
    """
    Benchmarks the give model to calculate the footprint on a STM32 Target board.

    Args:
        stmai:object, an object of stm32ai_dc
        model_path:str, path to the model file
        model_name:str, path to the model file
        optimization:str, the way model is to be optimized available options ['balanced', 'time', 'ram']
        from_model:str, if the model is coming from zoo or is a custom model from the user
        board_name:str, target board name from one of the available boards on the dev cloud
        allocateInputs:bool, If set to true activations buffer will be also used to handle the input buffers. 
        allocateOutputs:bool, If set to "True", activations buffer will be also used to handle the output buffers.

    Returns:
        fps: frames per second (1/inference_time)
    """
    print(f"Benchmarking on: {board_name}")
    if "mp" in board_name.lower():
        # if mpu is selected as the target
        model_extension = os.path.splitext(model_path)[1]
        # only supported options are quantized tflite or onnx models
        if model_extension in ['.onnx', '.tflite']:
            if "stm32mp2" in board_name.lower(): # if mp2 is selected as the target board optimize the model to generate a .nbg file
                optimized_model_path = os.path.dirname(model_path) + "/"
                try:
                    stmai.upload_model(model_path)
                    model = model_name
                    res = stmai.generate_nbg(model)
                    stmai.download_model(res, optimized_model_path + res)
                    model_path=os.path.join(optimized_model_path,res)
                    nb_model_name = os.path.splitext(os.path.basename(model_path))[0] + ".nb"
                    rename_model_path=os.path.join(optimized_model_path,nb_model_name)
                    os.rename(model_path, rename_model_path)
                    model_path = rename_model_path
                    print("[INFO] : Optimized Model Name:", model_name)
                    print("[INFO] : Optimization done ! Model available at :",optimized_model_path)
                    model_name = nb_model_name
                except Exception as e:
                    print(f"[FAIL] : Model optimization via Cloud failed : {e}.")
                    print("[INFO] : Use default model instead of optimized ...")
        else:
            print("[ERROR]: Only .tflite or .onnx models can be benchmarked for MPU")
            fps = 0
            return fps

        engine, nbCores = get_mpu_options(board_name)
        stmai_params = MpuParameters(model=model_name,
                                     nbCores=nbCores,
                                     engine=engine)

    elif 'stm32n6' in board_name.lower():
        model_extension = os.path.splitext(model_path)[1]
        if model_extension in ['.onnx', '.tflite']:
            stmai_params = CliParameters(model=model_name,
                                         target='stm32n6',
                                         stNeuralArt='default',
                                         atonnOptions=AtonParameters())
        else:
            print("[ERROR]: Only .tflite or .onnx models can be benchmarked for N6")
        
    else:
        # target board in mcu, prepare stm32ai parameters
        stmai_params = CliParameters(model=model_name,
                                     optimization=optimization,
                                     allocateInputs=allocateInputs,
                                     allocateOutputs=allocateOutputs,
                                     fromModel=from_model)
    # running the benchmarking with prepared params
    
    benchmark_report_dir = f'{stm32ai_output_dir}/benchmark_reports/'
    os.makedirs(benchmark_report_dir, exist_ok=True)


    try:
        result = stmai.benchmark(stmai_params, board_name)
        fps = analyze_inference_time(report=result,
                                     target_mpu="mp" in board_name.lower())
        # Save the result in outputs folder
        with open(f'./{benchmark_report_dir}/{model_name}_{board_name}.txt', 'w') as file_benchmark:
            file_benchmark.write(f'{result}')
        return fps

    except Exception as e:
        print(f"Benchmarking failed on board: {board_name}")
        fps = 0
        return fps

def analyze_inference_time(report: object = None,
                           target_mpu = False) -> float:
    """
    Analyzes the inference time of a STEdgeAI model, prints the report and return the FPS.
    Args:
        report: A report object containing information about the model.
        target_mpu: a boolean (True: if target is MPU, False: otherwise)

    Returns:
        The frames per second (FPS) of the model.
    """

    inference_time: float = report.duration_ms
    fps: float = 1000.0/inference_time
    if not target_mpu:
        # in mpu benchmark result report we do not have cycles
        cycles: int = report.cycles
        print("[INFO] : Number of cycles : {} ".format(cycles))
    print("[INFO] : Inference Time : {0:.1f} (ms)".format(inference_time))
    print("[INFO] : FPS : {0:.1f}".format(fps))
    return fps


# UI widgets

# optimization options
optimization: List[str] = ["balanced", "time", "ram"]
optim_dropdown: widgets.Dropdown = widgets.Dropdown(
    options=optimization,
    value=optimization[0],
    description='Optim:',
    disabled=False
)

# STM32MCU series for code generation target
series_name: List[str] = [
    "STM32H7", "STM32F7", "STM32F4", "STM32L4", "STM32G4",
    "STM32F3", "STM32U5", "STM32L5", "STM32F0", "STM32L0",
    "STM32G0", "STM32C0", "STM32WL", "STM32H5"
]
series_dropdown: widgets.Dropdown = widgets.Dropdown(
    options=series_name,
    value=series_name[0],
    description='Series:',
    disabled=False
)

# options for the IDE while code generation
IDE_name: List[str] = ["gcc", "iar", "keil"]
ide_dropdown: widgets.Dropdown = widgets.Dropdown(
    options=IDE_name,
    value=IDE_name[0],
    description='IDE:',
    disabled=False
)

### A. Login to STEdgeAI Developer Cloud
Set environment variables with your credentials to acces STEdgeAI Developer Cloud.

If you don't have an account yet go to: https://stm32ai-cs.st.com/home and click on sign in to create an account. 

Then set the environment variables below with your credentials.

In [None]:
import getpass
# Set environment variables with your credentials to access 
# STEdgeAI Developer Cloud services
# Fill the username with your login address 
username = 'your.email@example.com'
os.environ['stmai_username'] = username
print('Enter you password')
password = getpass.getpass()
os.environ['stmai_password'] = password

In [None]:
# Log in STEdgeAI Developer Cloud 
try:
    stmai = Stm32Ai(CloudBackend(str(username), str(password)))
    print("Successfully Connected!")
except Exception as e:
    print("Error: please verify your credentials")
    print(e)

### B. Upload the model on STEdgeAI Developer Cloud

In [None]:
# Get the onnx models available locally
model_list = []
for entry in os.listdir('./results_efficientnet_b0/exports/'):
  if os.path.isfile(os.path.join('./results_efficientnet_b0/exports/', entry)):
    if entry.endswith('.onnx'): model_list.append(entry)
model_sel_dropdown = widgets.Dropdown(
    options=model_list,
    value=model_list[0],
    description='Model:',
    disabled=False
)
display(model_sel_dropdown)

In [None]:
model_name = model_sel_dropdown.value
model_path = f'./results_efficientnet_b0/exports/{model_name}'
model_name = os.path.basename(model_path)
from_model = 'user'

try:
  stmai.upload_model(model_path)
  print(f'Model {model_name} is uploaded !')
except Exception as e:
    print("ERROR: ", e)

### C. Select the STEdgeAI optimization setting
| Configuration | Description |
| --- | --- |
| balanced | default compromise between RAM footprint and latency. |
| time | optimize for latency. |
| ram | optimize for minimal RAM footprint. |

In [None]:
display(optim_dropdown)

### D. Analyze your model memory footprints for STM32MCU targets
When analyzing the footprints of the model for STM32MCU targets, following parameters can be configured for stm32ai.analyze callback:

CLIParameters (options of STEdgeAI):

| Parameter | Description |
| --- | --- |
| model | Model name corresponding to the file name uploaded. This parameter is __required__. |
| optimization | Optimization setting: "balanced", "time" or "ram". This parameter is __required__. |
| allocateInputs | If set to "True", activations buffer will be also used to handle the input buffers. This parameter is __optional__. Default value is "True". |
| allocateOutputs | If set to "True", activations buffer will be also used to handle the output buffers. This parameter is __optional__. Default value is "True". |
| noOnnxOptimizer | If set to "True", allows to disable the ONNX optimizer pass. This parameter is __optional__. Default value is "False". |
| fromModel | To identify the origin model when coming from ST model zoo. This parameter is __optional__. Default value is "user".|

In [None]:
# Analyze RAM/Flash model memory footprints after optimization by STEdgeAI
optimization = optim_dropdown.value
print(f'Anlyzing model : {model_name}, using opimization : {optimization}')
# The runtime library footprint varies slightly depending on the STM32 series
# For an estimation, we use the default series to the STM32F4
try:
  result = stmai.analyze(CliParameters(model=model_path,
                                       optimization=optimization,
                                       allocateInputs=True,
                                       allocateOutputs=True,
                                       fromModel=from_model))

  # analyze and print the summary of footprint report
  analyze_footprints(report=result)
  
  # Save the result in outputs folder
  stm32ai_analysis_dir = f'{stm32ai_output_dir}/analysis_report'
  os.makedirs(stm32ai_analysis_dir, exist_ok=True)
  with open(f'./{stm32ai_analysis_dir}/{model_name}_analyze.txt', 'w') as file_analyze:
    file_analyze.write(f'{result}')
except Exception as e:
    print("Error: ", e)

### E. Benchmark your model on a STM32 target
Starting from STEdgeAI dev cloud version 10.0.0 onwards, the models can be benchmarked for STM32MCU and STM32MPU as well as for STM32NPU target boards.

Here's a table with the parameters and their descriptions while benchmarking for the STM32MCU targets (CLIParameters options of STEdgeAI):

| Parameter | Description |
| --- | --- |
| model | Model name corresponding to the file name uploaded. This parameter is required. |
| optimization | Optimization setting: "balanced", "time" or "ram". This parameter is required. |
| allocateInputs | If set to "True", activations buffer will be also used to handle the input buffers. This parameter is optional. Default value is "True". |
| allocateOutputs | If set to "True", activations buffer will be also used to handle the output buffers. This parameter is optional. Default value is "True". |
| noOnnxOptimizer | If set to "True", allows to disable the ONNX optimizer pass. This parameter is optional. Default value is "False". Apply only to ONNX file will be ignored otherwise. |
| fromModel | To identify the origin model when coming from ST model zoo. This parameter is optional. Default value is "user". |


While for the STM32MPU targets, only needed parameters are:

| Parameter | Description |
| --- | --- |
| model | Model name corresponding to the file name uploaded. This parameter is __required__. |
| nbCores | Number of CPU cores used for benchmarking. This parameter is __set by the code__ depending on the type of MPU. The value should be an integer "1", or "2". |
| engine | Choice of the hardware engine used on the board for benchmarking.This parameter is __set by the code__ depending on the target MPU. For STM32MP1X boards it is "MpuEngine.CPU" and for STM32MP2X this is "MpuEngine.HW_ACCELERATOR". |

* Note that the the code section below, the boad_name to benchmark the model on should be a string

In [None]:
# Get the available board on STEdgeAI Developer Cloud
boards = stmai.get_benchmark_boards()
board_names = [boards[i].name for i in range(len(boards))]
print("Available boards:", board_names)

#### Option 1. Benchmark on all available STM32 boards

In [None]:
# Benchmark the model on all STEdgeAI Developer Cloud boards
print(model_name)
fps_array = []
# loop through all boards
for board_name in board_names:
        fps_array.append(benchmark_model(stmai=stmai,
                                         model_path=model_path,
                                         model_name=model_name,
                                         optimization=optimization,
                                         from_model=from_model,
                                         board_name=board_name,
                                         allocateInputs= True,
                                         allocateOutputs=True))

In [None]:
# Display the Frame per Second benchmark
sorted_fps = sorted(fps_array, reverse=True)
sorted_boards = [board_names[fps_array.index(i)] for i in sorted_fps]
fig = plt.figure(1, figsize=(15, 8), tight_layout=True)
# colors = sns.color_palette()
colors = ['#4C72B0', '#55A868', '#C44E52', '#8172B2', '#CCB974',
          '#64B5CD', '#B4A7D6', '#AEC7E8', '#FFA07A', '#FFC0CB',
          '#FFFFB3', '#8DD3C7', '#BEBADA', '#FDB462', '#FB8072',
          '#FF6347', '#4682B4', '#6A5ACD', '#7FFF00', '#D2691E']

plt.bar(sorted_boards, sorted_fps, color=colors[:len(boards)], width=0.7)
plt.ylabel('FPS', fontsize=15)
plt.yticks(fontsize=12)
plt.xticks(sorted_boards, rotation = 75)
plt.title('STM32 FPS benchmark')
plt.show()

#### Option 2. Benchmark on a selected board

In [None]:
# Select a board among the available boards
board_dropdown = widgets.Dropdown(
    options = board_names,
    value = 'STM32N6570-DK',
    description ='Board:',
    disabled = False,)

display(board_dropdown)

In [None]:
board_name = board_dropdown.value
print(model_name, board_name)
fps = benchmark_model(stmai=stmai,
                      model_path=model_path,
                      model_name=model_name,
                      optimization=optimization,
                      from_model=from_model,
                      board_name=board_name,
                      allocateInputs= True,
                      allocateOutputs=True)

### F. Generate your model optimized C code for STM32MCU targets

To deploy the model on an STM32MCU target the user has to generate the C-Code of the optimized model. Here's a table with the parameters and their descriptions for the stm32.generate callback (CLIParameters of STEdgeAI):

| Parameter | Description |
| --- | --- |
| model | Model name corresponding to the file name uploaded. This parameter is required. |
| optimization | Optimization setting: "balanced", "time" or "ram". This parameter is required. |
| allocateInputs | If set to "True", activations buffer will be also used to handle the input buffers. This parameter is optional. Default value is "True". |
| allocateOutputs | If set to "True", activations buffer will be also used to handle the output buffers. This parameter is optional. Default value is "True". |
| noOnnxOptimizer | If set to "True", allows to disable the ONNX optimizer pass. This parameter is optional. Default value is "False". Apply only to ONNX file will be ignored otherwise. |
| includeLibraryForSerie | Include the runtime library for the given STM32 series. This parameter is optional. |
| fromModel | To identify the origin model when coming from ST model zoo. This parameter is optional. |



### NOTE

There is no need for this step if the deployment is intended on the MPU. One can directly deploy the .tflite model on the STM32MPUs. In case of STM32MP2x, an optimized version of the model should be already available in the path where the starting model was placed with the same name as model and extension ".nb".


Due to licensing issues, the C-Code for STM32NPU can only be generated by downloading and installing the STEdgeAI locally.

In [None]:
display(series_dropdown)
display(ide_dropdown)

In [None]:
series = series_dropdown.value
IDE = ide_dropdown.value
print(f'Generating optimized C code of {model_name} model, for {series} series boards!\n')
# Generate model .c/.h code + Lib/Inc on STEdgeAI Developer Cloud
stm32ai_code_dir = f'{stm32ai_output_dir}/generated_code'
os.makedirs(stm32ai_code_dir, exist_ok=True)
result = stmai.generate(CliParameters(
    model=model_name,
    output=stm32ai_code_dir,
    optimization=optimization,
    allocateInputs=True,
    allocateOutputs=True,
    includeLibraryForSerie=CliLibrarySerie(series),
    includeLibraryForIde=CliLibraryIde(IDE),
    fromModel=from_model
))
!ls "{stm32ai_code_dir}"
# print 20 first lines of the report
if os.path.isfile(f'./{stm32ai_code_dir}/network_generate_report.txt'):
  print("\n\n---- code generation report ----\n","*" * 80)
  with open(f'./{stm32ai_code_dir}/network_generate_report.txt', 'r') as f:
    for _ in range(20): print(next(f))


#### You are ready to integrate your model in your STM32 application !

#### (Optional) : Delete your model from your STEdgeAI Developer Cloud space

In [None]:
if stmai.delete_model(model_name):
    print(f'{model_name} deleted from STEdgeAI developer Cloud workspace!')

## Deployment on STM32N6, STM32H7* and STM32MPU

The `QDQ` quantized models can be deployed using the [stm32ai-modelzoo-services](https://github.com/STMicroelectronics/stm32ai-modelzoo-services) as an [image_classification](https://github.com/STMicroelectronics/stm32ai-modelzoo-services/tree/main/image_classification) model. For knowing more details on how to do that please refer to [Deploying Image Classification models on STM32MCU](https://github.com/STMicroelectronics/stm32ai-modelzoo-services/blob/main/image_classification/deployment/README.md) and [Deploying Image Classificaiton Models on STM32MPU](https://github.com/STMicroelectronics/stm32ai-modelzoo-services/blob/main/image_classification/deployment/README_MPU.md).