In [None]:
#  Copyright (c) 2021 Arm Limited. All rights reserved.
#  SPDX-License-Identifier: Apache-2.0
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

#  Fast Inference on Arm® Ethos™-U55 microNPU with Arm ML Embedded Evaluation Kit

 

In this Notebook you will see how to build and run a micro speech command model targeting Arm® Cortex™-M55 CPU and
Arm® Ethos™-U55 using Arm ML Embedded Evaluation Kit and Arm Virtual Hardware.
The [Ethos-U55 microNPU](https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u55) is a first generation microNPU designed
to accelerate computation for ML workloads in resource-constrained embedded and IoT devices. Its advanced compression
techniques save power, and reduce ML model sizes significantly to enable execution of neural networks that previously
only ran on larger systems. Ethos-U microNPU works with Cortex-M CPU devices and Arm Corstone™ systems and allows developers
to configure and build high performance, power efficient SoCs while differentiating with combinations of Arm processors
and their own IP.
 
 
The **Arm ML Embedded Evaluation Kit** allows developers to quickly build and deploy embedded machine learning
applications for Arm Cortex-M55 CPU and Arm Ethos-U55 microNPU. With ML Embedded Evaluation Kit you can run inferences by
using either a custom neural network on Ethos-U microNPU or pre-built ML applications such as image classification,
keyword spotting (KWS), automated speech recognition (ASR), anomaly detection, and person detection all using Arm Fixed
Virtual Platform (FVP) available in Arm Virtual Hardware.

The **Arm Virtual Hardware** is an accurate representation of a physical SoC and it runs as a simple application in a
Linux environment for easy scalability in the cloud and removes dependency on silicon availability.
Powered by Amazon Web Services (AWS), developers can launch Amazon Machine Image (AMI) running as a virtual server in
the cloud called Arm Virtual Hardware which is configured with Arm development tools for IoT, Machine learning, and
embedded applications, Arm Compilers, Fixed Virtual Platforms, and other development tools targeting Cortex-M CPU and
Ethos-U microNPU.

This notebook contains the following sections:

- Pre-processing the input data
- Training the Convolutional Neural Network (CNN) model using TensorFlow
- Optimizing and Quantizing the trained network model using TensorFlow Optimization Toolkit in order to target Ethos-U55 microNPU
- Compiling the quantized model with Arm Vela compiler
- Configuring and compiling the build project targeting Ethos-U55 using Arm ML Embedded Evolution Kit
- Executing the model on Ethos-U55 microNPU using Arm Virtual Hardware

## Before you begin

#### 1. Install dependencies 


In [None]:
!pip install --upgrade tensorflow
!pip install matplotlib numpy IPython
!pip install tensorflow-model-optimization

#### 2. Clone Arm ML Embedded Evaluation Kit

In [None]:
!git clone "https://review.mlplatform.org/ml/ethos-u/ml-embedded-evaluation-kit"


In [None]:
%cd ml-embedded-evaluation-kit

#### 3. Pull all the external dependencies

In [None]:
!git submodule update --init

In [None]:
%cd ~/projects/MicroSpeechEthosU55

#### 4. Import necessary modules and run the code sample

In [None]:
import os
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras import layers
from tensorflow.keras import models
from IPython import display

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

# Set seed for experiment reproducibility
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)

## Import the Speech Commands dataset

The Google [speech command dataset](https://www.tensorflow.org/datasets/catalog/speech_commands)  consists of over 105,
000 WAV audio files of people saying 30 different words which was collected by Google and released under a CC BY license.
In this tutorial we download and extract a portion of the Speech Commands dataset containing one second WAV file of 8
different words with 16kHz sampling rate.

In [None]:
data_dir = pathlib.Path('data/mini_speech_commands')
if not data_dir.exists():
    tf.keras.utils.get_file(
      'mini_speech_commands.zip',
      origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
      extract=True,
      cache_dir='.', cache_subdir='data')

In [None]:
commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[commands != 'README.md']
print('Commands:', commands)

filenames = tf.io.gfile.glob(str(data_dir) + '/*/*')
filenames = tf.random.shuffle(filenames)
num_samples = len(filenames)

print('Number of total examples:', num_samples)
print('Number of examples per label:',
      len(tf.io.gfile.listdir(str(data_dir/commands[0]))))

Splitting the dataset into three sets of training, validation and test sets to evaluate model performance.

In [None]:
train_files = filenames[:6400]
val_files = filenames[6400: 6400 + 800]
test_files = filenames[-800:]

print('Training set size', len(train_files))
print('Validation set size', len(val_files))
print('Test set size', len(test_files))

## Reading audio data and their labels from audio file

Each .wav file contains a header and the raw data in time format with a sampling rate of 16kHz which means that one
second of audio has 16,000 samples.

TensorFlow provides tf.io module to read audio file as a binary file and tf.audio module to process the audio. The
tf.audio.decode_wav API decodes a given 16bit PCM wav file and returns the sample rate and the scaled decoded 16bit PCM
wav file to the range [-1 ,1] as a tensor.

In [None]:
def decode_audio(audio_binary):
    audio, _ = tf.audio.decode_wav(audio_binary)
    return tf.squeeze(audio, axis=-1)

The supervised learning algorithms require a set of inputs and corresponding outputs to learn from the data in order to
build a predictive model. In our dataset, the label of each wav file is in its parent directory and we can get them by
get_label method. Then, in order to assign each wav file to its corresponding label, get_waveform_and_label method is
applied which takes the name of the wav file and outputs a tuple including the audio and associated labels.

In [None]:
def get_label(file_path):
    parts = tf.strings.split(file_path, os.path.sep)
    
    return parts[-2]

In [None]:
def get_waveform_and_label(file_path):
    label = get_label(file_path)
    audio_binary = tf.io.read_file(file_path)
    waveform = decode_audio(audio_binary)
    
    return waveform, label

Pre-Processing with TF.Data
tf.data API helps you to build complex input data pipelines as well as handling large amounts of data, reading from
different data formats such as CSV, Numpy, text, etc and perform complex transformation.

Although GPUs and TPUs can significantly reduce the training time, as a deep learning developer you may experience
not using the full capacity of your GPU with the bottleneck being on the CPU. Therefore, it is very important to ensure that we
achieve optimal performance and efficiency in our input pipeline and with tf.data API we can address this issue.
There are several techniques which reduce computational overhead and you can easily implement them into your pipeline
such as:

- Prefetching
- Parallelising data extraction and transformation
- Caching
- Vector mapping

You can read more about each technique and how they work from Google TensorFlow tutorials. Here we use parallelising
data transformation to do some preprocessing on our dataset before passing it to the model for training.
The tf.data.Dataset.map does the transformation (extract the audio-label pairs) and uses multiple available CPU cores for
working during tf.data runtime with the tf.data.AUTOTUNE parameter.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE
files_ds = tf.data.Dataset.from_tensor_slices(train_files)
waveform_ds = files_ds.map(get_waveform_and_label, num_parallel_calls=AUTOTUNE)

## Spectral feature extraction for audio analysis

A spectrogram is a common technique to analyze audio files and sound waves. It extracts information from the signal by
converting the waveform into a spectrogram which shows frequency changes over time and can be represented as a 2D image
with time in x-axis and frequency in the y-axis and density of colors representing the signal strength. Hence, the
spectrogram image explains how the strength of the signal is distributed over different frequencies.

A short-time Fourier transform (tf.signal.stft.) is a technique that converts a signal to time-frequency domain and it
generates an array of complex numbers representing magnitude (tf.abs) and phase.

To obtain waveforms of the same length we can zero pad audio that is shorter than one second.

In [None]:
def get_spectrogram(waveform):
    # Padding for files with less than 16000 samples
    zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)
    
    # Concatenate audio with padding so that all audio clips will be of the 
    # same length
    waveform = tf.cast(waveform, tf.float32)
    equal_length = tf.concat([waveform, zero_padding], 0)
    spectrogram = tf.signal.stft(
      equal_length, frame_length=255, frame_step=128)
      
    spectrogram = tf.abs(spectrogram)
    
    return spectrogram

In [None]:
for waveform, label in waveform_ds.take(100):
    label = label.numpy().decode('utf-8')
    spectrogram = get_spectrogram(waveform)

print('Label:', label)
print('Waveform shape:', waveform.shape)
print('Spectrogram shape:', spectrogram.shape)
print('Audio playback')
display.display(display.Audio(waveform, rate=16000))

##  Visualize the waveform and spectrogram

In [None]:
def plot_spectrogram(spectrogram, ax):
    log_spec = np.log(spectrogram.T)
    height = log_spec.shape[0]
    width = log_spec.shape[1]
    X = np.linspace(0, np.size(spectrogram), num=width, dtype=int)
    Y = range(height)
    ax.pcolormesh(X, Y, log_spec, shading='auto')

# training data
fig, axes = plt.subplots(2, figsize=(12, 8))
timescale = np.arange(waveform.shape[0])
axes[0].plot(timescale, waveform.numpy())
axes[0].set_title('Waveform')
axes[0].set_xlim([0, 16000])
plot_spectrogram(spectrogram.numpy(), axes[1])
axes[1].set_title('Spectrogram')
plt.show()

Now we transform the waveform dataset to get the spectrogram images and their corresponding labels as integer IDs.

In [None]:
def get_spectrogram_and_label_id(audio, label):
    spectrogram = get_spectrogram(audio)
    spectrogram = tf.expand_dims(spectrogram, -1)
    spectrogram =tf.image.resize(spectrogram, (32,32))
    label_id = tf.argmax(label == commands)
    return spectrogram, label_id

In [None]:
spectrogram_ds = waveform_ds.map(
    get_spectrogram_and_label_id, num_parallel_calls=AUTOTUNE)

Pre-processing for validation and test sets

In [None]:
def preprocess_dataset(files):
    files_ds = tf.data.Dataset.from_tensor_slices(files)
    output_ds = files_ds.map(get_waveform_and_label, num_parallel_calls=AUTOTUNE)
    output_ds = output_ds.map(get_spectrogram_and_label_id,  num_parallel_calls=AUTOTUNE)
    return output_ds

In [None]:
train_ds = spectrogram_ds
val_ds = preprocess_dataset(val_files)
test_ds = preprocess_dataset(test_files)

Batch the training and validation sets for model training.

In [None]:
batch_size = 64
train_ds = train_ds.batch(batch_size)
val_ds = val_ds.batch(batch_size)

Add dataset cache() and prefetch() operations to reduce read latency while training the model

In [None]:
train_ds = train_ds.cache().prefetch(AUTOTUNE)
val_ds = val_ds.cache().prefetch(AUTOTUNE)

## Training a CNN model

Convolutional Neural Networks, or CNNs for short, are a class of deep neural networks that are designed to recognize an
image by transforming the image via layers to class scores. Since CNNs are powerful for processing and classifying
images and we converted the audio files into spectrogram images, a CNN model is trained here. The model contains a
**normalization layer** to normalize each pixel in the image based on its mean and standard deviation ready for the following layer.

In [None]:
for spectrogram, _ in spectrogram_ds.take(1):
    input_shape = spectrogram.shape
print('Input shape:', input_shape)
num_labels = len(commands)

norm_layer = preprocessing.Normalization()
norm_layer.adapt(spectrogram_ds.map(lambda x, _: x))

model = models.Sequential([
    layers.Input(shape=input_shape), 
    norm_layer,
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_labels),
])

model.summary()

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)

In [None]:
EPOCHS = 10
history = model.fit(
    train_ds, 
    validation_data=val_ds,  
    epochs=EPOCHS,
    callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),
)

### Evaluate the model perfomance on test set

In [None]:
test_audio = []
test_labels = []

for audio, label in test_ds:
    test_audio.append(audio.numpy())
    test_labels.append(label.numpy())      

test_audio = np.array(test_audio)
test_labels = np.array(test_labels)

In [None]:
y_pred = np.argmax(model.predict(test_audio), axis=1)
y_true = test_labels

test_acc = sum(y_pred == y_true) / len(y_true)
print(f'Test set accuracy: {test_acc:.0%}')

### Save the model
You can save an entire model to a single artifact including the model's architecture, weights, compilation information
and training configuration such as optimizer, losses and metrics.

In [None]:
model.save('model.h5')

In [None]:
import numpy as np
import math
from pathlib import Path

def round_up(n, decimals=2):
    multiplier = 10 ** decimals
    return math.ceil(n * multiplier) / multiplier

home_dir = Path.home()
file_path = str(home_dir)+'/'+ 'projects'+'/'+'MicroSpeechEthosU55'
size = (os.path.getsize(file_path+'/model.h5')/1000000)

print('The size of model: {} Mb'.format(round_up(size)))

### Evaluate the baseline model

In [None]:
_, baseline_model_accuracy = model.evaluate(
    test_audio, test_labels, verbose=0)

print('Baseline test accuracy:', baseline_model_accuracy)

## Model Optimization for inference on Ethos-U microNPU

To run and accelerate an inference on edge devices, several model optimization methods can be applied to optimize
machine learning models. TensorFlow Model Optimization Toolkit provides optimization techniques such as
[quantization](https://www.tensorflow.org/model_optimization/guide/quantization/post_training),
[pruning](https://www.tensorflow.org/model_optimization/guide/pruning) and
[clustering](https://www.tensorflow.org/model_optimization/guide/clustering) compatible with TensorFlow Lite. Based on
the optimization technique, the complexity and the size of the model can be reduced which results in less memory usage,
smaller storage size, and download size.
Also, optimization is required for some hardware accelerators such as Arm Ethos-U microNPU as it performs calculations in
8-bit integer precision.

### TensorFlow Model Optimization Toolkit - Weight Clustering API

Weight clustering which was proposed and contributed by Arm ML Tooling team to TensorFlow Model Optimization Toolkit reduces the
storage and the size of the model leading to benefits for deployment on resource-constrain embedded systems. With this
technique, the size of the model will be reduced by replacing similar weights in each layer with the same value. These
values are found by running a clustering algorithm over the weights of the trained model.
Depending on the model and number of chosen clusters, the accuracy of the model could drop after clustering. To reduce
the impact on accuracy, you must pass a pre-trained model with acceptable accuracy before clustering.

#### Define the model and apply weight clustering to a pre-trained model

In [None]:
import tensorflow_model_optimization as tfmot

cluster_weights = tfmot.clustering.keras.cluster_weights
CentroidInitialization = tfmot.clustering.keras.CentroidInitialization

clustering_params = {
  'number_of_clusters': 32,
  'cluster_centroids_init': CentroidInitialization.LINEAR
}

# Cluster a whole model
clustered_model = cluster_weights(model, **clustering_params)

# Use smaller learning rate for fine-tuning clustered model
opt = tf.keras.optimizers.Adam(learning_rate=1e-6)

clustered_model.compile(
  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
  optimizer=opt,
  metrics=['accuracy'])

clustered_model.summary()

#### Fine Tune the model with 1 epoch and evaluate the accuracy against the baseline

In [None]:
# Fine-tune model
clustered_model.fit(
  train_ds,
    validation_data=val_ds,  
  batch_size=500,
  epochs=1,
  )

In [None]:
_, clustered_model_accuracy = clustered_model.evaluate(
  test_audio, test_labels, verbose=0)

print('Baseline test accuracy:', baseline_model_accuracy)
print('Clustered test accuracy:', clustered_model_accuracy)

In [None]:
# create a compressible model for TensorFlow.
final_model = tfmot.clustering.keras.strip_clustering(clustered_model)
clustered_tflite_file = 'clustered_model.tflite'
converter = tf.lite.TFLiteConverter.from_keras_model(final_model)
tflite_clustered_model = converter.convert()

with open(clustered_tflite_file, 'wb') as f:
    f.write(tflite_clustered_model)
print('Saved clustered TFLite model to:', clustered_tflite_file)

In [None]:
size = (os.path.getsize(file_path+'/clustered_model.tflite')/1000000)

print('The size of clustered model: {} Mb'.format(round_up(size)))

### Create a TFLite model from combining weight clustering and post-training quantization

As Ethos-U55 only supports 8-bit operations and 8 or 16 bit activations, post-training integer quantization should be applied to
the trained TensorFlow model to convert the weights and biases from floating point numbers to integer numbers.
Quantazation is not only supported by all CPU platforms, but also supports deploying the optimized model for special purpose
hardware accelerators such as NPUs.
Weight clustering can combine with quantization to improve memory footprint from both techniques and speed up inference.
Quantization then allows the clustered model to be used with Arm Ethos-N and Ethos-U machine learning processors.

Post-training integer quantization not only increases inferencing speed on microcontrollers but also is compatible with fixed-point hardware accelerators such as Arm Ethos-U and Ethos-N NPUs. It converts models’ parameters from 32-bit floating point to nearest 8-bit fixed-point numbers while getting reasonable quantized model accuracy with 3-4x reduction in model size.  


There are two modes of post-training integer quantization: 


- Post-training integer quantization with int8 activation and weights 

- Post-training integer quantization with int16 activation and int8 weights (16x8 quantization mode) 

Quantizing using integer-only converts weights, variables, input, and output tensors to integer. TensorFlow Lite supports quantization with int16 activations and int8 weights during model conversion from TensorFlow to TensorFlow Lite’s flat buffer format. 

With post training quantization, the weights of the model are quantized to 8bit integer values following by quantizing
the variable tensors such as layer activations. To calculate the potential range of values that all these tensors can
take, we need a small subset of data as a representative of model input during deployment and these samples can be taken
from training or validation set. Model inference is then performed using this representative dataset with calculating
minimum and maximum values for variable tensors.

In [None]:
def representative_dataset():
    for _ in range(100):
      data = next(iter(val_ds))[0]
      yield [data.numpy().astype(np.float32)]

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(final_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset

converter.target_spec.supported_ops = [tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8]

tflite_quant_model = converter.convert()

quantized_and_clustered_tflite_file = 'quantized_clustered.tflite'

with open(quantized_and_clustered_tflite_file, 'wb') as f:
    f.write(tflite_quant_model)

print('Saved quantized and clustered TFLite model to:', quantized_and_clustered_tflite_file)

#### See the persistence of accuracy from TF to TFLite

In [None]:
# function to evaluate the TFLite model on the test dataset
def eval_model(interpreter,test_audio):
    input_index = interpreter.get_input_details()[0]["index"]
    output_index = interpreter.get_output_details()[0]["index"]

    
    # Run predictions on every image in the "test" dataset.
    prediction_digits = []

    for i, test_audio in enumerate(test_audio):

        if i % 100 == 0:
            print('Evaluated on {n} results so far.'.format(n=i))
        
        # Pre-processing: add batch dimension and convert to float32 to match with
        # the model's input data format.
        test_audio = np.expand_dims(test_audio, axis=0).astype(np.float32)
        interpreter.set_tensor(input_index, test_audio)

        # Run inference.
        interpreter.invoke()

        # Post-processing: remove batch dimension and find the digit with highest
        # probability.
        output = interpreter.tensor(output_index)
        digit = np.argmax(output()[0])
        prediction_digits.append(digit)
        
    print('\n')
    # Compare prediction results with ground truth labels to calculate accuracy.
    prediction_digits = np.array(prediction_digits)
    accuracy = (prediction_digits == test_labels).mean()
    return accuracy

load the TFLite model from the disk using TensorFlow Lite Interpreter Python API for deployment. 

In [None]:
## Load the TFLite model in TFLite Interpreter 
interpreter = tf.lite.Interpreter(model_content=tflite_quant_model)
interpreter.allocate_tensors()

test_audio = test_audio

test_accuracy = eval_model( interpreter,test_audio)

print('Clustered and quantized TFLite test_accuracy:', test_accuracy)
print('Clustered TF test accuracy:', clustered_model_accuracy)


In [None]:
size = (os.path.getsize(file_path+'/quantized_clustered.tflite')/1000000)

print('The size of clustered and quantized TFlite model: {} Mb'.format(round_up(size)))

You can check the result of quantized TensorFlow Lite file with [Netron](https://netron.app/) and see the result of
post-training quantization which converts the weights and activations from floating point numbers to integer numbers.

## Compile the model for Ethos-U55 with Vela Compiler
 
To deploy your NN model on Ethos-U55, you need to compile the trained quantized model via Vela to generate an optimized
NN model for Ethos-U. Vela is an open source python tool which compiles a TFLite NN model into an optimized version that
can run on an embedded system containing Arm Ethos-U microNPU.

The optimized model has TensorFlow Lite custom operators (supported operators) for those parts of the model that can be
accelerated by the Ethos-U microNPU. Parts of the model that cannot be accelerated are left unchanged and will instead run on
the Cortex-M series CPU using an appropriate kernel.

You can install Vela by running `$ pip install ethos-u-vela` command.

In [None]:
!pip install ethos-u-vela

The Vela compiler accepts a set of parameters to influence model optimization. The model provided within this project
has been optimized with the following configuration:
 
- `accelerator-config`: specifies the NPU configuration to use between 
    - ethos-u55-256
    - **ethos-u55-128**
    - ethos-u55-64
    - ethos-u55-32
    - ethos-u65-256
    - ethos-u65-512
    
- `optimise`: sets the optimization strategy to maximize the **performance** of model or minimize the memory usage.

We will create a vela.ini file with our system configuration description. This information helps vela to optimize the model
efficiently.

In [None]:
%%writefile vela.ini

[System_Config.Ethos_U55_High_End_Embedded]
core_clock=500e6
axi0_port=Sram
axi1_port=OffChipFlash
Sram_clock_scale=1.0
Sram_burst_length=32
Sram_read_latency=32
Sram_write_latency=32
OffChipFlash_clock_scale=0.125
OffChipFlash_burst_length=128
OffChipFlash_read_latency=64
OffChipFlash_write_latency=64

; Shared SRAM: the SRAM is shared between the Ethos-U and the Cortex-M software
; The non-SRAM memory is assumed to be read-only
[Memory_Mode.Shared_Sram]
const_mem_area=Axi1
arena_mem_area=Axi0
cache_mem_area=Axi0

Compile the network for an Ethos-U55 128 microNPU:

In [None]:
%%bash
vela --accelerator-config=ethos-u55-128 \
--optimise Performance \
--memory-mode=Shared_Sram \
--system-config=Ethos_U55_High_End_Embedded \
--config vela.ini \
quantized_clustered.tflite

To summarize, a neural network can be efficiently accelerated in an extremely small area and power envelope using the
following:

- TensorFlow Lite micro
- TensorFlow Model Optimzation Toolkit
- Ethos-U55 and Vela

Finally, after the model has been compiled through Vela, the output of the tool is an optimized TensorFlow Lite file
which is ready to deploy on a system using an Ethos-U microNPU in this case Arm Virtual Hardware configured with the
Corstone-300 FVP.

## ARM ML Embedded Evaluation Kit
[ML Eval Kit](https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ml-embedded-evaluation-kit) is an open source project available under Apache 2.0 license.


Three main functionality: 

- performance evaluation
    - number of NPU cycles that are necessary to compute inference
    - amount of memory transactions that occurred
    
    
- Software stack evaluation
    - contains developed ML applications for Ethos-U55 systems
    - configure the build system for a default build using build_default.py 
    
    
- Custom workflow
    - test custom NN performance on the Ethos-u55 with Generic Inference Runner capability
    - configure the build system for non-default build:
        - specify Vela configuration and compile the model,
        - configure the build system with CMake
        - compile the project with make

### Configure the build system with CMake

Configure the build project by creating a build directory in the root of the project, navigate inside and execute cmake
with setting the locations of the TFLite file generated by Vela and the labels text file of the associated labels file.

In [None]:
!mkdir ~/projects/MicroSpeechEthosU55/ml-embedded-evaluation-kit/build

In [None]:
%cd ~/projects/MicroSpeechEthosU55/ml-embedded-evaluation-kit/build/


We will use the following build options:

- TARGET_PLATFORM
- CMAKE_TOOLCHAIN_FILE
- USE_CASE_BUILD
- <use_case\>\_MODEL_TFLITE_PATH

See [reference manual](https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ml-embedded-evaluation-kit/+/refs/heads/main/docs/sections/building.md#build-options)
for more details.

Use Generic Inference Runner ML Eval Kit build option to profile inference speeds for your specific ML applications on
Cortex-M55 CPU and Ethos-U55 microNPU.

In [None]:
%%bash
cmake -DTARGET_PLATFORM=mps3 \
    -DCMAKE_TOOLCHAIN_FILE=../scripts/cmake/toolchains/bare-metal-gcc.cmake \
    -Dinference_runner_MODEL_TFLITE_PATH=/home/ubuntu/projects/MicroSpeechEthosU55/output/quantized_clustered_vela.tflite \
    -DUSE_CASE_BUILD=inference_runner ..

### Compile the project with make

In [None]:
!make -j

Results of the build are placed under build/bin folder

## Running the application binary on an FVP emulating MPS3 using Arm Virtual Hardware

Arm Virtual Hardware provides an Ubuntu Linux image including Arm development tools for IoT, Machine Learning, and
embedded applications. Arm Compilers, Fixed Virtual Platforms, and other development tools targeting Cortex-M CPU are
available to get started quickly. The Arm Virtual Hardware Beta (Initial) Release is provided free of charge and may be
used only for evaluation, for example, to evaluate development processes in CI/CD, MLOps and DevOps workflows which
require automated testing and scalability beyond a farm of development boards.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instances-and-amis.html

#### Launch the desired application on the Fixed Virtual Platform (FVP) - Corstone-300 MPS3 based platform

Finally, to deploy the micro speech application on an FVP emulating MPS3 FPGA board that contains Cortex-M55 and
Ethos-U55 processors, launch the FVP with the choice of the Ethos-U55 `$ FVP_Corstone_SSE-300_Ethos-U55`.

The number of MACs on the Arm Virtual Hardware FVP execution should be the same as in the Vela compiler
`--accelerator-config` configuration. To pass the number of MACs to the Ethos-U55 model use the `ethosu.num_macs` parameter.
If the number of MACs used in the compilation does not match the model configuration at runtime, the inference will fail
with an NPU config mismatch error. It is essential to check that the number of MACs is the same for the build and
for the run.

- Ethos-U model capable of producing cycle approximate results (within 10% tolerance).
- Cannot be used to profile Cortex-M55.

In [None]:
%%bash
FVP_Corstone_SSE-300_Ethos-U55 -C ethosu.num_macs=128 \
    -C mps3_board.telnetterminal0.start_telnet=0 \
    -C mps3_board.uart0.out_file='-' \
    -C mps3_board.uart0.shutdown_on_eot=1 \
    -C mps3_board.visualisation.disable-visualisation=1 \
    --stat /home/ubuntu/projects/MicroSpeechEthosU55/ml-embedded-evaluation-kit/build/bin/ethos-u-inference_runner.axf