In [10]:
#  Copyright (c) 2021 Arm Limited. All rights reserved.
#  SPDX-License-Identifier: Apache-2.0
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

# Train and Deploy your NPU-enabled models

> Using the Arm Corstone-300 with Cortex-M55 and Ethos-U55.

## Summary

This notebook presents a flow to help bridge the gap between data scientists and embedded engineers.

In [14]:
!conda install -c conda-forge tensorflow -y

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.1.0
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - tensorflow


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _openmp_mutex-5.1          |            1_gnu          21 KB
    _tflow_select-2.3.0        |              mkl           2 KB
    abseil-cpp-20211102.0      |       h27087fc_1         1.1 MB  conda-forge
    absl-py-1.4.0              |     pyhd8ed1ab_0         100 KB  conda-forge
    aiohttp-3.8.1              |  py310h5764c6d_1         585 KB  conda-forge
    aiosignal-1.3.1            |     pyhd8ed1ab_0          12 KB  cond

## Training a Model

In this example we are going to train a "toy" model. We will create a basic convolutional neural network model to solve the MNIST problem.

The [MNIST database](http://yann.lecun.com/exdb/mnist/) is a dataset of handwritten digits which can be used to train a digit classifier. It is often used as a starter dataset.

Let's start of by importing the required Python dependencies. For this we will use the [TensorFlow](https://github.com/tensorflow/tensorflow) framework for the model and [TensorFlow Datasets](https://github.com/tensorflow/datasets) to download the MNIST dataset. If you're using Google Colab, these dependencies come preinstalled.

In [12]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

ModuleNotFoundError: No module named 'tensorflow'

We can now download the MNIST dataset using TensorFlow datasets.

In [None]:
(ds_train, ds_test), ds_info = tfds.load('mnist', split=['train', 'test'], shuffle_files=True, 
  as_supervised=True, with_info=True,
)

NameError: name 'tfds' is not defined

Once downloaded, we write a function to preprocess the MNIST dataset ready for use in a neural network. The images come in `uint8` format, and so to normalize the dataset so that all values are between `[0, 1]` we divde by `255` (the max `uint8` value).

In [None]:
def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.cast(image, tf.float32) / 255., label

Let's apply this function to the dataset using `.map` and take a batch size of `128`.

In [None]:
ds_train = ds_train.map(
  normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE
)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

ds_test = ds_test.map(
  normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE
)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)

We are now ready to create the model using the `Sequential` functionality. 

Although we could achieve a model with high accuracy using a fully connected model, this would require a lot of weights and biases. The Ethos-U55 is designed to be used with a Cortex-M55 meaning there will be memory limits. For this reason we build a convolutional network with large kernel sizes to reduce the number of weights.

In [None]:
model = tf.keras.models.Sequential([
  tf.keras.layers.InputLayer(input_shape=(28,28,1)),
  tf.keras.layers.Conv2D(32, (3, 3), activation=tf.nn.relu, input_shape=(28, 28, 1)),
  tf.keras.layers.MaxPooling2D((2, 2)),
  tf.keras.layers.Conv2D(64, (3, 3), activation=tf.nn.relu),
  tf.keras.layers.MaxPooling2D((2, 2)),
  tf.keras.layers.Conv2D(64, (3, 3), activation=tf.nn.relu),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(64, activation=tf.nn.relu),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

We are now ready to train the model. For this toy example we will just train for a singular epoch.

In [None]:
model.compile(
  optimizer=tf.keras.optimizers.legacy.Adam(0.001), 
  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
  metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

history = model.fit(ds_train, epochs=1, validation_data=ds_test)

## Quantize the Model

The next step is to quantize the model. This converts the weights from floating-point numbers to integer numbers. The Ethos-U55 supports 8 bit weights, and 8 bit and 16 bit activations. 

In this example we will quantize the model into `int8` format. 

Let's first `unbatch` the dataset from 128 samples at a time. In inference we will only be running one image at a time.

In [None]:
ds_train = ds_train.unbatch()

We can then build a generator function to use in the conversion process. 

Creating a generator allows the TensorFlow Lite converter find the best weights to fall to based on the input data.

In [None]:
def representative_data_gen():
  for input_value, output_value in ds_train.batch(1).take(100):
    yield [input_value]

Finally we are ready to convert the model. We can use the `from_keras_model` method to create a converter from our model:

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)

We can then set the `inference_input_type`, `inference_output_type` and `supported_ops` to `int8`:

In [None]:
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

We then add the `representative_dataset` to be our generator.

In [None]:
converter.representative_dataset = representative_data_gen

The last step is to run the conversion process

In [None]:
tflite_model_quant = converter.convert()

We now have a quantized model in TFLite format. Let's save this to our files as `my_model.tflite`:

In [None]:
with open("my_model.tflite", "wb") as f:
  f.write(tflite_model_quant)

## Vela Compiler

When creating a model for use on Ethos-U55 we need to use the Vela Compiler to optimise the model.

This is a command-line tool written in Python which takes a `.tflite` file and outputs another `.tflite` file. The new file is restructured in a way that Ethos-U understands.

To do this, let's first install `ethos-u-vela` for the compiler and `xxd` which will be used to convert binary files into hexdumps.

In [None]:
%pip uninstall ethos-u-vela -y
%pip install numpy==1.21.4 --force
%pip install "setuptools_scm[toml]<6" wheel
%pip install ethos-u-vela --no-build-isolation --no-cache-dir

We can now compile the model. For this we will specify the config as `ethos-u55-128`. This is one of the commonly used templates for Ethos-U55. This configuration has 128 macs. We will create a `vela.ini` file with our system configuration description. This information helps vela to optimize model efficiently.

In [None]:
%%writefile vela.ini

[System_Config.Ethos_U55_High_End_Embedded]
core_clock=500e6
axi0_port=Sram
axi1_port=OffChipFlash
Sram_clock_scale=1.0
Sram_burst_length=32
Sram_read_latency=32
Sram_write_latency=32
OffChipFlash_clock_scale=0.125
OffChipFlash_burst_length=128
OffChipFlash_read_latency=64
OffChipFlash_write_latency=64

; Shared SRAM: the SRAM is shared between the Ethos-U and the Cortex-M software
; The non-SRAM memory is assumed to be read-only
[Memory_Mode.Shared_Sram]
const_mem_area=Axi1
arena_mem_area=Axi0
cache_mem_area=Axi0

TODO: make this work. Vela has issues with numpy version

In [None]:
%%script false --no-raise-error
%%bash
vela --accelerator-config=ethos-u55-128 \
--optimise Performance \
--memory-mode=Shared_Sram \
--system-config=Ethos_U55_High_End_Embedded \
--config vela.ini \
my_model.tflite \
|| echo "Error: Vela compilation failed with exit code $?"


We can then convert the `.tflite` binary into a hexdump C headerfile.

In [None]:
!xxd -i output/my_model_vela.tflite my_network_model.h

The last step is to do some cleaning up of the file for the application. Here we rename the model from `output_my_model_vela_tflite` to `network_model` and add some header guards to the file.

The most important is to add model variable attribute `__attribute__((aligned(16)))` for 16 bytes alignment.

In [None]:
!sed -i 's/unsigned int output_my_model_vela_tflite_len/const unsigned int network_model_len/' my_network_model.h
!sed -i 's/unsigned char output_my_model_vela_tflite\[\]/const unsigned char network_model\[\] __attribute__((aligned(16)))/' my_network_model.h

!sed -i '1s/^/#define NETWORK_MODEL_H\n/' my_network_model.h
!sed -i '1s/^/#ifndef NETWORK_MODEL_H\n/' my_network_model.h
!echo "#endif //NETWORK_MODEL_H" >> my_network_model.h

## Build the application

With the model now ready to use in the application, we need to generate some test data to use in the model. To do this we create two functions, `write_input_headerfile` which writes an example input array to a headerfile and `write_output_headerfile` which writes the expected output array to the headerfile.

In [None]:
def write_input_headerfile(array):
  with open("input_data.h", "w") as f:
    line = "#ifndef INPUT_DATA_H\n#define INPUT_DATA_H\n\n"
    f.write(line)
    line = f"static const int input_data_len = {len(array)};\n"
    f.write(line)
    line = "static const int8_t input_data[] = {\n  "
    f.write(line)
    count = 0
    for val in array:
      if (count+1)%8 == 0:
        line = f"{val},\n  "
      else:
        line = f"{val}, "
      count += 1
      if count == len(array):
        line = line.replace(",","")
      f.write(line)
    line = "\n};\n\n"
    f.write(line)
    line = "#endif // INPUT_DATA_H"
    f.write(line)

  return None

def write_output_headerfile(array):
  with open("expected_output_data.h", "w") as f:
    line = "#ifndef EXPECTED_OUTPUT_DATA_H\n#define EXPECTED_OUTPUT_DATA_H\n\n"
    f.write(line)
    line = f"static const int expected_output_data_len = {len(array)};\n"
    f.write(line)
    line = "static const int8_t expected_output_data[] = {\n  "
    f.write(line)
    count = 0
    for val in array:
      if (count+1)%8 ==0:
        line = f"{val},\n  "
      else:
        line = f"{val}, "
      count += 1
      if count == len(array):
        line = line.replace(",","")
      f.write(line)

    line = "\n};\n\n"
    f.write(line)
    line = "#endif // EXPECTED_OUTPUT_DATA_H"
    f.write(line)

  return None

Let's take an input from a test set for use in the application:

In [None]:
# Load the model into tflite
tflite_model = tf.lite.Interpreter("my_model.tflite")

# Get the input and output information from the model
input_details = tflite_model.get_input_details()
input_scale, input_zero_point = input_details[0]["quantization"]
output_details = tflite_model.get_output_details()

# Unbatch the test dataset
ds_test = ds_test.unbatch()

# Take one example from the test set
for x,y in ds_test.batch(1).take(1):
  # Convert the input to a numpy array
  x_numpy = x.numpy()
  # Quantize the input data into int8 format
  x_numpy = x_numpy / input_scale + input_zero_point
  x_numpy = x_numpy.astype(input_details[0]["dtype"])
  # Write the array to a headerfile
  write_input_headerfile(x_numpy.flatten())

  # Run the model to get the expected output
  tflite_model.allocate_tensors()
  #tflite_model.set_tensor(input_details[0]['index'], np.expand_dims(x_numpy,axis=0))
  tflite_model.set_tensor(input_details[0]['index'], x_numpy)
  tflite_model.invoke()

  # Get the output array from the model
  output_data = tflite_model.get_tensor(output_details[0]["index"])
  # Write the headerfile for the expected output
  write_output_headerfile(output_data.flatten())

  break

In [None]:
%%writefile main.cpp

#include "tensorflow/lite/micro/micro_error_reporter.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_utils.h"
#include "tensorflow/lite/micro/testing/micro_test.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"

#include "my_network_model.h"
#include "input_data.h"
#include "expected_output_data.h"

#define TENSOR_ARENA_SIZE (70 * 1024)

uint8_t tensor_arena[TENSOR_ARENA_SIZE];

TF_LITE_MICRO_TESTS_BEGIN

TF_LITE_MICRO_TEST(TestInvoke) {
  
  tflite::MicroErrorReporter micro_error_reporter;
  // load the model
  const tflite::Model* model = ::tflite::GetModel(network_model);
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    TF_LITE_REPORT_ERROR(&micro_error_reporter,
                         "Model provided is schema version %d not equal "
                         "to supported version %d.\n",
                         model->version(), TFLITE_SCHEMA_VERSION);
    return kTfLiteError;
  }

  TF_LITE_REPORT_ERROR(&micro_error_reporter, "Hello TFLITE Micro Tests.\n");
  tflite::MicroMutableOpResolver<1> micro_op_resolver;
  //tell tensorflow micro to add ethos-u operator   
  micro_op_resolver.AddEthosU();

  tflite::MicroInterpreter interpreter(
      model, micro_op_resolver, tensor_arena, TENSOR_ARENA_SIZE, &micro_error_reporter);

  TfLiteStatus allocate_status = interpreter.AllocateTensors();
  if (allocate_status != kTfLiteOk) {
    TF_LITE_REPORT_ERROR(&micro_error_reporter, "Tensor allocation failed\n");
    return kTfLiteError;
  }

  TfLiteTensor* input = interpreter.input(0);
  TfLiteTensor* output = interpreter.output(0);

  memcpy(input->data.int8, &input_data, input->bytes);

  TfLiteStatus invoke_status = interpreter.Invoke();

  if (invoke_status != kTfLiteOk) {
    TF_LITE_REPORT_ERROR(&micro_error_reporter, "Invoke failed\n");
      return kTfLiteError;
  }
  TF_LITE_MICRO_EXPECT_EQ(kTfLiteOk, invoke_status);

  for (int i=0; i < expected_output_data_len; i++) {
    TF_LITE_MICRO_EXPECT_EQ(output->data.int8[i], expected_output_data[i]);
  }

}

TF_LITE_MICRO_TESTS_END

## Setup environment

 Open the 'Explorer' view (ctrl-shift-e) and select the file 'vcpkg-configuration.json'. This file instructs [Microsoft vcpkg](https://github.com/microsoft/vcpkg-tool#vcpkg-artifacts) to install the prerequisite artifacts required for building the solution.
  - ctools 1.7.0  [CMSIS-Toolbox](https://github.com/Open-CMSIS-Pack/devtools/blob/main/tools/projmgr/docs/Manual/Overview.md)
  - cmake 3.25.2
  - ninja 1.10.2
  - arm-none-eabi-gcc 10.3.1-2021.10 (GNU Arm Embedded Toolchain 10.3.1)

In [None]:
%vcpkg activate .

# COMMAND LINE BUILD FROM HERE 

Install packs

In [None]:
!csolution list packs -s mnist.csolution.yml -m >packs.txt
!cpackget update-index   
!cpackget add -f packs.txt


Create the cprj CMSIS Project files from the csolution.yml

In [None]:
!csolution convert -s mnist.csolution.yml

Build the application with cbuild

In [None]:
!cbuild mnist-test.debug+avh-cs300.cprj

## Run/Debug application on Corstone-300

Run the application in Corstone-300: