# Multi-Domain: Audio Model

The notebook shows the models and training process for the audio-side of the multi-domain model.

In [1]:
!apt-get install -y xxd

Reading package lists... Done
Building dependency tree       
Reading state information... Done
xxd is already the newest version (2:8.0.1453-1ubuntu1.4).
0 upgraded, 0 newly installed, 0 to remove and 17 not upgraded.


In [2]:
import random

import tensorflow as tf
import numpy as np

In [3]:
from google.colab import drive 
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


## Data

The audio data is taken from the same video as the image data. The following preprocessing steps were taken to ensure consistency between all data samples.

- Setting the same sample rate
- Setting the same audio length
- Conversion into mfccs with `python-speech-features` package

After the preprocessing the data was saved into numpy arrays ready for training. 

In [4]:
!cp gdrive/MyDrive/multi_domain/audio_dataset.npz .

In [5]:
dataset = np.load("audio_dataset.npz")

x_train = dataset['x_train']
y_train = dataset['y_train']
x_val = dataset['x_val']
y_val = dataset['y_val']

Set the output values to be 1 if "happy" and 0 if "angry".

In [6]:
y_train = np.array([1 if y == 'happy' else 0 for y in y_train])
y_val = np.array([1 if y == 'happy' else 0 for y in y_val])

To be used by a CNN, the input arrays need to be reshaped to be similar to an image. One way to look at this is to imaging the mfccs as a greyscale image.

In [7]:
x_train.shape

(48, 16, 16)

In [8]:
x_train = x_train.reshape(x_train.shape[0], 
                          x_train.shape[1], 
                          x_train.shape[2], 
                          1)
x_val = x_val.reshape(x_val.shape[0], 
                      x_val.shape[1], 
                      x_val.shape[2], 
                      1)

In [9]:
x_train.shape

(48, 16, 16, 1)

In [10]:
#create a list of tuples
c = list(zip(x_train, y_train))
#shuffle the tuples
random.shuffle(c)
#return back to x_train and y_train
x_train, y_train = zip(*c)

In [11]:
x_train = np.array(x_train)
y_train = np.array(y_train)

## Modelling

The next part is to create the neural network to fit to the images. Because this is a simple problem (binary classification) a model is built from scratch. However Tensorfow and Keras have many pretrained models which can be adapted to your problem. Prequantized models can also be found on the [TensorFlow Hub](https://tfhub.dev/s?q=quantized). 

When creating a model for a microcontroller you need to think more carefully about your model selection. A few important points:
- Are the layers supported by TensorFlow Lite for Microcontrollers?
- Is the model too big?
- Is there a more efficient architecture

For example when running on a laptop you may create a really simple dense network by flattening the image. This will result in too many weights for a microcontroller and waste precious memory. 

The model in this example is a simple feed-forward convolutional network which uses a sigmoid classifier to shift the output between 0 and 1 (like the y values we have).

In [12]:
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, kernel_size = (2,2), activation="relu", input_shape = (16,16,1)),
    tf.keras.layers.MaxPool2D(pool_size=(2,2)),
    tf.keras.layers.Conv2D(32, kernel_size = (2,2), activation="relu"),
    tf.keras.layers.MaxPool2D(pool_size=(2,2)),
    tf.keras.layers.Conv2D(64, kernel_size = (2,2), activation="relu"),
    tf.keras.layers.MaxPool2D(pool_size=(2,2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation = "sigmoid")
    ]
)

In [13]:
model.compile(loss='binary_crossentropy', 
              optimizer='rmsprop', 
              metrics=['acc'])
model.build((1,16,16,1))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 15, 15, 32)        160       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 7, 7, 32)          0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 6, 6, 32)          4128      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 3, 3, 32)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 2, 2, 64)          8256      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 1, 1, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 64)                0

## Training

Because the model is so simple, the model is simple trainined for 30 epochs at the default learning rate for Keras's `'rmsprop'` optimization function.

In [14]:
history = model.fit(x_train, 
                    y_train, 
                    epochs=30, 
                    batch_size=8, 
                    validation_data=(x_val, y_val))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Post-Training Quantization

The model is quantized ready for use. This helps keep the model as small as possible. Here we use int8 quantization to keep it as small as possible.

In [15]:
def representative_dataset():
  for data in x_train:
    yield [x_train.astype(np.float32)]

In [16]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8  # or tf.uint8
converter.inference_output_type = tf.int8  # or tf.uint8
quant_model = converter.convert()

INFO:tensorflow:Assets written to: /tmp/tmp4itqsb52/assets


In [17]:
with open("audio_model.tflite", "wb") as f:
  f.write(quant_model)

In [18]:
!xxd -i "audio_model.tflite" > "audio_model.cc"

Once converted, the model can be exported as a .tflite. For tensorflow lite for microcontrollers. One final step is required and that is to convert the model into a .cc file using `xxd`.  The C source file contains the TensorFlow Lite model as a char array.

## Test Data

To test the model on the microcontroller, we can create some test data which can be used to simulate the model.

To do this we need to convert the float32 input data into int8 data using the conversion provided with the Tensorflow Lite for Microcontrollers 

In [19]:
interpreter = tf.lite.Interpreter("audio_model.tflite")

In [20]:
input_details = interpreter.get_input_details()
scale, zero = input_details[0]["quantization"]

In [21]:
test_x_data = (x_val[0] / scale + zero).astype(input_details[0]['dtype'])
test_x_data.tofile("x_audio_test.txt")

In [22]:
!xxd -i x_audio_test.txt > x_audio_test.cc