# Audio Feature Generator Example

This demonstrates how to:
1. Load a quantized keyword spotting model 
2. Manually invoke the [Audio Feature Generator](https://siliconlabs.github.io/mltk/docs/python_api/data_preprocessing/audio_feature_generator.html) APIs to generate a spectrogram from an audio file
3. Run inference with the manually processed audio sample using Tensorflow-Lite and Tensorflow-Lite Micro

In this example, we use the [keyword_spotting_numbers](https://siliconlabs.github.io/mltk/docs/python_api/models/siliconlabs/keyword_spotting_numbers.html) ML model.

__NOTES:__  
- Click here: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/siliconlabs/mltk/blob/master/mltk/examples/audio_feature_generator.ipynb) to run this example interactively in your browser  
- Refer to the [Notebook Examples Guide](https://siliconlabs.github.io/mltk/docs/guides/notebook_examples_guide.html) for how to run this example locally in VSCode  

## Install the MLTK python package

In [None]:
# Install the MLTK Python package (if necessary)
!pip install --upgrade silabs-mltk

## Import the Python packages

In [1]:
import os
import pprint
import numpy as np
from mltk.datasets import audio as audio_datasets
from mltk.core.preprocess.utils import audio as audio_utils
from mltk.core.preprocess.audio.audio_feature_generator import AudioFeatureGeneratorSettings
from mltk.core import load_mltk_model, TfliteModel, TfliteModelParameters
from mltk.core.tflite_micro import TfliteMicro

## Load Audio Sample

First we need to obtain an audio sample. In this example, we load a random sample from the [ten_digits](https://siliconlabs.github.io/mltk/docs/python_api/datasets/audio/ten_digits.html) dataset which was used to train the [keyword_spotting_numbers](https://siliconlabs.github.io/mltk/docs/python_api/models/siliconlabs/keyword_spotting_numbers.html) ML model.

In [2]:
# Download the "ten digits" dataset (if necessary)
dataset_dir = audio_datasets.ten_digits.download()

# And grab the first sample in the 'seven' directory
audio_sample_dir = f'{dataset_dir}/seven'
audio_sample_fn = list(os.listdir(audio_sample_dir))[0]

audio_sample_path = f'{audio_sample_dir}/{audio_sample_fn}'
print(f'Using audio sample: {audio_sample_path}')

# Load the audio file into memory
audio_sample_data, audio_sample_rate_hz = audio_utils.read_audio_file(audio_sample_path, return_numpy=True, return_sample_rate=True)

audio_sample_length = len(audio_sample_data)
audio_sample_length_seconds = audio_sample_length/audio_sample_rate_hz
print(f'Sample length: {audio_sample_length_seconds:.1f}s, rate: {audio_sample_rate_hz/1000:.1f}kHz')

Using audio sample: C:/Users/dried/.mltk/datasets/ten_digits/seven/aws_ar-AE+Hala+seven+medium+medium+1209d48a.wav
Sample length: 0.7s, rate: 16.0kHz


## Load the MLTK Model

Next, we load the MLTK model. We also print a list of the "classes" supported by the model with their corresponding list indices.

In [3]:
# Load the MLTK model
mltk_model = load_mltk_model('keyword_spotting_numbers')

# Retrieve the classes used by the model
classes = mltk_model.classes

# Print classes and their corresponding indices
print(f'Model: {mltk_model.name} classifies {mltk_model.n_classes} classes with the following mapping:')
for class_index, class_label in enumerate(classes):
    print(f'{class_index:2d} -> {class_label}')

Model: keyword_spotting_numbers classifies 11 classes with the following mapping:
 0 -> zero
 1 -> one
 2 -> two
 3 -> three
 4 -> four
 5 -> five
 6 -> six
 7 -> seven
 8 -> eight
 9 -> nine
10 -> _unknown_


## Load the .tflite model

Next, we load the trained and quantized [keyword_spotting_numbers](https://siliconlabs.github.io/mltk/docs/python_api/models/siliconlabs/keyword_spotting_numbers.html) `.tflite` model. We do this by extracting the `.tflite` from the `keyword_spotting_numbers.mltk.zip` model archive and loading it into a [TfliteModel](https://siliconlabs.github.io/mltk/docs/python_api/tflite_model/index.html) instance.

In [4]:

# Get the file path to the .tflite in the keyword_spotting_numbers.mltk.zip model archive
tflite_path = mltk_model.get_archive_file('keyword_spotting_numbers.tflite')
print(f'.tflite path: {tflite_path}')

# Load the .tflite file into a TfliteModel instance
tflite_model = TfliteModel.load_flatbuffer_file(tflite_path)

# Generate a summary of the model
print(tflite_model.summary())

.tflite path: E:/dried/mltk/models/keyword_spotting_numbers/extracted_archive/keyword_spotting_numbers.tflite
+-------+------------------------------+-------------------+-----------------+------------------------------------------------------+
| Index | OpCode                       | Input(s)          | Output(s)       | Config                                               |
+-------+------------------------------+-------------------+-----------------+------------------------------------------------------+
| 0     | quantize                     | 98x1x40 (float32) | 98x1x40 (int8)  | Type=none                                            |
| 1     | conv_2d                      | 98x1x40 (int8)    | 98x1x40 (int8)  | Padding:Same stride:1x1 activation:None              |
|       |                              | 3x1x40 (int8)     |                 |                                                      |
|       |                              | 40 (int32)        |                 |        

## Process the audio sample in the AudioFeatureGenerator

Next, we process the audio sample in the [Audio Feature Generator](https://siliconlabs.github.io/mltk/docs/python_api/data_preprocessing/audio_feature_generator.html). This will convert the raw audio into a spectrogram image which can be given to the `.tflite` model for classification.

To process the audio sample, we must use the AudioFeatureGenerator [settings](https://siliconlabs.github.io/mltk/docs/guides/model_parameters.html#audiodatasetmixin) embedded into the `.tflite`. These are the settings that were used to train the model and also the settings used by the embedded device at runtime.

In [6]:
# Retrieve the AudioFeatureGenerator settings from the .tflite
tflite_params = TfliteModelParameters.load_from_tflite_file(tflite_path)

# Load the .tflite parameters into a AudioFeatureGeneratorSettings instance 
tflite_frontend_settings = AudioFeatureGeneratorSettings(**tflite_params)

print(f'Audio frontend settings:\n{pprint.pformat(tflite_frontend_settings)}')

# Adjust the audio sample so that it is the correct length expected by the audio frontend settings
frontend_sample_length = int((audio_sample_rate_hz * tflite_frontend_settings.sample_length_ms) / 1000)
adjusted_audio_sample_data = audio_utils.adjust_length(
    audio_sample_data,
    out_length=frontend_sample_length,
    trim_threshold_db=30,
    offset=0
)

# Process the length-adjusted audio in the audio frontend (aka AudioFeatureGenerator).
# This will generate a spectrogram from the raw audio using the settings embedded into the .tflite
spectrogram = audio_utils.apply_frontend(
    sample=adjusted_audio_sample_data,
    settings=tflite_frontend_settings,
    dtype=np.uint16 # We just want the raw, uint16 output of the generated spectrogram
)
print(f'Generated spectrogram shape: {"x".join(map(str, spectrogram.shape))} ({spectrogram.dtype})')

# The generated spectrogram is uint16.
# However, the keyword_spotting_numbers model expects a normalized, float32 input.
# So, we use numpy to normalize the input sample
# norm_spectrogram = (spectrogram - mean(spectrogram)) / std(spectrogram)
norm_spectrogram = spectrogram.astype(np.float32)
norm_spectrogram -= np.mean(norm_spectrogram, dtype=np.float32, keepdims=False)
norm_spectrogram /= (np.std(norm_spectrogram, dtype=np.float32, keepdims=False) + 1e-6)
print(f'Normalized spectrogram shape: {"x".join(map(str, norm_spectrogram.shape))} ({norm_spectrogram.dtype})')

# The keyword_spotting_numbers model also expects the input shape to be:
# <time, 1, features>
# So, we insert an extra dimension:
tflite_input_spectrogram = np.expand_dims(norm_spectrogram, axis=-2)
print(f'.tflite input spectrogram shape: {"x".join(map(str, tflite_input_spectrogram.shape))} ({tflite_input_spectrogram.dtype})')

Audio frontend settings:
{'average_window_duration_ms': 450,
 'classes': ['zero',
             'one',
             'two',
             'three',
             'four',
             'five',
             'six',
             'seven',
             'eight',
             'nine',
             '_unknown_'],
 'date': '2023-07-18T17:50:46.438Z',
 'detection_threshold': 242,
 'fe.activity_detection_alpha_a': 0.5,
 'fe.activity_detection_alpha_b': 0.800000011920929,
 'fe.activity_detection_arm_threshold': 0.75,
 'fe.activity_detection_enable': False,
 'fe.activity_detection_trip_threshold': 0.800000011920929,
 'fe.dc_notch_filter_coefficient': 0.949999988079071,
 'fe.dc_notch_filter_enable': True,
 'fe.fft_length': 512,
 'fe.filterbank_lower_band_limit': 125.0,
 'fe.filterbank_n_channels': 40,
 'fe.filterbank_upper_band_limit': 7500.0,
 'fe.log_scale_enable': True,
 'fe.log_scale_shift': 6,
 'fe.noise_reduction_enable': True,
 'fe.noise_reduction_even_smoothing': 0.02500000037252903,
 'fe.noise_reduc

## Classify the audio sample using TF-Lite

Next, we give the processed audio sample to the `.tflite` model instance which will classify the audio.
The model output is a list of probabilities. The list entry with the largest probability is the "class" to which the model thinks the audio sample belongs.

__NOTE:__ This uses the default `int8` "kernels" that come with [TF-Lite](https://www.tensorflow.org/lite/performance/quantization_spec)

In [7]:
# Give the processed audio sample (which is now a normalized spectrogram)
# to the trained and quantized keyword_spotting_numbers.tflite model,
# which will classify the sample and return the classification results
classification_results = tflite_model.predict(tflite_input_spectrogram)
print(f'Raw classification results: {classification_results}')

# Find the index of the largest entry in the list
predicted_class_index = np.argmax(classification_results)
prediction_confidence = classification_results[predicted_class_index]

print(f'The model "{mltk_model.name}" using the reference int8 Tensorflow-Lite kernels predict that the audio sample file:\n{audio_sample_path}\nbelongs to the class: "{classes[predicted_class_index]}" with a confidence of {prediction_confidence*100:.1f}%')

Raw classification results: [0.         0.         0.         0.         0.         0.
 0.         0.99609375 0.         0.         0.        ]
The model "keyword_spotting_numbers" using the reference int8 Tensorflow-Lite kernels predict that the audio sample file:
C:/Users/dried/.mltk/datasets/ten_digits/seven/aws_ar-AE+Hala+seven+medium+medium+1209d48a.wav
belongs to the class: "seven" with a confidence of 99.6%


## Classify the audio using TF-Lite Micro

Before, we used the `int8` Tensorflow-Lite kernels that come with the [Tensorflow](https://pypi.org/project/tensorflow/) Python package.

Now, let's use the `int8` kernels that come with [Tensorflow-Lite Micro](https://github.com/tensorflow/tflite-micro). We do this by using the [Tensorflow-Lite Micro Python Wrapper](https://siliconlabs.github.io/mltk/docs/cpp_development/wrappers/tflite_micro_wrapper.html) that comes with the MLTK. We use the [TfliteMicro API](https://siliconlabs.github.io/mltk/docs/python_api/tflite_micro_model/index.html) to do this.

In [9]:
# Load the TfliteMicroModel instance
tflm_model = TfliteMicro.load_tflite_model(tflite_path)
print(f'Tensorflow-Lite Micro model details:\n{tflm_model.details}')

try:
    # Load the audio sample into the TFLM model instance's input tensor
    tflm_model.input(value=tflite_input_spectrogram)

    # Run inference (which will use the TFLM int8 SW reference kernels)
    tflm_model.invoke()

    # Retrieve the classification results:
    # NOTE: The results has the shape 1x11, hence the [0]
    classification_results = tflm_model.output()[0]
    print(f'Raw classification results: {classification_results}')

    # Find the index of the largest entry in the list
    predicted_class_index = np.argmax(classification_results)
    prediction_confidence = classification_results[predicted_class_index]

    print(f'The model "{mltk_model.name}" using the reference int8 Tensorflow-Lite Micro kernels predict that the audio sample file:\n{audio_sample_path}\nbelongs to the class: "{classes[predicted_class_index]}" with a confidence of {prediction_confidence*100:.1f}%')
finally:
    # We MUST unload the model after we're done with it
    TfliteMicro.unload_model(tflm_model)

Tensorflow-Lite Micro model details:
Name: keyword_spotting_numbers
Version: 2
Date: 2023-07-18T17:50:46.438Z
Description: 
Hash: fb177254686232d6bb8e89d2b721ed00
Accelerator: none
Classes: zero, one, two, three, four, five, six, seven, eight, nine, _unknown_
Total runtime memory: 73.432 kBytes

Raw classification results: [0.         0.         0.         0.         0.         0.
 0.         0.99609375 0.         0.         0.        ]
The model "keyword_spotting_numbers" using the reference int8 Tensorflow-Lite Micro kernels predict that the audio sample file:
C:/Users/dried/.mltk/datasets/ten_digits/seven/aws_ar-AE+Hala+seven+medium+medium+1209d48a.wav
belongs to the class: "seven" with a confidence of 99.6%
