# KWS Training Task

TODO: blabla

## 0. Install software

The following steps should ideally done before launching this Jupyter notebook!

**1. Clone repository**

```
git clone ???
```

**2. Create virtual python environment**

```
virtualenv -p python3.8 .venv
```

**3. Enter virtual python environment**

```
source .venv/bin/activate
```

**4. Install python packages into environment**

```
pip install -r requirements.txt
```

**5. Start jupyter notebook**
    
```
jupyter ntebook Flow.ipynb
```

  If using a remote host, append: ` --no-browser --ip 0.0.0.0 --port XXXX` (where XXXX > 1000)
  
  If you experience warnings it might help to use ` --NotebookApp.iopub_msg_rate_limit=1.0e10`

## 1. Python imports

Python builtin dependencies

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'  # Reduce verbosity
import argparse
from pathlib import Path

Third party dependencies

In [None]:
import numpy as np
import tensorflow as tf
#tf.get_logger().setLevel('WARN')  # Reduce verbosity

Jupyter specific

In [None]:
from IPython.display import FileLink

Helper scripts

In [None]:
import data
import models
from test import get_val_accuracy, get_test_accuracy
from test_tflite import tflite_test

## 2. Define training parameters

In [None]:
FLAGS = argparse.Namespace()  # TODO: use upper-case constants instead?

# TODO: useful?
FLAGS.model_name = "kws_model"

# Location of speech training data archive on the web.
FLAGS.data_url = 'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz'

# Where to download the speech training data to.
FLAGS.data_dir = os.path.join("/tmp", os.getlogin(), "speech_dataset")

# How loud the background noise should be, between 0 and 1.
FLAGS.background_volume = 0.1

# How many of the training samples have background noise mixed in.
FLAGS.background_frequency = 0.8

# How much of the training data should be silence.
FLAGS.silence_percentage = 10.0

# How much of the training data should be unknown words
FLAGS.unknown_percentage = 10.0

# Range to randomly shift the training audio by in time.
FLAGS.time_shift_ms = 100.0

# What percentage of wavs to use as a test set.
FLAGS.testing_percentage = 10

# What percentage of wavs to use as a validation set.
FLAGS.validation_percentage = 10

# Expected sample rate of the wavs
FLAGS.sample_rate = 16000

# Expected duration in milliseconds of the wavs
FLAGS.clip_duration_ms = 1000

# How long each spectrogram timeslice is
FLAGS.window_size_ms = 30.0

# How long each spectrogram timeslice is
FLAGS.window_stride_ms = 20.0

# How many bins to use for the MFCC fingerprint
FLAGS.dct_coefficient_count = 40

# How many training loops to run
FLAGS.how_many_training_steps = "1500,300"  # 12000,3000

# How often to evaluate the training results.
FLAGS.eval_step_interval = 400

# How large a learning rate to use when training.
FLAGS.learning_rate = "0.001,0.0001"

# How many items to train with at once
FLAGS.batch_size = 100

# Where to save summary logs for TensorBoard.
# FLAGS.summaries_dir = '/tmp/retrain_logs'

# Words to use (others will be added to an unknown label)
FLAGS.wanted_words = "yes,no,one,two"

# Directory to write event logs and checkpoint.
FLAGS.train_dir = os.path.join("/tmp", os.getlogin(), "speech_commands_train")

## 3. Create Keras Model

Get model settings

In [None]:
model_settings = models.prepare_model_settings(len(data.prepare_words_list(FLAGS.wanted_words.split(','))),
                                               FLAGS.sample_rate, FLAGS.clip_duration_ms, FLAGS.window_size_ms,
                                               FLAGS.window_stride_ms, FLAGS.dct_coefficient_count)

Define a model architecture

In [None]:
def create_model(model_settings):
    """Builds a model with a single depthwise-convolution layer followed by a single fully-connected layer.
    Args:
        model_settings: Dict of different settings for model training.
    Returns:
        tf.keras Model of the 'CNN' architecture.
    """

    # Get relevant model setting.
    input_frequency_size = model_settings['dct_coefficient_count']
    input_time_size = model_settings['spectrogram_length']

    ### Task X: REPLACE CODE BELOW ###

    inputs = tf.keras.Input(shape=(model_settings["fingerprint_size"]), name="input")

    # Reshape the flattened input.
    x = tf.reshape(inputs, shape=(-1, input_time_size, input_frequency_size, 1))

    # First convolution.
    x = tf.keras.layers.DepthwiseConv2D(
        depth_multiplier=8,
        kernel_size=(10, 8),
        strides=(2, 2),
        padding="SAME",
        activation="relu",
    )(x)

    # Flatten for fully connected layers.
    x = tf.keras.layers.Flatten()(x)

    # Output fully connected.
    output = tf.keras.layers.Dense(units=model_settings["label_count"], activation="softmax")(x)
    
    ### Task X: REPLACE CODE ABOVE ###

    return tf.keras.Model(inputs, output, name=FLAGS.model_name)

Generate keras model

In [None]:
model = create_model(model_settings)
model.summary()

## 4. Prepare dataset

In [None]:
audio_processor = data.AudioProcessor(data_url=FLAGS.data_url,
                                      data_dir=FLAGS.data_dir,
                                      silence_percentage=FLAGS.silence_percentage,
                                      unknown_percentage=FLAGS.unknown_percentage,
                                      wanted_words=FLAGS.wanted_words.split(','),
                                      validation_percentage=FLAGS.validation_percentage,
                                      testing_percentage=FLAGS.testing_percentage,
                                      model_settings=model_settings)

## 5. Run Training

Define training procedure

In [None]:
def train(model, audio_processor):
    # We decay learning rate in a constant piecewise way to help learning.
    training_steps_list = list(map(int, FLAGS.how_many_training_steps.split(',')))
    learning_rates_list = list(map(float, FLAGS.learning_rate.split(',')))
    lr_boundary_list = training_steps_list[:-1]  # Only need the values at which to change lr.
    lr_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay(boundaries=lr_boundary_list,
                                                                       values=learning_rates_list)

    # Specify the optimizer configurations.
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

	# Compile the model.
    model.compile(optimizer=optimizer,
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  metrics=['accuracy'])

    # Prepare/split the dataset.
    train_data = audio_processor.get_data(audio_processor.Modes.TRAINING,
                                          FLAGS.background_frequency, FLAGS.background_volume,
                                          int((FLAGS.time_shift_ms * FLAGS.sample_rate) / 1000))
    train_data = train_data.repeat().batch(FLAGS.batch_size).prefetch(tf.data.AUTOTUNE)
    val_data = audio_processor.get_data(audio_processor.Modes.VALIDATION)
    val_data = val_data.batch(FLAGS.batch_size).prefetch(tf.data.AUTOTUNE)

    # We train for a max number of iterations so need to calculate how many 'epochs' this will be.
    training_steps_max = np.sum(training_steps_list)
    training_epoch_max = int(np.ceil(training_steps_max / FLAGS.eval_step_interval))

    # Callbacks.
    train_dir = Path(FLAGS.train_dir) / "best"
    train_dir.mkdir(parents=True, exist_ok=True)
    model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=(train_dir / (FLAGS.model_name + "_{val_accuracy:.3f}_ckpt")),
        save_weights_only=True,
        monitor='val_accuracy',
        mode='max',
        save_best_only=True)

    # Train the model.
    model.fit(x=train_data,
              steps_per_epoch=FLAGS.eval_step_interval,
              epochs=training_epoch_max,
              validation_data=val_data,
              callbacks=[model_checkpoint_callback])

    # Test and save the model.
    test_data = audio_processor.get_data(audio_processor.Modes.TESTING)
    test_data = test_data.batch(FLAGS.batch_size)

    # Evaluate the model performace.
    test_loss, test_acc = model.evaluate(x=test_data)
    print(f'Final test accuracy: {test_acc*100:.2f}%')

Invoke training procedure (**Warning:** This will take a very long time!)

In [None]:
train(model, audio_processor)

Determine latest checkpoint

In [None]:
latest = tf.train.latest_checkpoint(Path(FLAGS.train_dir) / "best")
print(latest)

Pick a checkpoint

In [None]:
FLAGS.checkpoint = latest  # Feel free to choose a different one!

## 6. Test trained TensorFlow model

Define test procedure

In [None]:
def test():
    """Calculate accuracy and confusion matrices on validation and test sets.

    Model is created and weights loaded from supplied command line arguments.
    """
    model_settings = models.prepare_model_settings(len(data.prepare_words_list(FLAGS.wanted_words.split(','))),
                                                   FLAGS.sample_rate, FLAGS.clip_duration_ms, FLAGS.window_size_ms,
                                                   FLAGS.window_stride_ms, FLAGS.dct_coefficient_count)

    # Create the model.
    model = create_model(model_settings)

    audio_processor = data.AudioProcessor(data_url=FLAGS.data_url,
                                          data_dir=FLAGS.data_dir,
                                          silence_percentage=FLAGS.silence_percentage,
                                          unknown_percentage=FLAGS.unknown_percentage,
                                          wanted_words=FLAGS.wanted_words.split(','),
                                          validation_percentage=FLAGS.validation_percentage,
                                          testing_percentage=FLAGS.testing_percentage,
                                          model_settings=model_settings)

    model.load_weights(FLAGS.checkpoint).expect_partial()

    print("Running testing on validation set...")
    get_val_accuracy(model_settings, model, audio_processor, FLAGS.batch_size)
    print()
    print("Running testing on test set...")
    get_test_accuracy(model_settings, model, audio_processor, FLAGS.batch_size)

Run test procedure

In [None]:
test()

## 7. Quantization and Conversion to TFLite 

Define conversion procedure

In [None]:
NUM_REP_DATA_SAMPLES = 100

def convert(model, audio_processor, checkpoint, quantize, inference_type, tflite_path):
    """Load our trained floating point model and convert it.
    TFLite conversion or post training quantization is performed and the
    resulting model is saved as a TFLite file.
    We use samples from the validation set to do post training quantization.
    Args:
        model: The keras model.
        audio_processor: Audio processor class object.
        checkpoint: Path to training checkpoint to load.
        quantize: Whether to quantize the model or convert to fp32 TFLite model.
        inference_type: Input/output type of the quantized model.
        tflite_path: Output TFLite file save path.
    """
    model.load_weights(checkpoint).expect_partial()

    val_data = audio_processor.get_data(audio_processor.Modes.VALIDATION).batch(1)

    def _rep_dataset():  # TODO: make this a student task?
        """Generator function to produce representative dataset."""
        i = 0
        for mfcc, label in val_data:
            if i > NUM_REP_DATA_SAMPLES:
                break
            i += 1
            yield [mfcc]

    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    converter.optimizations = [tf.lite.Optimize.DEFAULT]

    if quantize:
        # Quantize model and save to disk.
        if inference_type=='int8':
            converter.inference_input_type = tf.int8
            converter.inference_output_type = tf.int8

        # Int8 post training quantization needs representative dataset.
        converter.representative_dataset = _rep_dataset
        converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

    tflite_model = converter.convert()
    with open(tflite_path, 'wb') as f:
        f.write(tflite_model)
    print('{} model saved to {}.'.format("Quantized" if quantize else "Converted", tflite_path))

Invoke conversion

In [None]:
tflite_path_quantized = f'{FLAGS.model_name}_quantized.tflite'
tflite_path = f'{FLAGS.model_name}.tflite'

# Load floating point model from checkpoint and convert it.
convert(model, audio_processor, FLAGS.checkpoint,
        False, "fp32", tflite_path)

# Quantize model from checkpoint and convert it.
convert(model, audio_processor, FLAGS.checkpoint,
        True, "int8", tflite_path_quantized)


## 8. Test Converted TFLite Model

Test the newly converted model on the test set.

**Floating Point**

In [None]:
tflite_test(model_settings, audio_processor, tflite_path)

**Quantized**

In [None]:
tflite_test(model_settings, audio_processor, tflite_path_quantized)

## 9. Visualize TFLite Model

**Figure 1:** Example Keras KWS Model
<img src="basic_micro_speech_graph.png" alt="drawing" width="100"/>

Use the following links to download the generated `.tflite` files

**Floating Point**

In [None]:
FileLink(tflite_path)

**Quantized**

In [None]:
FileLink(tflite_path_quantized)

Use the web application https://netron.app/ to generate a graph representation of the converted model.

**Task X:** Save the resulting graph as PNG-image. (TODO: ?)

## 10. Performance and Memory Estimations

### 10.1 ROM Usage

TODO

TFLITE filesize: Weights+Graph-Metadata(Tensors/Operators) -> ? kB?

On-device:

TODO

Weights -> ROM (Float(4B)/Quantized(1B))

Graph -> Depends on inference engine
Kernel implementations -> ROM

Alignment

**Task X:** Estimate the memory requirement to store all trained weights of the quantized model in ROM considering the used data types.

*Hints:*
- The `summary()` method of a keras model can be used to extract the number of constants used by an operator.
- The dimension of the weight tensors can be also found in the Netron graph generated in the previous step.
- Memory alignment requirements can be ignored.



**<font color='red'>Solution X:</font>**

All constant dimensions and their datatypes:

1. DepthwiseConv2D filter/weights: `[1,10,8,8]` (int8)
2. DepthwiseConv2D bias: `[8]` (int32)
3. FullyConnected filter/weights: `[6,4000]` (int8)
4. FullyConnected bias: `[6]` (int32)

Total sum: `(1*10*8*8 + 6*4000) * 1 byte + (8+6) * 4 byte = 24696 byte` -> ~24.5 kB


**Task X:** Briefly explain the following model compression techniques in 2-3 sentences each:
  - Sparsity
  - Sub-byte quantization / "Packing"

*Hints:*
- At which level the technique has to be integrated into the training routine?
- How is model inference affected by the compression of weights?
- TFLite uses int32 for biases even if int8 quantization is used.

**<font color='red'>Solution X:</font>**

- **Sparsity**

 TODO
 
 
- **Sub-byte quantization**

  TODO


### 10.2 RAM Usage

TODO

Memory Planning!

Arena!

Worst case: sum of all intermedia buffers

inputs/outputs -> Depending on the "inference engine"

For linear/sequential models: biggest in/out pair of a layer

**Task X:** Estimate the dynamic memory requirement of the quantized model based on the TFLite graph only considering intermediate tensor buffers stored in RAM for the following memory-planning schemes:
  - Worst case: no memory planning (no intermediate buffers/tensors share the same memory)
  - Best case: optimal memory planning (e.g. found using an ILP-solver)

*Assumptions:*
- Neither branches nor nodes with multiple inputs/outputs extist in the trained model.
- Assume that the graph is processed in a linear way so that at most 2 buffers will be used at the same time.

**<font color='red'>Solution X:</font>**

Intermediate buffers:

1. `[1,49,40,1]`
2. `[1,25,20,8]`
3. `[1,4]`

Largest combination of input and output buffer: 1. + 2. leads to `1*49*40*1 + 1*25*20*8 = 5960` elements which are each 1 Byte large -> ~6 kB

### 10.3 Number of MAC Operations

In this section the compute demand of a given TFLite model should be estimated.

As a first simplification we will only consider the operation which will have the biggest impact on the actual inference time: Multiply-Add (-> MAC)

These operations can be found in Dense (FullyConnected), and convolutional layers. Thus other operations (here: Reshape, Flatten as well as activation functions) can be neglected for the following task.

First, a formular to describe the number of MAC operations of the three major types of with repect to the given tensor dimensions and parameters.

**Example (Dense/FullyConnected):**

  Assume: `h_out=h_in`, `w_out=w_kernel`

  `num_mac = h_out * w_out * h_kernel`
  
  For the example keras model: `1 * 6 * 4000` -> ~24k MACs

**Task X:** Estimate the number of Multiply-Add operations used in the quantized model (see Figure 1) by deriving a formula for `num_mac` in a convolutional layer with respect to `h_kernel`, `w_kernel`, `c_in(=1)`, `c_out`, `h_out`, `w_out`.

**<font color='red'>Solution X:</font>**

Assumption: no dilation!

Formulas per layer:

1. **FullyConnected**

  See above!
  

2. **Conv2D**/**DepthwiseConv2D**

  Output size is given, thus padding/stride/dilation can be ignored.
  
  Same formula for Conv2D and DepthwiseConv2D as long as depth_multiplier is considered in `c_out`! (TODO: double check, i.e. in TFLM codebase)

  `num_mac = (h_kernel * w_kernel) * c_in * (h_out * w_out * c_out)`
  
  For **DepthwiseConv2D** example keras model: `(10 * 8) * 1 * (25 * 20 * 8) = 320000` -> ~320K

## 11. Final challenge

To pass the lab you have to design a model architecture for the keyword-spotting task which satisfies each of the following constraints:

1. Accuracy of the quantized TFLite model is at least 90%
2. Total memory requirement to store all constants/weights in ROM is at most 75kB (See Task X?) TODO: TFLITE file size instead?, update limit
3. Best case memory requirement for intermediate tensors in RAM is at most 100kB (See Task X?) TODO: update limit
4. Estimated number of MAC operations is at most 100000? (See Task X?) TODO: update limit


*Hint:*
- Training- or Model-parameters may not be changed to complete this challenge.
- A subset of the following Keras layers should suffice: Dense, Conv2D, DepthwiseConv2D, Flatten, Reshape
- Try to keep the total number of (despthwise)-convolutions and dense-layers below 6.

## Further information

Summary of tasks in this lab:

- **Task 1:**
- **Task 2:**
- **Task 3:**
- ...
- **Task X:**

TODO: decide if the estimination tasks should be solved using the example model or the final one?

TODO: Should answers to theory questions be submitted or only be useful as "Exam"-Prep?

TODO: Submit only model `.tflite` file (accuracy could be tested via CI) or also keras code?

TODO: FLOPS vs MACS?

TODO: useful links:
https://stackoverflow.com/questions/56138754/formula-to-compute-the-number-of-macs-in-a-convolutional-neural-network

https://leimao.github.io/blog/Depthwise-Separable-Convolution/ (pytorch)

https://medium.com/@zurister/depth-wise-convolution-and-depth-wise-separable-convolution-37346565d4ec

https://cdmana.com/2021/04/20210413132349621b.html

https://towardsdatascience.com/understanding-and-calculating-the-number-of-parameters-in-convolution-neural-networks-cnns-fc88790d530d

(https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d)?

!!!https://machinethink.net/blog/how-fast-is-my-model/

TODO: Conv output size for manual?

**Conv2D**

  General: `num_mac = (n_h - k_h + p_h + 1) x (n_w - k_w + p_w + 1)`
  
  With non-default stride: `num_mac = floor((n_h - k_h + p_h + s_h)/s_h) x floor((n_w - k_w + p_w + s_w)/s_w)`
  
  For `padding="VALID"` (`p_h=p_w=0`): `num_mac = (n_h - k_h + p_h + 1) x (n_w - k_w + p_w + 1)`
  
  For `padding="SAME"` (`p_h=k_h-1`, `p_w=k_w-1`):