### Student Information:
- Name: Yilun Wang
- BU email: yilun830@bu.edu
- Collaborators: Yipeng Guo / ypguo@bu.edu

# Major Assignment Part2

You have already built and trained a model capable of recognizing a single digit from a 1s recording. The next step for our automated phone payment system is to extend the model to recognize 16 digits in a row (I made a mistake earlier saying that there are 12 digits!). Here is what you need to do:

1.   Extend the single-digit voice recognition model to take a 16*16000 component waveform and output 16 digits. For simplicity you can assume that each second of the input contains the recording of a single digit.


2.   *Optional*: Provide a method to convert a recording of someone saying 16 digits in a row to a 16*16000 component verctor, A,  where 
A[16000 j,16000 (j+1)] 
contains a recording of a single digit.

Be sure to submit this notebook as well as the saved weights of your final model in the h5 format.

## Set up - DO NOT EDIT THIS SECTION

In [1]:
!pip install tensorflow_io

Collecting tensorflow_io
  Downloading tensorflow_io-0.25.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.4 MB)
[K     |████████████████████████████████| 23.4 MB 1.1 MB/s 
Installing collected packages: tensorflow-io
Successfully installed tensorflow-io-0.25.0


In [2]:
import os

from IPython import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_io as tfio
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.utils import shuffle

In [3]:
dataset_links = {'train_data': 'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz',
                 'test_data': 'http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz'}

In [4]:
for key in dataset_links:
    tf.keras.utils.get_file(key+'.tar.gz',
                            dataset_links[key],
                            cache_dir='./',
                            cache_subdir='datasets/'+key,
                            extract=True)

Downloading data from http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
Downloading data from http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz


In [5]:
train_data_paths = []
for folder, labels, samples in os.walk('./datasets/train_data/'):
    for sample in samples:
        if sample[-3:] == 'wav':
            train_data_paths.append([folder+'/'+sample, folder[22:]])

df = pd.DataFrame(train_data_paths, columns=['paths', 'labels'])
df = df.drop(df[df['labels'] =='_background_noise_'].index)
categories = df['labels'].unique()
digits_dict = {'zero':0, 'one':1, 'two':2, 
               'three':3, 'four':4, 'five':5,
               'six':6, 'seven':7, 'eight':8,
               'nine':9}
digits_index = []
for digit in digits_dict.keys():
    digits_index = digits_index + list(df[df['labels']==digit].index)
df = df.loc[digits_index]
df = df.sample(frac=1)
df.reset_index(inplace=True)

In [6]:
# Audio import function with padding
def load_audio(filepath):
    """Takes the path of a wav audio file as input and creates
    a numpy array of shape (16000) as output. The input file
    needs to sample rate=16000. The expected duration is 1s,
    shorter samples are padded at the end while longer samples
    are cropped at 1s."""
    audio = tfio.audio.AudioIOTensor(filepath)
    audio_rate = int(audio.rate)
    assert audio_rate == 16000
    audio = audio.to_tensor().numpy().reshape((-1)) / 32767.0
    audio = audio.astype(dtype="float32")
    len = audio.shape[0]
    # Padding
    if len == 16000:
        return audio
    elif len < 16000:
        return np.concatenate([audio, 
                               np.zeros(shape=(16000-len),
                                        dtype="float32")], 
                              axis=0)
    else:
        return audio[0:16000]


# The dataset class used to feed data to our model during training and evaluation.
class audio_gen(keras.utils.Sequence):
    def __init__(self, file_paths, labels,
                 batch_size=32, shape=(16*16000,),
                 shuffle_on_epoch_end=True):
        # Initialization
        super().__init__()
        self.shape = shape
        self.batch_size = batch_size
        self.labels = labels
        self.paths = file_paths
        self.n_channels = 1
        self.n_classes = 10
        self.shuffle = shuffle_on_epoch_end
        self.on_epoch_end()
    
    def __len__(self):
        return int(np.floor(len(self.paths) / self.batch_size))
    
    def __getitem__(self, idx):
        batch_paths = self.paths[self.batch_size * idx: 
                                 self.batch_size * (idx+1)]
        batch_labels = self.labels[self.batch_size * idx:
                                   self.batch_size * (idx+1)]
        batch_samples = np.zeros(shape=(0, self.shape[0]), 
                                 dtype='float32')
        for paths in batch_paths:
            sample = np.zeros(shape=(0), dtype='float32')
            for path in paths:
                sample = np.concatenate([sample, load_audio(path)], axis=0)
            batch_samples = np.concatenate([batch_samples, [sample]], axis=0)
        return batch_samples, np.array(batch_labels, dtype='int')

    def on_epoch_end(self):
        # suffle the dataset after each epoch
        if self.shuffle:
            self.paths, self.labels = shuffle(self.paths, self.labels)

In [7]:
# We will ignore the constraints on credit card numbers for now.

train_paths = np.array(df['paths'])[:32000].reshape((-1,16))
train_labels = np.array([digits_dict[x] for x in df['labels']])[0:32000].reshape((-1,16))

valid_paths = np.array(df['paths'])[32000:35200].reshape((-1,16))
valid_labels = np.array([digits_dict[x] for x in df['labels']])[32000:35200].reshape((-1,16))

test_paths = np.array(df['paths'])[35200:38896].reshape((-1,16))
test_labels = np.array([digits_dict[x] for x in df['labels']])[35200:38896].reshape((-1,16))

In [8]:
train_gen = audio_gen(train_paths, train_labels)
valid_gen = audio_gen(valid_paths, valid_labels)
test_gen = audio_gen(test_paths, test_labels)

In [9]:
def get_spectrogram(audio_tensor):
    return tfio.audio.spectrogram(audio_tensor,
                                  nfft=512,
                                  window=256,
                                  stride=128)

def mel_spectrogram(audio_tensor):
    return tfio.audio.melscale(get_spectrogram(audio_tensor),
                               rate=16000,
                               mels=128,
                               fmin=0,
                               fmax=8000)


def dbscale_spectrogram(audio_tensor):
    return tfio.audio.dbscale(mel_spectrogram(audio_tensor),
                              top_db=80)/60.0

### Evaluation metric

In [10]:
# The model's prediction is accurate only if the model predicts all 16
# digits correctly. The custom metric below can be used to evaluate the
# performance of the model.

class seq_accuracy(keras.metrics.Metric):

    def __init__(self):
        super(seq_accuracy, self).__init__()
        self.total = self.add_weight(name='total', initializer='zeros')
        self.count = self.add_weight(name='count', initializer='zeros')

    def update_state(self, y_true, y_pred, sample_weight=None):
        # Takes the product of single-digit accuracies for each 16-digit sample.
        accuracies = tf.reduce_prod(
            tf.cast(tf.equal(y_true, tf.argmax(y_pred, axis=2)), 
                    tf.float32), axis=1)
        sum_a = tf.reduce_sum(accuracies)
        with tf.control_dependencies([sum_a]):
            update_t = self.total.assign_add(sum_a)
        num_a = tf.cast(tf.size(accuracies), self._dtype)
        with tf.control_dependencies([update_t]):
            return self.count.assign_add(num_a)
    
    def result(self):
        return tf.math.divide_no_nan(self.total, self.count)
    
    def reset_states(self):
        self.total.assign(0.)

### Sample inputs and outputs

In [11]:
# Sample inputs and labels in our dataset:
for batch in train_gen:
    sample_inputs = batch[0]
    print(sample_inputs.shape)
    sample_labels = batch[1]
    print(sample_labels.shape)
    break

(32, 256000)
(32, 16)


In [12]:
# Sample outputs:
# predictions = model_16.predict(sample_inputs)
# predictions.shape

In [13]:
# Inference:
#np.argmax(predictions, axis=2)[0]

In [14]:
#sample_labels[0]

# Edit the cells below

## Import your single-digit voice recognition model

Import your pretrained single-digit classifier. The model should take 16000-component 'waveform' vectors as input and produce a 10-component 'softmax' vector.

In [15]:
# Single-digit classifier structure:
#
#model_1_inputs = keras.Input(shape=(16000,))
#x = layers.Dense(10, activation='softmax')(model_1_inputs)
#
#model_1 = keras.Model(inputs = model_1_inputs, outputs = x)
#model_1.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=['accuracy'])
#
#model_1.load_weights('./model_1.h5')


model_1_inputs = keras.Input(shape=(16000,))
x = layers.Lambda(lambda waveform: dbscale_spectrogram(waveform))(model_1_inputs)
x = layers.Reshape((125, 128, 1))(x)
x = layers.Conv2D(32, 4, 1, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(64, 4, 1, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(128, 4, 1)(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(256, 4, 1, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(512, 2, 1, activation='relu')(x)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dense(10, activation='softmax')(x)

model_1 = keras.Model(inputs = model_1_inputs, outputs = x)

model_1.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=['accuracy'])

#model_Conv2D_10 = keras.Model(inputs = inputs, outputs = z)









We will provide pre-trained model in qtools when submitting our homework.

In [None]:
model_1.load_weights('/content/Conv2D_10_val975.h5')

## Extend to 16 digits

Extend your single-digit classifier to classify 16 digits in parallel. The expected input of the model is a 256000-component 'waveform' while the output should be a (16,10)-shaped tensor or sixteen 10-component vectors where each 10-component vector corresponds to the classification of a single digit.


---


Hint: Consider the toy-classifier below which classifies a single digit recording.

```
inputs = keras.Input(shape=(16000,))
x = layers.Dense(400, activation = 'relu')(inputs)
x = layers.Dense(10, activation = 'softmax')(x)
classifier1 = keras.Model(inputs = inputs, outputs = x)
```

To extend this to 16 parallel classifiers we can play a simple trick inspired by the object detection model we discussed in class:

```
inputs = keras.Input(shape=(16*16000,))
x = layers.Reshape((16,16000))(inputs)
x = layers.Dense(400, activation = 'relu')(x)
x = layers.Dense(10, activation = 'softmax')(x)
classifier16 = keras.Model(inputs = inputs, outputs = x)
```
This is equivalent to running 16 single-digit classifiers in parallel and feeding each only a portion of the input (input[16000 j: 16000(j+1)]).

---


### Your extended model

In [18]:
model_16_inputs = keras.Input(shape=(16*16000,))
x = layers.Reshape((16,16000))(model_16_inputs)
x = layers.Lambda(lambda waveform: dbscale_spectrogram(waveform))(x)
x = layers.Reshape((2000, 128, 1))(x)
x = layers.Conv2D(32, 4, 1, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(64, 4, 1, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(128, 4, 1)(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(256, 4, 1, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(512, 2, 1, activation='relu')(x)
#x = layers.GlobalAveragePooling2D()(x)
x = layers.Reshape((16,15488))(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(1024, activation='relu')(x)
x = layers.Dense(512, activation='relu')(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dense(10, activation='softmax')(x)

model_16 = keras.Model(inputs = model_16_inputs, outputs = x)
model_16.summary()


metric = seq_accuracy()
loss = keras.losses.sparse_categorical_crossentropy
model_16.compile(optimizer="adam", loss=loss, metrics = [metric])

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 256000)]          0         
                                                                 
 reshape_3 (Reshape)         (None, 16, 16000)         0         
                                                                 
 lambda_2 (Lambda)           (None, 16, 125, 128)      0         
                                                                 
 reshape_4 (Reshape)         (None, 2000, 128, 1)      0         
                                                                 
 conv2d_5 (Conv2D)           (None, 1997, 125, 32)     544       
                                                                 
 max_pooling2d_4 (MaxPooling  (None, 998, 62, 32)      0         
 2D)                                                             
                                                           

In [19]:
callback = keras.callbacks.ModelCheckpoint("/content/model_16.h5",
                                           monitor='val_loss',
                                           save_weights_only=True,
                                           save_best_only=True)

In [20]:
history = model_16.fit(train_gen, validation_data=valid_gen,
                              epochs=100, batch_size=256, callbacks=callback)

Epoch 1/100

  m.reset_state()


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

In [None]:
# 16-digit classifier structure:
#
# model_16_inputs = keras.Input(shape=(16*16000,))
# x = layers.Reshape((16, 16000))(model_16_inputs)
# x = layers.Dense(10, activation='softmax')(x)
#
# model_16 = keras.Model(inputs = model_16_inputs, outputs = x)
# model_16.compile(optimizer="adam", loss=loss, metrics = [metric])
#

In [None]:
# Use the custom metric seq_accuracy for evaluating the performance of your model.
metric = seq_accuracy()
loss = keras.losses.sparse_categorical_crossentropy
model_16.compile(optimizer="adam", loss=loss, metrics=[metric])
model_16.evaluate(test_gen)



[0.245279923081398, 0.4866071343421936]

In the end, we got 48.66% accuracy, which is not bad because we are predicting 16 digits at once. 

## Parsing (*Optional*)

So far we assumed that the waveform vector was parsed such that each 16000-long segment records a single digit. This may not be the case for a real recording thus we need to preprocess the input. If you feel extra motivated you can try writing a function which implements the following:


1.   Take as an input a waveform of arbitrary length
2.   Locate the spoken digits in the waveform and check that there are 16 of them.
3.   Pad/crop the waveform such that the spoken digits are located in 1s long segments.
4.   Return the resulting 256000 component vector.

