# INTRODUCTION


The project will use speech data and their transcriptions to train a speech to text model. The goal is to develop a model that can collect data through speech. Five deep learning models will be compared and the best model will be used in the prediction of text from speech input.

# PREPROCESSING


The first step will be to preprocess the data and generate spectrograms from it. the resulting spectrogram from one of the audios is displayed below. The spectograms will be converted to Mel Frequency Ceptral coefficients(MFCC). The final models will allow the use of the dimensions of either a spectrogram or MFCC. MFCCs improve on the spectrogram by first taking into account the fact that humans passive speech on a logarithmic scale and the compresses the features and extracts the most common frequency.

In [1]:
from data_generator import vis_train_features

ModuleNotFoundError: No module named 'data_generator'

In [None]:

# extract label and audio features for a single training example
vis_text, vis_raw_audio, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path = vis_train_features()

In [None]:
from IPython.display import Markdown, display
from data_generator import vis_train_features, plot_raw_audio
from IPython.display import Audio
%matplotlib inline

# plot audio signal
plot_raw_audio(vis_raw_audio)
# print length of audio signal
display(Markdown('**Shape of Audio Signal** : ' + str(vis_raw_audio.shape)))
# print transcript corresponding to audio clip
display(Markdown('**Transcript** : ' + str(vis_text)))
# play the audio file
Audio(vis_audio_path)

In [None]:
from data_generator import plot_spectrogram_feature

# plot normalized spectrogram
plot_spectrogram_feature(vis_spectrogram_feature)
# print shape of spectrogram
display(Markdown('**Shape of Spectrogram** : ' + str(vis_spectrogram_feature.shape)))

In [None]:
from data_generator import plot_mfcc_feature

# plot normalized MFCC
plot_mfcc_feature(vis_mfcc_feature)
# print shape of MFCC
display(Markdown('**Shape of MFCC** : ' + str(vis_mfcc_feature.shape)))

# TRAINING


The dimensions extracted from the spectrograms and MFCCs are important for the next step of training with deep learning. We begin with a simple RNN model and add layers to it. The outputs for the models are vectors of the probabilities of a character being spoken. The CTC loss criterion is used to train the model. The CTC automatically maps input features to output features.

In [None]:
# # allocate 50% of GPU memory (if you like, feel free to change this)
# from tensorflow.keras.backend.tensorflow_backend import set_session
# import tensorflow as tf 
# config = tf.ConfigProto()
# config.gpu_options.per_process_gpu_memory_fraction = 0.5
# set_session(tf.Session(config=config))

# # watch for any changes in the sample_models module, and reload it automatically
# %load_ext autoreload
# %autoreload 2
# import NN architectures for speech recognition
from sample_models import *
# import function for training acoustic model
from train_utils import train_model

# SIMPLE RNN


The recurrent neural network works by looking at past valuea as well as the information gained from them. The models are good for training data that is sequential like speech and text. For the simple RNN the output of the previuos timestep is fed into the next time step.

In [None]:
model_0 = simple_rnn_model(input_dim=13)


In [None]:
train_model(input_to_softmax=model_0, 
            pickle_path='model_0.pickle', 
            save_model_path='model_0.h5',
            spectrogram=False)

In [None]:
model_1 = rnn_model(input_dim=13, 
                    units=200,
                    activation='relu')

In [None]:
train_model(input_to_softmax=model_1, 
            pickle_path='model_1.pickle', 
            save_model_path='model_1.h5',
            spectrogram=False)

In [None]:
train_model(input_to_softmax=model_2, 
            pickle_path='model_2.pickle', 
            save_model_path='model_2.h5', 
            spectrogram=False)

In [None]:
model_3 = deep_rnn_model(input_dim=161,
                         units=200,
                         recur_layers=2)

In [None]:
train_model(input_to_softmax=model_3, 
            pickle_path='model_3.pickle', 
            save_model_path='model_3.h5', 
            spectrogram=True)

In [None]:
model_4 = bidirectional_rnn_model(input_dim=161, 
                                  units=200)

In [None]:
train_model(input_to_softmax=model_4, 
            pickle_path='model_4.pickle', 
            save_model_path='model_4.h5', 
            spectrogram=True)


In [None]:
import os
import mlflow
from getpass import getpass

os.environ['MLFLOW_TRACKING_USERNAME'] = input('Enter your DAGsHub username: ')
os.environ['MLFLOW_TRACKING_PASSWORD'] = getpass('Enter your DAGsHub access token: ')
os.environ['MLFLOW_TRACKING_PROJECTNAME'] = input('Enter your DAGsHub project name: ')

mlflow.set_tracking_uri(f'https://dagshub.com/' + os.environ['MLFLOW_TRACKING_USERNAME'] 
                        + '/' + os.environ['MLFLOW_TRACKING_PROJECTNAME'] + '.mlflow')


#   mlflow.log_metric("m1", 2.0)
#   mlflow.log_param("p1", "mlflow-colab")

The loss curves give an idea of the direction in which a model is learning and the loss curve for the five models above are plotted below. Model 4, with a bidirectional layer, appears to have a good learning rate. https://cs231n.github.io/neural-networks-3/#loss explains the variations in a loss function, for low learning rates the improvement is linear while higher learning rates lead to faster decay.

In [None]:
from glob import glob
import numpy as np
import _pickle as pickle
import seaborn as sns
import mlflow
import mlflow.tensorflow
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style(style='white')

with mlflow.start_run(run_name="speech-recognition",nested=True):

    # obtain the paths for the saved model history
    all_pickles = sorted(glob("results/*.pickle"))
    # extract the name of each model
    model_names = [item[8:-7] for item in all_pickles]
    # extract the loss history for each model
    valid_loss = [pickle.load( open( i, "rb" ) )['val_loss'] for i in all_pickles]
    train_loss = [pickle.load( open( i, "rb" ) )['loss'] for i in all_pickles]
    # save the number of epochs used to train each model
    num_epochs = [len(valid_loss[i]) for i in range(len(valid_loss))]

    fig = plt.figure(figsize=(16,5))

    # plot the training loss vs. epoch for each model
    ax1 = fig.add_subplot(121)
    for i in range(len(all_pickles)):
        ax1.plot(np.linspace(1, num_epochs[i], num_epochs[i]), 
                train_loss[i], label=model_names[i])
    # clean up the plot
    ax1.legend()  
    ax1.set_xlim([1, max(num_epochs)])
    plt.xlabel('Epoch')
    plt.ylabel('Training Loss')

    # plot the validation loss vs. epoch for each model
    ax2 = fig.add_subplot(122)
    for i in range(len(all_pickles)):
        ax2.plot(np.linspace(1, num_epochs[i], num_epochs[i]), 
                valid_loss[i], label=model_names[i])
    # clean up the plot
    ax2.legend()  
    ax2.set_xlim([1, max(num_epochs)])
    plt.xlabel('Epoch')
    plt.ylabel('Validation Loss')
    #     plt.show()
    plt.savefig("loss.png")
    mlflow.log_artifact("loss.png")
    plt.show()

In [None]:
import IPython
display(IPython.display.IFrame("https://dagshub.com/"+ os.environ['MLFLOW_TRACKING_USERNAME'] 
                        + '/' + os.environ['MLFLOW_TRACKING_PROJECTNAME'] + "/experiments/#/",'100%',600))

The model chosen for the speech to text model will be a bidirectional model with two layers because the future context is availabel for speech data and can be exploited to provide a better prediction model.

In [None]:
model_end = final_model(input_dim=13, units=200)

In [None]:
import mlflow
mlflow.tensorflow.autolog()
mlflow.log_param("task",2)

In [None]:
train_model(input_to_softmax=model_end, 
            pickle_path='model_end.pickle', 
            save_model_path='model_end.h5', 
            spectrogram=False) # change to False if you would like to use MFCC features

# Prediction

In [2]:
import pickle
from data_generator import AudioGenerator
from IPython.display import Audio
from IPython.display import Markdown, display
import numpy as np
from utils import calc_feat_dim, spectrogram_from_file, text_to_int_sequence,int_sequence_to_text

ModuleNotFoundError: No module named 'data_generator'

In [None]:
# Code adapted from https://martin-thoma.com/word-error-rate-calculation/
def wer(r, h):
    """
    Calculation of WER with Levenshtein distance.

    Works only for iterables up to 254 elements (uint8).
    O(nm) time ans space complexity.

    Parameters
    ----------
    r : list
    h : list

    Returns
    -------
    int

    Examples
    --------
    >>> wer("who is there".split(), "is there".split())
    1
    >>> wer("who is there".split(), "".split())
    3
    >>> wer("".split(), "who is there".split())
    3
    """
    # initialisation
    import numpy
    d = numpy.zeros((len(r)+1)*(len(h)+1), dtype=numpy.uint8)
    d = d.reshape((len(r)+1, len(h)+1))
    for i in range(len(r)+1):
        for j in range(len(h)+1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    # computation
    for i in range(1, len(r)+1):
        for j in range(1, len(h)+1):
            if r[i-1] == h[j-1]:
                d[i][j] = d[i-1][j-1]
            else:
                substitution = d[i-1][j-1] + 1
                insertion    = d[i][j-1] + 1
                deletion     = d[i-1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)

    return d[len(r)][len(h)]


In [None]:
from data_generator import AudioGenerator
RNG_SEED = 123
model = model_end
model.load_weights('results/model_end.h5')

def make_audio_gen(train_json,
                   valid_json,
                   minibatch_size=20,
                   spectrogram=True,
                   mfcc_dim=13,
                   sort_by_duration=False,
                   max_duration=10.0):
    return AudioGenerator(minibatch_size=minibatch_size, 
        spectrogram=spectrogram, mfcc_dim=mfcc_dim, max_duration=max_duration,
        sort_by_duration=sort_by_duration)

In [None]:
TRAIN_CORPUS = "train_corpus.json"
VALID_CORPUS = "valid_corpus.json"

MFCC_DIM = 13
SPECTOGRAM = False
EPOCHS = 5
MODEL_NAME = "RNN_model"

################ Reminder MINI_BATCH_SIZE=250 
MINI_BATCH_SIZE = 20

SORT_BY_DURATION=False
MAX_DURATION = 10

audio_gen = make_audio_gen(TRAIN_CORPUS, VALID_CORPUS, spectrogram=False, mfcc_dim=MFCC_DIM,
                           minibatch_size=MINI_BATCH_SIZE, sort_by_duration=SORT_BY_DURATION,
                           max_duration=MAX_DURATION)
# add the training data to the generator
audio_gen.load_train_data()
audio_gen.load_validation_data()

In [None]:
def predict_raw(data_gen = audio_gen, index = 14, partition = 'train', model = model):
    """ Get a model's decoded predictions
    Params:
        data_gen: Data to run prediction on
        index (int): Example to visualize
        partition (str): Either 'train' or 'validation'
        model (Model): The acoustic model
    """

    if partition == 'validation':
        transcr = data_gen.valid_texts[index]
        audio_path = data_gen.valid_audio_paths[index]
        data_point = data_gen.normalize(data_gen.featurize(audio_path))
    elif partition == 'train':
        transcr = data_gen.train_texts[index]
        audio_path = data_gen.train_audio_paths[index]
        data_point = data_gen.normalize(data_gen.featurize(audio_path))
    else:
        raise Exception('Invalid partition!  Must be "train" or "validation"')
        
    prediction = model.predict(np.expand_dims(data_point, axis=0))
    return (audio_path,data_point,transcr,prediction)

In [None]:
def predict(data_gen=audio_gen, index=14, partition = 'train', model=model, verbose=True):
    """ Print a model's decoded predictions
    Params:
        data_gen: Data to run prediction on
        index (int): Example to visualize
        partition (str): Either 'train' or 'validation'
        model (Model): The acoustic model
    """
    audio_path,data_point,transcr,prediction = predict_raw(data_gen, index, partition, model)
    output_length = [model.output_length(data_point.shape[0])]
    pred_ints = (K.eval(K.ctc_decode(
                prediction, output_length, greedy=False)[0][0])+1).flatten().tolist()
    predicted = ''.join(int_sequence_to_text(pred_ints)).replace("<SPACE>", " ")
    wer_val = wer(transcr, predicted)
    if verbose:
        display(Audio(audio_path, embed=True))
        print('Truth: ' + transcr)
        print('Predicted: ' + predicted)
        print("wer: %d" % wer_val)
        # Write results to a file
    with open("metrics.txt", 'w') as outfile:
            outfile.write("Text input: %s\n" % transcr)
            outfile.write("Predicted Text: %s\n" % predicted)
            outfile.write("Word Error Rate: %2.1f%%\n" % wer_val)
    with open('test.wav', 'wb') as f:
            f.write(Audio(audio_path).data)
    return wer_val


In [None]:
a = predict()