## Automatic Speech Recognition

Character level speech recognition can be broken into two parts;
The acoustic model, that describes the distribution over acoustic observations, O, given the character sequence, C. 
The language model based solely on the character sequence which assigns a probability to every possible character sequence. 
This sequence to sequence model combines both the acoustic and language models into one neural network, though pretrained acoustic models.

### Problem Statement

My goal was to build a character-level ASR system using a recurrent neural network in TensorFlow ith a word error rate of < 20.

In [2]:
# Common, File Based, and Math Imports
import pandas as pd
import numpy as np
import collections
import os
from os.path import isdir, join
from pathlib import Path
from subprocess import check_output
import sys
import math
import pickle
from glob import glob
import random
from random import sample
import json
from mpl_toolkits.axes_grid1 import make_axes_locatable
from numpy.lib.stride_tricks import as_strided
from tqdm import tqdm

# Audio processing
from scipy import signal
from scipy.fftpack import dct
import soundfile
import json
from python_speech_features import mfcc
import scipy.io.wavfile as wav
from scipy.fftpack import fft

# Neural Network
import keras
from keras.utils.generic_utils import get_custom_objects
from keras import backend as K
from keras import regularizers, callbacks
from keras.constraints import max_norm
from keras.models import Model, Sequential, load_model
from keras.layers import Input, Lambda, Dense, Dropout, Flatten, Embedding, merge, Activation, GRUCell, LSTMCell,SimpleRNNCell
from keras.layers import Convolution2D, MaxPooling2D, Convolution1D, Conv1D, SimpleRNN, GRU, LSTM, CuDNNLSTM, CuDNNGRU, Conv2D
from keras.layers.advanced_activations import LeakyReLU, PReLU, ThresholdedReLU, ELU
from keras.layers import LeakyReLU, PReLU, ThresholdedReLU, ELU
from keras.layers import BatchNormalization, TimeDistributed, Bidirectional
from keras.layers import activations, Wrapper
from keras.regularizers import l2
from keras.optimizers import Adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nadam
from keras.callbacks import ModelCheckpoint 
from keras.utils import np_utils
from keras import constraints, initializers, regularizers
from keras.engine.topology import Layer
import keras.losses
from keras.backend.tensorflow_backend import set_session
from keras.engine import InputSpec
import tensorflow as tf 
from tensorflow.python.framework import graph_io
from tensorflow.python.tools import freeze_graph
from tensorflow.core.protobuf import saver_pb2
from tensorflow.python.training import saver as saver_lib

# Model metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

py.init_notebook_mode(connected=True)
color = sns.color_palette()
sns.set_style('darkgrid')
py.init_notebook_mode(connected=True)
%matplotlib inline

# Setting Random Seeds
np.random.seed(95)
RNG_SEED = 95

# Suppressing some of Tensorflow's warnings
tf.logging.set_verbosity(tf.logging.ERROR)

<a id='data'></a>
## Importing The Dataset

The dataset used is the [LibriSpeech ASR corpus](http://www.openslr.org/12/) which includes 1000 hours of recorded speech. The dataset consists of 16kHz audio files between 2-15 seconds long.
Audio files were converted to single channel (mono) WAV/WAVE files (.wav extension) with a 64k bit rate, and a 16kHz sample rate. They were encoded in PCM format, and then cut/padded to an equal length of 10 seconds.

In [3]:
train_corpus = pd.read_json('train_corpus.json', lines=True)
valid_corpus = pd.read_json('valid_corpus.json', lines=True)
test_corpus = pd.read_json('test_corpus.json', lines=True)
train_duration_mean = train_corpus.duration.mean()
valid_duration_mean = valid_corpus.duration.mean()
test_duration_mean = test_corpus.duration.mean()
print('Train Set Duration Mean:', train_duration_mean)
print('Valid Set Duration Mean:', valid_duration_mean)
print('Test Set Duration Mean:', test_duration_mean)

Train Set Duration Mean: 12.301810444600761
Valid Set Duration Mean: 6.795830092509418
Test Set Duration Mean: 6.958454892966357


### Defining some initial functions for preparing the dataset

In [5]:
# Function for sorting data by duration
def sort_dataset(audio_paths, durations, texts):
    p = np.argsort(durations).tolist()
    audio_paths = [audio_paths[i] for i in p]
    durations = [durations[i] for i in p] 
    texts = [texts[i] for i in p]
    return audio_paths, durations, texts

# Mapping each character that could be spoken at each time step
char_map_str = """
' 0
<SPACE> 1
a 2
b 3
c 4
d 5
e 6
f 7
g 8
h 9
i 10
j 11
k 12
l 13
m 14
n 15
o 16
p 17
q 18
r 19
s 20
t 21
u 22
v 23
w 24
x 25
y 26
z 27
"""
# This leaves "blank" character mapped to number 28

char_map = {}
index_map = {}
for line in char_map_str.strip().split('\n'):
    ch, index = line.split()
    char_map[ch] = int(index)
    index_map[int(index)+1] = ch
index_map[2] = ' '

# Function for calculating feature dimensions.
def calc_feat_dim(window, max_freq):
    return int(0.001 * window * max_freq) + 1

# Function for converting text to an integer sequence
def text_to_int_seq(text):
    int_sequence = []
    for c in text:
        if c == ' ':
            ch = char_map['<SPACE>']
        else:
            ch = char_map[c]
        int_sequence.append(ch)
    return int_sequence

# Function for converting an integer sequence to text
def int_seq_to_text(int_sequence):
    text = []
    for c in int_sequence:
        ch = index_map[c]
        text.append(ch)
    return text


### Defining the primary class for preparing the dataset for visualization and modeling.

This class provides options for training models on both MFCC's and Spectrograms of the data but is set to use spectrograms by default.

In [8]:
class AudioGenerator():
    def __init__(self, step=10, window=20, max_freq=8000, mfcc_dim=13,
        minibatch_size=20, desc_file=None, spectrogram=True, max_duration=10.0, 
        sort_by_duration=False):
        # Initializing variables
        self.feat_dim = calc_feat_dim(window, max_freq)
        self.mfcc_dim = mfcc_dim
        self.feats_mean = np.zeros((self.feat_dim,))
        self.feats_std = np.ones((self.feat_dim,))
        self.rng = random.Random(RNG_SEED)
        if desc_file is not None:
            self.load_metadata_from_desc_file(desc_file)
        self.step = step
        self.window = window
        self.max_freq = max_freq
        self.cur_train_index = 0
        self.cur_valid_index = 0
        self.cur_test_index = 0
        self.max_duration=max_duration
        self.minibatch_size = minibatch_size
        self.spectrogram = spectrogram
        self.sort_by_duration = sort_by_duration

    def get_batch(self, partition):
    # Obtain a batch of audio files
        if partition == 'train':
            audio_paths = self.train_audio_paths
            cur_index = self.cur_train_index
            texts = self.train_texts
        elif partition == 'valid':
            audio_paths = self.valid_audio_paths
            cur_index = self.cur_valid_index
            texts = self.valid_texts
        elif partition == 'test':
            audio_paths = self.test_audio_paths
            cur_index = self.test_valid_index
            texts = self.test_texts
        else:
            raise Exception("Invalid partition. Must be train/validation/test")

        features = [self.normalize(self.featurize(a)) for a in 
            audio_paths[cur_index:cur_index+self.minibatch_size]]

        # Calculate size
        max_length = max([features[i].shape[0] 
            for i in range(0, self.minibatch_size)])
        max_string_length = max([len(texts[cur_index+i]) 
            for i in range(0, self.minibatch_size)])
        
        # Initialize arrays
        X_data = np.zeros([self.minibatch_size, max_length, 
            self.feat_dim*self.spectrogram + self.mfcc_dim*(not self.spectrogram)])
        labels = np.ones([self.minibatch_size, max_string_length]) * 28
        input_length = np.zeros([self.minibatch_size, 1])
        label_length = np.zeros([self.minibatch_size, 1])
        
        for i in range(0, self.minibatch_size):
            # Calculate input_length
            feat = features[i]
            input_length[i] = feat.shape[0]
            X_data[i, :feat.shape[0], :] = feat

            # Calculate label_length
            label = np.array(text_to_int_seq(texts[cur_index+i])) 
            labels[i, :len(label)] = label
            label_length[i] = len(label)

        # Output arrays
        outputs = {'ctc': np.zeros([self.minibatch_size])}
        inputs = {'the_input': X_data, 
                  'the_labels': labels, 
                  'input_length': input_length, 
                  'label_length': label_length 
                 }
        return (inputs, outputs)

    def sort_dataset_by_duration(self, partition):
    # Extra shuffling
        if partition == 'train':
            self.train_audio_paths, self.train_durations, self.train_texts = sort_dataset(
                self.train_audio_paths, self.train_durations, self.train_texts)
        elif partition == 'valid':
            self.valid_audio_paths, self.valid_durations, self.valid_texts = sort_dataset(
                self.valid_audio_paths, self.valid_durations, self.valid_texts)
        else:
            raise Exception("Invalid partition. "
                "Must be train/val")

    def next_train(self):
    # Get a batch of training data
        while True:
            ret = self.get_batch('train')
            self.cur_train_index += self.minibatch_size
            if self.cur_train_index >= len(self.train_texts) - self.minibatch_size:
                self.cur_train_index = 0
            yield ret    

    def next_valid(self):
    # Get a batch of validation data
        while True:
            ret = self.get_batch('valid')
            self.cur_valid_index += self.minibatch_size
            if self.cur_valid_index >= len(self.valid_texts) - self.minibatch_size:
                self.cur_valid_index = 0
            yield ret

    def next_test(self):
    # Get a batch of testing data
        while True:
            ret = self.get_batch('test')
            self.cur_test_index += self.minibatch_size
            if self.cur_test_index >= len(self.test_texts) - self.minibatch_size:
                self.cur_test_index = 0
            yield ret
            
    # Load datasets
    def load_train_data(self, desc_file='train_corpus.json'):
        self.load_metadata_from_desc_file(desc_file, 'train')
        self.fit_train()
        if self.sort_by_duration:
            self.sort_dataset_by_duration('train')
                

    def load_validation_data(self, desc_file='valid_corpus.json'):
        self.load_metadata_from_desc_file(desc_file, 'validation')
        if self.sort_by_duration:
            self.sort_dataset_by_duration('valid')

    def load_test_data(self, desc_file='test_corpus.json'):
        self.load_metadata_from_desc_file(desc_file, 'test')
        if self.sort_by_duration:
            self.sort_dataset_by_duration('test')
            
    def load_metadata_from_desc_file(self, desc_file, partition):
    # Get metadata from json corpus
        audio_paths, durations, texts = [], [], []
        with open(desc_file) as json_line_file:
            for line_num, json_line in enumerate(json_line_file):
                try:
                    spec = json.loads(json_line)
                    if float(spec['duration']) > self.max_duration:
                        continue
                    audio_paths.append(spec['key'])
                    durations.append(float(spec['duration']))
                    texts.append(spec['text'])
                except Exception as e:
                    print('Error reading line #{}: {}'
                                .format(line_num, json_line))
        if partition == 'train':
            self.train_audio_paths = audio_paths
            self.train_durations = durations
            self.train_texts = texts
        elif partition == 'validation':
            self.valid_audio_paths = audio_paths
            self.valid_durations = durations
            self.valid_texts = texts
        elif partition == 'test':
            self.test_audio_paths = audio_paths
            self.test_durations = durations
            self.test_texts = texts
        else:
            raise Exception("Invalid partition. "
             "Must be train/validation/test")
            
    def fit_train(self, k_samples=100):
    # Estimate descriptive stats for training set based on sample of 100 instances
        k_samples = min(k_samples, len(self.train_audio_paths))
        samples = self.rng.sample(self.train_audio_paths, k_samples)
        feats = [self.featurize(s) for s in samples]
        feats = np.vstack(feats)
        self.feats_mean = np.mean(feats, axis=0)
        self.feats_std = np.std(feats, axis=0)
        
    def featurize(self, audio_clip):
    # Create features from data, either spectrogram or mfcc
        if self.spectrogram:
            return spectrogram_from_file(
                audio_clip, step=self.step, window=self.window,
                max_freq=self.max_freq)
        else:
            (rate, sig) = wav.read(audio_clip)
            return mfcc(sig, rate, numcep=self.mfcc_dim)

    def normalize(self, feature, eps=1e-14):
    # Scale the data to improve neural network performance and reduce the size of the gradients
        return (feature - self.feats_mean) / (self.feats_std + eps)

<a id='features'></a>
## Acoustic Feature Extraction/Engineering for Speech Recognition

There are 3 primary methods for extracting features for speech recognition. This includes using raw audio forms, spectrograms, and mfcc's.

In [12]:
# Defining 3 different ways of converting audio files to spectrograms

def spectrogram(samples, fft_length=256, sample_rate=2, hop_length=128):
# Create a spectrogram from audio signals
    assert not np.iscomplexobj(samples), "You shall not pass in complex numbers"
    window = np.hanning(fft_length)[:, None]
    window_norm = np.sum(window**2)  
    scale = window_norm * sample_rate
    trunc = (len(samples) - fft_length) % hop_length
    x = samples[:len(samples) - trunc]
    # Reshape to include the overlap
    nshape = (fft_length, (len(x) - fft_length) // hop_length + 1)
    nstrides = (x.strides[0], x.strides[0] * hop_length)
    x = as_strided(x, shape=nshape, strides=nstrides)
    # Window stride sanity check
    assert np.all(x[:, 1] == samples[hop_length:(hop_length + fft_length)])
    # Broadcast window, and then compute fft over columns and square mod
    x = np.fft.rfft(x * window, axis=0)
    x = np.absolute(x)**2
    # Scale 2.0 for everything except dc and fft_length/2
    x[1:-1, :] *= (2.0 / scale)
    x[(0, -1), :] /= scale
    freqs = float(sample_rate) / fft_length * np.arange(x.shape[0])
    return x, freqs

def spectrogram_from_file(filename, step=10, window=20, max_freq=None, eps=1e-14):
# Calculate log(linear spectrogram) from FFT energy
    with soundfile.SoundFile(filename) as sound_file:
        audio = sound_file.read(dtype='float32')
        sample_rate = sound_file.samplerate
        if audio.ndim >= 2:
            audio = np.mean(audio, 1)
        if max_freq is None:
            max_freq = sample_rate / 2
        if max_freq > sample_rate / 2:
            raise ValueError("max_freq can not be > than 0.5 of "
                             " sample rate")
        if step > window:
            raise ValueError("step size can not be > than window size")
        hop_length = int(0.001 * step * sample_rate)
        fft_length = int(0.001 * window * sample_rate)
        pxx, freqs = spectrogram(
            audio, fft_length=fft_length, sample_rate=sample_rate,
            hop_length=hop_length)
        ind = np.where(freqs <= max_freq)[0][-1] + 1
    return np.transpose(np.log(pxx[:ind, :] + eps))


<a id='plotting'></a>
## Visualizing The Data


- [Raw Audio](#raw)
- [Spectrograms](#spectograms)
- [Mel-Frequency Cepstral Coefficients](#mfcc)

In [14]:
def vis_train_features(index):
# Function for visualizing a single audio file based on index chosen
    # Get spectrogram
    audio_gen = AudioGenerator(spectrogram=True)
    audio_gen.load_train_data()
    vis_audio_path = audio_gen.train_audio_paths[index]
    vis_spectrogram_feature = audio_gen.normalize(audio_gen.featurize(vis_audio_path))
    # Get mfcc
    audio_gen = AudioGenerator(spectrogram=False)
    audio_gen.load_train_data()
    vis_mfcc_feature = audio_gen.normalize(audio_gen.featurize(vis_audio_path))
    # Obtain text label
    vis_text = audio_gen.train_texts[index]
    # Obtain raw audio
    sample_rate, samples = wav.read(vis_audio_path)
    # Print total number of training examples
    print('There are %d total training examples.' % len(audio_gen.train_audio_paths))
    # Return labels for plotting
    return vis_text, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path, sample_rate, samples

In [18]:
# Creating visualisations for audio file at index number 2012
vis_text, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path, sample_rate, samples, = vis_train_features(index=2012)

There are 64220 total training examples.


<a id='deeplearning'></a>
## Deep Neural Networks for Acoustic Modeling

The RNN is comprised of a combined acoustic model and language model. The acoustic model scores sequences of acoustic model labels over a time frame. 
The language model scores sequences of characters. 
A decoding graph then maps valid acoustic label sequences to the corresponding character sequences. 
Speech recognition is the process of finding the character sequence that maximizes both the language and acoustic model scores.

In [24]:
# Custom CTC loss function (discussed below)
def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

def add_ctc_loss(input_to_softmax):
    the_labels = Input(name='the_labels', shape=(None,), dtype='float32')
    input_lengths = Input(name='input_length', shape=(1,), dtype='int64')
    label_lengths = Input(name='label_length', shape=(1,), dtype='int64')
    output_lengths = Lambda(input_to_softmax.output_length)(input_lengths)
    # CTC loss is implemented in a lambda layer
    loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')(
        [input_to_softmax.output, the_labels, output_lengths, label_lengths])
    model = Model(
        inputs=[input_to_softmax.input, the_labels, input_lengths, label_lengths], 
        outputs=loss_out)
    return model

# Function for modifying CNN layers for sequence problems 
def cnn_output_length(input_length, filter_size, border_mode, stride,
                       dilation=1):
# Compute the length of cnn output seq after 1D convolution across time
    if input_length is None:
        return None
    assert border_mode in {'same', 'valid', 'causal'}
    dilated_filter_size = filter_size + (filter_size - 1) * (dilation - 1)
    if border_mode == 'same':
        output_length = input_length
    elif border_mode == 'valid':
        output_length = input_length - dilated_filter_size + 1
    elif border_mode == 'causal':
        output_length = input_length
    return (output_length + stride - 1) // stride

### Connectionist Temporal Classification

The loss function I am using is a Connectionist Temporal Classification (CTC), which is a special case of sequential objective functions that addresses some of the modeling burden in cross-entropy that forces the model to link every frame of input data to a label.

In [27]:
def train_model(input_to_softmax, 
                pickle_path,
                save_model_path,
                train_json='train_corpus.json',
                valid_json='valid_corpus.json',
                minibatch_size=16, # You will want to change this depending on the GPU you are training on
                spectrogram=True,
                mfcc_dim=13,
                optimizer=Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False, clipnorm=1, clipvalue=.5),
                epochs=30, # You will want to change this depending on the model you are training and data you are using
                verbose=1,
                sort_by_duration=False,
                max_duration=10.0):
    
    # Obtain batches of data
    audio_gen = AudioGenerator(minibatch_size=minibatch_size, 
        spectrogram=spectrogram, mfcc_dim=mfcc_dim, max_duration=max_duration,
        sort_by_duration=sort_by_duration)
    # Load the datasets
    audio_gen.load_train_data(train_json)
    audio_gen.load_validation_data(valid_json)  
    # Calculate steps per epoch
    num_train_examples=len(audio_gen.train_audio_paths)
    steps_per_epoch = num_train_examples//minibatch_size
    # Calculate validation steps
    num_valid_samples = len(audio_gen.valid_audio_paths) 
    validation_steps = num_valid_samples//minibatch_size    
    # Add custom CTC loss function to the nn
    model = add_ctc_loss(input_to_softmax)
    # Dummy lambda function for loss since CTC loss is implemented above
    model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=optimizer)
    # Make  initial results/ directory for saving model pickles
    if not os.path.exists('results'):
        os.makedirs('results')
    # Add callbacks
    checkpointer = ModelCheckpoint(filepath='results/'+save_model_path, verbose=0)
    terminator = callbacks.TerminateOnNaN()
    time_machiner = callbacks.History()
    logger = callbacks.CSVLogger('training.log')
    stopper = callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
    reducer = callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, verbose=0, mode='auto', min_delta=0.0001, cooldown=0, min_lr=0)
    tensor_boarder = callbacks.TensorBoard(log_dir='./logs', batch_size=16,
                                          write_graph=True, write_grads=True, write_images=True,)
    # Fit/train model
    hist = model.fit_generator(generator=audio_gen.next_train(), steps_per_epoch=steps_per_epoch,
        epochs=epochs, validation_data=audio_gen.next_valid(), validation_steps=validation_steps,
        callbacks=[checkpointer, terminator, logger, time_machiner, tensor_boarder, stopper, reducer], verbose=verbose)
    # Save model loss
    with open('results/'+pickle_path, 'wb') as f:
        pickle.dump(hist.history, f)

### Adam Optimizer
The Adam optimizer was chosen as it has momentum and has been shown to work well in speech recognition. 


In [28]:
# Creating a TensorFlow session
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 1.0
set_session(tf.Session(config=config))

<a id='aggregate'></a>
## ASR Model

The ASR Keras model is a fine tuned implementation of model_5 (CNN + Deep BRNN + TDD). The final production model will consist of 1 convolutional layer, 2 GRU layers, and 1 Time Distributed Dense layer. The convolutional layer conducts feature/pattern extraction, while the RNN layers develop predictions on those features. This model won't make use of dropout or dilated convolutions as they both led to gradient explosions in tests. We have also increased the number of neurons in each layer.

Inspiration for the aggregate architecture came from Baidu's [Deep Speech 2](resources/deepspeech2.pdf) engine.

#### Training with spectrograms

In [37]:
def keras_model(input_dim, filters, activation, kernel_size, conv_stride,
    conv_border_mode, recur_layers, units, output_dim=29):
    # Input
    input_data = Input(name='the_input', shape=(None, input_dim))
    # Convolutional layer
    conv_1d = Conv1D(filters, kernel_size, 
                     strides=conv_stride, 
                     padding=conv_border_mode,
                     activation=activation,
                     name='conv1d')(input_data)
    # Batch normalization
    bn_cnn = BatchNormalization()(conv_1d)
    # Bidirectional recurrent layer
    brnn = Bidirectional(GRU(units, activation=activation, 
        return_sequences=True, name='brnn'))(bn_cnn)
    # Batch normalization 
    bn_rnn = BatchNormalization()(brnn)
    # Loop for additional layers
    for i in range(recur_layers - 1):
        name = 'brnn_' + str(i + 1)
        brnn = Bidirectional(GRU(units, activation=activation, 
        return_sequences=True, implementation=2, name=name))(bn_rnn)
        bn_rnn = BatchNormalization()(brnn)
    # TimeDistributed Dense layer
    time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
    # Softmax activation layer
    y_pred = Activation('softmax', name='softmax')(time_dense)
    # Specifying the model
    model = Model(inputs=input_data, outputs=y_pred)
    model.output_length = lambda x: cnn_output_length(
        x, kernel_size, conv_border_mode, conv_stride)
    print(model.summary())
    return model

In [38]:
model_8 = keras_model(input_dim=161, # 161 for Spectrogram/13 for MFCC
                      filters=256,
                      activation='relu',
                      kernel_size=11, 
                      conv_stride=2,
                      conv_border_mode='valid',
                      recur_layers=2,
                      units=256)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 256)         453632    
_________________________________________________________________
batch_normalization_15 (Batc (None, None, 256)         1024      
_________________________________________________________________
bidirectional_8 (Bidirection (None, None, 512)         787968    
_________________________________________________________________
batch_normalization_16 (Batc (None, None, 512)         2048      
_________________________________________________________________
bidirectional_9 (Bidirection (None, None, 512)         1181184   
_________________________________________________________________
batch_normalization_17 (Batc (None, None, 512)         2048      
__________

In [44]:
train_model(input_to_softmax=model_8, 
            pickle_path='model_8.pickle', 
            save_model_path='model_8.h5', 
            spectrogram=True) # True for Spectrogram/False for MFCC

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


#### Training with MFCC's
Let's train this model using MFCC's just to see if there is a difference in performance:

In [49]:
model_9 = keras_model(input_dim=13, # 161 for Spectrogram/13 for MFCC
                      filters=256,
                      activation='relu',
                      kernel_size=11, 
                      conv_stride=2,
                      conv_border_mode='valid',
                      recur_layers=2,
                      units=256)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
the_input (InputLayer)       (None, None, 13)          0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 256)         36864     
_________________________________________________________________
batch_normalization_18 (Batc (None, None, 256)         1024      
_________________________________________________________________
bidirectional_10 (Bidirectio (None, None, 512)         787968    
_________________________________________________________________
batch_normalization_19 (Batc (None, None, 512)         2048      
_________________________________________________________________
bidirectional_11 (Bidirectio (None, None, 512)         1181184   
_________________________________________________________________
batch_normalization_20 (Batc (None, None, 512)         2048      
__________

In [50]:
train_model(input_to_softmax=model_9, 
            pickle_path='model_9.pickle', 
            save_model_path='model_9.h5', 
            spectrogram=False) # True for Spectrogram/False for MFCC

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<a id='architecture'></a>
## Visualizing The Final Model Architecture
<a id='selection'></a>
- [Model Performance](#test)
- [Word Error Rate](#error_rate)

In [53]:
def get_predictions(index, partition, input_to_softmax, model_path):
    # Load the train and test data
    data_gen = AudioGenerator(spectrogram = spectrogram)
    data_gen.load_train_data()
    data_gen.load_validation_data()
    data_gen.load_test_data()
    # Obtain ground truth transcriptions and audio features 
    if partition == 'validation':
        transcription = data_gen.valid_texts[index]
        audio_path = data_gen.valid_audio_paths[index]
        data_point = data_gen.normalize(data_gen.featurize(audio_path))
    elif partition == 'train':
        transcription = data_gen.train_texts[index]
        audio_path = data_gen.train_audio_paths[index]
        data_point = data_gen.normalize(data_gen.featurize(audio_path))
    elif partition == 'test':
        transcription = data_gen.test_texts[index]
        audio_path = data_gen.test_audio_paths[index]
        data_point = data_gen.normalize(data_gen.featurize(audio_path))
    else:
        raise Exception('Invalid partition!  Must be "train", "test", or "validation"')     
    # Obtain predictions
    input_to_softmax.load_weights(model_path)
    prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
    output_length = [input_to_softmax.output_length(data_point.shape[0])] 
    pred_ints = (K.eval(K.ctc_decode(
                prediction, output_length)[0][0])+1).flatten().tolist()
    # Display ground truth transcription and predicted transcripted.
    print('True transcription:\n' + '\n' + transcription)
    print('Predicted transcription:\n' + '\n' + ''.join(int_seq_to_text(pred_ints)))

#### Now, let's check the Spectrogram model trained on 460 hours of audio:

In [54]:
%time get_predictions(index=95, partition='test', input_to_softmax=model_8, model_path='./results/model_8.h5')

True transcription:

in the absence of a hypodermic syringe the remedy may be given by the rectum
Predicted transcription:

inse absens of the hapademec shaenge sevemety may be gave in vye of recttim
CPU times: user 2.29 s, sys: 89.4 ms, total: 2.37 s
Wall time: 2.48 s


In [55]:
%output.txt << time get_predictions(index=95, partition='test', input_to_softmax=model_8, model_path='./results/model_8.h5')

CPU times: user 20.29 s, sys: 89.4 ms, total: 20.37 s
Wall time: 20.48 s


<a id='error_rate'></a>
#### Word Error Rate

Word error rate is defined as (substitutions + deletions + insertions) / # of words in the ground truth transcription. 

In [57]:
def wer_calc(ref, pred):
    # Calcualte word error rate
    d = np.zeros((len(ref) + 1) * (len(pred) + 1), dtype=np.uint16)
    d = d.reshape((len(ref) + 1, len(pred) + 1))
    for i in range(len(ref) + 1):
        for j in range(len(pred) + 1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    for i in range(1, len(ref) + 1):
        for j in range(1, len(pred) + 1):
            if ref[i - 1] == pred[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                substitution = d[i - 1][j - 1] + 1
                insertion = d[i][j - 1] + 1
                deletion = d[i - 1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)
    result = float(d[len(ref)][len(pred)]) / len(ref) * 100
    return result
    
# Function for extracting the predicted transcriptions from the audio files and calculating word error rate on them
def get_wer(partition, input_to_softmax, model_path):
    wer_list = []
    data_gen = AudioGenerator(spectrogram = spectrogram)
    data_gen.load_test_data()
    data_gen.load_validation_data()
    data_gen.load_train_data()
    if partition == 'train':
        for i in range(0, 61956):
            transcription = data_gen.train_texts[i]
            audio_path = data_gen.train_audio_paths[i]
            data_point = data_gen.normalize(data_gen.featurize(audio_path))
            input_to_softmax.load_weights(model_path)
            prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
            output_length = [input_to_softmax.output_length(data_point.shape[0])] 
            pred_ints = (K.eval(K.ctc_decode(
                         prediction, output_length)[0][0])+1).flatten().tolist()
            pred_trans = ''.join(int_seq_to_text(pred_ints))
            error_rate = wer_calc(transcription, pred_trans)
            wer_list.append(error_rate)
            if i%2000 == 0: print('Processed {}'.format(i))
            
    elif partition == 'validation':
        for i in range(0, 4277):
            transcription = data_gen.valid_texts[i]
            audio_path = data_gen.valid_audio_paths[i]
            data_point = data_gen.normalize(data_gen.featurize(audio_path))
            input_to_softmax.load_weights(model_path)
            prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
            output_length = [input_to_softmax.output_length(data_point.shape[0])] 
            pred_ints = (K.eval(K.ctc_decode(
                         prediction, output_length)[0][0])+1).flatten().tolist()
            pred_trans = ''.join(int_seq_to_text(pred_ints))
            error_rate = wer_calc(transcription, pred_trans)
            wer_list.append(error_rate)
            if i%200 == 0: print('Processed {}'.format(i))
            
    elif partition == 'test':
        for i in range(0, 4176):
            transcription = data_gen.test_texts[i]
            audio_path = data_gen.test_audio_paths[i]
            data_point = data_gen.normalize(data_gen.featurize(audio_path))
            input_to_softmax.load_weights(model_path)
            prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
            output_length = [input_to_softmax.output_length(data_point.shape[0])] 
            pred_ints = (K.eval(K.ctc_decode(
                         prediction, output_length)[0][0])+1).flatten().tolist()
            pred_trans = ''.join(int_seq_to_text(pred_ints))
            error_rate = wer_calc(transcription, pred_trans)
            wer_list.append(error_rate)
            if i%200 == 0: print('Processed {}'.format(i))

    wer_array = np.asarray(wer_list)
    return wer_array

In [59]:
# Extracting the validation word error rates
valid_wer = get_wer(partition='validation', 
                    input_to_softmax=model_9, model_path='./results/model_9.h5')
valid_wer

Processed 0
Processed 200
Processed 400
Processed 600
Processed 800
Processed 1000
Processed 1200
Processed 1400
Processed 1600
Processed 1800
Processed 2000
Processed 2200
Processed 2400
Processed 2600
Processed 2800
Processed 3000
Processed 3200
Processed 3400
Processed 3600
Processed 3800
Processed 4000
Processed 4200


array([ 9.09090909,  5.55555556,  1.63934426, ...,  4.6875    ,
       10.57692308, 12.59259259])

In [61]:
# Calculating the word error rate in the validation set
valid_wer.mean()

15.85614561251273

In [66]:
# Extracting the test word error rates
test_wer = get_wer(partition='test', 
                   input_to_softmax=model_9, model_path='./results/model_9.h5')
test_wer

Processed 0
Processed 200
Processed 400
Processed 600
Processed 800
Processed 1000
Processed 1200
Processed 1400
Processed 1600
Processed 1800
Processed 2000
Processed 2200
Processed 2400
Processed 2600
Processed 2800
Processed 3000
Processed 3200
Processed 3400
Processed 3600
Processed 3800
Processed 4000


array([30.        , 28.33333333, 18.51851852, ...,  5.50458716,
       12.96296296,  5.97014925])

In [68]:
# Calculating the word error rate in the test set
test_wer.mean()

17.5566642890054

<a id='conclusion'></a>
## Conclusion

This concludes the model construction demo. You have now trained a strong performing recurrent neural network for speech recognition, from scratch, with a word error rate of <20%. 
- Reduce the word error rate to [<10%]
