Skip to content

3ba2ii/DNN-Speech-Recognition

Repository files navigation

Deep Neural Network Speech Recognition

In this project we built a deep neural network that functions as part of an end-to-end automatic speech recognition (ASR) pipeline. The full pipeline is summarized in the figure below.


DNN Architecture


Content

Description

The pipeline will accept raw audio as input and make a pre-processing step that converts raw audio to one of two feature representations that are commonly used for ASR (Spectrogram or MFCCs) in this project we've used a Convolutional Layer to extract features. Then these features are fed into an acoustic model which accepts audio features as input and returns a probability distribution over all potential transcriptions. The last step is that the pipeline takes the output from the acoustic model and returns a predicted transcription.


What To Improve

We should be able to get better performance on both training and validation set.

Methods to decrease the error :
  • Try getting larger dataset.
  • Try adding language model after the acoustic model.
  • Try training for more epochs >20.
  • Try deeper neural network or pre-trained network.
  • Try using another type of RNNs like LSTM, or GRU

Prerequisites

This project uses keras framework follow the commands below to install it appropriately

Install Keras using pip
pip install Keras
Install Keras using conda
conda install -c conda-forge keras

Network Architecture

used a 1D convolutional layer to extract features and added BatchNormalization layer after each layer to speed up learning process, a dropout layers to prevent the model from overfitting then used a combination of Bidirectional + SimpleRNNs; the reason why i chose SimpleRNNs as it was so fast compared to GRU and LSTM.
The output of the acoustic model is connected to a softmax function to predict the probability of transcriptions.

Network Architecture

feel free to take a look at final model in sample_models.py


Optimizer and Loss Function

we trained the acoustic model with the CTC loss along with SGD optimizer with learning rate 0.02.

def add_ctc_loss(input_to_softmax):

    the_labels    = Input(name='the_labels',
                          shape=(None,), dtype='float32')
    input_lengths = Input(name='input_length',
                          shape=(1,), dtype='int64')
    label_lengths = Input(name='label_length',
                          shape=(1,), dtype='int64')

    output_lengths = Lambda(input_to_softmax.output_length)(input_lengths)

    # CTC loss is implemented in a lambda layer

    loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')(
        [input_to_softmax.output, the_labels, output_lengths, label_lengths])

    model = Model(
        inputs=[input_to_softmax.input, the_labels,
                input_lengths, label_lengths],
        outputs=loss_out)

    return model

Authors

  • Ahmed Abd-Elbakey Ghonem - Github

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.