In this project we built a deep neural network that functions as part of an end-to-end automatic speech recognition (ASR) pipeline. The full pipeline is summarized in the figure below.
The pipeline will accept raw audio as input and make a pre-processing step that converts raw audio to one of two feature representations that are commonly used for ASR (Spectrogram or MFCCs) in this project we've used a Convolutional Layer to extract features. Then these features are fed into an acoustic model which accepts audio features as input and returns a probability distribution over all potential transcriptions. The last step is that the pipeline takes the output from the acoustic model and returns a predicted transcription.
We should be able to get better performance on both training and validation set.
- Try getting larger dataset.
- Try adding language model after the acoustic model.
- Try training for more epochs >20.
- Try deeper neural network or pre-trained network.
- Try using another type of RNNs like LSTM, or GRU
This project uses keras framework follow the commands below to install it appropriately
pip install Keras
conda install -c conda-forge keras
used a 1D convolutional layer to extract features and added BatchNormalization layer after each layer to speed up learning process, a dropout layers to prevent the model from overfitting then used a combination of Bidirectional + SimpleRNNs
; the reason why i chose SimpleRNNs as it was so fast compared to GRU and LSTM.
The output of the acoustic model is connected to a softmax function to predict the probability of transcriptions.
feel free to take a look at final model in
sample_models.py
we trained the acoustic model with the CTC loss along with SGD optimizer with learning rate 0.02.
def add_ctc_loss(input_to_softmax):
the_labels = Input(name='the_labels',
shape=(None,), dtype='float32')
input_lengths = Input(name='input_length',
shape=(1,), dtype='int64')
label_lengths = Input(name='label_length',
shape=(1,), dtype='int64')
output_lengths = Lambda(input_to_softmax.output_length)(input_lengths)
# CTC loss is implemented in a lambda layer
loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')(
[input_to_softmax.output, the_labels, output_lengths, label_lengths])
model = Model(
inputs=[input_to_softmax.input, the_labels,
input_lengths, label_lengths],
outputs=loss_out)
return model
- Ahmed Abd-Elbakey Ghonem - Github
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.