# The automatic speech recognition (ASR) task: from speech to text

- Recognize the __words__ from an __acoustic signal__
- __Output__ is a __written transcript__ of the spoken content -> frequently called speech-to-text (conversion)
- Produced transcripts are not necessarily segmented into sentences or contain proper punctuation and capitalization.

__supervised task__: models are trained on transcribed speech corpora.

##  Challenges

Some of the most important challenges stem from the __differences between speech and writing__ and __context dependent nature__ of language:

- __Segmentation__: word boundaries in writing are frequently not indicated by the acoustic segmentation of speech by silences, and, vice versa, speech silences are not necessarily indicative of word boundaries;
- __Ambiguity__: differently written texts can be pronounced the same way, e.g., in English "bare" and "bear" has the same pronunciation; on the other hand,
- The phenomenon of __coarticulation__: speech sounds following each other can interact and influence each other's pronunciation, e.g., the "v" in "I have to" is pronounced as "f" (in fast speech) because of the following voiceless "t"
- The so-called __Lombard effect__: One cannot augment data sets simply by adding noise because people change the way they speak in noisy environments (and it is not just speaking louder...).
- Speech, in contrast to typical written language, can contain  __agrammatical constructs, incomplete sentences or words, corrections, word/syllable repetitions and interruptions__.
- __Speaker adaptation__: there are huge differences between how people of different gender, age, cultural background etc. pronounce words.

A related challenge: human speech understanding relies on a large amount of __contextual background information on admissible interpretations__ -- we actively "perceive"/"hear" speech using contextual clues. A dramatic example is provided by:
http://drive.google.com/uc?export=view&id=1ICNa4Hj-lU_4POjdSCk_-Zyly93SUTNK)

(Audio from the [HiPhi nation podcast, S3 E9: The Illusionist](https://hiphination.org/season-3-episodes/s3-episode-9-the-illusionist-jun-8-2019/))

## Task variants

- __Continuous vs isolated speech recognition__: In the isolated case the input either consists of or can easily be segmented into single words (because there are separating bits of silence). In the continuous case words can follow each other without any silence between them, as in normal speech. Continuous speech recognition is significantly harder.
- __Joint recognition (possibly with diarization)__: Basic speech recognition is for one speaker: a more complex variant is where there are more speakers (e.g., in a dialogue), and, optionally, the transcript has to indicate who says what. Overlapping speech can be an especially difficult problem in this setting.

## Evaluation

The most important and common evaluation metric is the so called __"word error rate" (WER)__, which is based on a word-based __edit distance__ between the speech recognizer's output and the correct transcript.

### Edit distance

Given
- A set of sequence __editing operations__ (e.g., removing or inserting an object from/into the sequence) and 
- A __weight function__ that assigns a weight to each operation
the edit distance between two sequences, a source and a target is the minimum total weight that is needed for transforming the source into the target sequence.

One of the most important variants is the so-called __Levenshtein distance__, where the operations are
- __Deletion__ of a sequence element
- __Insertion__ of a sequence element
- __Substitution__ of a sequence element

and the weight of all operations is 1:

<a href="https://devopedia.org/images/article/213/5510.1567535069.svg"><img src="https://drive.google.com/uc?export=view&id=1DXFkUHeeSLsPxJ64z28UvzvyNU72yOu-" width="300"></a>

In mathematics, edit distance can be seen as a __metric in a metric space__. In other words, the problem can be interpreted __geometrically__. The similarity between two words can be seen as the __geometric distance between two points in the metric space__. Such a metric obeys the triangle inequality. Given distance d, 
$d(x,y) + d(y,z) \geq d(x,z)$
. A composition of two edits can't be smaller than a direct edit.

<a href="https://devopedia.org/images/article/213/8480.1567535130.png"><img src="https://drive.google.com/uc?export=view&id=1ectonZzf_AYG9t_VuI_eh476lL5lcXEZ" width="200"></a>

(Images from: [Devopedia](https://devopedia.org/levenshtein-distance))

Edit distances can be calculated recursively, using a "dynamic programming" method. For details see, for instance, the [Devopedia entry on Levensthein distance](https://devopedia.org/levenshtein-distance), and the discussion in [Chapter 2 of Jurafsky & Martin](https://web.stanford.edu/~jurafsky/slp3/2.pdf).

#### Metric Space

A metric space is an ordered pair $(M, d)$ where $M$ is a set and $d$ is a metric on $M,$ i.e., a distance function
$$
d: M \times M \rightarrow \mathbb{R}
$$
such that for any $x, y, z \in M,$ the following holds: ${ }^{[2]}$
1. $d(x, y)=0 \Longleftrightarrow x=y \quad$ identity of indiscernibles
2. $d(x, y)=d(y, x) \quad$ symmetry
3. $d(x, z) \leq d(x, y)+d(y, z)$ subadditivity or triangle inequality

The function d is also called distance function or simply distance. Metrics are important in the study of convergence (of series, functions) and for the solution of questions concerning approximation.

__Triangle of inequality__ is based on the geometry of a triangle, where  the direct path between two point is always shorter than the indirect path:


<a href="https://www.onlinemathlearning.com/image-files/triangle-inequality.png"><img src="https://drive.google.com/uc?export=view&id=1gc0vN7b2FkDJzqbFsFKUi2SCQ_NPU-Hv" width="400"></a>


#### Why is Levenshtein distance a metric?
1. Always >= 0 

2. It satisfies identiy in that if no edits are required to make them equal, metric distance is 0 and they are equal (applies to either ordering of strings A->B and B->A)

3. It is symmetric (run the edits backwards to get B->A instead of A->B)

4. It satisfies triangle inequality. exercise left to reader, but intuitive since the edit distance is always the "shortest path" through character changes between two strings A and B, and deviating to visit an intermediate string C would not decrease number of edits from A->C->B vs original path A->B



### Word Error Rate
Using the concept of Levenshtein distance the word error rate of a $\hat W$  word sequence compared to the $W$ correct transcript word sequence is simply defined as

$$
\frac{\mathrm{Levenshtein}(\hat W, W)}{\mathrm{length}(W)}
$$
i.e., the number of necessary (word based) editing operations to get the correct transcript, divided by the number of words in the correct transcript.

## Training corpora

Speech recognition training sets consist of __recorded speech audio__ and __time-aligned written transcipts__. In the early ASR days, their preparation was very painstaking because

+ __transcripts__ were __phonetic__
+ time __alignment__ was at the __phone-level__ as well, annotators tried to determine the phone boundaries by listening and looking at spectrograms:

<a href="http://www.phon.ox.ac.uk/mining_speech/mining1a-orthography.jpg"><img src="https://drive.google.com/uc?export=view&id=1jJBnnMuY_Tj1Wv6YH9Yja5P-hn_SBsvD" width="700"></a>

With the improvement of training methods neither of these are necessary, __modern__ ASR __data sets__ contain normally written transcripts which have to be __time aligned__ only at a __sentence level__.

Despite these improvements, it is still a huge amount of work to create good ASR data sets, since __usable corpus size starts at 20 hours of speech from several speakers__, both male and female. Because of the associated costs, the number of freely available corpora is low even for the most widely spoken languages, and for many languages no free data set exists at all.

### LDC data sets

For English, for a long time most publics data sets were those made available by LDC, the [Linguistic Data Consortium](https://catalog.ldc.upenn.edu). These include the

+ Wall Street Journal audio corpus (read newspaper articles, 80h, 1993)
+ Fisher corpus (telephone speech, 1600h, 2004/2005)
+ Switchboard corpus (telephone speech. 300h, 1993/1997/2000)
+ TIMIT corpus (read example sentences, limited grammatical/vocab. variability, 1986)

More recently, data sets in other languages got added to the LDC catalog, now it contains, among others, Spanish, Mandarin and Arabic.

Unfortunately, these data sets are typically __not free__, either Linguistic Data Consortium membership or payment is required for accessing most of them. 

### Open data set initiatives

Most recently, two important initiatives started to work on creating and curating freely available data sets:

+ The [Open Speech and Language Resources](http://www.openslr.org/) lists several free data sets for various languages, among them the important [LibriSpeech](http://www.openslr.org/12/) corpus, which contains ~1000h speech from audio books
+ [Common Voice](https://voice.mozilla.org/) is a Mozilla-initiated project to collect ASR data sets for as many languages as possible. They have already collected and validated almost 1000hs of transcribed English speech and other languages are progressing as well, German is at 412hs, French at 282hs at the time of writing.

# Input: processing the acoustic speech signal

When speech (or any other type of sound) is captured by a microphone then the __air pressure changes__ __move__ the microphone's __diaphragm__, and these movements are converted to changes in electronic current -- as a consequence, the speech gets represented as a __continuous signal__ that changes over time:

<a href="http://www.seaandsailor.com/images/timedomain.png"><img src="https://drive.google.com/uc?export=view&id=16nnrMYqymjhWk4PSQCXc4s2qtYGrnSm6"></a>

This is a continuous, __analog__ signal, which can be digitalized by sampling with a certainly rate:

<a href="http://www.seaandsailor.com/images/timedomain_zoom.png"><img src="https://drive.google.com/uc?export=view&id=1r99XyTcWYMrWZWGw1i69YvAoL4N7RO3s"></a>

(Images source: [Speech signal representations](http://www.seaandsailor.com/initial_representation.html))

In speech processing, sampling rate is at least __8kHz__ to be able to represent the __100 Hz–4 kHz__ range where the phonemes can be typically found.

The digital signal is in turn typically converted to the "frequency domain" by Fourier transformation: we take short, overlapping time windows (20-40 ms is a typical length, and 10 ms is a typical step size),

<a href="https://i.stack.imgur.com/Jg5EG.png"><img src="https://drive.google.com/uc?export=view&id=1G_8aBKAZ-meHaEEq1-KbYkAoxkcOM8rG" width="500"></a>

(Image source: [Stack Overflow](https://dsp.stackexchange.com/questions/36509/why-is-each-window-frame-overlapping))

and calculate the discrete Fourier transform to obtain a __spectrum__ for each window:

<a href="https://upload.wikimedia.org/wikipedia/commons/7/72/Fourier_transform_time_and_frequency_domains_%28small%29.gif"><img src="https://drive.google.com/uc?export=view&id=1PbQ0vdExalgkRRRvQDsNhFdNP8LAkl8x"></a>

(Image source: [Wikipedia](https://en.wikipedia.org/wiki/Fourier_transform))

The end result is a (relatively sparse) representation of the signal with spectrum data in ~10 ms steps.

## MFCC - Mel-Frequency Cepstral Coefficients

Although certain deep learning-based speech processing approaches work simply with this representation, traditionally further transformations were done (filter bank, using a logarithmic scale etc.)  to provide a compressed representation in terms of features that are close to how humans perceive and process speech. The historically most important representation has been __MFCC__ (Mel-frequency cepstral coefficients) -- see, e.g., [Cepstrum and MFCC](https://wiki.aalto.fi/display/ITSP/Cepstrum+and+MFCC) and [Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between](https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html) for details.


__Cepstrum__: discrete cosine transform (DCT) of the the log-spectrum

__Mel-Frequency__: particular filter/ frequency range used by the human ear


# The classic approach: HMM-based speech recognition

In contrast to other NLP-areas, the stochastic nature of many aspects of speech led to the application of __probabalistic methods__ from early on: from a probabilistic perspective, the speech recognition task is to __find the most probable word sequence that could underly the speech signal__. Formally, we try to find the most probable  $W=\langle w_1,\dots,w_n\rangle$ word sequence given our speech signal $S$, i.e. 

$$
\underset{W}{\operatorname{argmax}} P(W \vert S).
$$

Using Bayes' Rule this can be reformulated as

$$
\underset{W}{\operatorname{argmax}} P(S \vert W) \cdot P(W),
$$
the two probabilities are typically modeled by two models:
- An __acoustic model__ models the _conditional_ $P(S \vert W)$  probabilities of sound signals __given__ a sequence of words, and
- a __language model__ models the probabilities of word sequences for the given language.

Starting from the late 1970s and until the introduction of deep learning-based methods, both of these modeling tasks were predominantly solved by Markov models. As language modeling will be discussed later, our discussion here will focus on acoustic modeling.

## Markov models
__Markov chain:__ stochastic model describing a sequence of possible events in which the __probability of each event depends only on the state attained in the previous event(s)__.

__Markov process:__ stochastic process that satisfies the __Markov property__ (sometimes characterized as "memorylessness"). Roughly speaking, a process satisfies the Markov property if one can make __predictions for the future of the process based solely on its present state__ just as well as one could knowing the process's full history, hence independently from such history (Markov Independence), that is, conditional on the present state of the system, its future and past states are independent.

Formally, this independence assumption (the so called first order __Markov assumption__) amounts to the hypothesis that the probability of a sequence is independent of the past (or any history specified):

First order: 
$P[s_{t+1}|s_t,…,s_2,s_1]=P[s_{t+1}|s_t]$

A Markov chain can be described by specifying
- the __set of states__
- a __transition probability matrix__ specifying the probability of a transition from $s_1$ to $s_2$ for every $\langle s_1, s_2 \rangle$ state pair
- an __initial probability distribution__ over the states which specifies for each state the probability that the chain/sequence starts with this state

Markov chains are frequently visualized as directed graphs with states represented as nodes and non-zero probability state transitions as directed edges labeled with their probability:

<a href="http://drive.google.com/uc?export=view&id=1m5fZlHGcCyaAMBl2lbfo9JlmbiDJ9GHg"><img src="https://drive.google.com/uc?export=view&id=1_Hp0SUFHV5C5tHBBPJTE8AxHbZNf13ZZ" width="300px"></a>


A state diagram for a simple example is shown in the figure on the right, using a directed graph to picture the state transitions. The states represent whether a hypothetical stock market is exhibiting a bull market or bear market trend during a given week. According to the figure, a bull week is followed by another bull week 90% of the time, and a bear week 10% of the time. A bull week is followed by another bull week 50% of the time and by a bear week 50% of the time. Labelling the state space {A = bull, B = bear} the transition matrix for this example is:

$$P=\begin{bmatrix}
0.9 & 0.1 \\
0.5 & 0.5 
\end{bmatrix}
\quad$$

The distribution over states can be written as a stochastic row vector x with the relation x(n + 1) = x(n)P. So if at time n the system is in state x(n), then three time periods later, at time n + 3 the distribution is:

$ x^{(n+3)}=x^{(x+2)}P=\big( x^{(x+1)}P \big) P$

$=x^{(n+1)}P^2=\big(x^{(n)}P \big)P^2$

$=x^{(n)}P^3$

In particular, if at time n the system is in state 2 (bear), then at time n + 3 the distribution is

$$x^{(x+3)}= \begin{bmatrix}
0 & 1  
\end{bmatrix}
 \begin{bmatrix}
0.9 & 0.5  \\
0.1 & 0.5
\end{bmatrix}^3
\quad$$

$$x^{(x+3)}= \begin{bmatrix}
0 & 1 
\end{bmatrix}
 \begin{bmatrix}
0.844 & 0.78 \\
0.156 & 0.22
\end{bmatrix}
\quad$$

$$=\begin{bmatrix}
0.78 \\
0.22
\end{bmatrix}$$






### Steady state of a Markov model

To find the stable state of a Markov model we need to find a vector $\hat{X}$, so that 


$$P\hat{X}=\lambda\hat{X}$$

, where the eigenvector $\lambda$, since the probability mass needs to add up to one.


$$X=\begin{bmatrix}
A \\
B
\end{bmatrix}$$

,so 



$$\begin{bmatrix}
0.9 & 0.5  \\
0.1 & 0.5
\end{bmatrix}
\quad
\begin{bmatrix}
A \\
B
\end{bmatrix}=\begin{bmatrix}
A \\
B
\end{bmatrix}$$



Which is a system of linear equations :

$$ 0.9A+0.1B=A$$
and
$$0.5A+0.5B=B$$

the first equation can be converted into $0.1B=0.1A$, so $A=B$

We also know that A+B=1, since the probabilities need to add up to 1, so:

$$A+A=1$$

$$A=\frac{1}{2}$$

Hence,

$$B=\frac{1}{2}$$

__Higher-order Markov models__

The first-order Markov assumption can be relaxed to assuming only that state transitions depend only on the __last $n$ states__ (with $n > 1$). Markov models built on this type of assumption are called higher-order Markov models. Although more flexible, these models have the disadvantage of having a much larger number of parameters since transition probabilities have to be specified not for state-pairs but for state $n+1$-tuples.

First order: 
$P[s_{t+1}|s_t,…,s_2,s_1]=P[s_{t+1}|s_t]$

Second order:
$P[s_{t+1}|s_t,…,s_2,s_1]=P[s_{t+1}|s_t,s_{t−1}]$


Nth order:
$P[s_{t+1}|s_t,…,s_2,s_1]=P[s_{t+1}|s_t,s_{t−1},…,s_{t−(n−1)}]$

### Hidden Markov models (HMMs)

The sequence models used in speech recognition are not simply Markov chains, which would assume that the sequence of _observations_ (e.g., frame sound spectra) satisfies the Markov assumption, but __hidden Markov models (HMMs)__, which adds hidden or __latent variables__ to the observable ones: According to this model, the $\langle y_1,\dots, y_N \rangle$ __observable__ sequence is not directly a Markov state sequence, but is rather __generated__ or __emitted__ at every time step of a Markov process whose $\langle x_1,\dots,x_N \rangle$ states are hidden. Crucially, each generated / emitted observable state depends only on the actual hidden state of the Markov model at the time step in question. Consequently, a HMM can be fully specified by specifying

- the underlying Markov model with hidden states 
- the set of possible observable states
- emission probability distributions over the observable states for each hidden state.

HMMs with a finite number of observable states can also be visualized as directed graphs:

<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/HiddenMarkovModel.svg/1280px-HiddenMarkovModel.svg.png"><img src="https://drive.google.com/uc?export=view&id=1UW5CSk3v-kl7plOvlQ-yFj1hkmHU7Lp8" width="400"></a>

(Image source: [Wikpedia](https://en.wikipedia.org/wiki/Hidden_Markov_model))

It is also common to use HMMs with continuous emission distributions -- the most frequent choices here are Gaussians or mixtures of Gaussians.

### Finding the most probable hidden state sequence (aka decoding)

The most important HMM-related inference task, at least for NLP purposes, is finding the __most probable hidden state sequence given an observation sequence__. Due to the Markov assumption, for mid-size models this can be done efficiently using the dynamic-programming based __Viterbi algorithm__: see, e.g., the discussion in [Chapter 8 of Jurafsky & Martin](https://web.stanford.edu/~jurafsky/slp3/8.pdf) for details.

### Learning HMM parameters

If training material is available in which the hidden states are given along with the observable states then we can:

+ estimate the most likely transition and initial probabilities of the underlying Markov chain, and also 
+ to give an MLE estimation of the emission probability distribution.

Unfortunately, in many settings, information about the hidden states is missing or only partially available. In these cases the [__Baum-Welch estimation algorithm__](https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm) can be used, which is an expectation maximization algorithm variant for HMMs. 

## Speech recognition with HMMs

### Acoustic model

In the HMM-based approach, the modeling of the $P(S \mid W)$ conditional probabilities for the sound realization of word sequences is based on the __decomposition__ of __speech into words__ and, further, __words into phones__ (basic units of sound).

For example, the word “bat” is composed of three phones /b/ /ae/ /t/. About 40 such phones
are required for English.


The acoustic model contains HMM models for the phones of the target language. A typical choice is to model phones with 3-state HMMs with self-loops (the __3 states__ are for __phone onset, middle and ending__):

<a href="http://drive.google.com/uc?export=view&id=1AHaULWIjkhnGD1KZSJzyStRLpZE4EdQe"><img src="https://drive.google.com/uc?export=view&id=1Cngc8m5htTIkbJvszgltOnl6XoPdBoNY" width="300px"></a>

In modern (but pre-DL) systems the emmission distributions are modeled by Gaussian Mixture Models (GMMs).

#### Context dependent phone models

A complication is that in fact phones in natural languages are context dependent in the sense that the physical realization of phones depends on the __preceding and succeeding phone__. Because of this, more sophisticated speech recognizers work with __context dependent, or "triphon" models__. Because of the large number of combinations, some __states__ are __shared__ or "tied"  between similar phone models: in order to decrease the number of parameters, these states are assumed to have the same transition probabilities and or emission distribution.

<a href="http://drive.google.com/uc?export=view&id=1ML4dOD0HBf-R4TgkiVxeN92qRlizE5Hc"><img src="https://drive.google.com/uc?export=view&id=1TSNjJWF_dPjS-yBxyk4SkJLRNfsSS6cc" width="650px"></a>

(Image from [Gales and Young: The Application of Hidden Markov Models in Speech recognition](https://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf))

#### Acoustic model training

Modern HMM-based acoustic models are trained using

- __transcribed speech samples__, in which the transcripts are very __precisely time-aligned__, and
- a __phonetic lexicon__ describing the __pronunciation(s)__ of all __words__ occurring in the transcripts.

in the early days, transcripts were phonetic, and time-aligned phone-by-phone, but modern systems work with only sentence aligned transcripts, whose phonetic transcription is automatically generated using the phonetic lexicon.

#### Word and word-sequence models

Since HMMs can be composed (in this case concatenated), using the phonetic lexicon the  phoneme models can be used to build word models, and word models in turn provide word sequence models:

<a href="http://drive.google.com/uc?export=view&id=1e2qbmDDKMUO2wNdON4jDUDifLWsFWdzL"><img src="https://drive.google.com/uc?export=view&id=1J3hr4VJ7urTl975lDyfquUWBjw7iu4Js" width="600px"></a>

(Images are from [Laureys et al: Assessing_segmentations](https://www.researchgate.net/publication/228954892_Assessing_segmentations_Two_methods_for_confidence_scoring_automatic_HMM-based_word_segmentations))

As a consequence, the HMM-based methodology can indeed provide the $P(S \mid W)$ models for word sequences (assuming that the words are in the phonetic lexicon).

### Language model and putting it all together

As traditional language models are also based on the Markov assumption (more on this later) a __HMM acoustic model__ can be __combined with a probabilistic language model__ to form a (huge) joint probabilistic model (in effect, a huge HMM), which can be used to find the most probable word sequence for speech signals, i.e. solve the speech recognition problem.

As the combined models for continuous, large vocabulary speech recognition are really large, even standard Viterbi decoding would be too slow (the Viterbi time complexity is $O(L\cdot |S|^2)$ where $|S|$ is the number of states and $L$ is the length of the sequence to be decoded). Accordingly, special pruning and decoding methods are used, see, e.g., [ASR Decoding](https://medium.com/@jonathan_hui/speech-recognition-asr-decoding-f152aebed779) for details.

## Further reading

+ A very good, almost classic introduction to HMM-based speech recognition can be found in the [HTK book](http://www.dsic.upv.es/docs/posgrado/20/RES/materialesDocentes/alejandroViewgraphs/htkbook.pdf) accompanying the [HTK speech recognition toolkit](http://htk.eng.cam.ac.uk/).
+ A very detailed but still accessible blog series on finite state automaton-based speech recognition has (very) recently been published on Medium by Jonathan Hui -- all parts are well worth reading:
    - [Phonetics](https://medium.com/@jonathan_hui/speech-recognition-phonetics-d761ea1710c0)
    - [Feature extraction](https://medium.com/@jonathan_hui/speech-recognition-feature-extraction-mfcc-plp-5455f5a69dd9)
    - [GMM, HMM](https://medium.com/@jonathan_hui/speech-recognition-gmm-hmm-8bb5eff8b196)
    - [Acoustic, Lexicon and Language Model](https://medium.com/@jonathan_hui/speech-recognition-acoustic-lexicon-language-model-aacac0462639)
    - [ASR decoding](https://medium.com/@jonathan_hui/speech-recognition-asr-decoding-f152aebed779)
    - [ASR model training](https://medium.com/@jonathan_hui/speech-recognition-asr-model-training-90ed50d93615)

## HMM-based ASR systems

The most important HMM-based ASR systems/frameworks that are available under open source licenses are
- [CMUSphinx](https://cmusphinx.github.io/) -- a very influential, relatively user/developer friendly framework whose development has slowed down in the last few years.
- [Kaldi](http://kaldi.asr.org), a currently leading, actively developed HMM-based ASR toolkit which can produce state of the art results (as it includes DNN tools as well), but is geared towards experienced ASR professionals.

# Deep learning-based speech recognition

## HMM-DNN hybrids

The first important step in the application of deep learning-based methods in ASR was the development of __HMM-DNN hybrid__ methods starting from the middle of the 1990s. In these architectures, a single but __crucial component__ of traditional HMMs acoustic models is __replaced by neural networks__: the __probabilistic model connecting hidden states with observable outputs__. 

In this setting, __first__ a __traditional HMM/GMM acoustic model__ is __trained__ and then, using the resulting hidden state--frame alignments, a __DNN classifier is trained which classifies observable frames__ (or frame sequences) in terms of internal HMM states which could produce them. As usual, the classifier is probabilistic, i.e., it returns a probability distribution over the HMM hidden states:

<a href="https://www.researchgate.net/profile/Alexey_Karpov2/publication/306064220/figure/fig1/AS:585601954889729@1516629794160/Architecture-of-the-DNN-HMM-hybrid-system-1_W640.jpg"><img src="https://drive.google.com/uc?export=view&id=1HbGwZdmvbA8879ZlS9ZsEcSKGg6eJ_ot" width="350px"></a>

(Image source: [Kipyatkova, Karpov: DNN-Based_Acoustic_Modeling_for_Russian_Speech_Recognition_Using_Kaldi](https://www.researchgate.net/publication/306064220_DNN-Based_Acoustic_Modeling_for_Russian_Speech_Recognition_Using_Kaldi))

After training, the trained neural classifier is used for calculating the emission likelihoods instead of the original GMMs. The neural network architectures used in these hybrid systems range from simple MLP-s to RNNs, and from the point of view of performance they are still very competitive, in fact hybrids currently achieve the best result on a number of data sets (e.g., on the WSJ corpus).

### Further reading

See the presentation [Neural Networks for Acoustic Modelling 2:Hybrid HMM/DNN system](http://www.inf.ed.ac.uk/teaching/courses/asr/2018-19/asr08-hybrid_hmm_nn.pdf) for details and references on hybrid ASR architectures.

## End-to-end DL-based ASR systems

### Deep Speech (Baidu, 2014)

[__Deep Speech__](https://arxiv.org/pdf/1412.5567.pdf), the first significant, exclusively DL-based end-to-end ASR architecture came out in 2014. It introduced a number of radical changes compared to HMM-based ASR solutions:

- The acoustic model was an __end-to-end trained DNN__, __without__ any __finite state automaton__ components.
- __Phone-level representations__ were totally __eliminated__: the system was __trained__ simply __on written transcripts__, without requiring any sort of phonetic information (pronunciation lexicon or rules).
- Instead of hand-engineered, ASR-specific features like MFCC, the DNN's __input__ consisted simply of the __spectrum representation of the audio signal__.
- Instead of words, the __system's output__ was a __character-level transcription__ of the input.
- In order to bridge the gap between the different input and output lengths (acoustic frames vs transcript) the network was trained with [CTC (Connectionist Temporal Classification) loss](https://en.wikipedia.org/wiki/Connectionist_temporal_classification) and used CTC decoding for producing the final ouput.

The neural architecture was surpisingly simple, as it contained only 5 hidden layers, only one of them was a simple (bidirectional, but not LSTM variant) RNN layer. The input of the network  was a window containing the spectra for a moving window of frames at each time-step.

<a href="https://miro.medium.com/max/1312/1*BN2rY_mP_uThoJk1i8h0uA.png"><img src="https://drive.google.com/uc?export=view&id=1hFNmV7w9ng9FyLgHq0shiTKYJdQKZ8rg" width="400"></a>

(Image from the [Deep Speech paper](https://arxiv.org/pdf/1412.5567.pdf))

Although the network achieved acceptable performance without a language model, the full system used a language model to produce the final (most probable) transcript.

__Data sets and performance__

Although Deep Speech improved on the state of art at the time (18.4%->16% WER on the Switchboard corpus), and did this with a surprisingly simple and clean architecture, it was trained on data sets that were significantly larger than those typically used for HMM-based training: while a 200 hours long data set was considered to be more than adequate for HMM-based systems, Deep Speech's largest model was trained on "5000 hours of read speech from 9600 speakers."  

## More recent architectures

In the last few years, following the breakthrough of Deep Speech, a steady stream of end-to-end DL-based ASR systems have been developed with constantly improving results. The used network architectures broadly follow the general trends in NLP, but with strong influences from image processing. A few significant examples from the most performant systems:

__[Deep Speech 2](https://arxiv.org/pdf/1512.02595.pdf) (2015, Baidu)__

An interesting aspect is the use of both 1d and 2d convolutions (time-only and time-frequency directions). All convolutions are "same", keeping dimensionality.

<a href="https://1.bp.blogspot.com/-ZPfnZQnLKVg/VmjckBIPrkI/AAAAAAAAEZ4/mm9V1i-Xhfo/s1600/Screenshot%2Bfrom%2B2015-12-10%2B12%253A59%253A21.png"><img src="https://drive.google.com/uc?export=view&id=1r_ZYhFQnzljZkv1NTm1NeG6MZGs6OALZ" width="350px"></a>

(Figure from the linked paper)

[__Capio 2017__](https://arxiv.org/pdf/1801.00059.pdf)

Inspired by the Densenet architecture for image processing, the Capio 2017 ASR system used blocks of densely connected LSTM layers:

<a href="https://d3i71xaburhd42.cloudfront.net/bdaacde65e68f6f617381377ba0b968013dbb1c6/3-Figure3-1.png"><img src="https://drive.google.com/uc?export=view&id=1C6aQCEH2SL8MdLOiOA5-0f4wYeCVhUEh" width="400"></a>

(Figure from the linked paper)

__[Multistream self-attention](https://arxiv.org/pdf/1910.00716.pdf) (2019)__

Somewhat predictably, following the latest trends, self-attention based architectures also appeared and perform very competitively. (We will cover self-attention in detail later.) In addition to self-attention, parallel processing pipelines with differently parametrized (e.g., dilation) convolutional layers are also interesting here:

<a href="https://storage.googleapis.com/groundai-web-prod/media%2Fusers%2Fuser_292860%2Fproject_392762%2Fimages%2Ffig1.png"><img src="https://drive.google.com/uc?export=view&id=1-vi4hwQk1oLfMVnkLn2n2yPzQ47sCeUZ" width="600"></a>

(Figure from the linked paper)

## Trade-offs and challenges

Currently the best __HMM-DNN hybrids__ and __end-to-end__, DL-based ASR systems have very __similar performance__, but still there are important tradeoffs between them:
+ __End-to-end__ DL systems tend to require __larger data sets__ to reach the same performance
+ __HMM-based hybrids__, on the other hand, are __more complex architecturally__ and require way more complicated training regimes.

As a consequence of the first point, the development of useful data augmentation methods is highly important, similarly, the development of transfer-learning method will be crucial.

An important area of research is the architectural integration of neural acoustic and language models (currently these are typically two independent modules), see [Cold Fusion: Training Seq2Seq Models Together with Language Models](https://arxiv.org/pdf/1708.06426.pdf) for an attempt.

## Important frameworks/implementations

In addition to the standard DNN frameworks (Tensorflow, PyTorch etc.), the [Kaldi ASR framework](http://kaldi-asr.org) is often used since it provides both neural and acoustic tools. There are some attempts to make the C++-based Kaldi more accessible to the Python DL world, see, e.g, the [PyKaldi](https://github.com/pykaldi/pykaldi) library.

As accessible open source implementations the most important is by far Mozilla's [project](https://github.com/mozilla/DeepSpeech) to implement (a version of the) Deep Speech architecture and provide models for it. 

# See also

A highly useful resource to follow the latest ASR research results is the web page [Wer are we?](https://github.com/syhw/wer_are_we) which tries to track the best performing ASR systems on various public corpora.