# Evolution of Speech Recognition (ASR)

---

## 1. Acoustic → Text Conversion (Signal → Symbols)

**Input:** Raw audio waveform  
**Goal:** Map continuous acoustic signals → discrete textual units (phonemes, subwords, words)

### Techniques

- **Signal processing (early era):** MFCCs, spectrograms  
- **HMM/GMM era:** Statistical mapping of frames to phonemes  
  - Hidden Markov Models:  
    $$
    P(O,S) = \pi_{s_1} \prod_{t=2}^{T} a_{s_{t-1}, s_t} \prod_{t=1}^{T} b_{s_t}(o_t)
    $$
  - Gaussian Mixture Models:  
    $$
    P(x) = \sum_{m=1}^{M} c_m \, \mathcal{N}(x \mid \mu_m, \Sigma_m)
    $$

- **Neural acoustic models:** CNNs, RNNs, GRUs, LSTMs replacing GMMs  
- **End-to-end models:**
  - **CTC:** Connectionist Temporal Classification  
    $$
    P(y \mid x) = \sum_{\pi \in B^{-1}(y)} \prod_{t=1}^T P(\pi_t \mid x_t)
    $$
  - **Attention-based Seq2Seq:** Aligns acoustic frames with output tokens  
  - **Transformers (Conformer, Whisper):** Direct mapping from features → text without hand-crafted phoneme models  

---

## 2. Text → Understanding (NLP on Recognized Text)

**Input:** Transcribed text from ASR  
**Goal:** Extract meaning, context, and semantic structure  

### Techniques

- **Statistical Language Models (LMs):**  
  $$
  P(W) \approx \prod_{t=1}^T P(w_t \mid w_{t-n+1}, \dots, w_{t-1})
  $$

- **Neural sequence models:** RNN, LSTM, GRU  
- **Attention & Transformers:**
  - **BERT:** Deep bidirectional contextual embeddings  
  - **GPT-family:** Autoregressive large language models  
  - **Whisper:** Joint ASR + NLP end-to-end system  

---

##  Summary

- **ASR = Audio → Text**
  - Old way: HMM/GMM  
  - Modern way: Neural acoustic models + CTC / Attention / Transformers  

- **NLP = Text → Meaning**
  - Old way: n-gram statistical models  
  - Modern way: RNNs, LSTMs, Transformers (BERT, GPT, etc.)  

---

##  Integration in Modern Systems

In classic pipelines, ASR and NLP were separate:  
1. **ASR:** Convert audio → text  
2. **NLP:** Process text for meaning  

In **modern end-to-end architectures** (DeepSpeech, Conformer, Whisper), the acoustic and language modeling are **trained jointly** within a single neural system, blurring the line between ASR and NLP.


# Core Mathematical Foundations in Speech Recognition

---

## 1. Acoustic Modeling with Hidden Markov Models (HMMs)

Speech is modeled as a sequence of hidden states (phonemes) generating observable acoustic signals.

**Markov assumption:**

$$
P(s_t \mid s_1, s_2, \dots, s_{t-1}) \approx P(s_t \mid s_{t-1})
$$

**Observation likelihood:**

$$
P(O \mid S) = \prod_{t=1}^{T} P(o_t \mid s_t)
$$

**Total probability (joint):**

$$
P(O, S) = \pi_{s_1} \prod_{t=2}^{T} a_{s_{t-1}, s_t} \prod_{t=1}^{T} b_{s_t}(o_t)
$$

where:  
- \( \pi_{s_1} \) = initial state distribution  
- \( a_{ij} \) = transition probability between states  
- \( b_j(o_t) \) = emission probability  

**Key algorithms:**  
- Forward algorithm: Computes likelihood efficiently.  
- Viterbi algorithm (1973): Finds most likely state sequence.  
- Baum–Welch (EM algorithm): Learns transition and emission probabilities.  

---

## 2. Probabilistic Acoustic Features

Speech features = continuous signals → modeled by Gaussian Mixture Models (GMMs).  

**Likelihood of acoustic vector \(x\):**

$$
P(x) = \sum_{m=1}^{M} c_m \, \mathcal{N}(x \mid \mu_m, \Sigma_m)
$$

where \( \mathcal{N} \) is a Gaussian density with mean \( \mu_m \) and covariance \( \Sigma_m \).  

Later, GMMs were replaced by neural networks estimating posterior probabilities.  

---

## 3. Bayesian Decoding (Speech as Noisy Channel)

**Central ASR decoding formula:**

$$
\hat{W} = \arg\max_W P(W \mid O) = \arg\max_W P(O \mid W) P(W)
$$

where:  
- \( W \) = word sequence  
- \( O \) = observed acoustic sequence  

This separates acoustic model \( P(O \mid W) \) and language model \( P(W) \).  

---

## 4. Language Models (LMs)

**n-gram models:**

$$
P(W) \approx \prod_{t=1}^T P(w_t \mid w_{t-1}, \dots, w_{t-n+1})
$$

Estimated with Maximum Likelihood Estimation (MLE) and smoothing.  

**Neural LMs:** (feedforward, RNNs, Transformers) approximate the same probability via embeddings + parameterized networks.  

---

## 5. Neural Network Acoustic Models

**Connectionist models (late 1980s, 1990s):**  
Feedforward NN outputs posterior over states.  

**Cross-entropy training:**

$$
L = - \sum_{t=1}^{T} \sum_i y_{t,i} \log \hat{y}_{t,i}
$$

**Recurrent neural networks (RNN, LSTM, GRU):**

$$
h_t = f(W_{xh} x_t + W_{hh} h_{t-1})
$$

**CTC (Connectionist Temporal Classification, 2006):**  

Loss function for unaligned sequences:

$$
P(y \mid x) = \sum_{\pi \in B^{-1}(y)} \prod_{t=1}^{T} P(\pi_t \mid x_t)
$$

where:  
- \( \pi \) = alignment paths  
- \( B \) = collapsing function (merging repeats and blanks)  

---

## 6. Attention and Transformers in ASR

**Seq2Seq with Attention:**

$$
\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_j \exp(e_{t,j})}, \quad
e_{t,i} = v^T \tanh(W_1 h_i + W_2 s_{t-1})
$$

**Context vector:**

$$
c_t = \sum_i \alpha_{t,i} h_i
$$

**Transformer (Vaswani 2017) self-attention:**

$$
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
$$

Used in Conformer and Whisper models for end-to-end ASR.  

---

## 7. Statistical and Physics-Inspired Links

- **Ising / Hopfield nets:** Associative memory → robust recall with noisy input.  
- **Boltzmann Machines / RBMs:** Early generative pretraining (2000s, e.g. DBNs for phoneme recognition).  
- **Spin glass mathematics (EA/SK models):** Framework for rugged probability landscapes, analogous to speech state decoding with HMMs.  

---

##  Summary of Key Equations in Speech Recognition

**Markov chain probability:**  

$$
P(s_t \mid s_{t-1})
$$

**Joint HMM probability:**  

$$
P(O, S) = \pi_{s_1} \prod a_{ij} b_j(o_t)
$$

**GMM acoustic model:**  

$$
P(x) = \sum c_m \, \mathcal{N}(x \mid \mu_m, \Sigma_m)
$$

**Bayesian decoding (noisy channel):**  

$$
\hat{W} = \arg\max_W P(O \mid W) P(W)
$$

**n-gram LM:**  

$$
P(W) \approx \prod P(w_t \mid w_{t-n+1}^{t-1})
$$

**NN cross-entropy loss:**  

$$
L = - \sum y \log \hat{y}
$$

**CTC loss:**  

$$
P(y \mid x) = \sum_{\pi \in B^{-1}(y)} \prod P(\pi_t \mid x_t)
$$

**Attention weights:**  

$$
\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_j \exp(e_{t,j})}
$$

**Transformer self-attention:**  

$$
\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
$$

---

 **In other words:**  
- Early ASR = Markov + Gaussian statistics (HMM/GMM).  
- Middle era = Neural + statistical physics (RBM, DBN).  
- Modern era = Deep end-to-end models (CTC, Attention, Transformer).  


#  Landmark Papers in Speech Recognition (ASR)

---

## 1. Foundations (1950s–1970s)

- **Baum, L. E. & Eagon, J. A. (1967).**  
  *An inequality with applications to statistical estimation for probabilistic functions of Markov processes.*  
  → Mathematical basis for Hidden Markov Models (HMMs).

- **Forney, G. D. (1973).**  
  *The Viterbi Algorithm.*  
  → Introduced Viterbi decoding, crucial for sequence alignment in HMM-based ASR.

- **Baker, J. K. (1975).**  
  *The DRAGON system — An overview.*  
  → First large-scale speech recognition system using HMMs.

---

## 2. Statistical HMM/GMM Era (1980s–1990s)

- **Bahl, L. R., Jelinek, F., & Mercer, R. L. (1983).**  
  *A maximum likelihood approach to continuous speech recognition.*  
  → Pioneering statistical ASR with n-gram language models.

- **Jelinek, F. (1985).**  
  *Markov source modeling of text generation.*  
  → Advanced statistical language modeling for ASR.

- **Rabiner, L. R. (1989).**  
  *A tutorial on Hidden Markov Models and selected applications in speech recognition.*  
  → The canonical tutorial on HMMs in ASR.

- **Young, S., et al. (1993–1995).**  
  *The HTK Book.*  
  → Toolkit that standardized HMM-based ASR research and practice.

---

## 3. Neural Network Era (1990s–2000s)

- **Bourlard, H., & Morgan, N. (1994).**  
  *Connectionist Speech Recognition: A Hybrid Approach.*  
  → Early integration of neural networks with HMMs.

- **Bengio, Y., et al. (1994).**  
  *Learning long-term dependencies with gradient descent is difficult.*  
  → Identified vanishing gradients, motivating LSTM development.

- **Hochreiter, S., & Schmidhuber, J. (1997).**  
  *Long Short-Term Memory.*  
  → Introduced LSTM, later a cornerstone in ASR sequence modeling.

---

## 4. Deep Learning Revolution (2010s)

- **Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006).**  
  *Connectionist Temporal Classification (CTC).*  
  → Enabled end-to-end sequence training without explicit alignment.

- **Mohamed, A., Dahl, G. E., & Hinton, G. (2012).**  
  *Acoustic modeling using deep belief networks.*  
  → Introduced deep neural networks (DNNs) to ASR.

- **Graves, A., Mohamed, A., & Hinton, G. (2013).**  
  *Speech recognition with deep recurrent neural networks.*  
  → First successful RNN-based ASR system.

- **Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015).**  
  *Attention-based models for speech recognition.*  
  → Brought attention mechanisms into ASR.

---

## 5. End-to-End & Transformer Era (2016–Present)

- **Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2016).**  
  *Listen, Attend and Spell (LAS).*  
  → End-to-end attention-based ASR.

- **Amodei, D., et al. (2016).**  
  *Deep Speech 2: End-to-End Speech Recognition in English and Mandarin.*  
  → Large-scale RNN-based end-to-end ASR from Baidu.

- **Vaswani, A., et al. (2017).**  
  *Attention Is All You Need.*  
  → Introduced the Transformer; later adopted in ASR.

- **Gulati, A., et al. (2020).**  
  *Conformer: Convolution-augmented Transformer for Speech Recognition.*  
  → State-of-the-art acoustic modeling with Transformer + CNN synergy.

- **Radford, A., et al. (2023).**  
  *Whisper: Robust Speech Recognition via Large-Scale Weak Supervision.*  
  → OpenAI’s multilingual, multitask Transformer-based ASR model.

---

##  Conclusion

- **HMM/GMM statistical era:** Baum, Forney, Rabiner → foundations of probabilistic ASR.  
- **Neural hybrid era:** Bourlard, Morgan, Hochreiter → neural nets + HMMs, LSTM breakthroughs.  
- **Deep learning revolution:** Hinton, Graves, Mohamed → DNNs, RNNs, CTC.  
- **End-to-end era:** LAS, DeepSpeech, Conformer, Whisper → attention & Transformer architectures dominating modern ASR.


#  Important Real-World Projects in Speech Recognition

---

## 1. Early Pioneering Systems (1950s–1970s)

- **Audrey (Bell Labs, 1952)**  
  Recognized digits 0–9 from a single speaker.  
  ➝ First working ASR prototype.

- **Harpy (CMU, 1976)**  
  Recognized ~1,000 words using a finite-state network.  
  ➝ Introduced **beam search decoding** → still core in ASR today.

---

## 2. Statistical HMM-Based Systems (1980s–1990s)

- **DRAGON (1975–1980s, CMU / Dragon Systems)**  
  First large-vocabulary dictation system.  
  ➝ Evolved into **Dragon NaturallySpeaking (1997)** → commercial success.

- **IBM Tangora (1980s)**  
  Large-vocabulary continuous speech recognition (20,000 words).  
  ➝ Based on **Hidden Markov Models (HMMs).**

- **AT&T Voice Recognition Call Routing (1990s)**  
  Deployed in call centers for IVR.  
  ➝ First **commercial speech-enabled customer service.**

---

## 3. Consumer Applications Begin (2000s)

- **Nuance Dragon NaturallySpeaking (1997 → 2000s)**  
  Dictation software, widely used in law, medicine, transcription.

- **Google Voice Search (2008)**  
  First **cloud-based ASR** for consumers.  
  ➝ Demonstrated power of internet-scale statistical models.

- **Microsoft Cortana / Speech API (2000s)**  
  Integrated ASR into **Windows, Xbox, enterprise tools.**

- **Apple Siri (2011)**  
  First **mainstream voice assistant**, combining ASR + NLP.

---

## 4. Deep Learning Deployment (2010s)

- **Google Voice Search / Google Now (2012)**  
  Replaced GMMs with **deep neural networks (DNNs).**  
  ➝ Achieved ≈30% error reduction.

- **Baidu DeepSpeech (2014, 2016)**  
  End-to-end **RNN-based ASR.**  
  ➝ Deployed in Mandarin and English at scale.

- **Amazon Alexa (2014)**  
  Smart speaker ecosystem.  
  ➝ Used **LSTM acoustic models + large-scale LMs.**

- **Microsoft Skype Translator (2014)**  
  Real-time **speech-to-speech translation** with ASR + MT + TTS.

---

## 5. Transformer and End-to-End Era (2020s)

- **Google Assistant (2016 → now)**  
  Progressed from **RNN-T** to **Transformer/Conformer models.**  
  ➝ Real-time, multilingual ASR.

- **Apple Siri (updated, 2021)**  
  On-device ASR with **Transformer models.**  
  ➝ Privacy-preserving, low-latency inference.

- **Meta AI – wav2vec 2.0 (2020)**  
  Self-supervised ASR.  
  ➝ Deployed in **low-resource languages.**

- **OpenAI Whisper (2022)**  
  Multilingual, multitask **Transformer-based ASR** trained on 680k hours.  
  ➝ Open-sourced, widely used for **transcription & subtitling.**

- **Anthropic, Microsoft, Google Cloud Speech-to-Text APIs (2020s)**  
  Industrial-grade ASR deployed as **cloud services.**

---

##  Conclusion

- **Prototypes (1950s–70s):** Audrey, Harpy.  
- **HMM Era (1980s–90s):** DRAGON, IBM Tangora, AT&T IVR.  
- **Consumer Cloud Era (2000s):** Dragon, Google Voice Search, Siri, Cortana.  
- **Deep Learning Breakthrough (2010s):** Google DNNs, Baidu DeepSpeech, Alexa.  
- **Transformer Era (2020s):** Google Assistant (Conformer), Siri (on-device Transformer), OpenAI Whisper (multilingual end-to-end).
