
### 1. **Foundations of Speech Processing**

* Basics of sound and speech signals
* Digital audio representation: sampling rate, bit depth, spectrograms
* Feature extraction techniques:

  * **MFCCs (Mel-Frequency Cepstral Coefficients)**
  * **Mel Spectrograms**
  * **Log-Mel features**
  * Chroma features

---

### 2. **Classical Speech Recognition Approaches**

* **Acoustic Models** (mapping audio features to phonemes)
* **Language Models** (predicting word sequences)
* **Hidden Markov Models (HMMs)**
* **Gaussian Mixture Models (GMMs)**

---

### 3. **Deep Learning in Speech Recognition**

* **RNNs (Recurrent Neural Networks)** for sequential data
* **LSTMs/GRUs** for long context handling
* **CTC (Connectionist Temporal Classification)** for aligning speech and text
* **Seq2Seq Models with Attention**
* **Transformers in Speech Recognition**

---

### 4. **Modern Architectures**

* **DeepSpeech (by Baidu)**
* **Wav2Vec / Wav2Vec2.0 (Facebook/Meta)**
* **Conformer models (Google)**
* **Whisper (OpenAI)**

---

### 5. **Pre-trained Models & Frameworks**

* Hugging Face Transformers for ASR (Wav2Vec2, Whisper, etc.)
* OpenAI Whisper for multilingual speech-to-text
* SpeechBrain, ESPnet, Kaldi, Fairseq

---

### 6. **Handling Real-World Challenges**

* Noise reduction and speech enhancement
* Speaker diarization (who spoke when)
* Multilingual and code-switching speech
* Domain adaptation (medical, legal, etc.)
* Low-resource languages

---

### 7. **Practical Implementation**

* Using **libraries**:

  * `speech_recognition` (Python)
  * Hugging Face `transformers` for Wav2Vec2, Whisper
  * `torchaudio`
  * Google Speech-to-Text API, Azure, AWS Transcribe
* Streaming speech recognition (real-time STT)
* Building an **end-to-end pipeline**:

  * Audio input → Feature extraction → Model inference → Text output

---

### 8. **Evaluation & Metrics**

* Word Error Rate (WER)
* Character Error Rate (CER)
* Real-time Factor (RTF)



### 1. **Foundations of Speech Processing**

* Basics of sound and speech signals
* Digital audio representation: sampling rate, bit depth, spectrograms
* Feature extraction techniques:

  * **MFCCs (Mel-Frequency Cepstral Coefficients)**
  * **Mel Spectrograms**
  * **Log-Mel features**
  * Chroma features


### 1. What is Sound?

* **Sound** is a vibration that propagates as a wave through a medium (like air).
* It is a **mechanical wave**, not electromagnetic, meaning it needs a medium.
* Represented as a **continuous analog signal** (waveform).

Key properties of sound:

* **Amplitude** → Loudness (higher amplitude = louder sound).
* **Frequency** (Hz) → Pitch (higher frequency = higher pitch).
* **Phase** → Position of the wave at a given time.

---

### 2. What is Speech?

* **Speech** is a special type of sound signal produced by humans to convey language.
* It is composed of **phonemes** (smallest units of sound in a language, e.g., /p/, /b/, /a/).
* Speech has a structured pattern:

  * **Prosody** (intonation, rhythm, stress).
  * **Phonetics** (sounds).
  * **Linguistics** (words, grammar, meaning).

---

### 3. Speech Signal Characteristics

* **Non-stationary**: Speech varies over time (not constant like a pure sine wave).
* **Quasi-periodic**: Contains repeating structures (like vowels), but not perfectly periodic.
* **Time-domain representation**: The raw waveform as a function of time.
* **Frequency-domain representation**: Breaks the signal into frequency components (via Fourier Transform).

Example:

* A **vowel** sound (like "a") is periodic, with clear frequency harmonics.
* A **consonant** sound (like "s") is noisy and aperiodic.

---

### 4. Digital Representation of Speech

Since computers can’t process continuous analog signals directly, speech must be digitized.
Steps:

1. **Sampling** → Converting continuous sound into discrete points (e.g., 16kHz = 16,000 samples/sec).
2. **Quantization** → Mapping amplitudes to numeric values (bit depth, e.g., 16-bit PCM).
3. **Feature Extraction** → Reducing raw data into meaningful features for models (like MFCCs, spectrograms).

---

### 5. Why Speech Processing is Hard?

* **Noise**: Background sounds interfere with speech.
* **Variability**: Different accents, speeds, and emotions.
* **Co-articulation**: Sounds overlap when speaking naturally.
* **Context**: Words depend on surrounding words.

---

👉 In **Speech-to-Text systems**, the first step is to capture the sound signal and convert it into a digital representation that captures the essential features of speech while removing irrelevant noise.



# Digital audio representation: sampling rate, bit depth, spectrograms



### 1. **Sampling Rate**

* **Definition**: The number of samples (measurements of amplitude) taken per second from a continuous audio signal to convert it into digital form.
* **Measured in**: Hertz (Hz), or samples per second.
* **Example**:

  * 8,000 Hz → Telephone quality (captures up to 4 kHz, sufficient for human speech intelligibility).
  * 16,000 Hz (16 kHz) → Common in speech recognition systems, balances quality and efficiency.
  * 44,100 Hz (44.1 kHz) → CD-quality audio, used for music.
* **Concept**: According to the **Nyquist theorem**, to capture all information in a signal, the sampling rate should be at least twice the highest frequency present. For human speech (up to \~8 kHz), 16 kHz is often enough.

---

### 2. **Bit Depth**

* **Definition**: The number of bits used to represent each audio sample (how precisely we record amplitude).
* **Determines**: The **dynamic range** (difference between the quietest and loudest sounds).
* **Examples**:

  * 8-bit: Very low quality, noisy.
  * 16-bit: CD quality, standard for most speech data.
  * 24-bit or 32-bit: High precision, used in studios.
* **Formula**: Dynamic range ≈ 6.02 × Bit Depth (in dB).

  * 16-bit → \~96 dB dynamic range.

---

### 3. **Spectrograms**

* **Definition**: A 2D visual representation of sound that shows how frequencies in a signal change over time.
* **Axes**:

  * X-axis → Time.
  * Y-axis → Frequency.
  * Color/Intensity → Amplitude (loudness of frequency components).
* **Types**:

  * **Linear Spectrogram** → Frequencies shown linearly.
  * **Log-Mel Spectrogram** → Uses the **Mel scale**, which better matches human perception of pitch.
* **Why important in speech-to-text**:
  Speech recognition models (like DeepSpeech, Whisper, wav2vec) don’t usually process raw waveforms directly. Instead, they work with **spectrograms or Mel-spectrograms**, which provide richer, structured representations.

---

✅ **Summary**

* **Sampling Rate** defines how many slices per second of sound you capture.
* **Bit Depth** defines how detailed each slice is.
* **Spectrograms** transform raw audio into a time–frequency representation, making speech features easier for ML models to process.



### **MFCCs (Mel-Frequency Cepstral Coefficients)**



### 1. Why MFCCs are used

* Raw audio signals are too detailed and noisy for speech recognition.
* Human hearing is not linear — we are more sensitive to some frequencies than others.
* MFCCs extract the most relevant features of speech (timbre, tone, phonetic structure) while reducing noise and redundancy.

---

### 2. The process of extracting MFCCs

MFCC feature extraction involves multiple transformations:

#### a) **Pre-emphasis**

* Speech signal is passed through a filter to boost high frequencies.
* This balances the energy between low and high frequencies.

Equation:

$$
y(t) = x(t) - \alpha \cdot x(t-1)
$$

where $\alpha \approx 0.95$.

---

#### b) **Framing**

* The audio signal is divided into short overlapping frames (e.g., 20–40 ms).
* Speech is non-stationary, but in small frames, it can be considered stationary.

---

#### c) **Windowing**

* Each frame is multiplied by a window function (like **Hamming window**) to reduce discontinuities at edges.

---

#### d) **Fast Fourier Transform (FFT)**

* Converts the time-domain signal into frequency domain.
* Produces the **power spectrum** of the signal.

---

#### e) **Mel Filter Bank**

* Human ear perceives frequency on a **Mel scale** (logarithmic perception of pitch).
* Apply triangular filters spaced along the Mel scale to emphasize frequencies important for speech.

Mel scale formula:

$$
M(f) = 2595 \cdot \log_{10}\left(1 + \frac{f}{700}\right)
$$

---

#### f) **Logarithm**

* Logarithm is applied to the filter bank energies.
* Mimics human perception of loudness (which is roughly logarithmic).

---

#### g) **Discrete Cosine Transform (DCT)**

* DCT is applied to decorrelate features and compress information.
* Result: A set of coefficients (usually 12–13 coefficients per frame) = **MFCCs**.

---

### 3. MFCC Representation

* Each frame of speech is represented as a vector of MFCCs.
* Adding **delta (Δ)** and **delta-delta (ΔΔ)** coefficients (time derivatives) captures speech dynamics.

---

### 4. Why MFCCs are powerful

* They reduce dimensionality while preserving phonetic information.
* Used in **speech recognition, speaker identification, music classification, emotion detection**.
* They mimic the **human auditory system** better than raw spectrum features.

---

👉 In short: MFCCs take speech → break into frames → apply Fourier + Mel scaling + log + DCT → produce compact numerical features that describe speech content effectively.



### *Mel Spectrograms**

### Mel Spectrograms in Speech Processing

A **mel spectrogram** is a way to represent audio signals, especially speech, in a format that aligns more closely with how humans perceive sound. It is widely used in **speech recognition, speaker identification, and speech synthesis**.

---

### 1. Spectrogram Recap

* A **spectrogram** shows how the frequency content of a signal changes over time.
* It is generated by applying the **Short-Time Fourier Transform (STFT)** to split audio into small time frames, then analyzing frequency components.
* Axes:

  * X-axis → time
  * Y-axis → frequency
  * Color intensity → amplitude (strength of frequency component)

---

### 2. The Mel Scale

* Human ears **do not perceive frequencies linearly**.

  * Below 1 kHz → humans are sensitive to small frequency changes.
  * Above 1 kHz → we perceive changes in a compressed, logarithmic way.
* To mimic this, the **mel scale** is used:

  * Formula:

    $$
    m = 2595 \cdot \log_{10}\left(1 + \frac{f}{700}\right)
    $$

    where:

    * $m$ = frequency in mels
    * $f$ = frequency in Hz

This transformation compresses higher frequencies and expands lower ones to match human perception.

---

### 3. Mel Spectrogram Construction

Steps to convert audio into a **mel spectrogram**:

1. **Preprocessing**: Normalize audio and apply windowing.
2. **STFT**: Break audio into short frames and compute Fourier transform.
3. **Power Spectrum**: Convert magnitudes to power or log-power.
4. **Apply Mel Filter Banks**:

   * Filter banks are triangular filters spaced on the mel scale.
   * They smooth out frequencies into perceptual bands.
5. **Log Transformation**: Take log to approximate human loudness perception.

Result → a **2D image**:

* Time (x-axis)
* Mel frequency bins (y-axis)
* Color intensity = log power

---

### 4. Why Use Mel Spectrograms?

* Aligns with **human auditory perception** (important for speech tasks).
* Reduces dimensionality compared to raw spectrograms.
* Provides a **compact, robust representation** for machine learning models.
* Used as input for **deep learning models** like CNNs and RNNs.

---

### 5. Applications

* **Speech Recognition** (e.g., Google Speech-to-Text, Siri, Alexa).
* **Speaker Recognition** (voice authentication).
* **Emotion Detection from Speech**.
* **Music Analysis** (genre classification, instrument detection).
* **Text-to-Speech (TTS)** systems (mel spectrograms → vocoder → waveform).

