# AUDIO PREPROCESSING FOR SPEECH RECOGNITION

## GLOSSARY

- **Preprocessing**: Manipulating raw audio data before feeding it to a speech recognition system
- **Resampling**: Converting audio from one sample rate to another
- **Normalization**: Adjusting the volume of audio to a standard level
- **Noise Reduction**: Removing unwanted background noise from audio
- **VAD (Voice Activity Detection)**: Detecting which parts of an audio stream contain speech
- **SNR (Signal-to-Noise Ratio)**: The ratio of speech signal power to background noise power
- **Filtering**: Removing specific frequency components from audio
- **Gain**: The amount by which audio is amplified
- **dB (Decibel)**: A unit used to measure sound level and signal strength
- **FFT (Fast Fourier Transform)**: An algorithm that converts time-domain signals to frequency-domain representation

## CONCEPT INTERACTIONS

- **Building on PyAudio**: We'll use PyAudio to capture raw audio, then preprocess it before passing to Vosk
- **Building on Vosk Basics**: The preprocessing techniques will improve the accuracy of Vosk's speech recognition
- **Looking Forward**: After learning preprocessing, we'll integrate these techniques with Vosk for better speech-to-text conversion

## MAIN CONTENT

### Why Preprocess Audio?

Speech recognition systems like Vosk perform best with clean, well-formatted audio. In real-world environments, however, audio recordings often contain:

1. **Background noise**: air conditioners, traffic, other people talking
2. **Volume issues**: too quiet or too loud
3. **Varying sample rates**: different recording devices use different rates
4. **Irrelevant audio**: silence or non-speech segments

Preprocessing addresses these issues, improving recognition accuracy by 20-40% in noisy environments.

### Key Preprocessing Techniques

In this module, we'll focus on five essential preprocessing techniques:

1. **Resampling**: Converting audio to a standard sample rate
2. **Normalization**: Adjusting volume to a consistent level
3. **Noise reduction**: Removing background noise
4. **Voice Activity Detection (VAD)**: Identifying speech segments
5. **Filtering**: Enhancing speech frequencies

Let's explore each technique in detail.

### 1. Resampling

**What it does**: Converts audio from one sample rate (e.g., 44.1kHz) to another (e.g., 16kHz).

**Why it matters**: Vosk and most speech recognition models expect a specific sample rate, usually 16kHz. Using a different rate can dramatically reduce accuracy.

Here's how to resample audio using librosa:

**Flowchart: Resampling Audio**

```
[Start]
   |
   v
[Load audio file and get original sample rate]
   |
   v
[Is original sample rate == target?] --Yes--> [Return audio as is]
   |
  No
   |
   v
[Resample audio to target sample rate]
   |
   v
[Return resampled audio]
```

**Key Functions Used in This Section**

- `librosa.load`: Loads an audio file as a floating point time series and returns both the audio data and its sample rate.
- `librosa.resample`: Changes the sample rate of audio data to a new target sample rate.
- `soundfile.write` (`sf.write`): Saves audio data to a file in various formats (e.g., WAV).
- `numpy.max`, `numpy.abs`: Used to find the maximum amplitude in the audio data.
- `matplotlib.pyplot`: Used for plotting audio waveforms (not directly used in this code block, but imported for visualization).

In [None]:
# Import necessary libraries for audio processing and visualization
import librosa              # For loading and resampling audio
import soundfile as sf      # For saving audio files
import numpy as np          # For numerical operations
import matplotlib.pyplot as plt  # For plotting waveforms

def resample_audio(audio_path, target_sr=16000):
    """
    Resample audio to target sample rate.
    Parameters:
        audio_path: Path to audio file
        target_sr: Target sample rate (default: 16kHz)
    Returns:
        resampled_audio: Audio data at new sample rate
        target_sr: Target sample rate
    """
    # Load audio file with its original sample rate (sr=None disables automatic resampling)
    audio, original_sr = librosa.load(audio_path, sr=None)
    print(f"Loaded audio with original sample rate: {original_sr} Hz")

    # If the original sample rate is different from the target, resample
    if original_sr != target_sr:
        # Resample audio to the target sample rate
        resampled_audio = librosa.resample(audio, orig_sr=original_sr, target_sr=target_sr)
        print(f"Resampled from {original_sr} Hz to {target_sr} Hz")
    else:
        # If already at target sample rate, no need to resample
        resampled_audio = audio
        print(f"Audio is already at {target_sr} Hz")

    return resampled_audio, target_sr

# Example usage (uncomment to use):
# audio, sr = resample_audio("path_to_audio.wav")
# sf.write("resampled_audio.wav", audio, sr)  # Save the resampled audio to a file

#### Breakdown: Resampling Audio

- The function checks if the audio needs resampling and only processes it if necessary.
- Using `librosa.load` with `sr=None` ensures you get the file's true sample rate.

**Amplitude, Sample Rate, and Their Role in Resampling**

- **Amplitude**: Represents the loudness of the audio at each sample point. Resampling does not change amplitude values, but changes how often they are measured per second.
- **Sample Rate**: The number of samples per second (Hz). Higher sample rates capture more detail, but may not be supported by all models.
- **Why Resample?**: Speech recognition models are trained on a specific sample rate (often 16kHz). Mismatched rates can cause the model to misinterpret the timing and frequency content of speech.
- **Relation**: Resampling changes the number of data points per second, but preserves the overall shape and amplitude of the waveform as much as possible.

### 2. Normalization

**What it does**: Adjusts the volume of the audio to a standard level.

**Why it matters**: Audio that's too quiet may not be recognized, while audio that's too loud may be distorted.

There are several normalization techniques:

1. **Peak normalization**: Scales audio so the loudest peak reaches a target level
2. **RMS normalization**: Scales audio to reach a target average loudness
3. **Dynamic range compression**: Reduces the difference between loud and quiet parts

Let's implement peak normalization:

**Flowchart: Normalizing Audio**

```
[Start]
   |
   v
[Find maximum amplitude of audio]
   |
   v
[Calculate current dB level]
   |
   v
[Calculate gain needed to reach target dB]
   |
   v
[Apply gain to audio]
   |
   v
[Check if audio exceeds [-1, 1]]
   |
   v
[If so, scale down to fit range]
   |
   v
[Return normalized audio]
```

**Key Functions Used in This Section**

- `numpy.max`, `numpy.abs`: Used to find the maximum absolute amplitude in the audio array.
- `numpy.log10`: Used to convert amplitude to decibels (dB).
- Standard Python arithmetic for gain calculation and scaling.

In [None]:
def normalize_audio(numpaudio, target_dB=-3):
    """
    Normalize audio using peak normalization.
    Parameters:
        audio: Audio data (numpy array)
        target_dB: Target peak level in dB (0 = max possible level)
    Returns:
        normalized_audio: Normalized audio data
    """
    
    shh = -80
    good = -3
    
    
    log = 20*np.log10
    gainform = 10 ** (gain_dB / 20)
    
    # Use numpy's log10 for logarithmic calculations
    # Find the maximum absolute amplitude in the audio (peak value)
    # splits audio amplitude and gets the max value
    max_amplitude = np.max(np.abs(numpaudio))
    print(f"Maximum amplitude before normalization: {max_amplitude}")

    # Convert the peak amplitude to decibels (dB)
    current_dB = log(max_amplitude) if max_amplitude > 0 else shh
    
    print(f"Current peak dB: {current_dB:.2f}")

    # Calculate the gain needed to reach the target dB
    
    gain_dB = target_dB - current_dB
    gain_amp = gainform(gain_dB) 
    
    
    print(f"Applying gain (linear): {gain_amp}")

    # Apply the gain to the audio
    normalized_audio = numpaudio * gain_amp

    # Ensure the audio does not exceed the range [-1, 1] (to avoid distortion)
    
    if np.max(np.abs(normalized_audio)) > 1.0:
        normalized_audio = normalized_audio / np.max(np.abs(normalized_audio))
        print("Audio was clipped; scaled down to fit [-1, 1]")

    print(f"Normalized audio from {current_dB:.2f}dB to {target_dB:.2f}dB")
    return normalized_audio

# Example usage:
# normalized = normalize_audio(audio)

#### Breakdown: Normalizing Audio

- Decibels (dB) are a logarithmic way to express amplitude; this function ensures the loudest part of the audio matches the target dB.
- **Amplitude** is a measure of how "strong" or "loud" a signal is at any point in time. In digital audio, amplitude values are typically floating-point numbers between -1.0 and 1.0, where 0 is silence, positive values are above the center line, and negative values are below.
- **Absolute amplitude** means we ignore the sign (positive or negative) and just look at the size of the value. For example, both -0.8 and 0.8 have an absolute amplitude of 0.8. This is important because loudness is about how far the signal moves from zero, not the direction.
- **Maximum absolute amplitude** is the largest distance from zero in the entire audio signal. This is used for "peak normalization" because it tells us the loudest moment in the audio.
- **Decibels (dB)** are a logarithmic way to express amplitude. The formula `20 * log10(amplitude)` converts a linear amplitude (like 0.5) to dB. This is useful because our ears perceive loudness logarithmically, and dB makes it easier to compare very large and very small values.
- **Relation**: The higher the amplitude, the higher the dB value. If amplitude is 1.0, that's 0 dB (the maximum possible in digital audio). Lower amplitudes have negative dB values (e.g., 0.5 amplitude ≈ -6 dB).

### 3. Noise Reduction

**What it does**: Reduces background noise while preserving speech content.

**Why it matters**: Background noise can significantly interfere with speech recognition.

We'll use the `noisereduce` library, which employs spectral gating - a technique that identifies and removes frequencies that likely contain noise.

**What is a Noise Clip? What is a Noise Reference?**

- **Noise Clip**: A short segment of audio that contains only background noise, with no speech. This can be recorded before or after the main speech, or extracted from a silent part of the recording. It serves as an example of what the "noise" sounds like.
- **Noise Reference**: The actual data (from a noise clip) used by noise reduction algorithms to learn the characteristics of the noise. By providing a noise reference, the algorithm can more accurately remove similar noise from the rest of the audio.
- **Why Use Them?**: If you provide a noise clip as a reference, noise reduction works better because it knows exactly what to filter out. If you don't provide one, the algorithm tries to estimate noise from the audio itself, which may be less accurate.

**Flowchart: Noise Reduction**

```
[Start]
   |
   v
[Is a noise clip provided?] --Yes--> [Use noise clip as reference]
   |                                   |
  No                                   v
   |                             [Reduce noise using reference]
   v                                   |
[Estimate noise from audio itself]     |
   |                                   v
   +------------------------------> [Return cleaned audio]
```

**Key Functions Used in This Section**

- `noisereduce.reduce_noise(y, sr, y_noise=None, stationary=True)`: Reduces noise in the signal `y` at sample rate `sr`.  
  - `y`: The noisy audio (numpy array).  
  - `sr`: Sample rate (int).  
  - `y_noise`: Optional noise reference (numpy array).  
  - `stationary`: If `True`, assumes noise is constant; if `False`, adapts to changing noise.
- `matplotlib.pyplot.plot(x, y)`: Plots `y` (amplitude) versus `x` (time or sample index).
  - `x`: X-axis data (e.g., time array).
  - `y`: Y-axis data (audio amplitude).
- `numpy.linspace(start, stop, num)`: Returns `num` evenly spaced values from `start` to `stop`.
  - `start`: Start value.
  - `stop`: End value.
  - `num`: Number of samples to generate.

In [None]:
import noisereduce as nr

def reduce_noise(numpaudio, sr, noise_clip=None, stationary=True):
    """
    Reduce noise in audio. 
    Parameters:
        numpaudio: Audio data (numpy array)
        sr: Sample rate
        noise_clip: Reference noise clip (if available)
        stationary: Whether noise is consistent (True) or changes over time (False)
    Returns:
        cleaned_audio: Audio with reduced noise
    """
    if noise_clip is not None:
        # Use the provided noise clip as a reference for what is "noise"
        reduced = nr.reduce_noise(y=numpaudio, sr=sr, y_noise=noise_clip)
        print("Noise reduction applied using reference noise sample")
    else:
        # Estimate noise from the audio itself (good for stationary noise)
        reduced = nr.reduce_noise(y=numpaudio, sr=sr, stationary=stationary)
        print("Noise reduction applied using estimated noise profile")
    return reduced

# Example usage:
# reduced_audio = reduce_noise(audio, sr)

def plot_waveforms(original, processed, sr, title="Noise Reduction"):
    """
    Plot original and processed audio waveforms for comparison.
    """
    plt.figure(figsize=(12, 6))
    # Plot original audio
    plt.subplot(2, 1, 1)
    plt.plot(np.linspace(0, len(original)/sr, len(original)), original)
    plt.title("Original Audio")
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    # Plot processed audio
    plt.subplot(2, 1, 2)
    plt.plot(np.linspace(0, len(processed)/sr, len(processed)), processed, color='orange')
    plt.title("Processed Audio")
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.tight_layout()
    plt.suptitle(title, fontsize=16)
    plt.subplots_adjust(top=0.9)
    plt.show()

# Example visualization:
# plot_waveforms(audio, reduced_audio, sr, "Before vs. After Noise Reduction")

#### Breakdown: Noise Reduction

- Spectral gating works by identifying frequency regions that are likely noise and reducing their volume.

**Amplitude, Noise, and Their Role in Noise Reduction**

- **Amplitude**: Represents both speech and noise in the audio signal. Noise reduction aims to lower the amplitude of unwanted noise without affecting speech.
- **Noise**: Unwanted, often random, background sounds that overlap with speech. Noise can be stationary (constant, like a fan) or non-stationary (changing, like traffic).
- **Spectral Gating**: Analyzes the frequency content of the audio and reduces the amplitude of frequencies identified as noise.
- **Relation**: By reducing the amplitude of noise frequencies, the speech signal stands out more clearly, improving recognition accuracy.

### 4. Voice Activity Detection (VAD)

**What it does**: Identifies segments of audio that contain speech (vs. silence or noise).

**Why it matters**: Processing only speech segments improves recognition speed and accuracy.

We'll use the `webrtcvad` library, which implements Google's WebRTC Voice Activity Detector:

**Flowchart: Voice Activity Detection**

```
[Start]
   |
   v
[Check sample rate is valid]
   |
   v
[Normalize audio to [-1, 1]]
   |
   v
[Convert audio to int16]
   |
   v
[Split audio into frames]
   |
   v
[For each frame:]
   |
   v
[Check if frame contains speech]
   |
   v
[Collect speech frames as True/False]
   |
   v
[Return speech frames list]
```

**Key Functions Used in This Section**

- `webrtcvad.Vad`: Creates a Voice Activity Detector object.
- `Vad.is_speech`: Checks if a given audio frame contains speech.
- `struct.pack`: Converts a list of int16 samples into bytes for VAD processing.
- `numpy.max`, `numpy.abs`, `numpy.zeros_like`: Used for normalization and creating silent arrays.

In [None]:
import webrtcvad
import struct
import numpy as np

def detect_speech(audio, sr, frame_duration=30, aggressiveness=3):
    """
    Detect speech segments using WebRTC VAD.
    Parameters:
        audio: Audio data (mono, numpy array)
        sr: Sample rate (must be 8000, 16000, 32000, or 48000 Hz)
        frame_duration: Frame duration in ms (10, 20, or 30)
        aggressiveness: VAD aggressiveness (0-3, higher = more aggressive)
    Returns:
        speech_frames: List of booleans (True = speech detected)
    """
    vad = webrtcvad.Vad(aggressiveness)  # Create VAD object

    # Check if sample rate is valid for VAD
    valid_rates = (8000, 16000, 32000, 48000)
    if sr not in valid_rates:
        raise ValueError(f"Sample rate must be one of {valid_rates}")

    # Normalize audio to [-1, 1] if needed
    if np.max(np.abs(audio)) > 1.0:
        audio = audio / np.max(np.abs(audio))

    # Convert audio from float [-1, 1] to int16 (required by VAD)
    audio_int16 = (audio * 32767).astype(np.int16)

    # Calculate frame size in samples
    frame_size = int(sr * frame_duration / 1000)

    # Analyze each frame for speech
    speech_frames = []
    for i in range(0, len(audio_int16) - frame_size, frame_size):
        frame = audio_int16[i:i + frame_size]
        frame_bytes = struct.pack("h" * len(frame), *frame)  # Convert to bytes
        is_speech = vad.is_speech(frame_bytes, sr)           # VAD decision
        speech_frames.append(is_speech)

    speech_percentage = sum(speech_frames) / len(speech_frames) * 100
    print(f"Detected speech in {speech_percentage:.1f}% of frames")
    return speech_frames

# Example usage:
# speech_frames = detect_speech(audio, sr)

def keep_speech_segments(audio, sr, speech_frames, frame_duration=30):
    """
    Keep only speech segments from audio.
    Parameters:
        audio: Audio data
        sr: Sample rate
        speech_frames: List of booleans from detect_speech()
        frame_duration: Frame duration in ms
    Returns:
        speech_only: Audio with non-speech segments replaced by silence
    """
    frame_size = int(sr * frame_duration / 1000)
    speech_only = np.zeros_like(audio)  # Start with silence

    for i, is_speech in enumerate(speech_frames):
        if is_speech:
            start = i * frame_size
            end = min(start + frame_size, len(audio))
            speech_only[start:end] = audio[start:end]  # Copy speech frames

    return speech_only

# Example usage:
# speech_only_audio = keep_speech_segments(audio, sr, speech_frames)
# plot_waveforms(audio, speech_only_audio, sr, "Original vs. Speech Only")

#### Breakdown: Voice Activity Detection (VAD)

- Aggressiveness controls how strict the VAD is: higher values remove more non-speech but may cut off quiet speech.

**Amplitude, Speech Detection, and Their Role in VAD**

- **Amplitude**: Speech tends to have higher and more variable amplitude than silence or background noise.
- **VAD**: Uses both amplitude and frequency patterns to decide if a frame contains speech.
- **Frames**: Audio is split into short segments (frames) for analysis; each is checked for speech presence.
- **Relation**: VAD helps isolate the parts of the audio where speech is present, reducing the amount of irrelevant data passed to the recognizer and improving both speed and accuracy.

### 5. Filtering

**What it does**: Enhances frequencies important for speech while reducing others.

**Why it matters**: Human speech typically occurs in the 300-3000Hz range. Filtering can emphasize these frequencies.

Let's implement a band-pass filter to focus on speech frequencies:

**Flowchart: Bandpass Filtering**

```
[Start]
   |
   v
[Calculate Nyquist frequency]
   |
   v
[Normalize cutoff frequencies]
   |
   v
[Design bandpass filter]
   |
   v
[Apply filter to audio]
   |
   v
[Return filtered audio]
```

**Key Functions Used in This Section**

- `scipy.signal.butter`: Designs a Butterworth filter (returns filter coefficients).
- `scipy.signal.lfilter`: Applies the designed filter to the audio data.

In [None]:
from scipy.signal import butter, lfilter

def bandpass_filter(audio, sr, lowcut=300, highcut=3000, order=5):
    """
    Apply bandpass filter to audio.
    Parameters:
        audio: Audio data
        sr: Sample rate
        lowcut: Low frequency cutoff (Hz)
        highcut: High frequency cutoff (Hz)
        order: Filter order (higher = sharper cutoff)
    Returns:
        filtered_audio: Audio with frequencies outside range attenuated
    """
    nyquist = 0.5 * sr  # Nyquist frequency is half the sample rate
    low = lowcut / nyquist
    high = highcut / nyquist

    # Design a Butterworth bandpass filter
    b, a = butter(order, [low, high], btype='band')

    # Apply the filter to the audio
    filtered_audio = lfilter(b, a, audio)

    print(f"Applied bandpass filter: {lowcut}-{highcut}Hz")
    return filtered_audio

# Example usage:
# filtered_audio = bandpass_filter(audio, sr)
# plot_waveforms(audio, filtered_audio, sr, "Original vs. Filtered (300-3000Hz)")

#### Breakdown: Bandpass Filtering

- The Butterworth filter is chosen for its flat frequency response in the passband.

**Amplitude, Frequency, and Their Role in Filtering**

- **Amplitude**: Filtering changes the amplitude of different frequency components in the audio.
- **Frequency**: Human speech is concentrated between 300Hz and 3000Hz. Frequencies outside this range are often noise or irrelevant.
- **Bandpass Filter**: Allows frequencies within a certain range to pass through, while attenuating (reducing) others.
- **Relation**: By filtering out frequencies not used in speech, we reduce noise and make the speech signal clearer for recognition.

### Building a Complete Preprocessing Pipeline

Now, let's combine all these techniques into a complete audio preprocessing pipeline for speech recognition:

**Flowchart: Preprocessing Pipeline**

```
[Start]
   |
   v
[Resample audio]
   |
   v
[Normalize audio]
   |
   v
[If noise reduction enabled, reduce noise]
   |
   v
[If VAD enabled, keep only speech segments]
   |
   v
[If filtering enabled, apply bandpass filter]
   |
   v
[Return processed audio]
```

**Key Functions Used in This Section**

- `resample_audio`: Resamples audio to the target sample rate.
- `normalize_audio`: Normalizes audio volume.
- `reduce_noise`: Reduces background noise.
- `detect_speech`: Detects speech frames using VAD.
- `keep_speech_segments`: Keeps only speech segments in the audio.
- `bandpass_filter`: Applies a bandpass filter to the audio.
- All these are user-defined functions from previous sections.

In [None]:
def preprocess_audio(audio_path, target_sr=16000, normalize_target_dB=-3, 
                     reduce_noise=True, vad_enabled=True, apply_filter=True):
    """
    Complete audio preprocessing pipeline.
    Parameters:
        audio_path: Path to audio file
        target_sr: Target sample rate (default: 16kHz)
        normalize_target_dB: Target normalization level (dB)
        reduce_noise: Whether to apply noise reduction
        vad_enabled: Whether to apply voice activity detection
        apply_filter: Whether to apply bandpass filtering
    Returns:
        processed_audio: Fully processed audio
        sr: Sample rate of processed audio
    """
    print(f"===== Processing {audio_path} =====")

    # Step 1: Resample to target sample rate
    audio, sr = resample_audio(audio_path, target_sr)

    # Step 2: Normalize audio volume
    audio = normalize_audio(audio, normalize_target_dB)

    # Step 3: Optionally reduce noise
    if reduce_noise:
        audio = reduce_noise(audio, sr)

    # Step 4: Optionally apply VAD to keep only speech
    if vad_enabled:
        speech_frames = detect_speech(audio, sr)
        audio = keep_speech_segments(audio, sr, speech_frames)

    # Step 5: Optionally apply bandpass filter
    if apply_filter:
        audio = bandpass_filter(audio, sr)

    print("===== Processing complete =====")
    return audio, sr

# Example usage:
# processed, sr = preprocess_audio("path_to_audio.wav")
# sf.write("processed_audio.wav", processed, sr)

#### Breakdown: Complete Preprocessing Pipeline

- Each step is modular and can be enabled or disabled as needed for your application.

**How Each Step Contributes to Better Recognition**

- **Resampling**: Ensures compatibility with the speech recognition model.
- **Normalization**: Guarantees consistent loudness, preventing missed or distorted words.
- **Noise Reduction**: Removes background noise, making speech clearer.
- **VAD**: Focuses processing on speech, ignoring silence and irrelevant sounds.
- **Filtering**: Emphasizes the frequency range of human speech, further reducing noise and improving clarity.
- **Combined Effect**: Each step addresses a specific problem in raw audio, and together they maximize the quality and recognizability of speech for the model.

### Testing Preprocessing with Vosk

To see the impact of preprocessing, we can test it with Vosk:

**Flowchart: Speech Recognition with Vosk**

```
[Start]
   |
   v
[Check if model exists]
   |
   v
[Save audio data to temporary WAV file]
   |
   v
[Open WAV file for reading]
   |
   v
[Load Vosk model and create recognizer]
   |
   v
[Read audio in chunks and recognize speech]
   |
   v
[Collect recognized text]
   |
   v
[Delete temporary file]
   |
   v
[Return recognized text]
```

**Key Functions Used in This Section**

- `vosk.Model`: Loads a Vosk speech recognition model.
- `vosk.KaldiRecognizer`: Creates a recognizer object for processing audio.
- `soundfile.write` (`sf.write`): Saves numpy audio data to a WAV file.
- `wave.open`: Opens a WAV file for reading.
- `KaldiRecognizer.AcceptWaveform`: Processes audio frames and returns recognition results.
- `json.loads`: Parses JSON output from the recognizer.
- `os.remove`: Deletes the temporary WAV file after processing.

In [None]:
from vosk import Model, KaldiRecognizer
import wave
import json
import os

def recognize_speech(audio_data, sample_rate, model_path):
    """
    Recognize speech in audio data using Vosk.
    Parameters:
        audio_data: Numpy array of audio data
        sample_rate: Sample rate of the audio
        model_path: Path to Vosk model
    Returns:
        text: Recognized text
    """
    if not os.path.exists(model_path):
        print(f"Error: Model not found at {model_path}")
        return ""

    # Save audio to a temporary WAV file for Vosk to read
    temp_wav = "temp_audio.wav"
    sf.write(temp_wav, audio_data, sample_rate)

    # Open the WAV file for reading
    wf = wave.open(temp_wav, "rb")

    # Load the Vosk model and create a recognizer
    model = Model(model_path)
    recognizer = KaldiRecognizer(model, wf.getframerate())

    # Read and process audio in chunks
    text = ""
    while True:
        data = wf.readframes(4000)  # Read 4000 frames at a time
        if len(data) == 0:
            break
        if recognizer.AcceptWaveform(data):
            result = json.loads(recognizer.Result())
            text += result.get("text", "") + " "

    # Get the final recognition result
    final_result = json.loads(recognizer.FinalResult())
    text += final_result.get("text", "")

    # Clean up: remove the temporary file
    os.remove(temp_wav)

    return text.strip()

# Example usage:
# model_path = "/path/to/vosk-model"
# original_audio, sr = librosa.load("audio.wav", sr=16000)
# processed_audio, sr = preprocess_audio("audio.wav")
# print("Recognition without preprocessing:")
# print(recognize_speech(original_audio, sr, model_path))
# print("\nRecognition with preprocessing:")
# print(recognize_speech(processed_audio, sr, model_path))

#### Breakdown: Speech Recognition with Vosk

- Vosk expects audio in WAV format, so we save the numpy array to a temporary file.

**Why Preprocessing Improves Recognition Results**

- **Raw audio** may contain noise, silence, or irrelevant frequencies that confuse the recognizer.
- **Preprocessed audio** is cleaner, with only the important speech content at the right loudness and frequency range.
- **Recognition accuracy**: Preprocessing can significantly increase the percentage of correctly recognized words, especially in challenging environments.
- **Practical tip**: Always compare recognition results before and after preprocessing to see the improvement for your specific use case.

### Common Issues and Solutions

1. **Out of memory errors**: When processing long audio files, try processing in chunks
2. **Distortion after preprocessing**: Ensure normalization doesn't clip audio and filter settings are appropriate
3. **VAD removing speech**: Try lower aggressiveness values (0-1 instead of 2-3)
4. **Noise reduction artifacts**: Use non-stationary mode for varying noise conditions

### Performance Tips

1. **For real-time applications**: Skip intensive steps like noise reduction
2. **For batch processing**: Use all preprocessing steps for maximum accuracy
3. **For VAD**: Frame duration of 10-30ms works best for typical speech
4. **For noise reduction**: Providing a separate noise profile when possible improves results

## BRIDGE TO PRACTICE

In the practice guide, you'll build a complete preprocessing pipeline and test it with real audio. You'll compare recognition results before and after preprocessing to see how each technique improves accuracy. The final exercise will integrate everything into a reusable audio processing module that you can use in your voice assistant project.