# SPEECH-TO-TEXT WITH VOSK

## GLOSSARY

- **Speech-to-Text (STT)**: The process of converting spoken language into written text
- **Real-time Recognition**: Processing speech as it's being spoken, rather than after a recording is complete
- **Partial Results**: Preliminary recognition results that may change as more audio is processed
- **Final Result**: The complete recognition result after all audio has been processed
- **Grammar**: A set of rules that constrains the words the recognizer will consider
- **Word List**: A limited set of words the recognizer should recognize
- **Confidence Score**: A measure of how confident the system is in its recognition result
- **Alternative Results**: Alternative interpretations of the same speech input
- **Speaker Identification**: The process of determining who is speaking in an audio sample
- **Word Timestamps**: Time markers indicating when each word was spoken

## CONCEPT INTERACTIONS

- **Building on Vosk Basics**: We'll expand on the Vosk concepts from Module 1, applying them to real-time scenarios
- **Building on Audio Preprocessing**: We'll integrate the preprocessing techniques from Module 2 for better recognition
- **Looking Forward**: The speech-to-text capabilities we develop here will form the foundation for understanding user intent in the next module

## MAIN CONTENT

### From Batch to Real-Time Recognition

In the first module, we learned how to use Vosk to recognize speech from a pre-recorded audio file. While useful, voice assistants need to process speech in real-time as the user is speaking. In this module, we'll explore how to:

1. Capture live audio using PyAudio
2. Process this audio in chunks with Vosk
3. Handle both partial and final results
4. Improve accuracy with preprocessing
5. Build a complete real-time speech recognition system

### Setting Up Real-Time Recognition

Real-time recognition follows a different pattern than batch processing:

1. Initialize audio capture stream
2. Initialize Vosk recognizer
3. Continuously:
   - Capture audio chunks
   - Preprocess the audio (optional)
   - Feed chunks to the recognizer
   - Handle partial results
   - Check for silence/end of speech
4. Get final result
5. Take action based on the recognized text

Let's implement each step:

In [None]:
# Required imports
import pyaudio
import numpy as np
import queue
import threading
import time
import json
from vosk import Model, KaldiRecognizer

# For preprocessing (optional)
import librosa
from scipy.signal import butter, lfilter

# Initialize PyAudio
def init_audio(sample_rate=16000, chunk_size=1024):
    """
    Initialize PyAudio for real-time audio capture
    
    Parameters:
    - sample_rate: Audio sample rate (must match model requirements)
    - chunk_size: Size of audio chunks to process
    
    Returns:
    - p: PyAudio object
    - stream: PyAudio stream
    """
    p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16,
                    channels=1,
                    rate=sample_rate,
                    input=True,
                    frames_per_buffer=chunk_size)
    return p, stream

# Initialize Vosk recognizer
def init_recognizer(model_path, sample_rate=16000):
    """
    Initialize Vosk recognizer
    
    Parameters:
    - model_path: Path to Vosk model
    - sample_rate: Audio sample rate
    
    Returns:
    - recognizer: KaldiRecognizer object
    """
    model = Model(model_path)
    recognizer = KaldiRecognizer(model, sample_rate)
    return recognizer

### Real-Time Audio Processing

For real-time recognition, we need to efficiently process audio chunks as they come in:

In [None]:
def process_audio_chunk(recognizer, audio_data, preprocess=False):
    """
    Process a chunk of audio data
    
    Parameters:
    - recognizer: Vosk recognizer object
    - audio_data: Raw audio data (bytes)
    - preprocess: Whether to apply preprocessing
    
    Returns:
    - is_final: Whether this is a final result
    - text: Recognized text
    """
    if preprocess:
        # Convert to numpy array for preprocessing
        audio = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32767.0
        
        # Simple preprocessing (bandpass filter for speech frequencies)
        audio = bandpass_filter(audio, 16000, 300, 3000)
        
        # Convert back to bytes
        audio_data = (audio * 32767).astype(np.int16).tobytes()
    
    # Process the audio chunk
    if recognizer.AcceptWaveform(audio_data):
        # We have a final result
        result = json.loads(recognizer.Result()).get("text", "")
        return True, result
    else:
        # We have a partial result
        partial = json.loads(recognizer.PartialResult()).get("partial", "")
        return False, partial


def bandpass_filter(audio, sr, lowcut=300, highcut=3000, order=5):
    """Simple bandpass filter for speech frequencies"""
    nyquist = 0.5 * sr
    low = lowcut / nyquist
    high = highcut / nyquist
    b, a = butter(order, [low, high], btype='band')
    return lfilter(b, a, audio)

### Continuous Recognition Loop

Now, let's create a continuous recognition loop that captures and processes audio in real-time:

In [None]:
def continuous_recognition(model_path, timeout=5, preprocess=False):
    """
    Perform continuous speech recognition
    
    Parameters:
    - model_path: Path to Vosk model
    - timeout: Seconds of silence before finalizing
    - preprocess: Whether to apply preprocessing
    
    Returns:
    - recognized_text: Final recognized text
    """
    p, stream = init_audio()
    recognizer = init_recognizer(model_path)
    
    print("Listening... (Speak now)")
    
    # Variables to track silence
    last_speech_time = time.time()
    speech_detected = False
    silence_threshold = 1000  # Adjust based on your microphone
    
    try:
        while True:
            # Read audio chunk
            audio_chunk = stream.read(1024, exception_on_overflow=False)
            
            # Check audio level (simple silence detection)
            audio_level = np.max(np.abs(np.frombuffer(audio_chunk, dtype=np.int16)))
            
            # Update speech detection state
            if audio_level > silence_threshold:
                if not speech_detected:
                    print("Speech detected...")
                speech_detected = True
                last_speech_time = time.time()
            
            # Process audio with recognizer
            is_final, text = process_audio_chunk(recognizer, audio_chunk, preprocess)
            
            # Display partial results
            if text:
                print(f"\rPartial: {text}", end="", flush=True)
            
            # Check for timeout (silence)
            if speech_detected and time.time() - last_speech_time > timeout:
                print("\nSilence detected, finalizing...")
                break
    
    except KeyboardInterrupt:
        print("\nStopped by user")
    finally:
        # Clean up
        stream.stop_stream()
        stream.close()
        p.terminate()
    
    # Get final result
    final_result = json.loads(recognizer.FinalResult())
    final_text = final_result.get("text", "")
    print(f"\nFinal result: \"{final_text}\"")
    
    return final_text

### Non-Blocking Recognition

The above approach blocks the main thread. For a voice assistant, we often want recognition to happen in the background while other operations continue. Here's a non-blocking approach using threading:

In [None]:
class VoiceRecognizer:
    """Class for non-blocking voice recognition"""
    
    def __init__(self, model_path, sample_rate=16000, chunk_size=1024, preprocess=False):
        """Initialize the recognizer"""
        self.sample_rate = sample_rate
        self.chunk_size = chunk_size
        self.preprocess = preprocess
        self.p = None
        self.stream = None
        self.recognizer = None
        self.model_path = model_path
        self.running = False
        self.audio_queue = queue.Queue()
        self.text_queue = queue.Queue()
        self.last_partial = ""
    
    def start(self):
        """Start recognition in background thread"""
        if self.running:
            return
        
        self.running = True
        self.p = pyaudio.PyAudio()
        self.stream = self.p.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk_size,
            stream_callback=self._audio_callback
        )
        
        # Initialize Vosk
        self.recognizer = init_recognizer(self.model_path, self.sample_rate)
        
        # Start processing thread
        self.process_thread = threading.Thread(target=self._process_audio)
        self.process_thread.daemon = True
        self.process_thread.start()
        
        print("Voice recognizer started")
    
    def _audio_callback(self, in_data, frame_count, time_info, status):
        """Callback for audio stream"""
        self.audio_queue.put(in_data)
        return (in_data, pyaudio.paContinue)
    
    def _process_audio(self):
        """Process audio chunks from queue"""
        while self.running:
            # Get audio chunk from queue
            if not self.audio_queue.empty():
                audio_data = self.audio_queue.get()
                
                # Process with recognizer
                is_final, text = process_audio_chunk(
                    self.recognizer, audio_data, self.preprocess
                )
                
                if is_final and text:
                    # Put final result in queue
                    self.text_queue.put(("final", text))
                    self.last_partial = ""
                elif text and text != self.last_partial:
                    # Only queue new partials
                    self.text_queue.put(("partial", text))
                    self.last_partial = text
            
            # Short sleep to prevent CPU hogging
            time.sleep(0.01)
    
    def get_result(self, block=False, timeout=None):
        """
        Get recognition result if available
        
        Returns tuple of (type, text) where type is 'final' or 'partial'
        Returns None if no result available
        """
        try:
            return self.text_queue.get(block=block, timeout=timeout)
        except queue.Empty:
            return None
    
    def stop(self):
        """Stop recognition"""
        if not self.running:
            return
        
        self.running = False
        
        # Stop and close stream
        if self.stream:
            self.stream.stop_stream()
            self.stream.close()
        
        # Terminate PyAudio
        if self.p:
            self.p.terminate()
        
        # Get final result
        final_result = json.loads(self.recognizer.FinalResult())
        final_text = final_result.get("text", "")
        
        if final_text:
            self.text_queue.put(("final", final_text))
        
        print("Voice recognizer stopped")

### Example Usage of Non-Blocking Recognizer

In [None]:
def demo_non_blocking_recognition():
    """Demo of non-blocking recognition"""
    recognizer = VoiceRecognizer(
        model_path="models/vosk-model-small-en-us-0.15",
        preprocess=True
    )
    
    print("Starting recognizer...")
    recognizer.start()
    
    print("Speak now! (Press Ctrl+C to stop)")
    
    try:
        # Main loop - could do other things here
        while True:
            # Check for results
            result = recognizer.get_result(block=False)
            if result:
                result_type, text = result
                if result_type == "partial":
                    print(f"\rPartial: {text}", end="", flush=True)
                else:
                    print(f"\nFinal: {text}")
            
            # Do other things...
            time.sleep(0.1)
            
    except KeyboardInterrupt:
        print("\nStopping...")
    finally:
        recognizer.stop()

# Uncomment to run the demo
# demo_non_blocking_recognition()

### Advanced Features

Vosk provides several advanced features that can enhance your speech recognition capabilities:

#### 1. Grammars and Word Lists

You can constrain recognition to a specific set of words or phrases, which improves accuracy for specific domains:

In [None]:
def create_grammar_recognizer(model_path, sample_rate=16000):
    """
    Create a recognizer with a grammar constraint
    """
    model = Model(model_path)
    
    # Define a simple grammar as JSON
    # This example recognizes commands for a smart home
    grammar = {
        "grammar": [
            "turn on the lights",
            "turn off the lights",
            "turn on the tv",
            "turn off the tv",
            "play music",
            "stop music",
            "what time is it",
            "what's the weather"
        ]
    }
    
    # Create recognizer with grammar
    rec = KaldiRecognizer(model, sample_rate, json.dumps(grammar))
    return rec

# Example usage:
# grammar_recognizer = create_grammar_recognizer("models/vosk-model-small-en-us-0.15")

#### 2. Getting Word Timestamps

Vosk can provide timestamps for each recognized word, which is useful for synchronizing with other events:

In [None]:
def process_with_timestamps(recognizer, audio_data):
    """
    Process audio and extract word timestamps
    """
    recognizer.AcceptWaveform(audio_data)
    result = json.loads(recognizer.Result())
    
    # Check if we have results with timestamps
    if "result" in result:
        words_with_times = []
        for word_data in result["result"]:
            words_with_times.append({
                "word": word_data["word"],
                "start": word_data["start"],
                "end": word_data["end"],
                "conf": word_data.get("conf", 1.0)
            })
        return words_with_times
    
    return []

# To enable timestamps, set a config parameter
# rec = KaldiRecognizer(model, sample_rate, '{"words": true}')

#### 3. Getting Alternative Results

For ambiguous speech, you can get alternative interpretations:

In [None]:
def get_alternatives(recognizer, audio_data, max_alternatives=5):
    """
    Get alternative recognition results
    """
    # Create recognizer with alternatives option
    model = Model("models/vosk-model-small-en-us-0.15")
    config = json.dumps({"max_alternatives": max_alternatives})
    alt_recognizer = KaldiRecognizer(model, 16000, config)
    
    # Process audio
    alt_recognizer.AcceptWaveform(audio_data)
    result = json.loads(alt_recognizer.Result())
    
    # Extract alternatives
    if "alternatives" in result:
        return result["alternatives"]
    
    return []

### Integration with Preprocessing

For optimal results, we should integrate the audio preprocessing techniques from Module 2 into our real-time recognition system. Here's how:

In [None]:
def bandpass_filter(audio, sr, lowcut=300, highcut=3000, order=5):
    """
    Apply bandpass filter for speech frequencies
    """
    nyquist = 0.5 * sr
    low = lowcut / nyquist
    high = highcut / nyquist
    b, a = butter(order, [low, high], btype='band')
    return lfilter(b, a, audio)

def normalize_audio(audio, target_dB=-3):
    """
    Normalize audio level
    """
    # Find the maximum absolute amplitude
    max_amplitude = np.max(np.abs(audio))
    
    # Calculate current peak in dB
    current_dB = 20 * np.log10(max_amplitude) if max_amplitude > 0 else -80
    
    # Calculate the gain needed to reach target
    gain_dB = target_dB - current_dB
    gain_linear = 10 ** (gain_dB / 20)
    
    # Apply gain to normalize
    normalized_audio = audio * gain_linear
    
    # Ensure we don't exceed [-1, 1]
    if np.max(np.abs(normalized_audio)) > 1.0:
        normalized_audio = normalized_audio / np.max(np.abs(normalized_audio))
    
    return normalized_audio

class EnhancedVoiceRecognizer(VoiceRecognizer):
    """Voice recognizer with enhanced preprocessing"""
    
    def __init__(self, model_path, sample_rate=16000, chunk_size=1024):
        super().__init__(model_path, sample_rate, chunk_size, preprocess=False)
        self.enable_bandpass = True
        self.enable_normalization = True
    
    def _process_audio(self):
        """Process audio with enhanced preprocessing"""
        while self.running:
            if not self.audio_queue.empty():
                audio_data = self.audio_queue.get()
                
                # Convert to numpy for preprocessing
                audio = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32767.0
                
                # Apply preprocessing
                if self.enable_normalization:
                    audio = normalize_audio(audio)
                
                if self.enable_bandpass:
                    audio = bandpass_filter(audio, self.sample_rate)
                
                # Convert back to bytes
                processed_audio = (audio * 32767).astype(np.int16).tobytes()
                
                # Process with recognizer
                if self.recognizer.AcceptWaveform(processed_audio):
                    result = json.loads(self.recognizer.Result())
                    text = result.get("text", "")
                    if text:
                        self.text_queue.put(("final", text))
                        self.last_partial = ""
                else:
                    partial = json.loads(self.recognizer.PartialResult())
                    text = partial.get("partial", "")
                    if text and text != self.last_partial:
                        self.text_queue.put(("partial", text))
                        self.last_partial = text
            
            time.sleep(0.01)

### Building a Complete Voice Recognition System

Now, let's build a complete system that brings everything together:

1. Real-time audio capture
2. Advanced preprocessing
3. Vosk recognition with alternatives
4. Result handling

In [None]:
class CompleteVoiceRecognizer:
    """Complete voice recognition system"""
    
    def __init__(self, model_path, use_grammar=False, grammar_list=None):
        self.model_path = model_path
        self.sample_rate = 16000
        self.chunk_size = 1024
        self.use_grammar = use_grammar
        self.grammar_list = grammar_list or []
        self.p = None
        self.stream = None
        self.recognizer = None
        self.running = False
        self.speech_detected = False
        self.last_speech_time = 0
        self.silence_threshold = 700  # Adjust based on your environment
        self.silence_timeout = 2.0  # Seconds of silence to consider speech ended
        
        # For audio processing
        self.audio_queue = queue.Queue()
        self.result_queue = queue.Queue()
        
        # Preprocessing settings
        self.enable_normalization = True
        self.enable_bandpass = True
        self.lowcut = 300
        self.highcut = 3000
    
    def initialize(self):
        """Initialize the recognizer and audio stream"""
        # Load the model
        model = Model(self.model_path)
        
        # Create recognizer with appropriate config
        if self.use_grammar and self.grammar_list:
            grammar = {"grammar": self.grammar_list}
            config = json.dumps({"words": True, "grammar": self.grammar_list})
        else:
            config = json.dumps({"words": True, "max_alternatives": 3})
        
        self.recognizer = KaldiRecognizer(model, self.sample_rate, config)
        
        # Initialize PyAudio
        self.p = pyaudio.PyAudio()
        self.stream = self.p.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk_size,
            stream_callback=self._audio_callback
        )
    
    def _audio_callback(self, in_data, frame_count, time_info, status):
        """Callback for audio stream"""
        self.audio_queue.put(in_data)
        return (in_data, pyaudio.paContinue)
    
    def _preprocess_audio(self, audio_data):
        """Preprocess audio data"""
        # Convert to numpy array
        audio = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32767.0
        
        # Check audio level for speech detection
        audio_level = np.max(np.abs(audio)) * 32767
        if audio_level > self.silence_threshold:
            self.speech_detected = True
            self.last_speech_time = time.time()
        elif self.speech_detected and time.time() - self.last_speech_time > self.silence_timeout:
            self.speech_detected = False
            self.result_queue.put(("speech_end", None))
        
        # Apply preprocessing if enabled
        if self.enable_normalization:
            audio = normalize_audio(audio)
        
        if self.enable_bandpass:
            audio = bandpass_filter(audio, self.sample_rate, self.lowcut, self.highcut)
        
        # Convert back to bytes
        return (audio * 32767).astype(np.int16).tobytes()
    
    def _process_thread(self):
        """Audio processing thread"""
        while self.running:
            if not self.audio_queue.empty():
                # Get audio data
                audio_data = self.audio_queue.get()
                
                # Preprocess
                processed_audio = self._preprocess_audio(audio_data)
                
                # Process with recognizer
                if self.recognizer.AcceptWaveform(processed_audio):
                    # We have a final result
                    result = json.loads(self.recognizer.Result())
                    
                    # Extract text and word timestamps
                    text = result.get("text", "")
                    words = result.get("result", [])
                    
                    if text:
                        self.result_queue.put(("final", {
                            "text": text,
                            "words": words,
                            "alternatives": result.get("alternatives", [])
                        }))
                else:
                    # We have a partial result
                    partial = json.loads(self.recognizer.PartialResult())
                    partial_text = partial.get("partial", "")
                    if partial_text:
                        self.result_queue.put(("partial", {"text": partial_text}))
            
            time.sleep(0.01)
    
    def start(self):
        """Start recognition"""
        if self.running:
            return
        
        self.running = True
        self.initialize()
        
        # Start processing thread
        self.process_thread = threading.Thread(target=self._process_thread)
        self.process_thread.daemon = True
        self.process_thread.start()
        
        # Start audio stream
        self.stream.start_stream()
        print("Voice recognition started")
    
    def stop(self):
        """Stop recognition"""
        if not self.running:
            return
        
        self.running = False
        
        # Stop and close the audio stream
        if self.stream:
            self.stream.stop_stream()
            self.stream.close()
            self.stream = None
        
        # Terminate PyAudio
        if self.p:
            self.p.terminate()
            self.p = None
        
        # Get any final result
        final_result = json.loads(self.recognizer.FinalResult())
        if final_result.get("text", ""):
            self.result_queue.put(("final", {
                "text": final_result.get("text", ""),
                "words": final_result.get("result", [])
            }))
        
        print("Voice recognition stopped")
    
    def get_result(self, block=False, timeout=None):
        """
        Get recognition result
        
        Returns:
        - result_type: 'partial', 'final', or 'speech_end'
        - result_data: Result data (text, words, etc.)
        """
        try:
            return self.result_queue.get(block=block, timeout=timeout)
        except queue.Empty:
            return None

### Demo of the Complete System

In [None]:
def demo_complete_system(model_path="models/vosk-model-small-en-us-0.15", 
                         use_grammar=False, run_time=30):
    """Demo of the complete voice recognition system"""
    # Define grammar if used
    grammar_list = [
        "what time is it",
        "what's the weather",
        "play music",
        "stop music",
        "turn on the lights",
        "turn off the lights",
        "open the door",
        "close the door"
    ] if use_grammar else None
    
    # Create recognizer
    recognizer = CompleteVoiceRecognizer(
        model_path=model_path,
        use_grammar=use_grammar,
        grammar_list=grammar_list
    )
    
    print(f"Starting voice recognition{'with grammar' if use_grammar else ''}...")
    recognizer.start()
    
    print("Speak now! (Press Ctrl+C to stop)")
    
    try:
        start_time = time.time()
        while time.time() - start_time < run_time:  # Run for specified seconds
            # Get results
            result = recognizer.get_result(block=False)
            if result:
                result_type, result_data = result
                
                if result_type == "partial":
                    print(f"\rPartial: {result_data['text']}", end="", flush=True)
                elif result_type == "final":
                    print(f"\nFinal: {result_data['text']}")
                    
                    # Print word timestamps if available
                    if result_data.get("words"):
                        print("Word timestamps:")
                        for word in result_data["words"]:
                            print(f"  {word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")
                    
                    # Print alternatives if available
                    if result_data.get("alternatives"):
                        print("Alternatives:")
                        for alt in result_data["alternatives"]:
                            print(f"  {alt['text']}")
                
                elif result_type == "speech_end":
                    print("\n--- End of speech detected ---")
            
            time.sleep(0.1)
    
    except KeyboardInterrupt:
        print("\nStopping...")
    finally:
        recognizer.stop()

# Uncomment to run demo
# demo_complete_system(run_time=30)

### Common Issues and Solutions

When implementing real-time speech recognition with Vosk, you might encounter some challenges:

1. **High CPU Usage**: Processing audio in real-time can be CPU-intensive. Solutions:
   - Reduce the sample rate if possible
   - Process audio in a separate thread
   - Skip some preprocessing steps for faster processing

2. **Recognition Latency**: There might be a delay between speaking and recognition. Solutions:
   - Use smaller audio chunks
   - Use a smaller model
   - Run on more powerful hardware

3. **False Detections in Silence**: The recognizer might detect words in silence. Solutions:
   - Implement proper silence detection
   - Use a higher silence threshold
   - Add a minimum audio level check before processing

4. **Accuracy Issues**: Recognition might not be accurate enough. Solutions:
   - Use a larger, more accurate model
   - Apply preprocessing techniques (normalization, filtering)
   - Implement a domain-specific grammar for constraint

## Performance Optimizations

For voice assistants, recognition performance is crucial. Here are some optimizations:

1. **Buffer Management**: Only keep recent audio in memory
2. **Adaptive Processing**: Apply heavier preprocessing only when needed
3. **Model Selection**: Choose the right balance between size and accuracy
4. **Grammar Constraints**: Use grammar when the possible responses are limited
5. **Hardware Acceleration**: Some Vosk models support GPU acceleration

## BRIDGE TO PRACTICE

Now that you understand the concepts of real-time speech recognition with Vosk, the practice guide will walk you through building a complete speech recognition system. You'll implement real-time audio capture, preprocessing, and recognition, and test it in various scenarios. By the end of the practice, you'll have a robust speech-to-text system that forms the foundation of your voice assistant.