# VOSK INTRODUCTION PRACTICE GUIDE

## EXERCISE GOAL
In this practice session, you'll install Vosk, download a language model, and create a simple program that recognizes speech from a pre-recorded WAV file.

## STEP-BY-STEP INSTRUCTIONS

### Step 1: Install Vosk Package

In [None]:
# Install the Vosk package and SoundFile
!pip install vosk soundfile

In [None]:
from vosk import Model
print('Vosk installed successfully!')

### Step 2: Download a Vosk Model
- Visit https://alphacephei.com/vosk/models
- Download the appropriate model for your language (we recommend starting with a small model)
- For English, you can use: vosk-model-small-en-us-0.15

You can download and extract the model automatically using the commands below:

In [None]:
import os
import sys

# Create models directory
!mkdir -p models

# Check if the model is already downloaded
model_dir = "models/vosk-model-small-en-us-0.15"
if not os.path.exists(model_dir):
    # Download model (US English small)
    !wget -P models https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
    
    # Unzip the model
    !unzip -q models/vosk-model-small-en-us-0.15.zip -d models
    
    print("Model downloaded and extracted successfully.")
else:
    print("Model already exists. Using existing model.")

### Step 3: Create a Test Audio File

We need an audio file for testing. If you already have one, you can use that. Otherwise, let's create a simple function to record audio using PyAudio:

In [4]:
import pyaudio
import wave

def record_audio(filename="test.wav", seconds=5, sample_rate=44000):
    """Record audio from microphone and save to WAV file"""
    # Configure PyAudio
    p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16,
                    channels=1,
                    rate=sample_rate,
                    input=True,
                    frames_per_buffer=1024)
    
    print(f"Recording for {seconds} seconds...")
    
    frames = []
    
    # Record audio
    for i in range(0, int(sample_rate / 1024 * seconds)):
        data = stream.read(1024)
        frames.append(data)
        
    print("Recording finished.")
    
    # Stop and close the stream
    stream.stop_stream()
    stream.close()
    p.terminate()
    
    # Save the recorded audio as a WAV file
    wf = wave.open(filename, 'wb')
    wf.setnchannels(1)
    wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
    wf.setframerate(sample_rate)
    wf.writeframes(b''.join(frames))
    wf.close()
    
    print(f"Audio saved to {filename}")
    return filename

test_file = record_audio(seconds=5)

ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.front
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround40
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround41
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround50
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround51
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround71
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2721:(snd

Recording for 5 seconds...
Recording finished.
Audio saved to test.wav
Recording finished.
Audio saved to test.wav


### Step 4: Build the Speech Recognition Function

Now, let's create a function to recognize speech from an audio file:

#### What is KaldiRecognizer?

`KaldiRecognizer` is a class provided by the Vosk library that performs speech recognition using a pre-trained model. It processes audio data and converts spoken words into text. Internally, it uses the Kaldi speech recognition toolkit, which is a powerful open-source toolkit for speech recognition research. In Vosk, `KaldiRecognizer` handles the decoding of audio streams and provides recognized text results as you feed it audio data.

In [None]:
from vosk import Model, KaldiRecognizer
import json
import os

def recognize_speech(audio_file="test.wav", model_path="./vosk-model-small-en-us-0.15"):
    """Recognize speech from audio file using Vosk"""
    # Validate model path
    if not os.path.exists(model_path):
        print(f"Error: Model not found at {model_path}")
        return ""
    
    # Validate audio file
    if not os.path.exists(audio_file):
        print(f"Error: Audio file not found at {audio_file}")
        return ""
    
    
    
    # Load the model
    model = Model(model_path)
    
    # Open the audio file
    wf = wave.open(audio_file, "rb")
    
    # Check audio format
    if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
        print("Audio file must be WAV format mono PCM.")
        return ""
    
    # Create a recognizer
    recognizer = KaldiRecognizer(model, wf.getframerate())
    
    # Process the audio
    text = ""
    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
            break
        if recognizer.AcceptWaveform(data):
            result = json.loads(recognizer.Result())
            partial_text = result.get("text", "")
            text += partial_text + " "
            print(f"Partial: {partial_text}")
    
    # Get the final result
    final_result = json.loads(recognizer.FinalResult())
    final_text = final_result.get("text", "")
    text += final_text
    print(f"Final: {final_text}")
    
    return text.strip()

# Example usage:
# recognized_text = recognize_speech("test.wav")

#### Line-by-line Explanation of the recognize_speech Function

- `from vosk import Model, KaldiRecognizer`  
  Imports the Vosk classes needed for loading the speech model and performing recognition.

- `import json`  
  Imports the JSON module to parse recognition results, which are returned as JSON strings.

- `import os`  
  Imports the OS module to check if files and directories exist.

- `def recognize_speech(audio_file="test.wav", model_path="models/vosk-model-small-en-us-0.15"):`  
  Defines a function that takes the path to an audio file and a model directory.

- `if not os.path.exists(model_path):`  
  Checks if the model directory exists.

- `print(f"Error: Model not found at {model_path}")`  
  Prints an error if the model directory is missing.

- `return ""`  
  Stops the function and returns an empty string if the model is missing.

- `if not os.path.exists(audio_file):`  
  Checks if the audio file exists.

- `print(f"Error: Audio file not found at {audio_file}")`  
  Prints an error if the audio file is missing.

- `return ""`  
  Stops the function and returns an empty string if the audio file is missing.

- `model = Model(model_path)`  
  Loads the Vosk model from the specified directory.

- `wf = wave.open(audio_file, "rb")`  
  Opens the audio file for reading in binary mode.

- `if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":`  
  Checks if the audio file is mono, 16-bit, and uncompressed PCM format.

- `print("Audio file must be WAV format mono PCM.")`  
  Prints an error if the audio file format is not supported.

- `return ""`  
  Stops the function and returns an empty string if the format is wrong.

- `recognizer = KaldiRecognizer(model, wf.getframerate())`  
  Creates a KaldiRecognizer object using the model and the audio file's sample rate.

- `text = ""`  
  Initializes an empty string to store recognized text.

- `while True:`  
  Starts a loop to process the audio in chunks.

- `data = wf.readframes(4000)`  
  Reads 4000 frames of audio data from the file.

- `if len(data) == 0:`  
  Checks if there is no more data to read.

- `break`  
  Exits the loop if all audio has been processed.

- `if recognizer.AcceptWaveform(data):`  
  Feeds the audio chunk to the recognizer and checks if a full utterance has been recognized.

- `result = json.loads(recognizer.Result())`  
  Gets the recognition result as a JSON string and parses it into a Python dictionary.

- `partial_text = result.get("text", "")`  
  Extracts the recognized text from the result.

- `text += partial_text + " "`  
  Adds the recognized text to the output string.

- `print(f"Partial: {partial_text}")`  
  Prints the partial result for debugging.

- `final_result = json.loads(recognizer.FinalResult())`  
  Gets the final recognition result after all audio is processed.

- `final_text = final_result.get("text", "")`  
  Extracts the final recognized text.

- `text += final_text`  
  Adds the final recognized text to the output string.

- `print(f"Final: {final_text}")`  
  Prints the final result.

- `return text.strip()`  
  Returns the complete recognized text, with leading/trailing spaces removed.

### Step 5: Test the Speech Recognition

Now let's bring everything together! You can record audio and then recognize the speech, or use an existing audio file:

In [6]:
# Option 1: Record new audio and recognize
def record_and_recognize(duration=5):
    audio_file = record_audio(seconds=duration)
    return recognize_speech(audio_file)

# Option 2: Recognize speech from an existing file
def recognize_from_file(file_path):
    return recognize_speech(file_path)

In [7]:
# Uncomment one of the following options:

# Option 1: Record and recognize
# text = record_and_recognize(5)
# print(f"\nRecognized text: {text}")

# Option 2: Use an existing file
# existing_file = "/path/to/your/audio.wav"  # Replace with your audio file path
# text = recognize_from_file(existing_file)
# print(f"\nRecognized text: {text}")

### Step 6: Experiment and Learn

Now that you have a working speech recognition system, try experimenting with it:

1. Record audio with different content and see how well it's recognized
2. Try modifying the code to handle longer recordings
3. Explore how different audio quality affects recognition accuracy

Below is a space for your experimentation:

In [8]:
# Your experimentation code here
# Ideas:
# - Measure recognition accuracy with different audio inputs
# - Process an audio file with background noise
# - Add timing measurements to check performance

## Conclusion

Congratulations! You've successfully installed Vosk, set up a language model, and created a program that can recognize speech from audio files. This is a fundamental building block for your voice assistant project.

In the next modules, you'll learn about audio preprocessing techniques that can improve recognition accuracy, especially in noisy environments.