# VOSK INTRODUCTION

## GLOSSARY
- **Vosk**: An offline speech recognition toolkit that enables speech-to-text functionality without internet connectivity
- **Speech Recognition**: The ability of a computer program to identify words spoken aloud and convert them to text
- **ASR (Automatic Speech Recognition)**: Technology that converts spoken language into written text
- **Acoustic Model**: A statistical representation of sounds that make up each word in a language
- **Language Model**: A statistical model that determines the probability of a sequence of words
- **Offline Processing**: Processing data without requiring an internet connection
- **API (Application Programming Interface)**: A set of rules allowing different software applications to communicate with each other

## CONCEPT INTERACTIONS
- **Building on PyAudio**: Vosk works with audio data we capture using PyAudio concepts learned earlier
- **Looking Forward**: After understanding Vosk basics, we'll learn how to preprocess audio to improve recognition accuracy

## MAIN CONTENT

### Introduction to Vosk

Vosk is an offline speech recognition toolkit designed to provide fast and accurate speech-to-text conversion without requiring an internet connection. Unlike cloud-based services like Google Speech API or Amazon Transcribe, Vosk runs entirely on your device, making it ideal for applications requiring privacy, reliability regardless of internet connectivity, or deployment on edge devices.

### Key Features of Vosk

1. **Offline Operation**: Works without internet connectivity
2. **Multiple Language Support**: Provides models for many languages and dialects
3. **Lightweight**: Can run on resource-constrained devices like Raspberry Pi
4. **Open Source**: Free to use and modify
5. **Cross-Platform**: Works on Windows, macOS, Linux, Android, and iOS
6. **Python Integration**: Easy to use with Python applications

### Installing Vosk

Before we can use Vosk, we need to install both the Python package and a language model:

In [None]:
# Install the Vosk package
#!pip install vosk

# We'll also need SoundFile for handling audio files
#!pip install soundfile

After installation, we need to download a model for our target language. Vosk provides models of different sizes:
- Small models (50-100MB): Fast but less accurate
- Large models (1-2GB): More accurate but require more computational resources

Models can be downloaded from: https://alphacephei.com/vosk/models

### Basic Vosk Architecture

Vosk operates using the following components:

1. **Model**: Contains acoustic and language models for a specific language
2. **Recognizer**: Processes audio data and extracts text
3. **Result Handler**: Manages the recognition results

Here's a simple diagram of how these components work together:

Audio Input → Audio Preprocessing → Vosk Recognizer → Text Output

### Your First Vosk Program

Here's a basic example that shows how to use Vosk with pre-recorded audio:

In [None]:
from vosk import Model, KaldiRecognizer
import wave
import json

# Path to the model - adjust this to your actual model path
model_path = "/home/luar/AI/voice_assistant/vosk-model-small-en-us-0.15"

# Load the model
model = Model(model_path)

# Open the audio file - adjust this to an actual audio file in your system
# You can use a recording from the PyAudio exercises if you have one
audio_path = "path/to/audio.wav"  # Replace with your actual audio file path

wf = wave.open(audio_path, "rb")

# Create a recognizer
recognizer = KaldiRecognizer(model, wf.getframerate())

# Process the audio
text = ""
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if recognizer.AcceptWaveform(data):
        result = json.loads(recognizer.Result())
        text += result.get("text", "") + " "

# Get the final results
final_result = json.loads(recognizer.FinalResult())
text += final_result.get("text", "")

print("Recognized text:", text)

### Understanding the Code

Let's break down each element:

1. **Model Initialization**:
   ```python
   model = Model(model_path)
   ```
   This loads the language model from disk. The model contains all the information needed to recognize speech in a specific language.

2. **Recognizer Setup**:
   ```python
   recognizer = KaldiRecognizer(model, wf.getframerate())
   ```
   The KaldiRecognizer combines the model with the audio properties (specifically the sample rate) to create a recognition engine.

3. **Processing Audio Data**:
   ```python
   while 1==1:
       data = wf.readframes(4000)
       if len(data) == 0:
           break
       if recognizer.AcceptWaveform(data):
           result = json.loads(recognizer.Result())
           text += result.get("text", "") + " "
   ```
   This loop reads chunks of audio data and feeds them to the recognizer. When enough data is processed to recognize words, it returns a JSON result containing the recognized text.

4. **Final Result**:
   ```python
   final_result = json.loads(recognizer.FinalResult())
   ```
   After all audio is processed, FinalResult() returns any remaining recognized text.

#### Detailed Explanation: Processing Audio Data

Let's look at the audio processing loop line by line:

- `while True:`  
  This starts an infinite loop to process the audio file in chunks.

- `data = wf.readframes(4000)`  
  Reads 4000 audio frames from the WAV file. This gives a small chunk of audio data to process at a time.

- `if len(data) == 0:`  
  Checks if there is no more audio data left to read (end of file).

- `break`  
  If there is no more data, exit the loop.

- `if recognizer.AcceptWaveform(data):`  
  Feeds the chunk of audio data to the recognizer. If the recognizer has enough data to recognize a phrase or sentence, it returns `True`.

- `result = json.loads(recognizer.Result())`  
  Gets the recognition result as a JSON string and parses it into a Python dictionary.

- `text += result.get("text", "") + " "`  
  Extracts the recognized text from the result and adds it to the overall text string.

This loop continues until all audio data is processed. Each chunk may or may not produce recognized text, depending on whether the recognizer has enough information to form words or sentences.

### Try It Yourself!

Now you can experiment by modifying the code above. Here are some suggestions:

1. Use a different audio file
2. Print intermediate results as they are recognized
3. Measure the time it takes to process your audio

In [5]:
# 1. Import any required modules (e.g., time, vosk, wave, json)
# 2. Start timing the processing
# 3. Set up the model path and load the Vosk model
# 4. Open your audio file (WAV format)
# 5. Create a recognizer using the model and audio sample rate

import time
import wave
import json
import vosk
import os

vmod = vosk.Model
vrec = vosk.KaldiRecognizer
texr = ""
model = vmod('/home/luar/AI/voice_assistant/vosk-model-small-en-us-0.15') 
wavfile = wave.open('/home/luar/AI/voice_assistant/Branch/recordings/test.wav', 'rb')
recognizer = vrec(model, wavfile.getframerate())

start_time = time.time()

# 6. Process the audio in chunks and collect recognized text

while 1:
    binary = wavfile.readframes(4000)
    if len(binary) == 0:
        break
    if recognizer.AcceptWaveform(binary):
        result = json.loads(recognizer.Result())
        print(result)
        




# 7. Get the final recognition result and append to text

# 8. Print the recognized text

# 9. Stop timing and print the total processing time
end_time = time.time()
print(f"Processing time: {end_time - start_time:.2f} seconds")

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/luar/AI/voice_assistant/vosk-model-small-en-us-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/luar/AI/voice_assistant/vosk-model-small-en-us-0.15/graph/HCLr.fst /home/luar/AI/voice_assistant/vosk-model-small-en-us-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:308) Loading winfo /home/luar/AI/voice_assistant/vosk-model-small-en-us-0.15/graph/phone

Processing time: 1.84 seconds


### Common Issues and Solutions

1. **Model Not Found**: Ensure you've downloaded the model and specified the correct path
2. **Audio Format Issues**: Vosk works best with 16kHz, 16-bit mono audio
3. **No Text Recognized**: Check if your audio file is clear, at a good volume, and in a supported language

## Next Steps

Now that you understand the basics of Vosk, you're ready to set up your first Vosk project. You'll need to:
1. Install Vosk
2. Download a model
3. Create a simple script that recognizes speech from a WAV file

The practice guide will walk you through each step in detail, helping you build a foundation for your voice assistant project.