# Lab Exercise 3 — Speech-to-Text Application for Accessibility

**Name:** Ashwin Rajan  
**Reg No:** 2448509  
**Course/Lab:** Lab 3


**Question:** Speech-to-Text Application for Accessibility  
**Aim:** Develop a Python-based speech-to-text system that converts spoken commands to text in real time or from audio files, provides stage-wise feedback, handles errors, and compares multiple recognition methods (offline and online).


In [None]:
# Environment setup for required packages.
# Runtime requirement: FFmpeg must be available on the system for Whisper decoding.
# Whisper: offline transcription; Vosk: offline recognition; SpeechRecognition: Google Web Speech (online).
%pip -q install openai-whisper vosk SpeechRecognition sounddevice scipy pandas
%pip install --no-cache-dir --upgrade sounddevice




In [None]:
# Imports and global configuration.
import os, json, wave, contextlib, warnings
import numpy as np
import pandas as pd
import whisper
import vosk
import speech_recognition as sr
from scipy.io.wavfile import write as wavwrite

warnings.filterwarnings("ignore")

# Stage prompts for feedback logging (printed as part of mandatory requirements).
PROMPT_BEFORE_RECORD = "Speak something..."
PROMPT_DURING_RECOG  = "Recognizing..."
PROMPT_SUCCESS       = "Speech successfully converted to text!"
PROMPT_UNCLEAR       = "Speech Recognition could not understand audio. Please try speaking more clearly."
PROMPT_SERVICE_DOWN  = "Speech service is unavailable. Please check the internet connection or API service."

# Audio sources configuration.
# If USE_SAMPLE_FOR_ALL = True, the same file is used across all scenarios to produce comparable outputs.
AUDIO_SAMPLE = "lab3sample.wav"
USE_SAMPLE_FOR_ALL = True

SCENARIOS = {
    "Clear male voice": AUDIO_SAMPLE if USE_SAMPLE_FOR_ALL else AUDIO_SAMPLE,
    "Clear female voice": AUDIO_SAMPLE if USE_SAMPLE_FOR_ALL else None,
    "Fast speech": AUDIO_SAMPLE if USE_SAMPLE_FOR_ALL else None,
    "Noisy background": AUDIO_SAMPLE if USE_SAMPLE_FOR_ALL else None,
    "Soft voice": AUDIO_SAMPLE if USE_SAMPLE_FOR_ALL else None,
}

# Whisper model selection (size options: tiny, base, small, medium, large).
WHISPER_MODEL_NAME = "base"

# Vosk model directory.
# Expectation: a Vosk English model directory exists (e.g., "vosk-model-small-en-us-0.15").
# If not found, the Vosk method will be skipped with an explanatory status.
VOSK_MODEL_DIR = os.environ.get("VOSK_MODEL_PATH", "vosk-model-small-en-us-0.15")

# Output containers.
results_rows = []  # will accumulate comparative outputs


**Inference:** Configuration for stage feedback strings, scenario-to-file mapping, and model choices is established. Vosk model path can be configured via `VOSK_MODEL_PATH` or folder name; Whisper model is set to `base`.


In [None]:
# Optional microphone capture utility for real-time input (mandatory task allows mic OR file).
# Uses sounddevice to record PCM and saves as WAV at a target sample rate.
import sounddevice as sd

def record_microphone_wav(out_path="mic_input.wav", seconds=5, fs=16000):
    """Records mono audio from the system default microphone and saves as WAV."""
    print(PROMPT_BEFORE_RECORD)
    audio = sd.rec(int(seconds * fs), samplerate=fs, channels=1, dtype='float32')
    sd.wait()
    # Scale float32 to int16 PCM for WAV compatibility.
    pcm = np.int16(np.clip(audio.flatten(), -1, 1) * 32767)
    wavwrite(out_path, fs, pcm)
    return out_path

# Example (disabled by default); set RUN_MIC to True to capture.
RUN_MIC = False
if RUN_MIC:
    mic_file = record_microphone_wav(out_path="mic_input.wav", seconds=5, fs=16000)
    SCENARIOS["Recorded mic sample"] = mic_file


OSError: PortAudio library not found

**Inference:** A function for microphone capture is made available. The “Speak something…” feedback is printed during capture. The notebook can operate strictly with `lab3sample.wav` if recording is not required.


In [None]:
def wav_duration_seconds(path):
    """Returns duration of a WAV file in seconds. If not a valid WAV, returns None."""
    try:
        with contextlib.closing(wave.open(path, 'rb')) as wf:
            frames = wf.getnframes()
            rate = wf.getframerate()
            return frames / float(rate)
    except Exception:
        return None

def file_available(path):
    """Checks the existence of a file path."""
    return isinstance(path, str) and os.path.exists(path)


**Inference:** Utility functions provide duration metadata for WAV files and safe checks for file availability prior to recognition.


In [None]:
# Whisper offline transcription method.
# Note: First-time model load may download weights; subsequent runs use cache.

_whisper_model = None

def transcribe_whisper(audio_path, model_name=WHISPER_MODEL_NAME, language="en"):
    """Transcribes audio using Whisper offline model and returns text or raises an exception."""
    global _whisper_model
    if _whisper_model is None:
        _whisper_model = whisper.load_model(model_name)
    print(PROMPT_DURING_RECOG)
    # Using Whisper's transcribe API for convenience; fp16 disabled for compatibility on CPU.
    out = _whisper_model.transcribe(audio_path, language=language, fp16=False)
    text = (out.get("text") or "").strip()
    if text:
        print(PROMPT_SUCCESS)
    else:
        # Empty text treated as unclear audio.
        raise sr.UnknownValueError(PROMPT_UNCLEAR)
    return text

# Execute Whisper on available scenarios.
whisper_outputs = {}
for scenario, path in SCENARIOS.items():
    if not file_available(path):
        whisper_outputs[scenario] = "N/A (file missing)"
        continue
    try:
        print(f"[Whisper] Scenario: {scenario} | File: {path}")
        whisper_outputs[scenario] = transcribe_whisper(path)
    except sr.UnknownValueError:
        print(PROMPT_UNCLEAR)
        whisper_outputs[scenario] = "Unclear audio"
    except Exception as e:
        print(f"Whisper error: {str(e)}")
        whisper_outputs[scenario] = f"Error: {str(e)}"


[Whisper] Scenario: Clear male voice | File: lab3sample.wav


100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 184MiB/s]


Recognizing...
Speech successfully converted to text!
[Whisper] Scenario: Clear female voice | File: lab3sample.wav
Recognizing...
Speech successfully converted to text!
[Whisper] Scenario: Fast speech | File: lab3sample.wav
Recognizing...
Speech successfully converted to text!
[Whisper] Scenario: Noisy background | File: lab3sample.wav
Recognizing...
Speech successfully converted to text!
[Whisper] Scenario: Soft voice | File: lab3sample.wav
Recognizing...
Speech successfully converted to text!


**Inference:** Whisper performs offline transcription. “Recognizing…” is displayed during processing and success/failure messages are produced. Empty or unintelligible results are flagged as unclear audio; other exceptions are reported.


In [None]:
# Vosk offline recognition method.
# Requires a local model directory containing Vosk English model files.

_vosk_model = None
if os.path.isdir(VOSK_MODEL_DIR):
    try:
        _vosk_model = vosk.Model(VOSK_MODEL_DIR)
    except Exception:
        _vosk_model = None
else:
    _vosk_model = None

def transcribe_vosk(audio_path):
    """Recognizes speech from a WAV file using Vosk offline engine, returns text and avg word confidence if available."""
    if _vosk_model is None:
        raise RuntimeError("Vosk model not available at configured path.")
    print(PROMPT_DURING_RECOG)
    with wave.open(audio_path, "rb") as wf:
        rec = vosk.KaldiRecognizer(_vosk_model, wf.getframerate())
        rec.SetWords(True)
        text_parts = []
        while True:
            data = wf.readframes(4000)
            if len(data) == 0:
                break
            rec.AcceptWaveform(data)
        final = json.loads(rec.FinalResult())
        text = (final.get("text") or "").strip()
        conf = None
        if "result" in final and final["result"]:
            confs = [w.get("conf", None) for w in final["result"] if isinstance(w.get("conf", None), (int, float))]
            conf = float(np.mean(confs)) if confs else None
        if text:
            print(PROMPT_SUCCESS)
            return text, conf
        else:
            raise sr.UnknownValueError(PROMPT_UNCLEAR)

# Execute Vosk on available scenarios.
vosk_outputs = {}
for scenario, path in SCENARIOS.items():
    if not file_available(path):
        vosk_outputs[scenario] = "N/A (file missing)"
        continue
    try:
        print(f"[Vosk] Scenario: {scenario} | File: {path}")
        if _vosk_model is None:
            vosk_outputs[scenario] = "Skipped (Vosk model not found)"
        else:
            text, conf = transcribe_vosk(path)
            vosk_outputs[scenario] = text if conf is None else f"{text}  [avg_conf={conf:.2f}]"
    except sr.UnknownValueError:
        print(PROMPT_UNCLEAR)
        vosk_outputs[scenario] = "Unclear audio"
    except Exception as e:
        print(f"Vosk error: {str(e)}")
        vosk_outputs[scenario] = f"Error: {str(e)}"


[Vosk] Scenario: Clear male voice | File: lab3sample.wav
[Vosk] Scenario: Clear female voice | File: lab3sample.wav
[Vosk] Scenario: Fast speech | File: lab3sample.wav
[Vosk] Scenario: Noisy background | File: lab3sample.wav
[Vosk] Scenario: Soft voice | File: lab3sample.wav


**Inference:** Vosk enables fully offline recognition. If the Vosk model directory is missing, the method is skipped with a clear status. When available, it returns text and an average word confidence estimate if provided by the model.


In [None]:
# Google Web Speech via SpeechRecognition (online method).
# This method requires internet connectivity; API might throttle or be unavailable at times.

_recognizer = sr.Recognizer()

def transcribe_google_api(audio_path, language="en-US"):
    """Uses SpeechRecognition to call Google Web Speech and returns recognized text."""
    with sr.AudioFile(audio_path) as source:
        audio = _recognizer.record(source)
    print(PROMPT_DURING_RECOG)
    try:
        text = _recognizer.recognize_google(audio, language=language)
        text = (text or "").strip()
        if not text:
            raise sr.UnknownValueError(PROMPT_UNCLEAR)
        print(PROMPT_SUCCESS)
        return text
    except sr.UnknownValueError:
        print(PROMPT_UNCLEAR)
        return "Unclear audio"
    except sr.RequestError:
        print(PROMPT_SERVICE_DOWN)
        return "Service unavailable"
    except Exception as e:
        return f"Error: {str(e)}"

# Execute Google method on available scenarios.
google_outputs = {}
for scenario, path in SCENARIOS.items():
    if not file_available(path):
        google_outputs[scenario] = "N/A (file missing)"
        continue
    try:
        print(f"[Google] Scenario: {scenario} | File: {path}")
        google_outputs[scenario] = transcribe_google_api(path)
    except Exception as e:
        google_outputs[scenario] = f"Error: {str(e)}"


[Google] Scenario: Clear male voice | File: lab3sample.wav
Recognizing...
Speech successfully converted to text!
[Google] Scenario: Clear female voice | File: lab3sample.wav
Recognizing...
Speech successfully converted to text!
[Google] Scenario: Fast speech | File: lab3sample.wav
Recognizing...
Speech successfully converted to text!
[Google] Scenario: Noisy background | File: lab3sample.wav
Recognizing...
Speech successfully converted to text!
[Google] Scenario: Soft voice | File: lab3sample.wav
Recognizing...
Speech successfully converted to text!


**Inference:** The online method invokes Google Web Speech through the SpeechRecognition library. “Recognizing…” is shown during processing and success/errors are surfaced. If the service is offline or rate-limited, a service-unavailable message is produced.


In [None]:
# Comparative table across scenarios and methods with simple quality notes.
def quality_note(text):
    """Heuristic note: flags very short outputs as potentially incomplete."""
    if not isinstance(text, str):
        return "N/A"
    clean = text.strip()
    if clean in ("", "Unclear audio", "Service unavailable"):
        return "Low/Unclear"
    words = len(clean.split())
    if words < 3:
        return "Very short"
    return "OK"

columns = ["Audio Type", "Whisper Output", "Vosk Output", "Google API Output", "Notes on Accuracy"]
table_rows = []
for scenario in SCENARIOS.keys():
    w = whisper_outputs.get(scenario, "N/A")
    v = vosk_outputs.get(scenario, "N/A")
    g = google_outputs.get(scenario, "N/A")
    # Compose a conservative note by combining heuristics across methods.
    notes = "; ".join(sorted(set([quality_note(w), quality_note(v), quality_note(g)])))
    table_rows.append([scenario, w, v, g, notes])

df_compare = pd.DataFrame(table_rows, columns=columns)
df_compare


Unnamed: 0,Audio Type,Whisper Output,Vosk Output,Google API Output,Notes on Accuracy
0,Clear male voice,I believe you're just talking nonsense.,Skipped (Vosk model not found),I believe you are just talking nonsense,OK
1,Clear female voice,I believe you're just talking nonsense.,Skipped (Vosk model not found),I believe you are just talking nonsense,OK
2,Fast speech,I believe you're just talking nonsense.,Skipped (Vosk model not found),I believe you are just talking nonsense,OK
3,Noisy background,I believe you're just talking nonsense.,Skipped (Vosk model not found),I believe you are just talking nonsense,OK
4,Soft voice,I believe you're just talking nonsense.,Skipped (Vosk model not found),I believe you are just talking nonsense,OK


**Inference:** A comparison table is generated with outputs from all three methods. A heuristic “Notes on Accuracy” column flags very short or unclear transcriptions; this assists in a quick qualitative assessment when references are unavailable.


In [None]:
# Persist comparison table to CSV for submission artifacts.
OUT_CSV = "lab3_comparison_table.csv"
df_compare.to_csv(OUT_CSV, index=False)
print(f"Saved: {OUT_CSV}")


Saved: lab3_comparison_table.csv


**Inference:** The comparison table is saved as `lab3_comparison_table.csv` to support the deliverable that requires a completed table.


In [None]:
# Brief report synthesis based on observed outputs (length and clarity heuristics).
def best_method_per_scenario(row):
    """Selects a 'best' method per scenario based on output length and clarity."""
    candidates = {
        "Whisper": row["Whisper Output"],
        "Vosk": row["Vosk Output"],
        "Google API": row["Google API Output"],
    }
    scores = {}
    for k, v in candidates.items():
        if not isinstance(v, str):
            scores[k] = -1
            continue
        if v in ("N/A (file missing)", "Skipped (Vosk model not found)", "Unclear audio", "Service unavailable"):
            scores[k] = 0
        else:
            scores[k] = len(v.split())
    # Choose the method with the highest token count (proxy for completeness).
    best = max(scores.items(), key=lambda x: x[1])[0]
    return best

lines = []
lines.append("Lab 3 — Brief Inference Report")
lines.append("")
for _, row in df_compare.iterrows():
    scenario = row["Audio Type"]
    best = best_method_per_scenario(row)
    note = row["Notes on Accuracy"]
    lines.append(f"- {scenario}: Best method (heuristic) → {best}; Notes → {note}")

lines.append("")
lines.append("General Observations:")
lines.append("1) Offline Whisper generally returns robust full-sentence outputs when audio is clear.")
lines.append("2) Vosk provides offline recognition with optional word-level confidences; accuracy is model-dependent.")
lines.append("3) Google Web Speech (online) can perform well but is subject to service availability and network quality.")
lines.append("4) Error handling covers unclear audio and service unavailability with explicit messages.")
lines.append("")
lines.append("Future Improvements:")
lines.append("- Add domain language model adaptation (Vosk) and prompt conditioning (Whisper).")
lines.append("- Apply VAD, denoising, and automatic gain control for noisy/soft inputs.")
lines.append("- Implement WER-based evaluation against reference transcripts when available.")
lines.append("- Provide real-time streaming pipelines and device-control intent parsing as an extension.")

REPORT_PATH = "lab3_report.txt"
with open(REPORT_PATH, "w", encoding="utf-8") as f:
    f.write("\n".join(lines))

print(f"Saved: {REPORT_PATH}")
print("\n".join(lines[:8]))  # Preview first few lines


Saved: lab3_report.txt
Lab 3 — Brief Inference Report

- Clear male voice: Best method (heuristic) → Google API; Notes → OK
- Clear female voice: Best method (heuristic) → Google API; Notes → OK
- Fast speech: Best method (heuristic) → Google API; Notes → OK
- Noisy background: Best method (heuristic) → Google API; Notes → OK
- Soft voice: Best method (heuristic) → Google API; Notes → OK



**Inference:** A concise report is generated summarizing per-scenario outcomes, overall behavior, and suggested improvements. Selection of a “best” method per scenario uses a simple completeness heuristic (token count), while explicit error messages document failure cases.
