# VOSK Acoustic Model Fine-tuning (Colab)

The notebook provides a step-by-step environment for adapting / fine-tuning a VOSK (Kaldi-based) acoustic model
using a **small dataset**. It downloads a VOSK model, fetches a small subset of speech data (Mozilla Common Voice),
creates Kaldi-format data files, and shows the Kaldi/VOSK commands for adaptation (force-align / fMLLR / SAT / simple fine-tune).

**Files produced:** `/mnt/data/vosk_finetune_colab.ipynb` (this notebook).  
**Helpful links (in notebook):** VOSK models, Common Voice, LibriSpeech, TED-LIUM.  

**Notes:** Building Kaldi in Colab can take 10–30 minutes. Some dataset downloads are large; the notebook shows how to fetch a *small* subset for quick testing.

---


## 1) Colab setup — install prerequisites

This cell installs common packages, `vosk`, `sox`, `ffmpeg`, `git`, and the `datasets` library to fetch a small subset of Common Voice.
It also provides commands to clone Kaldi if you plan to build it (optional; building Kaldi takes time).

In [1]:
# Install basic tools (run in Colab)
!apt-get update -qq
!apt-get install -y -qq sox ffmpeg git build-essential automake autoconf libtool wget python3-dev
# Install Python packages
!pip install -q vosk==0.3.52 datasets soundfile tqdm

# (Optional) Clone Kaldi - building Kaldi is time-consuming. Uncomment to clone.
# !git clone --depth=1 https://github.com/kaldi-asr/kaldi.git /content/kaldi
# After cloning, follow Kaldi's INSTALL instructions. Many users build in Colab but it may take 10-30 min.
print('Basic tools installed. If you plan to build Kaldi, uncomment the clone command and follow Kaldi INSTALL instructions.')

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Selecting previously unselected package javascript-common.
(Reading database ... 121229 files and directories currently installed.)
Preparing to unpack .../00-javascript-common_11+nmu1_all.deb ...
Unpacking javascript-common (11+nmu1) ...
Selecting previously unselected package libjs-underscore.
Preparing to unpack .../01-libjs-underscore_1.13.2~dfsg-2_all.deb ...
Unpacking libjs-underscore (1.13.2~dfsg-2) ...
Selecting previously unselected package libjs-sphinxdoc.
Preparing to unpack .../02-libjs-sphinxdoc_4.3.2-1_all.deb ...
Unpacking libjs-sphinxdoc (4.3.2-1) ...
Selecting previously unselected package libopencore-amrnb0:amd64.
Preparing to unpack .../03-libopencore-amrnb0_0.1.5-1_amd64.deb ...
Unpacking libopencore-amrnb0:amd64 (0.1.5-1) ...
Selecting previously unselected package libopencore-am

## 2) Download a VOSK model

Choose a VOSK model (small for faster experiments). The notebook downloads `vosk-model-small-en-us-0.15` as an example.
Change the URL to the model you prefer (e.g., `vosk-model-en-us-0.22`). See https://alphacephei.com/vosk/models

In [2]:
# Download and extract a VOSK model (example small English model)
# Change the URL if you want a different VOSK model version.
MODEL_URL="https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip"
MODEL_DIR="/content/vosk-model-small-en-us-0.15"
!mkdir -p /content/models
!wget -q --show-progress -O /content/models/vosk_model.zip "$MODEL_URL"
!unzip -q /content/models/vosk_model.zip -d /content/models
!ls -lh /content/models
print('VOSK model downloaded to /content/models')

total 40M
drwxr-xr-x 6 root root 4.0K Dec  8  2020 vosk-model-small-en-us-0.15
-rw-r--r-- 1 root root  40M Dec  8  2020 vosk_model.zip
VOSK model downloaded to /content/models


## 3) Download a small dataset (Common Voice subset)

Use `datasets` library to download a small subset (e.g., 100 short clips) from Mozilla Common Voice. This is quick and good for a proof-of-concept.
If you prefer LibriSpeech or TED-LIUM, swap the dataset code — links are provided in the notebook.

In [6]:
# Colab cell: install torchcodec and supporting libs, then show free disk space
# (runs quickly; small install)
!pip install -q torchcodec datasets soundfile
# optionally install huggingface hub login helper if you plan to use private/gated datasets:
!pip install -q huggingface_hub

# Show disk space so you can confirm there's enough room
!df -h / | sed -n '1,2p'


Filesystem      Size  Used Avail Use% Mounted on
overlay         108G   97G   11G  90% /


In [8]:
# Colab cell: login to Hugging Face (secure)
from huggingface_hub import login
# This will prompt you for your HF token in an input box
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [10]:
# Single robust Colab cell: install torchcodec, attempt HF datasets, fallback to LibriSpeech, else create TTS demo.
# Paste and run this cell in Colab (it may take ~1-2 minutes to install packages).

# 0) Install required packages (torchcodec required to decode audio columns)
!pip install -q torchcodec datasets soundfile huggingface_hub

# Optional useful libs for fallback TTS creation (small)
!pip install -q gTTS librosa

print("Installed packages. Now attempting to load datasets...")

# 1) Robust loader with fallback
from datasets import load_dataset
from huggingface_hub import login as hf_login
import soundfile as sf, os, traceback, sys
import numpy as np

OUT_BASE = '/content/data'
os.makedirs(OUT_BASE, exist_ok=True)

def save_dataset(ds, name, max_examples=100, max_duration=5.0):
    wav_dir = os.path.join(OUT_BASE, name, 'wav')
    os.makedirs(wav_dir, exist_ok=True)
    transcripts_path = os.path.join(OUT_BASE, name, 'transcripts.txt')
    saved = 0
    with open(transcripts_path, 'w') as tf:
        for ex in ds:
            if saved >= max_examples:
                break
            # robustly get audio field
            audio = None
            for k in ('audio','audio_file','speech','speech_array'):
                if isinstance(ex, dict) and k in ex:
                    audio = ex[k]
                    break
            if audio is None:
                # sometimes datasets return nested types (use .get if available)
                try:
                    audio = ex.get('audio') if hasattr(ex, 'get') else None
                except Exception:
                    audio = None
            if not audio:
                continue
            # audio may be dict {'array', 'sampling_rate'}
            if isinstance(audio, dict):
                arr = audio.get('array')
                sr = audio.get('sampling_rate', 16000)
            else:
                # fallback: audio might be path; try reading
                try:
                    arr, sr = sf.read(audio)
                except Exception:
                    continue
            if arr is None or len(arr)==0:
                continue
            duration = len(arr) / float(sr) if sr else 0
            if duration > max_duration or duration <= 0:
                continue
            path = os.path.join(wav_dir, f'utt{saved}.wav')
            # normalize floats to int16 if needed
            if np.issubdtype(getattr(arr,'dtype',np.float32), np.floating):
                arr_to_write = (arr * 32767).astype('int16')
            else:
                arr_to_write = arr.astype('int16')
            try:
                sf.write(path, arr_to_write, sr)
            except Exception as e:
                # skip file if write fails
                print("Warning: failed to write", path, e)
                continue
            # transcripts: try common keys
            transcript = ''
            for key in ('sentence','text','transcript'):
                try:
                    transcript = ex.get(key,'') if isinstance(ex, dict) else ''
                    if transcript:
                        break
                except Exception:
                    transcript = ''
            transcript = (transcript or '').upper().strip()
            tf.write(f'utt{saved} {transcript}\\n')
            saved += 1
    print(f"Saved {saved} examples to {wav_dir} and transcripts at {transcripts_path}")
    return saved

loaded = False

# Try to login (will prompt) — you can skip if you don't want to login
print("\\nIf dataset requires HF access (Common Voice), a login prompt will appear. You may press Ctrl+C to skip login and use fallback.")
try:
    hf_login()  # interactive prompt: paste your HF token if required
except Exception as e:
    print("HF login skipped or failed (continuing with public fallbacks).", e)

# Try several Common Voice versions (some snapshots may be gated)
common_voice_versions = ['common_voice_13_0','common_voice_11_0','common_voice_10_0','common_voice_9_0']
for ver in common_voice_versions:
    repo = f"mozilla-foundation/{ver}"
    try:
        print(f"Trying to load {repo} (english) with a small slice...")
        ds = load_dataset(repo, 'en', split='train[:1%]')
        # quick filter if duration available
        try:
            short_ds = ds.filter(lambda x: x.get('duration', 0) < 5.0)
        except Exception:
            short_ds = ds
        short_ds = short_ds.select(range(min(200, len(short_ds))))
        saved = save_dataset(short_ds, 'commonvoice_sample', max_examples=80, max_duration=5.0)
        if saved > 0:
            loaded = True
            dataset_name = 'commonvoice_sample'
            break
    except Exception as e:
        print(f"Failed to load {repo}: {e}")
        traceback.print_exc()

# If not loaded, fall back to LibriSpeech (public)
if not loaded:
    try:
        print("Falling back to LibriSpeech (public). Downloading a small slice...")
        ds = load_dataset('librispeech_asr', 'clean', split='train.100')
        # filter short clips if possible
        try:
            short_ds = ds.filter(lambda x: x['audio']['array'].shape[0] / x['audio']['sampling_rate'] < 5.0)
        except Exception:
            short_ds = ds
        short_ds = short_ds.select(range(min(200, len(short_ds))))
        saved = save_dataset(short_ds, 'librispeech_sample', max_examples=80, max_duration=5.0)
        if saved > 0:
            loaded = True
            dataset_name = 'librispeech_sample'
    except Exception as e:
        print("LibriSpeech fallback failed:", e)
        traceback.print_exc()

# Final fallback: create a tiny TTS dataset using gTTS (guaranteed)
if not loaded:
    try:
        print("All external dataset attempts failed. Creating a small synthetic TTS demo dataset (quick).")
        # create TTS dataset (~80 files) - small and safe
        from gtts import gTTS
        import librosa
        base = os.path.join(OUT_BASE, 'tts_demo')
        wav_dir = os.path.join(base, 'wav')
        os.makedirs(wav_dir, exist_ok=True)
        sentences = [
            "Hello, this is a test audio.",
            "This dataset is for VOSK acoustic adaptation experiments.",
            "Please speak clearly and slowly.",
            "I am testing fine tuning with a small dataset.",
            "The quick brown fox jumps over the lazy dog."
        ]
        idx = 0
        for rep in range(16):  # 16*5 = 80
            for s in sentences:
                text = s if rep % 2 == 0 else f"{s} variation {rep}"
                fname = os.path.join(wav_dir, f'utt{idx}.wav')
                mp3path = fname + '.mp3'
                tts = gTTS(text=text, lang='en')
                tts.save(mp3path)
                # convert mp3 -> wav at 16kHz mono
                y, sr = librosa.load(mp3path, sr=16000, mono=True)
                sf.write(fname, y, 16000, subtype='PCM_16')
                os.remove(mp3path)
                idx += 1
                if idx >= 80:
                    break
            if idx >= 80:
                break
        # write Kaldi-style files
        with open(os.path.join(base,'text'), 'w') as t, \
             open(os.path.join(base,'wav.scp'), 'w') as w, \
             open(os.path.join(base,'utt2spk'), 'w') as u:
            for i in range(idx):
                utt = f'utt{i}'
                path = os.path.join(wav_dir, f'utt{i}.wav')
                transcript = f"SYNTHETIC SAMPLE {i}"
                t.write(f'{utt} {transcript}\\n')
                w.write(f'{utt} {path}\\n')
                u.write(f'{utt} spk1\\n')
        with open(os.path.join(base,'spk2utt'),'w') as s:
            s.write('spk1 ' + ' '.join([f'utt{i}' for i in range(idx)]) + '\\n')
        dataset_name = 'tts_demo'
        loaded = True
        print("TTS demo dataset created at:", base)
    except Exception as e:
        print("TTS fallback also failed:", e)
        traceback.print_exc()

if loaded:
    print(f"Dataset ready: /content/data/{dataset_name}")
    print("Kaldi-style files: text, wav.scp, utt2spk, spk2utt located in that folder.")
else:
    raise RuntimeError("Unable to fetch or create a usable dataset automatically. Provide audio files or free up space and retry.")


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/98.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalled packages. Now attempting to load datasets...
\nIf dataset requires HF access (Common Voice), a login prompt will appear. You may press Ctrl+C to skip login and use fallback.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Trying to load mozilla-foundation/common_voice_13_0 (english) with a small slice...


Repo card metadata block was not found. Setting CardData to empty.


Failed to load mozilla-foundation/common_voice_13_0: The directory at hf://datasets/mozilla-foundation/common_voice_13_0@ff2bbb54dcdb597100fe534a1b911ff9103f9e22 doesn't contain any data files
Trying to load mozilla-foundation/common_voice_11_0 (english) with a small slice...


Traceback (most recent call last):
  File "/tmp/ipython-input-572255742.py", line 101, in <cell line: 0>
    ds = load_dataset(repo, 'en', split='train[:1%]')
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1392, in load_dataset
    verification_mode = VerificationMode(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1132, in load_dataset_builder
    if storage_options is not None:
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1025, in dataset_module_factory
    except Exception:
            ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1004, in dataset_module_factory
    data_dir=data_dir,
    ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 631, in get_module
    else:
          
  File "/usr/local/lib/python3

Failed to load mozilla-foundation/common_voice_11_0: Dataset 'mozilla-foundation/common_voice_11_0' doesn't exist on the Hub or cannot be accessed.
Trying to load mozilla-foundation/common_voice_10_0 (english) with a small slice...
Failed to load mozilla-foundation/common_voice_10_0: Dataset 'mozilla-foundation/common_voice_10_0' doesn't exist on the Hub or cannot be accessed.


Traceback (most recent call last):
  File "/tmp/ipython-input-572255742.py", line 101, in <cell line: 0>
    ds = load_dataset(repo, 'en', split='train[:1%]')
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1392, in load_dataset
    verification_mode = VerificationMode(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1132, in load_dataset_builder
    if storage_options is not None:
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1025, in dataset_module_factory
    except Exception:
            ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 980, in dataset_module_factory
    except RevisionNotFoundError as e:
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
datasets.exceptions.DatasetNotFoundError: Dataset 'mozilla-foundation/common_voice_10_0' doesn't exist o

Trying to load mozilla-foundation/common_voice_9_0 (english) with a small slice...
Failed to load mozilla-foundation/common_voice_9_0: Dataset 'mozilla-foundation/common_voice_9_0' doesn't exist on the Hub or cannot be accessed.


Traceback (most recent call last):
  File "/tmp/ipython-input-572255742.py", line 101, in <cell line: 0>
    ds = load_dataset(repo, 'en', split='train[:1%]')
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1392, in load_dataset
    verification_mode = VerificationMode(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1132, in load_dataset_builder
    if storage_options is not None:
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1025, in dataset_module_factory
    except Exception:
            ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 980, in dataset_module_factory
    except RevisionNotFoundError as e:
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
datasets.exceptions.DatasetNotFoundError: Dataset 'mozilla-foundation/common_voice_9_0' doesn't exist on

Falling back to LibriSpeech (public). Downloading a small slice...


Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Filter:   0%|          | 0/28539 [00:00<?, ? examples/s]

LibriSpeech fallback failed: To support decoding audio data, please install 'torchcodec'.
All external dataset attempts failed. Creating a small synthetic TTS demo dataset (quick).


Traceback (most recent call last):
  File "/tmp/ipython-input-572255742.py", line 128, in <cell line: 0>
    saved = save_dataset(short_ds, 'librispeech_sample', max_examples=80, max_duration=5.0)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ipython-input-572255742.py", line 27, in save_dataset
    for ex in ds:
              ^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/arrow_dataset.py", line 2466, in __iter__
    return self.num_rows
                         
  File "/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py", line 657, in format_table
    if format_columns is None:
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py", line 410, in __call__
    if query_type == "row":
               ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py", line 459, in format_row
    row = self.

TTS demo dataset created at: /content/data/tts_demo
Dataset ready: /content/data/tts_demo
Kaldi-style files: text, wav.scp, utt2spk, spk2utt located in that folder.


In [9]:
# Robust dataset loader: try Common Voice (requires HF token/accepted terms), else fallback to LibriSpeech.
# Saves short clips (<=5s) as wav and writes transcripts in uppercase in /content/data/<name>.
from datasets import load_dataset, Dataset
import soundfile as sf, os, sys, traceback
import numpy as np

OUT_BASE = '/content/data'
os.makedirs(OUT_BASE, exist_ok=True)

def save_dataset(ds, name, max_examples=200, max_duration=5.0):
    wav_dir = os.path.join(OUT_BASE, name, 'wav')
    os.makedirs(wav_dir, exist_ok=True)
    transcripts_path = os.path.join(OUT_BASE, name, 'transcripts.txt')
    saved = 0
    with open(transcripts_path, 'w') as tf:
        for i, ex in enumerate(ds):
            if saved >= max_examples:
                break
            audio = ex.get('audio') or ex.get('audio_file') or ex.get('audio_path') or ex.get('speech')
            # datasets' audio column is a dict with 'array' and 'sampling_rate'
            if not audio:
                continue
            arr = audio.get('array') if isinstance(audio, dict) else audio
            sr = audio.get('sampling_rate') if isinstance(audio, dict) else 16000
            duration = len(arr) / sr if len(arr)>0 else 0
            if duration > max_duration or duration<=0:
                continue
            path = os.path.join(wav_dir, f'utt{saved}.wav')
            # ensure np.float32/int16... convert to int16 if necessary
            if np.issubdtype(arr.dtype, np.floating):
                # scale floats (-1..1) to int16
                arr_to_write = (arr * 32767).astype('int16')
            else:
                arr_to_write = arr.astype('int16')
            sf.write(path, arr_to_write, sr)
            # transcripts: Common Voice uses 'sentence', LibriSpeech uses 'text'
            transcript = (ex.get('sentence') or ex.get('text') or ex.get('transcript') or '').upper().strip()
            tf.write(f'utt{saved} {transcript}\n')
            saved += 1
    print(f"Saved {saved} examples to {wav_dir} and transcripts at {transcripts_path}")
    return saved

# Try Common Voice snapshots (may require authentication / accepting terms)
common_voice_versions = ['common_voice_13_0', 'common_voice_11_0', 'common_voice_10_0', 'common_voice_9_0']
loaded = False
for ver in common_voice_versions:
    repo = f"mozilla-foundation/{ver}"
    try:
        print(f"Trying to load {repo} (english)...")
        ds = load_dataset(repo, 'en', split='train[:1%]', use_auth_token=True)
        # filter short clips if audio and duration available; otherwise attempt to load and then filter
        try:
            short_ds = ds.filter(lambda x: x.get('duration', 0) < 5.0)
        except Exception as e:
            short_ds = ds
        short_ds = short_ds.select(range(min(200, len(short_ds))))
        saved = save_dataset(short_ds, 'commonvoice_sample', max_examples=100, max_duration=5.0)
        if saved>0:
            loaded = True
            break
    except Exception as e:
        print(f"Failed to load {repo}: {e}")
        traceback.print_exc()
        continue

# If Common Voice didn't load, fall back to LibriSpeech (public)
if not loaded:
    try:
        print('Falling back to LibriSpeech (public). Downloading a small slice...')
        ds = load_dataset('librispeech_asr', 'clean', split='train.100')  # train-clean-100
        # Keep short clips only and a small subset
        try:
            short_ds = ds.filter(lambda x: x['audio']['array'].shape[0] / x['audio']['sampling_rate'] < 5.0)
        except Exception:
            short_ds = ds
        short_ds = short_ds.select(range(min(200, len(short_ds))))
        saved = save_dataset(short_ds, 'librispeech_sample', max_examples=150, max_duration=5.0)
        if saved>0:
            loaded = True
    except Exception as e:
        print('LibriSpeech fallback failed:', e)
        traceback.print_exc()

if not loaded:
    raise RuntimeError('Unable to fetch a usable dataset automatically. Please provide audio files or accept Common Voice terms and retry.')

Trying to load mozilla-foundation/common_voice_13_0 (english)...


Repo card metadata block was not found. Setting CardData to empty.


Failed to load mozilla-foundation/common_voice_13_0: The directory at hf://datasets/mozilla-foundation/common_voice_13_0@ff2bbb54dcdb597100fe534a1b911ff9103f9e22 doesn't contain any data files
Trying to load mozilla-foundation/common_voice_11_0 (english)...
Failed to load mozilla-foundation/common_voice_11_0: Dataset 'mozilla-foundation/common_voice_11_0' doesn't exist on the Hub or cannot be accessed.
Trying to load mozilla-foundation/common_voice_10_0 (english)...


Traceback (most recent call last):
  File "/tmp/ipython-input-3630498348.py", line 50, in <cell line: 0>
    ds = load_dataset(repo, 'en', split='train[:1%]', use_auth_token=True)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1392, in load_dataset
    verification_mode = VerificationMode(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1132, in load_dataset_builder
    if storage_options is not None:
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1025, in dataset_module_factory
    except Exception:
            ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1004, in dataset_module_factory
    data_dir=data_dir,
    ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 631, in get_module
    else:

Failed to load mozilla-foundation/common_voice_10_0: Dataset 'mozilla-foundation/common_voice_10_0' doesn't exist on the Hub or cannot be accessed.
Trying to load mozilla-foundation/common_voice_9_0 (english)...


Traceback (most recent call last):
  File "/tmp/ipython-input-3630498348.py", line 50, in <cell line: 0>
    ds = load_dataset(repo, 'en', split='train[:1%]', use_auth_token=True)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1392, in load_dataset
    verification_mode = VerificationMode(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1132, in load_dataset_builder
    if storage_options is not None:
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1025, in dataset_module_factory
    except Exception:
            ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 980, in dataset_module_factory
    except RevisionNotFoundError as e:
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
datasets.exceptions.DatasetNotFoundError: Dataset 'mozilla-fou

Failed to load mozilla-foundation/common_voice_9_0: Dataset 'mozilla-foundation/common_voice_9_0' doesn't exist on the Hub or cannot be accessed.
Falling back to LibriSpeech (public). Downloading a small slice...


Traceback (most recent call last):
  File "/tmp/ipython-input-3630498348.py", line 50, in <cell line: 0>
    ds = load_dataset(repo, 'en', split='train[:1%]', use_auth_token=True)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1392, in load_dataset
    verification_mode = VerificationMode(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1132, in load_dataset_builder
    if storage_options is not None:
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1025, in dataset_module_factory
    except Exception:
            ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 980, in dataset_module_factory
    except RevisionNotFoundError as e:
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
datasets.exceptions.DatasetNotFoundError: Dataset 'mozilla-fou

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Filter:   0%|          | 0/28539 [00:00<?, ? examples/s]

LibriSpeech fallback failed: To support decoding audio data, please install 'torchcodec'.


Traceback (most recent call last):
  File "/tmp/ipython-input-3630498348.py", line 77, in <cell line: 0>
    saved = save_dataset(short_ds, 'librispeech_sample', max_examples=150, max_duration=5.0)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ipython-input-3630498348.py", line 16, in save_dataset
    for i, ex in enumerate(ds):
                 ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/arrow_dataset.py", line 2466, in __iter__
    return self.num_rows
                         
  File "/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py", line 657, in format_table
    if format_columns is None:
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py", line 410, in __call__
    if query_type == "row":
               ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py", line 459

RuntimeError: Unable to fetch a usable dataset automatically. Please provide audio files or accept Common Voice terms and retry.

## 4) Prepare Kaldi-style data files (`wav.scp`, `text`, `utt2spk`, `spk2utt`)

This cell will generate basic Kaldi data files from the small dataset saved above. For small experiments, use a single-speaker per-utt mapping or derive `spk` from metadata.

In [None]:
# Build Kaldi data dir
import os, glob
wav_dir = '/content/data/small_commonvoice/wav'
out_dir = '/content/data/kaldi_mydata'
os.makedirs(out_dir, exist_ok=True)

wav_paths = sorted(glob.glob(os.path.join(wav_dir, '*.wav')))
with open(os.path.join(out_dir, 'wav.scp'), 'w') as w,      open(os.path.join(out_dir, 'text'), 'w') as t,      open(os.path.join(out_dir, 'utt2spk'), 'w') as u:
    for i, wav in enumerate(wav_paths):
        uttid = f'utt{i}'
        w.write(f'{uttid} {wav}\n')
        # read transcript from transcripts.txt
    # read transcripts
transcripts = {}
with open('/content/data/small_commonvoice/transcripts.txt','r') as f:
    for line in f:
        parts = line.strip().split(' ',1)
        if len(parts)==2:
            transcripts[parts[0]] = parts[1]
with open(os.path.join(out_dir,'text'),'w') as t, open(os.path.join(out_dir,'utt2spk'),'w') as u:
    for i, wav in enumerate(wav_paths):
        uttid = f'utt{i}'
        transcript = transcripts.get(uttid,'')
        t.write(f'{uttid} {transcript}\n')
        # use single speaker per file (or assign spk1)
        u.write(f'{uttid} spk1\n')

# write spk2utt
with open(os.path.join(out_dir,'spk2utt'),'w') as s:
    s.write('spk1 ' + ' '.join([f'utt{i}' for i in range(len(wav_paths))]) + '\n')

print('Kaldi data files written to', out_dir)
print('Preview:')
!ls -lh /content/data/kaldi_mydata
!sed -n '1,10p' /content/data/kaldi_mydata/text

## 5) Example Kaldi commands (do in a machine with Kaldi installed)

The following commands show feature extraction, alignment using an existing VOSK model, and an adaptation approach (fMLLR/SAT). These are to be run inside a Kaldi environment (not directly in this Python cell). Copy-paste into a bash cell in Colab *after* you build Kaldi, or run them on your local machine with Kaldi available.

Change `EXP_DIR` and `MODEL_DIR` to match your paths.

In [None]:
# Example: (These are shell commands - run in bash with Kaldi installed.)
bash_commands = r'''
# Set paths
DATA_DIR=/content/data/kaldi_mydata
MODEL_DIR=/content/models/vosk-model-small-en-us-0.15  # adjust to your extracted VOSK model path
EXP_DIR=/content/exp_vosk_adapt

# 1) Make MFCCs and compute CMVN
steps/make_mfcc.sh --nj 4 --cmd "run.pl" $DATA_DIR exp/make_mfcc $DATA_DIR/mfcc
steps/compute_cmvn_stats.sh $DATA_DIR exp/make_mfcc $DATA_DIR/mfcc

# 2) Prepare lang (use model's dict and lang if available)
# You may need to copy model's 'fst' and 'graph' or prepare your own using utils/prepare_lang.sh

# 3) Align with existing model (requires model in Kaldi exp/ format)
steps/align_si.sh --nj 4 --cmd "run.pl" $DATA_DIR data/lang $MODEL_DIR exp/align_mydata

# 4) Train SAT (speaker adaptive training) - useful for small data adaptation
steps/train_sat.sh --cmd "run.pl" 200 1000 $DATA_DIR data/lang exp/align_mydata $EXP_DIR/sat_model

# 5) Decode with fMLLR transforms
steps/decode_fmllr.sh --nj 4 --cmd "run.pl" exp/graph $EXP_DIR/sat_model $DATA_DIR $EXP_DIR/decode_adapt
'''
print(bash_commands)

## 6) Useful links & recommended datasets

- VOSK models: https://alphacephei.com/vosk/models  
- VOSK adaptation docs: https://alphacephei.com/vosk/adaptation  
- LibriSpeech (OpenSLR): https://www.openslr.org/12  
- TED-LIUM (OpenSLR): https://www.openslr.org/7  
- Mozilla Common Voice (Hugging Face dataset used in this notebook): https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0

## 7) Next steps I can do for you

- (A) Upload this notebook to Colab and walk you through running each cell live.  
- (B) Modify the notebook to attempt to build Kaldi automatically in Colab (long step).  
- (C) Create a ready-to-run shell script to adapt a specific VOSK model if you provide the model name/version.  

Tell me which one you want and I'll prepare it.