<a href="https://colab.research.google.com/github/Stephanisk/notebook/blob/main/notebooks/automatic_model_training7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook demonstrates how to train custom openWakeWord models using pre-defined datasets and an automated process for dataset generation and training. While not guaranteed to always produce the best performing model, the methods shown in this notebook often produce baseline models with releatively strong performance.

Manual data preparation and model training (e.g., see the [training models](training_models.ipynb) notebook) remains an option for when full control over the model development process is needed.

At a high level, the automatic training process takes advantages of several techniques to try and produce a good model, including:

- Early-stopping and checkpoint averaging (similar to [stochastic weight averaging](https://arxiv.org/abs/1803.05407)) to search for the best models found during training, according to the validation data
- Variable learning rates with cosine decay and multiple cycles
- Adaptive batch construction to focus on only high-loss examples when the model begins to converge, combined with gradient accumulation to ensure that batch sizes are still large enough for stable training
- Cycical weight schedules for negative examples to help the model reduce false-positive rates

See the contents of the `train.py` file for more details.

# Environment Setup

To begin, we'll need to install the requirements for training custom models. In particular, a relatively recent version of Pytorch and custom fork of the [piper-sample-generator](https://github.com/dscripka/piper-sample-generator) library for generating synthetic examples for the custom model.

**Important Note!** Currently, automated model training is only supported on linux systems due to the requirements of the text to speech library used for synthetic sample generation (Piper). It may be possible to use Piper on Windows/Mac systems, but that has not (yet) been tested.

In [1]:
import sys
print(f"Python version: {sys.version}")
print(f"Python version info: {sys.version_info}")

# Check if it's 3.11 or earlier
if sys.version_info.major == 3 and sys.version_info.minor <= 11:
    print(f"✓ Python {sys.version_info.major}.{sys.version_info.minor} - Should work!")
else:
    print(f"✗ Python {sys.version_info.major}.{sys.version_info.minor} - Still too new")

Python version: 3.11.13 (main, Jun  4 2025, 08:57:29) [GCC 11.4.0]
Python version info: sys.version_info(major=3, minor=11, micro=13, releaselevel='final', serial=0)
✓ Python 3.11 - Should work!


In [2]:
## Environment setup

# install piper-sample-generator (currently only supports linux systems)
!git clone https://github.com/rhasspy/piper-sample-generator
!wget -O piper-sample-generator/models/en_US-libritts_r-medium.pt 'https://github.com/rhasspy/piper-sample-generator/releases/download/v2.0.0/en_US-libritts_r-medium.pt'
!pip install piper-phonemize
!pip install webrtcvad

# install openwakeword (full installation to support training)
!git clone https://github.com/dscripka/openwakeword
!pip install -e ./openwakeword
!cd openwakeword

# install other dependencies
!pip install mutagen==1.47.0
!pip install torchinfo==1.8.0
!pip install torchmetrics==1.2.0
!pip install speechbrain==0.5.14
!pip install audiomentations==0.33.0
!pip install torch-audiomentations==0.11.0
!pip install acoustics==0.2.6
!pip install tensorflow-cpu==2.8.1
!pip install tensorflow_probability==0.16.0
!pip install onnx_tf==1.10.0
!pip install pronouncing==0.2.0
!pip install datasets==2.14.6
!pip install deep-phonemizer==0.0.19

# Download required models (workaround for Colab)
import os
os.makedirs("./openwakeword/openwakeword/resources/models")
!wget https://github.com/dscripka/openWakeWord/releases/download/v0.5.1/embedding_model.onnx -O ./openwakeword/openwakeword/resources/models/embedding_model.onnx
!wget https://github.com/dscripka/openWakeWord/releases/download/v0.5.1/embedding_model.tflite -O ./openwakeword/openwakeword/resources/models/embedding_model.tflite
!wget https://github.com/dscripka/openWakeWord/releases/download/v0.5.1/melspectrogram.onnx -O ./openwakeword/openwakeword/resources/models/melspectrogram.onnx
!wget https://github.com/dscripka/openWakeWord/releases/download/v0.5.1/melspectrogram.tflite -O ./openwakeword/openwakeword/resources/models/melspectrogram.tflite


Cloning into 'piper-sample-generator'...
remote: Enumerating objects: 161, done.[K
remote: Counting objects: 100% (92/92), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 161 (delta 64), reused 62 (delta 50), pack-reused 69 (from 1)[K
Receiving objects: 100% (161/161), 1.04 MiB | 9.19 MiB/s, done.
Resolving deltas: 100% (74/74), done.
--2025-11-28 23:53:58--  https://github.com/rhasspy/piper-sample-generator/releases/download/v2.0.0/en_US-libritts_r-medium.pt
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://release-assets.githubusercontent.com/github-production-release-asset/642029941/73f4af3c-7cf8-4547-a7b9-3bd29e7f3c33?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-11-29T00%3A43%3A41Z&rscd=attachment%3B+filename%3Den_US-libritts_r-medium.pt&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398

Collecting mutagen==1.47.0
  Downloading mutagen-1.47.0-py3-none-any.whl.metadata (1.7 kB)
Downloading mutagen-1.47.0-py3-none-any.whl (194 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/194.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.4/194.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mutagen
Successfully installed mutagen-1.47.0
Collecting torchinfo==1.8.0
  Downloading torchinfo-1.8.0-py3-none-any.whl.metadata (21 kB)
Downloading torchinfo-1.8.0-py3-none-any.whl (23 kB)
Installing collected packages: torchinfo
Successfully installed torchinfo-1.8.0
Collecting torchmetrics==1.2.0
  Downloading torchmetrics-1.2.0-py3-none-any.whl.metadata (21 kB)
Collecting lightning-utilities>=0.8.0 (from torchmetrics==1.2.0)
  Downloading lightning_utilities-0.15.2-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.8.1->torchmetrics

In [3]:
# FIX 1: Install piper-tts (missing from environment setup)
!pip install piper-tts --quiet

# FIX 2: Patch train.py to add model parameter to generate_samples calls
file_path = "openwakeword/openwakeword/train.py"
with open(file_path, 'r') as f:
    content = f.read()

# Add model path variable after import
if 'piper_model =' not in content:
    content = content.replace(
        'from generate_samples import generate_samples',
        'from generate_samples import generate_samples\n    piper_model = config.get("piper_sample_generator_model_path", "piper-sample-generator/models/en_US-libritts_r-medium.pt")'
    )

# Fix all generate_samples calls
content = content.replace(
    '            generate_samples(\n                text=config["target_phrase"], max_samples=',
    '            generate_samples(\n                text=config["target_phrase"], model=piper_model, max_samples='
)
content = content.replace(
    '            generate_samples(text=adversarial_texts, max_samples=',
    '            generate_samples(text=adversarial_texts, model=piper_model, max_samples='
)
content = content.replace(
    '            generate_samples(text=config["target_phrase"], max_samples=',
    '            generate_samples(text=config["target_phrase"], model=piper_model, max_samples='
)

with open(file_path, 'w') as f:
    f.write(content)

print("✓ Fixes applied!")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/13.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/13.8 MB[0m [31m15.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/13.8 MB[0m [31m40.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m9.9/13.8 MB[0m [31m95.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m13.8/13.8 MB[0m [31m193.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m130.2 MB/s[0m eta [36m0:00:00[0m
[?25h✓ Fixes applied!


In [4]:
# DOWNLOAD MULTI-LANGUAGE MODELS (run after every restart)
import os

models_dir = "piper-sample-generator/models"
os.makedirs(models_dir, exist_ok=True)

# These are the working Piper voices (ONNX format)
models = {
    "de_DE": "https://huggingface.co/rhasspy/piper-voices/resolve/main/de/de_DE/thorsten/medium/de_DE-thorsten-medium.onnx",
    "de_DE_json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/de/de_DE/thorsten/medium/de_DE-thorsten-medium.onnx.json",

    "es_ES": "https://huggingface.co/rhasspy/piper-voices/resolve/main/es/es_ES/davefx/medium/es_ES-davefx-medium.onnx",
    "es_ES_json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/es/es_ES/davefx/medium/es_ES-davefx-medium.onnx.json",

    "fr_FR": "https://huggingface.co/rhasspy/piper-voices/resolve/main/fr/fr_FR/siwis/medium/fr_FR-siwis-medium.onnx",
    "fr_FR_json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/fr/fr_FR/siwis/medium/fr_FR-siwis-medium.onnx.json",

    "pt_BR": "https://huggingface.co/rhasspy/piper-voices/resolve/main/pt/pt_BR/faber/medium/pt_BR-faber-medium.onnx",
    "pt_BR_json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/pt/pt_BR/faber/medium/pt_BR-faber-medium.onnx.json",

    "ru_RU": "https://huggingface.co/rhasspy/piper-voices/resolve/main/ru/ru_RU/dmitri/medium/ru_RU-dmitri-medium.onnx",
    "ru_RU_json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/ru/ru_RU/dmitri/medium/ru_RU-dmitri-medium.onnx.json",

    "zh_CN": "https://huggingface.co/rhasspy/piper-voices/resolve/main/zh/zh_CN/huayan/medium/zh_CN-huayan-medium.onnx",
    "zh_CN_json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/zh/zh_CN/huayan/medium/zh_CN-huayan-medium.onnx.json",
}

print("Downloading multi-language Piper models...")
print("(This takes ~2 minutes)")

for name, url in models.items():
    filename = url.split("/")[-1]
    output_path = f"{models_dir}/{filename}"

    if os.path.exists(output_path):
        print(f"✓ {filename} (already exists)")
    else:
        print(f"⬇ Downloading {filename}...")
        !wget -q -O {output_path} {url}
        if os.path.exists(output_path) and os.path.getsize(output_path) > 1000:
            print(f"  ✓ Done ({os.path.getsize(output_path)/(1024*1024):.1f} MB)")
        else:
            print(f"  ❌ Failed!")

print("\n" + "="*60)
print("Model download complete! Verifying...")
!ls -lh {models_dir}/*.onnx 2>/dev/null | awk '{{print $9, $5}}'
print("="*60)

Downloading multi-language Piper models...
(This takes ~2 minutes)
⬇ Downloading de_DE-thorsten-medium.onnx...
  ✓ Done (60.3 MB)
⬇ Downloading de_DE-thorsten-medium.onnx.json...
  ✓ Done (0.0 MB)
⬇ Downloading es_ES-davefx-medium.onnx...
  ✓ Done (60.3 MB)
⬇ Downloading es_ES-davefx-medium.onnx.json...
  ✓ Done (0.0 MB)
⬇ Downloading fr_FR-siwis-medium.onnx...
  ✓ Done (60.3 MB)
⬇ Downloading fr_FR-siwis-medium.onnx.json...
  ✓ Done (0.0 MB)
⬇ Downloading pt_BR-faber-medium.onnx...
  ✓ Done (60.3 MB)
⬇ Downloading pt_BR-faber-medium.onnx.json...
  ✓ Done (0.0 MB)
⬇ Downloading ru_RU-dmitri-medium.onnx...
  ✓ Done (60.3 MB)
⬇ Downloading ru_RU-dmitri-medium.onnx.json...
  ✓ Done (0.0 MB)
⬇ Downloading zh_CN-huayan-medium.onnx...
  ✓ Done (60.3 MB)
⬇ Downloading zh_CN-huayan-medium.onnx.json...
  ✓ Done (0.0 MB)

Model download complete! Verifying...
9 5
9 5
9 5
9 5
9 5
9 5


In [6]:
# Imports

import os
import numpy as np
import torch
import sys
from pathlib import Path
import uuid
import yaml
import datasets
import scipy
from tqdm import tqdm


In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Download Data

When training new openWakeWord models using the automated procedure, four specific types of data are required:

1) Synthetic examples of the target word/phrase generated with text-to-speech models

2) Synthetic examples of adversarial words/phrases generated with text-to-speech models

3) Room impulse reponses and noise/background audio data to augment the synthetic examples and make them more realistic

4) Generic "negative" audio data that is very unlikely to contain examples of the target word/phrase in the context where the model should detect it. This data can be the original audio data, or precomputed openWakeWord features ready for model training.

5) Validation data to use for early-stopping when training the model.

For the purposes of this notebook, all five of these sources will either be generated manually or can be obtained from HuggingFace thanks to their excellent `datasets` library and extremely generous hosting policy. Also note that while only a portion of some datasets are downloaded, for the best possible performance it is recommended to download the entire dataset and keep a local copy for future training runs.

In [8]:
# Download room impulse responses collected by MIT
# https://mcdermottlab.mit.edu/Reverb/IR_Survey.html

output_dir = "./mit_rirs"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
rir_dataset = datasets.load_dataset("davidscripka/MIT_environmental_impulse_responses", split="train", streaming=True)

# Save clips to 16-bit PCM wav files
for row in tqdm(rir_dataset):
    name = row['audio']['path'].split('/')[-1]
    scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, (row['audio']['array']*32767).astype(np.int16))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/936 [00:00<?, ?B/s]

HTTP Error 429 thrown while requesting GET https://huggingface.co/api/datasets/davidscripka/MIT_environmental_impulse_responses/tree/b824a1ef2821f112fda0b9cb26e4278c62b425bb/16khz?expand=true&recursive=true&limit=50&cursor=ZXlKbWFXeGxYMjVoYldVaU9pSXhObXRvZWk5b01UQXdYME5zWVhOemNtOXZiVjh5ZEhoMGN5NTNZWFlpTENKMGNtVmxYMjlwWkNJNkltWmtOelV3WVdZME1qRmtOekUxTnpSak9XTTFOamcwTURSaE56VTBaalV4TkRVek5EVTBNVEFpZlE9PToxMDA%3D
Retrying in 1s [Retry 1/20].


Resolving data files:   0%|          | 0/270 [00:00<?, ?it/s]

270it [00:45,  5.90it/s]


In [9]:
import os
import shutil

# Check if we have the big files in Drive already
drive_backup = '/content/drive/MyDrive/openWakeWord_backup'

if os.path.exists(f'{drive_backup}/openwakeword_features_ACAV100M_2000_hrs_16bit.npy'):
    print("✓ Found backup files in Drive! Copying to workspace...")

    # Copy from Drive instead of downloading
    if not os.path.exists('openwakeword_features_ACAV100M_2000_hrs_16bit.npy'):
        shutil.copy(f'{drive_backup}/openwakeword_features_ACAV100M_2000_hrs_16bit.npy', '.')
    if not os.path.exists('validation_set_features.npy'):
        shutil.copy(f'{drive_backup}/validation_set_features.npy', '.')

    print("✓ Files restored from Drive!")
else:
    print("No backup found - will download and then backup to Drive")

✓ Found backup files in Drive! Copying to workspace...
✓ Files restored from Drive!


In [15]:
## Download noise and background audio

# Audioset Dataset (https://research.google.com/audioset/dataset/index.html)
# Download one part of the audioset .tar files, extract, and convert to 16khz
# For full-scale training, it's recommended to download the entire dataset from
# https://huggingface.co/datasets/agkphysics/AudioSet, and
# even potentially combine it with other background noise datasets (e.g., FSD50k, Freesound, etc.)

if not os.path.exists("audioset"):
    os.mkdir("audioset")

fname = "bal_train09.tar"
out_dir = f"audioset/{fname}"
link = "https://huggingface.co/datasets/agkphysics/AudioSet/resolve/main/data/" + fname
!wget -O {out_dir} {link}
!cd audioset && tar -xvf bal_train09.tar

output_dir = "./audioset_16k"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

# Convert audioset files to 16khz sample rate
audioset_dataset = datasets.Dataset.from_dict({"audio": [str(i) for i in Path("audioset/audio").glob("**/*.flac")]})
audioset_dataset = audioset_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000))
for row in tqdm(audioset_dataset):
    name = row['audio']['path'].split('/')[-1].replace(".flac", ".wav")
    scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, (row['audio']['array']*32767).astype(np.int16))

# Free Music Archive dataset (https://github.com/mdeff/fma)
output_dir = "./fma"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
fma_dataset = datasets.load_dataset("rudraml/fma", name="small", split="train", streaming=True)
fma_dataset = iter(fma_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000)))

n_hours = 1  # use only 1 hour of clips for this example notebook, recommend increasing for full-scale training
for i in tqdm(range(n_hours*3600//30)):  # this works because the FMA dataset is all 30 second clips
    row = next(fma_dataset)
    name = row['audio']['path'].split('/')[-1].replace(".mp3", ".wav")
    scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, (row['audio']['array']*32767).astype(np.int16))
    i += 1
    if i == n_hours*3600//30:
        break


--2025-11-29 00:13:19--  https://huggingface.co/datasets/agkphysics/AudioSet/resolve/main/data/bal_train09.tar
Resolving huggingface.co (huggingface.co)... 3.170.185.14, 3.170.185.35, 3.170.185.33, ...
Connecting to huggingface.co (huggingface.co)|3.170.185.14|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-11-29 00:13:19 ERROR 404: Not Found.

tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors


0it [00:00, ?it/s]
 99%|█████████▉| 119/120 [00:36<00:00,  3.29it/s]


In [None]:
# Download pre-computed openWakeWord features for training and validation

# training set (~2,000 hours from the ACAV100M Dataset)
# See https://huggingface.co/datasets/davidscripka/openwakeword_features for more information
!wget https://huggingface.co/datasets/davidscripka/openwakeword_features/resolve/main/openwakeword_features_ACAV100M_2000_hrs_16bit.npy

# validation set for false positive rate estimation (~11 hours)
!wget https://huggingface.co/datasets/davidscripka/openwakeword_features/resolve/main/validation_set_features.npy

--2025-11-27 01:30:30--  https://huggingface.co/datasets/davidscripka/openwakeword_features/resolve/main/openwakeword_features_ACAV100M_2000_hrs_16bit.npy
Resolving huggingface.co (huggingface.co)... 13.35.202.121, 13.35.202.40, 13.35.202.34, ...
Connecting to huggingface.co (huggingface.co)|13.35.202.121|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cas-bridge.xethub.hf.co/xet-bridge-us/64f3a0b6918ffcc15af6923c/7e1cade4c3fda6a5081158383c8d43c4a3e1e42555150b596b373efddf9b5194?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20251127%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251127T013031Z&X-Amz-Expires=3600&X-Amz-Signature=a61fbd21c6761a95cb32ca00543983b2a02687faa53653ba2b3ec4bfde7e0cb8&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27openwakeword_features_ACAV100M_2000_hrs_16bit.npy%3B+filename%3D%22openwakeword_features_ACAV100M_2000_hrs_

In [None]:
import os
import shutil

# Create backup directory in Drive
drive_backup = '/content/drive/MyDrive/openWakeWord_backup'
os.makedirs(drive_backup, exist_ok=True)

# Copy the big files to Drive (only if not already there)
files_to_backup = [
    'openwakeword_features_ACAV100M_2000_hrs_16bit.npy',
    'validation_set_features.npy'
]

for filename in files_to_backup:
    if os.path.exists(filename):
        drive_path = f'{drive_backup}/{filename}'
        if not os.path.exists(drive_path):
            print(f"Backing up {filename} to Drive... (takes 2-3 min)")
            shutil.copy(filename, drive_path)
            print(f"✓ {filename} backed up!")
        else:
            print(f"✓ {filename} already in Drive")

print("\n✓ All files backed up to Google Drive!")
print("Next time, run Cell 2 above to restore instead of re-downloading!")

Backing up openwakeword_features_ACAV100M_2000_hrs_16bit.npy to Drive... (takes 2-3 min)
✓ openwakeword_features_ACAV100M_2000_hrs_16bit.npy backed up!
Backing up validation_set_features.npy to Drive... (takes 2-3 min)
✓ validation_set_features.npy backed up!

✓ All files backed up to Google Drive!
Next time, run Cell 2 above to restore instead of re-downloading!


# Define Training Configuration

For automated model training openWakeWord uses a specially designed training script and a [YAML](https://yaml.org/) configuration file that defines all of the information required for training a new wake word/phrase detection model.

It is strongly recommended that you review [the example config file](../examples/custom_model.yml), as each value is fully documented there. For the purposes of this notebook, we'll read in the YAML file to modify certain configuration parameters before saving a new YAML file for training our example model. Specifically:

- We'll train a detection model for the phrase "hey sebastian"
- We'll only generate 5,000 positive and negative examples (to save on time for this example)
- We'll only generate 1,000 validation positive and negative examples for early stopping (again to save time)
- The model will only be trained for 10,000 steps (larger datasets will benefit from longer training)
- We'll reduce the target metrics to account for the small dataset size and limited training.

On the topic of target metrics, there are *not* specific guidelines about what these metrics should be in practice, and you will need to conduct testing in your target deployment environment to establish good thresholds. However, from very limited testing the default values in the config file (accuracy >= 0.7, recall >= 0.5, false-positive rate <= 0.2 per hour) seem to produce models with reasonable performance.


In [16]:
# Load default YAML config file for training
config = yaml.load(open("openwakeword/examples/custom_model.yml", 'r').read(), yaml.Loader)
config

{'model_name': 'my_model',
 'target_phrase': ['hey jarvis'],
 'custom_negative_phrases': [],
 'n_samples': 10000,
 'n_samples_val': 2000,
 'tts_batch_size': 50,
 'augmentation_batch_size': 16,
 'piper_sample_generator_path': './piper-sample-generator',
 'output_dir': './my_custom_model',
 'rir_paths': ['./mit_rirs'],
 'background_paths': ['./background_clips'],
 'background_paths_duplication_rate': [1],
 'false_positive_validation_data_path': './validation_set_features.npy',
 'augmentation_rounds': 1,
 'feature_data_files': {'ACAV100M_sample': './openwakeword_features_ACAV100M_2000_hrs_16bit.npy'},
 'batch_n_per_class': {'ACAV100M_sample': 1024,
  'adversarial_negative': 50,
  'positive': 50},
 'model_type': 'dnn',
 'layer_size': 32,
 'steps': 50000,
 'max_negative_weight': 1500,
 'target_false_positives_per_hour': 0.2}

In [17]:
# Modify values in the config and save a new version
config["target_phrase"] = ["Hello James"]
config["model_name"] = config["target_phrase"][0].replace(" ", "_")

# MULTI-LANGUAGE TRAINING - ENHANCED VERSION
config["n_samples"] = 4500              # Keep existing backed-up samples
config["n_samples_val"] = 4500          # Keep existing validation samples
config["augmentation_rounds"] = 3       # 3x data augmentation (13,500 effective samples)
config["layer_size"] = 64               # Larger model for multi-language capacity
config["steps"] = 150000                # Extended training for 10-hour session
config["target_accuracy"] = 0.75        # Higher quality threshold
config["target_recall"] = 0.5           # Better detection rate
config["max_negative_weight"] = 2000    # Allow stronger negative weighting

# Data paths (keep as-is)
config["background_paths"] = ['./fma']
config["false_positive_validation_data_path"] = "validation_set_features.npy"
config["feature_data_files"] = {"ACAV100M_sample": "openwakeword_features_ACAV100M_2000_hrs_16bit.npy"}

with open('my_model.yaml', 'w') as file:
    documents = yaml.dump(config, file)

print("✓ Config updated for EXTENDED multi-language training!")
print(f"  - {config['n_samples']} samples × 3 augmentation rounds = {config['n_samples']*3} effective samples")
print(f"  - Layer size: {config['layer_size']}")
print(f"  - Training steps: {config['steps']:,}")
print(f"  - Estimated training time: ~2 hours")

✓ Config updated for EXTENDED multi-language training!
  - 4500 samples × 3 augmentation rounds = 13500 effective samples
  - Layer size: 64
  - Training steps: 150,000
  - Estimated training time: ~2 hours


# Train the Model

With the data downloaded and training configuration set, we can now start training the model. We'll do this in parts to better illustrate the sequence, but you can also execute every step at once for a fully automated process.

In [18]:
# PRE-FLIGHT CHECK: Test piper and verify all models exist
import os
import subprocess

print("="*60)
print("🔍 PRE-FLIGHT CHECKS")
print("="*60)

# 1. Check piper binary
print("\n1️⃣ Testing piper binary...")
piper_path = "/usr/local/bin/piper"
if not os.path.exists(piper_path):
    print(f"❌ Piper binary not found at {piper_path}")
    print("Run FIX 1 cell to install piper!")
else:
    print(f"✓ Piper binary exists")

    # Test with a simple model
    test_model = "piper-sample-generator/models/en_US-libritts_r-medium.onnx"
    if os.path.exists(test_model):
        test_file = "/tmp/piper_test.wav"
        cmd = f'echo "Hello James" | {piper_path} --model {test_model} --output_file {test_file}'
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)

        if result.returncode == 0 and os.path.exists(test_file) and os.path.getsize(test_file) > 100:
            size = os.path.getsize(test_file)
            print(f"✓ Piper works! Test file: {size} bytes")
            os.remove(test_file)  # cleanup
        else:
            print(f"❌ PIPER FAILED!")
            print(f"   Return code: {result.returncode}")
            print(f"   STDERR: {result.stderr[:300]}")
            print(f"   STDOUT: {result.stdout[:300]}")
            print("\n⚠️ Try installing onnxruntime:")
            print("   !pip install onnxruntime")
    else:
        print(f"⚠️ Test model not found: {test_model}")

# 2. Check all language models
print("\n2️⃣ Checking language models...")
languages = {
    "en_US": ("piper-sample-generator/models/en_US-libritts_r-medium.pt", "pt"),
    "de_DE": ("piper-sample-generator/models/de_DE-thorsten-medium.onnx", "onnx"),
    "es_ES": ("piper-sample-generator/models/es_ES-davefx-medium.onnx", "onnx"),
    "fr_FR": ("piper-sample-generator/models/fr_FR-siwis-medium.onnx", "onnx"),
    "pt_BR": ("piper-sample-generator/models/pt_BR-faber-medium.onnx", "onnx"),
    "ru_RU": ("piper-sample-generator/models/ru_RU-dmitri-medium.onnx", "onnx"),
    "zh_CN": ("piper-sample-generator/models/zh_CN-huayan-medium.onnx", "onnx")
}

found_models = []
missing_models = []

for lang, (model_path, model_type) in languages.items():
    if os.path.exists(model_path):
        size_mb = os.path.getsize(model_path) / (1024*1024)
        print(f"✓ {lang:8} ({model_type}): {size_mb:.1f} MB")
        found_models.append(lang)
    else:
        print(f"❌ {lang:8}: NOT FOUND at {model_path}")
        missing_models.append(lang)

# 3. Check what models ARE available
if missing_models:
    print(f"\n⚠️ Missing {len(missing_models)} models: {missing_models}")
    print("\nAvailable models in directory:")
    !ls -lh piper-sample-generator/models/*.{pt,onnx} 2>/dev/null | awk '{print $9, $5}'

# 4. Check Drive backup status
print("\n3️⃣ Checking Drive backup status...")
backup_base = "/content/drive/MyDrive/hello_james_samples"
if os.path.exists(backup_base):
    completed = [d for d in os.listdir(backup_base)
                 if os.path.isdir(f"{backup_base}/{d}")
                 and len(os.listdir(f"{backup_base}/{d}")) > 0]

    if completed:
        print(f"✓ Already completed: {completed}")
        for lang in completed:
            file_count = len(os.listdir(f"{backup_base}/{lang}"))
            print(f"   {lang}: {file_count} files")
    else:
        print("  No languages completed yet")
else:
    print("  Backup directory doesn't exist yet (will be created)")

# 5. Summary
print("\n" + "="*60)
print("📊 SUMMARY")
print("="*60)
print(f"✓ Models ready: {len(found_models)}/{len(languages)}")
print(f"  {found_models}")
if missing_models:
    print(f"❌ Models missing: {len(missing_models)}")
    print(f"  {missing_models}")
    print("\n⚠️ Generation will SKIP missing models")
else:
    print("✓ ALL MODELS READY!")

print("\nIf piper test failed, run: !pip install onnxruntime")
print("Then re-run this cell to verify.")
print("="*60)

🔍 PRE-FLIGHT CHECKS

1️⃣ Testing piper binary...
✓ Piper binary exists
⚠️ Test model not found: piper-sample-generator/models/en_US-libritts_r-medium.onnx

2️⃣ Checking language models...
✓ en_US    (pt): 194.6 MB
✓ de_DE    (onnx): 60.3 MB
✓ es_ES    (onnx): 60.3 MB
✓ fr_FR    (onnx): 60.3 MB
✓ pt_BR    (onnx): 60.3 MB
✓ ru_RU    (onnx): 60.3 MB
✓ zh_CN    (onnx): 60.3 MB

3️⃣ Checking Drive backup status...
✓ Already completed: ['en_US', 'de_DE', 'es_ES', 'fr_FR', 'pt_BR', 'ru_RU', 'zh_CN']
   en_US: 2850 files
   de_DE: 1610 files
   es_ES: 1300 files
   fr_FR: 1300 files
   pt_BR: 1300 files
   ru_RU: 1300 files
   zh_CN: 1300 files

📊 SUMMARY
✓ Models ready: 7/7
  ['en_US', 'de_DE', 'es_ES', 'fr_FR', 'pt_BR', 'ru_RU', 'zh_CN']
✓ ALL MODELS READY!

If piper test failed, run: !pip install onnxruntime
Then re-run this cell to verify.


In [19]:
# SAFE Multi-language generation with auto-backup after EACH language
import os, uuid, subprocess, shutil
from google.colab import drive

# Mount drive first
try:
    drive.mount('/content/drive')
except:
    print("Drive already mounted")

backup_base = "/content/drive/MyDrive/hello_james_samples"
os.makedirs(backup_base, exist_ok=True)

languages = {
    "en_US": ("piper-sample-generator/models/en_US-libritts_r-medium.pt", "pt"),
    "de_DE": ("piper-sample-generator/models/de_DE-thorsten-medium.onnx", "onnx"),
    "es_ES": ("piper-sample-generator/models/es_ES-davefx-medium.onnx", "onnx"),
    "fr_FR": ("piper-sample-generator/models/fr_FR-siwis-medium.onnx", "onnx"),
    "pt_BR": ("piper-sample-generator/models/pt_BR-faber-medium.onnx", "onnx"),
    "ru_RU": ("piper-sample-generator/models/ru_RU-dmitri-medium.onnx", "onnx"),
    "zh_CN": ("piper-sample-generator/models/zh_CN-huayan-medium.onnx", "onnx")
}

base_dir = "./my_custom_model/Hello_James"
positive_train = f"{base_dir}/positive_train"
positive_test = f"{base_dir}/positive_test"

for d in [positive_train, positive_test]:
    os.makedirs(d, exist_ok=True)

samples_per_lang = 650

import sys
sys.path.insert(0, "piper-sample-generator")
from generate_samples import generate_samples

def generate_with_piper_binary(text, model_path, output_file):
    cmd = f'echo "{text}" | /usr/local/bin/piper --model {model_path} --output_file {output_file} 2>/dev/null'
    result = subprocess.run(cmd, shell=True, capture_output=True)
    return result.returncode == 0

def backup_language(lang):
    """Backup one language to Drive immediately"""
    lang_backup = f"{backup_base}/{lang}"
    os.makedirs(lang_backup, exist_ok=True)

    # Copy this language's files
    import glob
    train_files = glob.glob(f"{positive_train}/{lang}_*.wav")
    test_files = glob.glob(f"{positive_test}/{lang}_*.wav")

    for f in train_files:
        shutil.copy(f, f"{lang_backup}/train_{os.path.basename(f)}")
    for f in test_files:
        shutil.copy(f, f"{lang_backup}/test_{os.path.basename(f)}")

    print(f"   💾 BACKED UP {len(train_files)+len(test_files)} files to Drive!")

# Check what's already done
completed = [d for d in os.listdir(backup_base) if os.path.isdir(f"{backup_base}/{d}")]
print(f"Already completed languages: {completed}")

# Generate each language with immediate backup
for lang, (model_path, model_type) in languages.items():
    if lang in completed:
        print(f"\n✓ {lang} already backed up, skipping...")
        continue

    print(f"\n{'='*60}")
    print(f"🎤 {lang}: Generating {samples_per_lang} samples...")
    print(f"{'='*60}")

    # TRAINING
    if model_type == "pt":
        try:
            generate_samples(
                text=["Hello James"], model=model_path, max_samples=samples_per_lang,
                batch_size=50, noise_scales=[0.98], noise_scale_ws=[0.98],
                length_scales=[0.75, 1.0, 1.25], output_dir=positive_train,
                auto_reduce_batch_size=True,
                file_names=[f"{lang}_{uuid.uuid4().hex}.wav" for _ in range(samples_per_lang)]
            )
        except Exception as e:
            print(f"   ❌ Error: {e}")
            continue
    else:
        success = 0
        for i in range(samples_per_lang):
            output = f"{positive_train}/{lang}_{uuid.uuid4().hex}.wav"
            if generate_with_piper_binary("Hello James", model_path, output):
                success += 1
            if (i+1) % 100 == 0:
                print(f"   Training: {i+1}/{samples_per_lang} ({success} successful)")

    # VALIDATION
    if model_type == "pt":
        try:
            generate_samples(
                text=["Hello James"], model=model_path, max_samples=samples_per_lang,
                batch_size=50, noise_scales=[0.98], noise_scale_ws=[0.98],
                length_scales=[0.75, 1.0, 1.25], output_dir=positive_test,
                auto_reduce_batch_size=True,
                file_names=[f"{lang}_{uuid.uuid4().hex}.wav" for _ in range(samples_per_lang)]
            )
        except Exception as e:
            print(f"   ❌ Error: {e}")
            continue
    else:
        success = 0
        for i in range(samples_per_lang):
            output = f"{positive_test}/{lang}_{uuid.uuid4().hex}.wav"
            if generate_with_piper_binary("Hello James", model_path, output):
                success += 1
            if (i+1) % 100 == 0:
                print(f"   Validation: {i+1}/{samples_per_lang}")

    # BACKUP IMMEDIATELY
    print(f"\n   💾 Backing up {lang} to Drive...")
    backup_language(lang)
    print(f"   ✓ {lang} COMPLETE AND SAVED!")

print("\n" + "="*60)
print("✓ ALL LANGUAGES COMPLETE!")
print("="*60)
print(f"Total samples: {len(os.listdir(positive_train))}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Already completed languages: ['en_US', 'de_DE', 'es_ES', '.ipynb_checkpoints', 'fr_FR', 'pt_BR', 'ru_RU', 'zh_CN']

✓ en_US already backed up, skipping...

✓ de_DE already backed up, skipping...

✓ es_ES already backed up, skipping...

✓ fr_FR already backed up, skipping...

✓ pt_BR already backed up, skipping...

✓ ru_RU already backed up, skipping...

✓ zh_CN already backed up, skipping...

✓ ALL LANGUAGES COMPLETE!
Total samples: 0


In [20]:
# RESTORE: Copy all backed-up samples from Drive to local directories
import os, shutil, glob

backup_base = "/content/drive/MyDrive/hello_james_samples"
base_dir = "./my_custom_model/Hello_James"
positive_train = f"{base_dir}/positive_train"
positive_test = f"{base_dir}/positive_test"

print("="*60)
print("📥 RESTORING ALL SAMPLES FROM DRIVE")
print("="*60)

# Get all backed up languages
backed_up_langs = [d for d in os.listdir(backup_base)
                   if os.path.isdir(f"{backup_base}/{d}")]

print(f"Found backups: {backed_up_langs}\n")

total_restored = 0
for lang in backed_up_langs:
    lang_backup = f"{backup_base}/{lang}"
    files = os.listdir(lang_backup)

    print(f"Restoring {lang}: {len(files)} files...")

    for filename in files:
        src = f"{lang_backup}/{filename}"

        # Determine if train or test
        if filename.startswith("train_"):
            dest = f"{positive_train}/{filename[6:]}"  # Remove "train_" prefix
        elif filename.startswith("test_"):
            dest = f"{positive_test}/{filename[5:]}"  # Remove "test_" prefix
        else:
            print(f"  ⚠️ Unknown file: {filename}")
            continue

        # Copy if not already there
        if not os.path.exists(dest):
            shutil.copy(src, dest)
            total_restored += 1

print("\n" + "="*60)
print(f"✓ RESTORED {total_restored} files from Drive!")
print("="*60)
print(f"Training samples: {len(os.listdir(positive_train))}")
print(f"Test samples: {len(os.listdir(positive_test))}")
print("\nReady for augmentation and training!")

📥 RESTORING ALL SAMPLES FROM DRIVE
Found backups: ['en_US', 'de_DE', 'es_ES', '.ipynb_checkpoints', 'fr_FR', 'pt_BR', 'ru_RU', 'zh_CN']

Restoring en_US: 2850 files...
Restoring de_DE: 1610 files...
Restoring es_ES: 1300 files...
Restoring .ipynb_checkpoints: 0 files...
Restoring fr_FR: 1300 files...
Restoring pt_BR: 1300 files...
Restoring ru_RU: 1300 files...
Restoring zh_CN: 1300 files...

✓ RESTORED 10960 files from Drive!
Training samples: 5960
Test samples: 5000

Ready for augmentation and training!


In [21]:
# 🔍 PRE-TRAINING VALIDATION: Check EVERYTHING before starting
import os
import wave
import glob

print("="*60)
print("🔍 PRE-TRAINING VALIDATION CHECK")
print("="*60)

all_good = True

# 1. Check sample directories and counts
print("\n1️⃣ Checking sample directories...")
base_dir = "./my_custom_model/Hello_James"
positive_train = f"{base_dir}/positive_train"
positive_test = f"{base_dir}/positive_test"
negative_train = f"{base_dir}/negative_train"
negative_test = f"{base_dir}/negative_test"

train_files = os.listdir(positive_train) if os.path.exists(positive_train) else []
test_files = os.listdir(positive_test) if os.path.exists(positive_test) else []

print(f"   Training samples: {len(train_files)}")
print(f"   Test samples: {len(test_files)}")

expected_per_lang = 650
expected_langs = 7
expected_total = expected_per_lang * expected_langs

if len(train_files) < expected_total * 0.9:  # Allow 10% tolerance
    print(f"   ⚠️ WARNING: Expected ~{expected_total} training samples, got {len(train_files)}")
    all_good = False
else:
    print(f"   ✓ Training sample count looks good!")

if len(test_files) < expected_total * 0.9:
    print(f"   ⚠️ WARNING: Expected ~{expected_total} test samples, got {len(test_files)}")
    all_good = False
else:
    print(f"   ✓ Test sample count looks good!")

# 2. Check language distribution
print("\n2️⃣ Checking language distribution...")
languages = ["en_US", "de_DE", "es_ES", "fr_FR", "pt_BR", "ru_RU", "zh_CN"]
lang_counts = {}

for lang in languages:
    train_count = len([f for f in train_files if f.startswith(lang)])
    test_count = len([f for f in test_files if f.startswith(lang)])
    lang_counts[lang] = (train_count, test_count)

    total = train_count + test_count
    if total > 0:
        print(f"   {lang}: {train_count} train + {test_count} test = {total} total")
    else:
        print(f"   ❌ {lang}: NO SAMPLES FOUND!")
        all_good = False

# 3. Check sample rates
print("\n3️⃣ Checking sample rates (should be 16000 Hz)...")
sample_rates = {}
for filename in train_files[:5]:  # Check first 5 files
    filepath = f"{positive_train}/{filename}"
    try:
        with wave.open(filepath, 'rb') as wav:
            rate = wav.getframerate()
            sample_rates[rate] = sample_rates.get(rate, 0) + 1
    except Exception as e:
        print(f"   ⚠️ Error reading {filename}: {e}")
        all_good = False

if sample_rates:
    for rate, count in sample_rates.items():
        if rate == 16000:
            print(f"   ✓ Sample rate: {rate} Hz (correct)")
        else:
            print(f"   ❌ Sample rate: {rate} Hz (WRONG! Should be 16000 Hz)")
            print(f"   → Run FIX 3 (resample) cell before training!")
            all_good = False

# 4. Check file sizes (detect empty files)
print("\n4️⃣ Checking for empty/corrupted files...")
empty_files = 0
small_files = 0
for filename in train_files[:100]:  # Sample 100 files
    filepath = f"{positive_train}/{filename}"
    size = os.path.getsize(filepath)
    if size == 0:
        empty_files += 1
    elif size < 1000:  # Less than 1KB is suspicious
        small_files += 1

if empty_files > 0:
    print(f"   ❌ Found {empty_files} empty files!")
    all_good = False
elif small_files > 5:
    print(f"   ⚠️ Found {small_files} suspiciously small files")
else:
    print(f"   ✓ No empty files detected")

# 5. Check background/noise data
print("\n5️⃣ Checking background audio datasets...")
required_dirs = {
    "./audioset_16k": "AudioSet background noise",
    "./fma": "FMA music dataset",
    "./mit_rirs": "MIT room impulse responses"
}

for dir_path, description in required_dirs.items():
    if os.path.exists(dir_path):
        file_count = len([f for f in os.listdir(dir_path) if f.endswith('.wav')])
        print(f"   ✓ {description}: {file_count} files")
    else:
        print(f"   ❌ {description}: NOT FOUND at {dir_path}")
        all_good = False

# 6. Check feature data file (16GB embeddings)
print("\n6️⃣ Checking feature embedding file...")
feature_file = "openwakeword_features_ACAV100M_2000_hrs_16bit.npy"
if os.path.exists(feature_file):
    size_gb = os.path.getsize(feature_file) / (1024**3)
    print(f"   ✓ Feature file exists: {size_gb:.1f} GB")
else:
    print(f"   ❌ Feature file NOT FOUND: {feature_file}")
    all_good = False

# 7. Check config file
print("\n7️⃣ Checking training config...")
if os.path.exists('my_model.yaml'):
    import yaml
    with open('my_model.yaml', 'r') as f:
        config = yaml.safe_load(f)

    print(f"   Target phrase: {config.get('target_phrase', 'NOT SET')}")
    print(f"   Model name: {config.get('model_name', 'NOT SET')}")
    print(f"   Training steps: {config.get('steps', 'NOT SET')}")
    print(f"   n_samples: {config.get('n_samples', 'NOT SET')}")

    if config.get('target_phrase') != ['Hello James']:
        print(f"   ⚠️ Target phrase mismatch!")
        all_good = False
    else:
        print(f"   ✓ Config looks good")
else:
    print(f"   ❌ Config file 'my_model.yaml' NOT FOUND!")
    all_good = False

# 8. Check Python environment
print("\n8️⃣ Checking Python packages...")
try:
    import torch
    import torchaudio
    import openwakeword
    print(f"   ✓ PyTorch: {torch.__version__}")
    print(f"   ✓ torchaudio: {torchaudio.__version__}")
    print(f"   ✓ openwakeword: installed")
except ImportError as e:
    print(f"   ❌ Missing package: {e}")
    all_good = False

# FINAL VERDICT
print("\n" + "="*60)
if all_good:
    print("✅ ALL CHECKS PASSED!")
    print("="*60)
    print("🚀 READY TO TRAIN!")
    print("\nNext steps:")
    print("1. Run Step 2: Data augmentation + feature extraction (~15 mins)")
    print("2. Run Step 3: Train model (~30-60 mins on A100)")
    print("3. Download your .tflite model!")
else:
    print("❌ VALIDATION FAILED!")
    print("="*60)
    print("⚠️ FIX THE ISSUES ABOVE BEFORE TRAINING!")
    print("\nCommon fixes:")
    print("- Run RESTORE cell to get all samples from Drive")
    print("- Run FIX 3 (resample) if sample rate is wrong")
    print("- Re-run download cells for missing datasets")
print("="*60)

🔍 PRE-TRAINING VALIDATION CHECK

1️⃣ Checking sample directories...
   Training samples: 5960
   Test samples: 5000
   ✓ Training sample count looks good!
   ✓ Test sample count looks good!

2️⃣ Checking language distribution...
   en_US: 1750 train + 1100 test = 2850 total
   de_DE: 960 train + 650 test = 1610 total
   es_ES: 650 train + 650 test = 1300 total
   fr_FR: 650 train + 650 test = 1300 total
   pt_BR: 650 train + 650 test = 1300 total
   ru_RU: 650 train + 650 test = 1300 total
   zh_CN: 650 train + 650 test = 1300 total

3️⃣ Checking sample rates (should be 16000 Hz)...
   ❌ Sample rate: 22050 Hz (WRONG! Should be 16000 Hz)
   → Run FIX 3 (resample) cell before training!

4️⃣ Checking for empty/corrupted files...
   ✓ No empty files detected

5️⃣ Checking background audio datasets...
   ✓ AudioSet background noise: 0 files
   ✓ FMA music dataset: 120 files
   ✓ MIT room impulse responses: 270 files

6️⃣ Checking feature embedding file...
   ✓ Feature file exists: 16.1 GB



In [22]:
# 🆘 BACKUP: Download alternative background audio if augmentation fails
# Run this ONLY if you get StopIteration during augmentation

import os
import datasets
import scipy.io.wavfile
from tqdm import tqdm

output_dir = "./audioset_16k"
os.makedirs(output_dir, exist_ok=True)

print("="*60)
print("DOWNLOADING BACKUP BACKGROUND AUDIO")
print("Source: FreeSound dataset (alternative to AudioSet)")
print("="*60)

# Option 1: Download more FMA music (fast, reliable)
print("\n📥 Downloading additional FMA tracks...")
fma_dataset = datasets.load_dataset("rudraml/fma", name="small", split="train", streaming=True)
fma_dataset = iter(fma_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000)))

# Download 3 more hours (vs the 1 hour we have)
target_clips = 360  # 3 hours of 30-second clips
current_fma = len([f for f in os.listdir("./fma") if f.endswith('.wav')])

print(f"Current FMA clips: {current_fma}")
print(f"Downloading {target_clips} more clips (~3 hours)...")

for i in tqdm(range(target_clips)):
    try:
        row = next(fma_dataset)
        name = f"fma_extra_{i:05d}.wav"
        scipy.io.wavfile.write(
            os.path.join("./fma", name),
            16000,
            (row['audio']['array']*32767).astype(np.int16)
        )
    except StopIteration:
        print(f"Dataset exhausted at {i} clips")
        break

final_count = len([f for f in os.listdir("./fma") if f.endswith('.wav')])
print(f"\n✅ FMA now has {final_count} clips!")

# Option 2: Use speech data as background (adds variety)
print("\n📥 Downloading CommonVoice samples for speech background...")
try:
    cv_dataset = datasets.load_dataset(
        "mozilla-foundation/common_voice_11_0",
        "en",
        split="train",
        streaming=True,
        trust_remote_code=True
    )
    cv_dataset = iter(cv_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000)))

    for i in tqdm(range(200)):  # Add 200 speech samples
        try:
            row = next(cv_dataset)
            name = f"speech_{i:05d}.wav"
            scipy.io.wavfile.write(
                os.path.join("./fma", name),
                16000,
                (row['audio']['array']*32767).astype(np.int16)
            )
        except:
            break

    print(f"✅ Added speech samples!")
except Exception as e:
    print(f"⚠️  CommonVoice download failed (optional): {e}")

print("\n" + "="*60)
print(f"✅ BACKUP COMPLETE!")
print(f"Total background files: {len(os.listdir('./fma'))}")
print("="*60)
print("\nNow re-run Step 2 (augmentation)")

DOWNLOADING BACKUP BACKGROUND AUDIO
Source: FreeSound dataset (alternative to AudioSet)

📥 Downloading additional FMA tracks...
Current FMA clips: 120
Downloading 360 more clips (~3 hours)...


100%|██████████| 360/360 [01:46<00:00,  3.37it/s]


✅ FMA now has 480 clips!

📥 Downloading CommonVoice samples for speech background...
⚠️  CommonVoice download failed (optional): Couldn't find a dataset script at /content/mozilla-foundation/common_voice_11_0/common_voice_11_0.py or any data file in the same directory. Couldn't find 'mozilla-foundation/common_voice_11_0' on the Hugging Face Hub either: FileNotFoundError: Dataset 'mozilla-foundation/common_voice_11_0' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login`.

✅ BACKUP COMPLETE!
Total background files: 480

Now re-run Step 2 (augmentation)





In [23]:
# DELETE old features to force regeneration with new augmentation_rounds
import os
import shutil

base_dir = "./my_custom_model/Hello_James"

print("Deleting old augmented data...")

# Delete feature files
for feature_file in ["positive_features_train.npy", "positive_features_test.npy"]:
    path = f"{base_dir}/{feature_file}"
    if os.path.exists(path):
        os.remove(path)
        print(f"✓ Deleted {feature_file}")
    else:
        print(f"  {feature_file} not found (already deleted)")

# Delete negative directories (will be regenerated)
for neg_dir in ["negative_train", "negative_test"]:
    path = f"{base_dir}/{neg_dir}"
    if os.path.exists(path):
        shutil.rmtree(path)
        print(f"✓ Deleted {neg_dir}/")
    else:
        print(f"  {neg_dir}/ not found (already deleted)")

print("\n✓ Ready for fresh augmentation!")

Deleting old augmented data...
✓ Deleted positive_features_train.npy
✓ Deleted positive_features_test.npy
✓ Deleted negative_train/
✓ Deleted negative_test/

✓ Ready for fresh augmentation!


In [23]:
# GENERATE ADVERSARIAL NEGATIVE CLIPS
# This creates similar-sounding phrases like "hello jane", "yellow james", etc.
import sys

print("="*60)
print("Generating adversarial negative clips...")
print("This will take ~5-10 minutes")
print("="*60)

!{sys.executable} openwakeword/openwakeword/train.py --training_config my_model.yaml --generate_clips

print("\n" + "="*60)
print("✓ Adversarial clips generated!")
print("="*60)
print("\nVerifying negative samples were created:")
!ls -lh {base_dir}/negative_train/ | head -5
!ls -lh {base_dir}/negative_test/ | head -5

Generating adversarial negative clips...
This will take ~5-10 minutes
2025-11-29 00:22:04.492039: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-29 00:22:04.507740: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764375724.526481   21092 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764375724.532941   21092 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-11-29 00:22:04.554351: I tensorflow/core/platform/cpu_feature_guard.

In [None]:
# CHECK sample rate of adversarial clips
import scipy.io.wavfile as wavfile
import os
import glob

base_dir = "./my_custom_model/Hello_James"

print("Checking sample rates of adversarial clips...")

# Check a few negative samples
for subdir in ['negative_train', 'negative_test']:
    path = f"{base_dir}/{subdir}"
    files = glob.glob(f"{path}/*.wav")[:3]  # Check first 3 files

    print(f"\n{subdir}:")
    for f in files:
        sr, data = wavfile.read(f)
        print(f"  {os.path.basename(f)}: {sr} Hz")

In [None]:
# FIX 3 CORRECTED: Resample all clips from 22050 Hz to 16000 Hz (with error handling)
import os
import scipy.io.wavfile as wavfile
import scipy.signal
from tqdm import tqdm

base_dir = "./my_custom_model/Hello_James"

# First, delete the old features file created with wrong sample rate
features_file = f"{base_dir}/positive_features_train.npy"
if os.path.exists(features_file):
    os.remove(features_file)
    print(f"✓ Deleted old features file\n")

# Resample all audio files (with error handling for corrupted files)
for subdir in ['positive_train', 'positive_test', 'negative_train', 'negative_test']:
    path = f"{base_dir}/{subdir}"
    if not os.path.exists(path):
        print(f"⚠️  {subdir} doesn't exist, skipping...")
        continue

    files = [f for f in os.listdir(path) if f.endswith('.wav')]
    print(f"Resampling {len(files)} files in {subdir}...")

    corrupted = 0
    for filename in tqdm(files):
        filepath = os.path.join(path, filename)

        try:
            # Check file size first (catch empty files)
            if os.path.getsize(filepath) < 1000:  # Less than 1KB = corrupted
                os.remove(filepath)
                corrupted += 1
                continue

            sr, data = wavfile.read(filepath)

            if sr != 16000:
                # Resample to 16000 Hz
                number_of_samples = round(len(data) * 16000 / sr)
                resampled = scipy.signal.resample(data, number_of_samples)
                wavfile.write(filepath, 16000, resampled.astype(data.dtype))

        except Exception as e:
            # Delete corrupted file
            os.remove(filepath)
            corrupted += 1

    if corrupted > 0:
        print(f"  ⚠️  Deleted {corrupted} corrupted files")

print("\n✓ All clips resampled to 16000 Hz!")
print("Now continue with next step!")

Resampling 5960 files in positive_train...


100%|██████████| 5960/5960 [00:00<00:00, 22291.65it/s]


Resampling 5000 files in positive_test...


100%|██████████| 5000/5000 [00:05<00:00, 939.36it/s]


  ⚠️  Deleted 1 corrupted files
Resampling 4500 files in negative_train...


 75%|███████▍  | 3366/4500 [00:14<00:04, 229.50it/s]

In [32]:
# COMPLETE cleanup if augmentation fails to remove all partial files.
import os
import glob

base_dir = "./my_custom_model/Hello_James"

print("Cleaning up ALL partial augmentation files...")

# Delete ALL .npy feature files (positive AND negative)
npy_files = glob.glob(f"{base_dir}/*.npy")
if npy_files:
    for npy_file in npy_files:
        os.remove(npy_file)
        print(f"✓ Deleted {os.path.basename(npy_file)}")
else:
    print("  No .npy files found")

# Also check for any augmented wav files (shouldn't exist, but just in case)
for subdir in ['positive_train', 'positive_test', 'negative_train', 'negative_test']:
    aug_path = f"{base_dir}/{subdir}_augmented"
    if os.path.exists(aug_path):
        import shutil
        shutil.rmtree(aug_path)
        print(f"✓ Deleted {subdir}_augmented/")

print("\n✅ Complete cleanup done - safe to retry augmentation!")

Cleaning up partial augmentation files...
  positive_features_train.npy - not found (ok)
  positive_features_test.npy - not found (ok)

✓ Safe to re-run augmentation now!


In [None]:
# 🔍 PRE-AUGMENTATION VALIDATION: Check EVERYTHING before starting
import os
import wave
import glob

print("="*60)
print("🔍 PRE-AUGMENTATION VALIDATION")
print("="*60)

base_dir = "./my_custom_model/Hello_James"
all_good = True

# 1. Check all sample directories exist with correct counts
print("\n1️⃣ Checking sample directories...")
expected = {
    'positive_train': 10000,
    'positive_test': 4999,
    'negative_train': 4500,
    'negative_test': 4500
}

for subdir, expected_count in expected.items():
    path = f"{base_dir}/{subdir}"
    if os.path.exists(path):
        count = len([f for f in os.listdir(path) if f.endswith('.wav')])
        if count >= expected_count * 0.95:  # Allow 5% tolerance
            print(f"  ✓ {subdir}: {count} files")
        else:
            print(f"  ⚠️  {subdir}: {count} files (expected ~{expected_count})")
            all_good = False
    else:
        print(f"  ❌ {subdir}: MISSING")
        all_good = False

# 2. Check sample rates (random sampling)
print("\n2️⃣ Checking sample rates (16000 Hz required)...")
for subdir in ['positive_train', 'positive_test', 'negative_train', 'negative_test']:
    path = f"{base_dir}/{subdir}"
    if os.path.exists(path):
        files = glob.glob(f"{path}/*.wav")[:5]  # Check 5 random files
        bad_files = 0
        for f in files:
            try:
                with wave.open(f, 'rb') as wav:
                    sr = wav.getframerate()
                    if sr != 16000:
                        bad_files += 1
            except:
                bad_files += 1

        if bad_files == 0:
            print(f"  ✓ {subdir}: All checked samples at 16000 Hz")
        else:
            print(f"  ❌ {subdir}: Found {bad_files} files with wrong sample rate!")
            all_good = False

# 3. Check background audio exists
print("\n3️⃣ Checking background audio...")
for bg_path in ['./audioset_16k', './fma', './mit_rirs']:
    if os.path.exists(bg_path):
        count = len(glob.glob(f"{bg_path}/*.wav"))
        print(f"  ✓ {bg_path}: {count} files")
        if count == 0:
            print(f"    ⚠️  Directory exists but is EMPTY!")
            all_good = False
    else:
        print(f"  ❌ {bg_path}: MISSING")
        all_good = False

# 4. Check config file
print("\n4️⃣ Checking config file...")
if os.path.exists('my_model.yaml'):
    print(f"  ✓ my_model.yaml exists")
    !grep -E "(n_samples|augmentation_rounds|layer_size|steps):" my_model.yaml
else:
    print(f"  ❌ my_model.yaml MISSING")
    all_good = False

# 5. Check no feature files exist (should be clean slate)
print("\n5️⃣ Checking no old features exist...")
for feature in ['positive_features_train.npy', 'positive_features_test.npy']:
    path = f"{base_dir}/{feature}"
    if os.path.exists(path):
        print(f"  ⚠️  {feature} exists (will be overwritten)")
    else:
        print(f"  ✓ {feature} not found (clean slate)")

# Final verdict
print("\n" + "="*60)
if all_good:
    print("✅ ALL CHECKS PASSED - READY FOR AUGMENTATION!")
else:
    print("❌ ISSUES FOUND - FIX BEFORE AUGMENTATION!")
print("="*60)

In [None]:
# Step 2: Augment the generated clips

!{sys.executable} openwakeword/openwakeword/train.py --training_config my_model.yaml --augment_clips

2025-11-28 23:43:21.238533: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764373401.260339   57677 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764373401.266801   57677 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  torchaudio.set_audio_backend("soundfile")
  >>> augment = PitchShift(..., output_type='dict')
  >>> augmented_samples = augment(samples).samples
  >>> augment = BandStopFilter(..., output_type='dict')
  >>> augmented_samples = augment(samples).samples
  >>> augment = AddColoredNoise(..., output_type='dict')
  >>> augmented_samples = augment(samples).samples
  >>> augment = AddBackgroundNoise(..., output_type='dict')
  >>> augmented

In [28]:
# Step 3: Train model

!{sys.executable} openwakeword/openwakeword/train.py --training_config my_model.yaml --train_model

Exception ignored in: <function _get_module_lock.<locals>.cb at 0x7d816bcedd00>
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 199, in cb
KeyboardInterrupt: 
Traceback (most recent call last):
  File "/content/openwakeword/openwakeword/train.py", line 4, in <module>
    import torchmetrics
  File "/usr/local/lib/python3.11/dist-packages/torchmetrics/__init__.py", line 14, in <module>
    from torchmetrics import functional  # noqa: E402
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torchmetrics/functional/__init__.py", line 120, in <module>
    from torchmetrics.functional.text._deprecated import _bleu_score as bleu_score
  File "/usr/local/lib/python3.11/dist-packages/torchmetrics/functional/text/__init__.py", line 50, in <module>
    from torchmetrics.functional.text.bert import bert_score  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages

In [21]:
# FIX: Downgrade onnx to compatible version
!pip install -q onnx==1.12.0

# Now convert to TFLite
import onnx
import logging
import tempfile
from onnx_tf.backend import prepare
import tensorflow as tf
import os

def convert_onnx_to_tflite(onnx_model_path, output_path):
    """Converts an ONNX version of an openwakeword model to the Tensorflow tflite format."""
    onnx_model = onnx.load(onnx_model_path)
    tf_rep = prepare(onnx_model, device="CPU")
    with tempfile.TemporaryDirectory() as tmp_dir:
        tf_rep.export_graph(os.path.join(tmp_dir, "tf_model"))
        converter = tf.lite.TFLiteConverter.from_saved_model(os.path.join(tmp_dir, "tf_model"))
        tflite_model = converter.convert()
        logging.info(f"####\nSaving tflite model to '{output_path}'")
        with open(output_path, 'wb') as f:
            f.write(tflite_model)
    return output_path

# Convert your model
model_name = "Hello_James"
onnx_path = f"my_custom_model/{model_name}.onnx"
tflite_path = f"my_custom_model/{model_name}.tflite"

print(f"Converting {onnx_path}...")
convert_onnx_to_tflite(onnx_path, tflite_path)

print(f"\n✓ Conversion complete!")
print(f"ONNX model: {onnx_path} ({os.path.getsize(onnx_path)/1024:.1f} KB)")
print(f"TFLite model: {tflite_path} ({os.path.getsize(tflite_path)/1024:.1f} KB)")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.1/162.1 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for onnx (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for onnx[0m[31m
[0m[31mERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (onnx)[0m[31m
[0m[?25h

ImportError: cannot import name 'mapping' from 'onnx' (/usr/local/lib/python3.11/dist-packages/onnx/__init__.py)

In [20]:
# Step 4 (Optional): On Google Colab, sometimes the .tflite model isn't saved correctly
# If so, run this cell to retry

# Manually save to tflite as this doesn't work right in colab
def convert_onnx_to_tflite(onnx_model_path, output_path):
    """Converts an ONNX version of an openwakeword model to the Tensorflow tflite format."""
    # imports
    import onnx
    import logging
    import tempfile
    from onnx_tf.backend import prepare
    import tensorflow as tf

    # Convert to tflite from onnx model
    onnx_model = onnx.load(onnx_model_path)
    tf_rep = prepare(onnx_model, device="CPU")
    with tempfile.TemporaryDirectory() as tmp_dir:
        tf_rep.export_graph(os.path.join(tmp_dir, "tf_model"))
        converter = tf.lite.TFLiteConverter.from_saved_model(os.path.join(tmp_dir, "tf_model"))
        tflite_model = converter.convert()

        logging.info(f"####\nSaving tflite mode to '{output_path}'")
        with open(output_path, 'wb') as f:
            f.write(tflite_model)

    return None

convert_onnx_to_tflite(f"my_custom_model/{config['model_name']}.onnx", f"my_custom_model/{config['model_name']}.tflite")


ImportError: cannot import name 'mapping' from 'onnx' (/usr/local/lib/python3.11/dist-packages/onnx/__init__.py)

After the model finishes training, the auto training script will automatically convert it to ONNX and tflite versions, saving them as `my_custom_model/<model_name>.onnx/tflite` in the present working directory, where `<model_name>` is defined in the YAML training config file. Either version can be used as normal with `openwakeword`. I recommend testing them with the [`detect_from_microphone.py`](https://github.com/dscripka/openWakeWord/blob/main/examples/detect_from_microphone.py) example script to see how the model performs!