# Easy microWakeWord Training

This notebook provides a simplified approach to training custom wake word models using microWakeWord. It's designed to be accessible to users with minimal machine learning experience while still producing high-quality models.

This notebook uses the direct training approach from the basic notebook but with more user-friendly explanations and visual elements to help you understand and customize the training process.

## What You'll Need

- Python 3.10 installed
- A GPU is recommended for faster training (but not required)
- Your desired wake word phrase (e.g., "hey computer")

## Setup

First, let's install the required packages:

In [None]:
# Install microWakeWord and dependencies
import platform

if platform.system() == "Darwin":
    # `pymicro-features` is installed from a fork to support building on macOS
    !pip install 'git+https://github.com/puddly/pymicro-features@puddly/minimum-cpp-version'

# `audio-metadata` is installed from a fork to unpin `attrs` from a version that breaks Jupyter
!pip install 'git+https://github.com/whatsnowplaying/audio-metadata@d4ebb238e6a401bb1a5aaaac60c9e2b3cb30929f'

# Install ipywidgets for interactive notebook elements
!pip install ipywidgets

!git clone https://github.com/BigPappy098/microWakeWord
!pip install -e ./microWakeWord

## Step 1: Choose Your Wake Word

Choose a wake word phrase that you want to use. Good wake words typically have:
- Multiple syllables (3-5 is ideal)
- Distinctive sounds that don't commonly appear in everyday speech
- Clear pronunciation

Examples: "hey computer", "jarvis", "alexa", "computer"

You can use phonetic spellings to improve recognition. For example, "computer" might be better as "kuhm-pyoo-ter".

In [None]:
# Set your wake word here
wake_word = "hey_computer"  # Use underscores instead of spaces

# Listen to a sample of how it will sound
import os
import sys
from IPython.display import Audio

if not os.path.exists("./piper-sample-generator"):
    !git clone https://github.com/rhasspy/piper-sample-generator
    !wget -O piper-sample-generator/models/en_US-libritts_r-medium.pt 'https://github.com/rhasspy/piper-sample-generator/releases/download/v2.0.0/en_US-libritts_r-medium.pt'
    !pip install torch torchaudio piper-phonemize-cross==1.2.1

    if "piper-sample-generator/" not in sys.path:
        sys.path.append("piper-sample-generator/")

!mkdir -p sample_test
!python3 piper-sample-generator/generate_samples.py "{wake_word}" \
--max-samples 1 \
--batch-size 1 \
--output-dir sample_test

Audio("sample_test/0.wav", autoplay=True)

# Create directory for generated samples
!mkdir -p generated_samples

## Step 2: Choose Training Parameters

Now, let's configure the training process based on your wake word and needs:

1. **Wake Word Length**: Choose a preset based on the length of your wake word
   - `short`: For 1-2 syllable wake words (e.g., "jarvis")
   - `medium`: For 3-4 syllable wake words (e.g., "hey computer")
   - `long`: For 5+ syllable wake words (e.g., "hey google assistant")

2. **Augmentation Level**: Choose how much to vary the training samples
   - `light`: Less variation, good for quiet environments
   - `medium`: Balanced variation, good for most home environments
   - `heavy`: High variation, good for noisy environments

3. **Sample Count**: How many synthetic samples to generate
   - 500-1000 is good for testing
   - 2000-5000 is recommended for production models
   
4. **Batch Size**: Size of batches during training
   - Larger values may train faster but require more memory
   - Smaller values use less memory but may train slower

In [None]:
# Configure training parameters
preset = "medium"  # Choose from: "short", "medium", "long"
augmentation_level = "medium"  # Choose from: "light", "medium", "heavy"
samples_count = 1000  # Number of samples to generate
batch_size = 128  # Batch size for training (larger values may be faster but require more memory)

# Output directory
output_dir = f"trained_models/{wake_word}"

## Step 3: Download Negative Samples

To train a robust model, we need "negative" samples - audio that is NOT the wake word. These help the model learn what to ignore.

In [None]:
# Download negative datasets
output_dir = './negative_datasets'
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
    link_root = "https://huggingface.co/datasets/kahrendt/microwakeword/resolve/main/"
    filenames = ['dinner_party.zip', 'dinner_party_eval.zip', 'no_speech.zip', 'speech.zip']
    for fname in filenames:
        link = link_root + fname
        zip_path = f"negative_datasets/{fname}"
        !wget -O {zip_path} {link}
        !unzip -q {zip_path} -d {output_dir}

## Step 4: Generate Wake Word Samples

Now we'll generate a larger number of wake word samples for training. This step creates synthetic audio samples of your wake word with different voices and variations.

In [None]:
# Generate a larger number of wake word samples for training
!python3 piper-sample-generator/generate_samples.py "{wake_word}" \
--max-samples {samples_count} \
--batch-size {batch_size} \
--output-dir generated_samples

print(f"Generated {samples_count} samples with batch size {batch_size}")

## Step 5: Set Up Augmentation

Now we'll set up the audio augmentation pipeline. This adds variations to our samples like background noise, distortion, and room effects to make the model more robust in real-world environments.

In [None]:
# Set up augmentation based on the selected level
from microwakeword.audio.augmentation import Augmentation
from microwakeword.audio.clips import Clips
from microwakeword.audio.spectrograms import SpectrogramGeneration

# Define augmentation parameters based on the selected level
augmentation_configs = {
    "light": {
        "probabilities": {
            "SevenBandParametricEQ": 0.05,
            "TanhDistortion": 0.05,
            "PitchShift": 0.05,
            "BandStopFilter": 0.05,
            "AddColorNoise": 0.05,
            "AddBackgroundNoise": 0.5,
            "Gain": 1.0,
            "RIR": 0.3,
        },
        "min_snr": 0,
        "max_snr": 15,
    },
    "medium": {
        "probabilities": {
            "SevenBandParametricEQ": 0.1,
            "TanhDistortion": 0.1,
            "PitchShift": 0.1,
            "BandStopFilter": 0.1,
            "AddColorNoise": 0.1,
            "AddBackgroundNoise": 0.75,
            "Gain": 1.0,
            "RIR": 0.5,
        },
        "min_snr": -3,
        "max_snr": 12,
    },
    "heavy": {
        "probabilities": {
            "SevenBandParametricEQ": 0.3,
            "TanhDistortion": 0.2,
            "PitchShift": 0.3,
            "BandStopFilter": 0.2,
            "AddColorNoise": 0.3,
            "AddBackgroundNoise": 0.9,
            "Gain": 1.0,
            "RIR": 0.7,
        },
        "min_snr": -5,
        "max_snr": 10,
    },
}

# Get the selected augmentation configuration
aug_config = augmentation_configs[augmentation_level]

# Set up the clips and augmentation
clips = Clips(
    input_directory='generated_samples',
    file_pattern='*.wav',
    max_clip_duration_s=None,
    remove_silence=False,
    random_split_seed=10,
    split_count=0.1,
)

augmenter = Augmentation(
    augmentation_duration_s=3.2,
    augmentation_probabilities=aug_config["probabilities"],
    impulse_paths=['mit_rirs'] if os.path.exists('mit_rirs') else [],
    background_paths=['fma_16k', 'audioset_16k'] if os.path.exists('fma_16k') else [],
    background_min_snr_db=aug_config["min_snr"],
    background_max_snr_db=aug_config["max_snr"],
    min_jitter_s=0.195,
    max_jitter_s=0.205,
)

print(f"Augmentation set up with {augmentation_level} level")
print(f"Background noise SNR range: {aug_config['min_snr']} to {aug_config['max_snr']} dB")

## Step 6: Generate Spectrograms

Now we'll generate spectrograms from our audio samples. Spectrograms are visual representations of the audio that the neural network will learn from.

In [None]:
# Create directories for spectrograms
!mkdir -p generated_augmented_features

# Create spectrogram generator
spectrograms = SpectrogramGeneration(
    clips=clips,
    augmenter=augmenter,
    slide_frames=10,  # Uses the same spectrogram repeatedly, just shifted over by one frame
    step_ms=10,
)

# Generate spectrograms for training
print("Generating spectrograms for training... This may take a while.")
from microwakeword.audio.ragged_mmap import RaggedMmap

# Generate training spectrograms
RaggedMmap.from_generator(
    out_dir=os.path.join('generated_augmented_features', 'wakeword_mmap'),
    sample_generator=spectrograms.spectrogram_generator(split="train", repeat=2),
    batch_size=batch_size,  # Using the same batch size as training
    verbose=True,
)

# Generate validation spectrograms
validation_spectrograms = SpectrogramGeneration(
    clips=clips,
    augmenter=augmenter,
    slide_frames=10,
    step_ms=10,
)

RaggedMmap.from_generator(
    out_dir=os.path.join('generated_augmented_features', 'validation'),
    sample_generator=validation_spectrograms.spectrogram_generator(split="validation", repeat=1),
    batch_size=batch_size,
    verbose=True,
)

# Generate testing spectrograms
testing_spectrograms = SpectrogramGeneration(
    clips=clips,
    augmenter=augmenter,
    slide_frames=1,  # Use slide_frames=1 for testing to simulate streaming
    step_ms=10,
)

RaggedMmap.from_generator(
    out_dir=os.path.join('generated_augmented_features', 'testing'),
    sample_generator=testing_spectrograms.spectrogram_generator(split="test", repeat=1),
    batch_size=batch_size,
    verbose=True,
)

print("All spectrograms generated successfully.")

## Step 7: Create Training Configuration

Now we'll create a configuration file that controls the training process. This includes model architecture, training parameters, and data sources.

In [None]:
import yaml

# Define model presets based on wake word length
model_presets = {
    "short": {  # For short wake words (1-2 syllables)
        "pointwise_filters": "48,48,48,48",
        "repeat_in_block": "1,1,1,1",
        "mixconv_kernel_sizes": "[5],[7,11],[9,15],[17]",
        "residual_connection": "0,0,0,0",
        "first_conv_filters": 32,
        "first_conv_kernel_size": 5,
        "stride": 3,
        "training_steps": 15000,
        "negative_class_weight": 15,
    },
    "medium": {  # For medium wake words (3-4 syllables)
        "pointwise_filters": "64,64,64,64",
        "repeat_in_block": "1,1,1,1",
        "mixconv_kernel_sizes": "[5],[7,11],[9,15],[23]",
        "residual_connection": "0,0,0,0",
        "first_conv_filters": 32,
        "first_conv_kernel_size": 5,
        "stride": 3,
        "training_steps": 20000,
        "negative_class_weight": 20,
    },
    "long": {  # For longer wake words (5+ syllables)
        "pointwise_filters": "64,64,64,64",
        "repeat_in_block": "1,1,1,1",
        "mixconv_kernel_sizes": "[5],[7,11],[9,15],[29]",
        "residual_connection": "0,0,0,0",
        "first_conv_filters": 32,
        "first_conv_kernel_size": 5,
        "stride": 3,
        "training_steps": 25000,
        "negative_class_weight": 25,
    },
}

# Get the selected model preset
selected_preset = model_presets[preset]

# Create output directory
os.makedirs(f"trained_models/{wake_word}/model", exist_ok=True)

# Create training configuration
config = {}
config["window_step_ms"] = 10
config["train_dir"] = f"trained_models/{wake_word}/model"
config["summaries_dir"] = os.path.join(config["train_dir"], "summaries")

# Define feature sources
config["features"] = [
    {
        "features_dir": "generated_augmented_features",
        "sampling_weight": 2.0,
        "penalty_weight": 1.0,
        "truth": True,
        "truncation_strategy": "truncate_start",
        "type": "mmap",
    },
    {
        "features_dir": "negative_datasets/speech",
        "sampling_weight": 10.0,
        "penalty_weight": 1.0,
        "truth": False,
        "truncation_strategy": "random",
        "type": "mmap",
    },
    {
        "features_dir": "negative_datasets/dinner_party",
        "sampling_weight": 10.0,
        "penalty_weight": 1.0,
        "truth": False,
        "truncation_strategy": "random",
        "type": "mmap",
    },
    {
        "features_dir": "negative_datasets/no_speech",
        "sampling_weight": 5.0,
        "penalty_weight": 1.0,
        "truth": False,
        "truncation_strategy": "random",
        "type": "mmap",
    },
    { # Only used for validation and testing
        "features_dir": "negative_datasets/dinner_party_eval",
        "sampling_weight": 0.0,
        "penalty_weight": 1.0,
        "truth": False,
        "truncation_strategy": "split",
        "type": "mmap",
    },
]

# Training parameters
config["training_steps"] = [selected_preset["training_steps"]]
config["positive_class_weight"] = [1]
config["negative_class_weight"] = [selected_preset["negative_class_weight"]]
config["learning_rates"] = [0.001]
config["batch_size"] = batch_size
config["time_mask_max_size"] = [5]
config["time_mask_count"] = [2]
config["freq_mask_max_size"] = [5]
config["freq_mask_count"] = [2]
config["eval_step_interval"] = 500
config["clip_duration_ms"] = 1500
config["target_minimization"] = 0.9
config["maximization_metric"] = "average_viable_recall"

# Get a sample spectrogram to determine dimensions
sample_spec = spectrograms.get_random_spectrogram()
config["spectrogram_length"] = sample_spec.shape[0]
config["feature_count"] = sample_spec.shape[1]
config["training_input_shape"] = [config["spectrogram_length"], config["feature_count"]]

print(f"Spectrogram dimensions: {config['spectrogram_length']} x {config['feature_count']}")

# Save configuration to file
config_path = "training_parameters.yaml"
with open(config_path, "w") as file:
    yaml.dump(config, file)

print(f"Training configuration saved to {config_path}")

## Step 8: Train the Model

Now we'll train the neural network model using the configuration we created. This process will:
1. Train the model on the spectrograms we generated
2. Convert the model to a streaming TFLite format for deployment
3. Test the model's performance

In [None]:
# Train the model using the direct approach from the basic notebook
# This gives us more control and consistency in the training process

# Get model parameters from the selected preset
pointwise_filters = selected_preset["pointwise_filters"]
repeat_in_block = selected_preset["repeat_in_block"]
mixconv_kernel_sizes = selected_preset["mixconv_kernel_sizes"]
residual_connection = selected_preset["residual_connection"]
first_conv_filters = selected_preset["first_conv_filters"]
first_conv_kernel_size = selected_preset["first_conv_kernel_size"]
stride = selected_preset["stride"]

# Run the training command
!python -m microwakeword.model_train_eval \
--training_config='training_parameters.yaml' \
--train 1 \
--restore_checkpoint 1 \
--test_tf_nonstreaming 0 \
--test_tflite_nonstreaming 0 \
--test_tflite_nonstreaming_quantized 0 \
--test_tflite_streaming 0 \
--test_tflite_streaming_quantized 1 \
--use_weights "best_weights" \
mixednet \
--pointwise_filters "{pointwise_filters}" \
--repeat_in_block  "{repeat_in_block}" \
--mixconv_kernel_sizes '{mixconv_kernel_sizes}' \
--residual_connection "{residual_connection}" \
--first_conv_filters {first_conv_filters} \
--first_conv_kernel_size {first_conv_kernel_size} \
--stride {stride}

print(f"Training completed! Model saved to trained_models/{wake_word}/model/")

## Step 9: Download Your Model

Once training is complete, you can download your model for use with ESPHome or other compatible systems.

In [None]:
from IPython.display import FileLink

# Path to the trained model
model_file = os.path.join(f"trained_models/{wake_word}/model/tflite_stream_state_internal_quant/stream_state_internal_quant.tflite")

if os.path.exists(model_file):
    print(f"Your model is ready! Click below to download:")
    display(FileLink(model_file))
else:
    print(f"Model file not found at {model_file}. Check for errors in the training process.")

## Step 10: Create a Model Manifest for ESPHome

To use your model with ESPHome, you need to create a model manifest JSON file. Here's a template:

In [None]:
import json

# Create a model manifest for ESPHome
manifest = {
    "name": wake_word,
    "version": 2,
    "type": "micro_speech",
    "description": f"Custom wake word model for '{wake_word}'",
    "specs": {
        "average_window_length": 10,
        "detection_threshold": 0.7,
        "suppression_ms": 1000,
        "minimum_count": 3,
        "sample_rate": 16000,
        "vocabulary": ["_silence_", "_unknown_", wake_word]
    }
}

manifest_file = os.path.join(f"trained_models/{wake_word}/model/manifest.json")
with open(manifest_file, 'w') as f:
    json.dump(manifest, f, indent=2)

print(f"Model manifest created at {manifest_file}")
display(FileLink(manifest_file))

## Troubleshooting and Fine-Tuning

If your model doesn't perform as expected, here are some tips:

1. **False Positives** (activates too often):
   - Increase the `negative_class_weight` in the training configuration
   - Increase the `detection_threshold` in the manifest file
   - Try a different phonetic spelling of your wake word

2. **False Negatives** (doesn't activate when it should):
   - Decrease the `negative_class_weight` in the training configuration
   - Decrease the `detection_threshold` in the manifest file
   - Generate more training samples
   - Try a different phonetic spelling of your wake word

3. **Advanced Configuration**:
   - You can modify the training_parameters.yaml file directly for more control
   - Increase training_steps for better accuracy (but longer training time)
   - Adjust augmentation parameters for different environments
   - Try different model architectures by changing the mixednet parameters