Speech Command Recognition

Deep learning project comparing CNN, LSTM, and Transformer architectures for spoken command classification. Trained on Google Speech Commands V2 dataset. Best model (fine-tuned AST) achieves 97.23% test accuracy.

Live demo: huggingface.co/spaces/ByJH/speech-command-demo

Overview

Keyword spotting system that classifies 1-second audio clips into 12 classes:

10 core commands: yes, no, up, down, left, right, on, off, stop, go
Unknown: other words not in the core set
Silence: background noise / no speech

Results

Model	Params	Test Acc	Macro F1
CNN Baseline	1.21M	95.29%	95.39%
LSTM (BiLSTM)	677K	93.75%	93.83%
CNN-LSTM Hybrid	1.65M	95.16%	95.24%
AST (fine-tuned)	86.2M	97.23%	94.68%

Trained on Google Colab (Tesla T4 GPU) with mixed precision. AST uses pretrained AudioSet weights.

Project Structure

speech-command-recognition/
├── src/
│   ├── augmentations.py         # mel spectrogram + audio augmentation
│   ├── data_loader.py           # dataset loading & preprocessing
│   ├── models.py                # CNN, LSTM, CNN-LSTM architectures
│   ├── models_transformer.py    # AST, Efficient Transformer
│   ├── train.py                 # training loop, optimizer, scheduler
│   ├── evaluate.py              # metrics, confusion matrix, plots
│   └── export.py                # ONNX export & benchmarking
├── configs/
│   └── config.yaml
├── notebooks/                   # Jupyter notebooks for exploration
├── models/                      # saved checkpoints (.pt)
└── results/                     # eval metrics, confusion matrices

Setup

git clone https://github.com/ByJH/speech-command-recognition.git
cd speech-command-recognition
pip install -r requirements.txt

Quick Start

from src.data_loader import SpeechCommandsDataset
from src.augmentations import MelSpectrogramTransform
from src.models import get_model
from src.train import Trainer, create_optimizer, create_scheduler
from torch.utils.data import DataLoader

# load data
tf = MelSpectrogramTransform(n_mels=128, augment=False)
train_ds = SpeechCommandsDataset(split="train", transform=tf, kws_mode=True, cache=True)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=0)

# train
model = get_model("cnn", num_classes=12, dropout=0.3)
opt = create_optimizer(model, "adamw", lr=1e-3)
sched = create_scheduler(opt, "plateau")

trainer = Trainer(model, train_loader, val_loader, opt, sched,
                  device="cuda", experiment_name="cnn_baseline")
history = trainer.train(epochs=50, patience=10, save_best=True)

Evaluation

from src.evaluate import Evaluator

trainer.load_checkpoint("cnn_baseline_best.pt")
evaluator = Evaluator(model, test_loader, id2label, device="cuda", save_dir="results")
metrics = evaluator.generate_report("cnn_baseline")

Models

CNN Baseline. VGG-style conv net with batch norm and global avg pooling. 95.3%.

LSTM. Bidirectional LSTM processing mel spectrogram frames as a time series. 93.8%.

CNN-LSTM. CNN extracts spatial features, LSTM models temporal sequence, with attention pooling. 95.2%.

AST (Audio Spectrogram Transformer). Pretrained on AudioSet, fine-tuned on 12 commands. Interpolates 1s spectrograms to 10s expected input. 97.2%.

Audio Pipeline

Load 1-second audio at 16kHz
Convert to 128-bin mel spectrogram (n_fft=1024, hop=256)
Optional augmentation: time shift, speed perturbation, Gaussian noise, SpecAugment

Training Details

Mixed precision FP16 training
AdamW optimizer with ReduceLROnPlateau scheduler
Early stopping, patience=10 for CNN/LSTM, 7 for AST
Checkpoint saving, best val + final
All training done on Google Colab, Tesla T4 GPU
AST trained with batch_size=16 due to model size

Colab Training

The project is developed locally and trained in Colab:

Code is synced to Google Drive
Colab copies project, installs deps, and trains on GPU
Models and results save back to Drive

Key Colab notes:

Use datasets==2.21.0, newer versions break Speech Commands loading
Use num_workers=0, Colab multiprocessing is unreliable
Cache dataset in memory for fast training, ~2 min/epoch vs ~5 min/batch

ONNX Export

from src.export import export_to_onnx, verify_onnx, benchmark

export_to_onnx(model, "models/cnn.onnx")
verify_onnx(model, "models/cnn.onnx")
benchmark(model, "models/cnn.onnx")

Troubleshooting

CUDA out of memory. Reduce batch_size. AST needs batch_size=16 on T4.

Dataset scripts no longer supported. Pin datasets==2.21.0.

torchaudio undefined symbol. Reinstall torch and torchaudio together: pip install torch torchaudio --upgrade.

ReduceLROnPlateau verbose error. PyTorch 2.x removed verbose param. Don't pass it.

References

Speech Commands Dataset, Warden 2018
Audio Spectrogram Transformer, Gong et al. 2021
SpecAugment, Park et al. 2019

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
configs		configs
hf_space		hf_space
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Command Recognition

Overview

Results

Project Structure

Setup

Quick Start

Evaluation

Models

Audio Pipeline

Training Details

Colab Training

ONNX Export

Troubleshooting

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speech Command Recognition

Overview

Results

Project Structure

Setup

Quick Start

Evaluation

Models

Audio Pipeline

Training Details

Colab Training

ONNX Export

Troubleshooting

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages