Deep learning project comparing CNN, LSTM, and Transformer architectures for spoken command classification. Trained on Google Speech Commands V2 dataset. Best model (fine-tuned AST) achieves 97.23% test accuracy.
Live demo: huggingface.co/spaces/ByJH/speech-command-demo
Keyword spotting system that classifies 1-second audio clips into 12 classes:
- 10 core commands: yes, no, up, down, left, right, on, off, stop, go
- Unknown: other words not in the core set
- Silence: background noise / no speech
| Model | Params | Test Acc | Macro F1 |
|---|---|---|---|
| CNN Baseline | 1.21M | 95.29% | 95.39% |
| LSTM (BiLSTM) | 677K | 93.75% | 93.83% |
| CNN-LSTM Hybrid | 1.65M | 95.16% | 95.24% |
| AST (fine-tuned) | 86.2M | 97.23% | 94.68% |
Trained on Google Colab (Tesla T4 GPU) with mixed precision. AST uses pretrained AudioSet weights.
speech-command-recognition/
├── src/
│ ├── augmentations.py # mel spectrogram + audio augmentation
│ ├── data_loader.py # dataset loading & preprocessing
│ ├── models.py # CNN, LSTM, CNN-LSTM architectures
│ ├── models_transformer.py # AST, Efficient Transformer
│ ├── train.py # training loop, optimizer, scheduler
│ ├── evaluate.py # metrics, confusion matrix, plots
│ └── export.py # ONNX export & benchmarking
├── configs/
│ └── config.yaml
├── notebooks/ # Jupyter notebooks for exploration
├── models/ # saved checkpoints (.pt)
└── results/ # eval metrics, confusion matrices
git clone https://github.com/ByJH/speech-command-recognition.git
cd speech-command-recognition
pip install -r requirements.txtfrom src.data_loader import SpeechCommandsDataset
from src.augmentations import MelSpectrogramTransform
from src.models import get_model
from src.train import Trainer, create_optimizer, create_scheduler
from torch.utils.data import DataLoader
# load data
tf = MelSpectrogramTransform(n_mels=128, augment=False)
train_ds = SpeechCommandsDataset(split="train", transform=tf, kws_mode=True, cache=True)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=0)
# train
model = get_model("cnn", num_classes=12, dropout=0.3)
opt = create_optimizer(model, "adamw", lr=1e-3)
sched = create_scheduler(opt, "plateau")
trainer = Trainer(model, train_loader, val_loader, opt, sched,
device="cuda", experiment_name="cnn_baseline")
history = trainer.train(epochs=50, patience=10, save_best=True)from src.evaluate import Evaluator
trainer.load_checkpoint("cnn_baseline_best.pt")
evaluator = Evaluator(model, test_loader, id2label, device="cuda", save_dir="results")
metrics = evaluator.generate_report("cnn_baseline")CNN Baseline. VGG-style conv net with batch norm and global avg pooling. 95.3%.
LSTM. Bidirectional LSTM processing mel spectrogram frames as a time series. 93.8%.
CNN-LSTM. CNN extracts spatial features, LSTM models temporal sequence, with attention pooling. 95.2%.
AST (Audio Spectrogram Transformer). Pretrained on AudioSet, fine-tuned on 12 commands. Interpolates 1s spectrograms to 10s expected input. 97.2%.
- Load 1-second audio at 16kHz
- Convert to 128-bin mel spectrogram (n_fft=1024, hop=256)
- Optional augmentation: time shift, speed perturbation, Gaussian noise, SpecAugment
- Mixed precision FP16 training
- AdamW optimizer with ReduceLROnPlateau scheduler
- Early stopping, patience=10 for CNN/LSTM, 7 for AST
- Checkpoint saving, best val + final
- All training done on Google Colab, Tesla T4 GPU
- AST trained with batch_size=16 due to model size
The project is developed locally and trained in Colab:
- Code is synced to Google Drive
- Colab copies project, installs deps, and trains on GPU
- Models and results save back to Drive
Key Colab notes:
- Use
datasets==2.21.0, newer versions break Speech Commands loading - Use
num_workers=0, Colab multiprocessing is unreliable - Cache dataset in memory for fast training, ~2 min/epoch vs ~5 min/batch
from src.export import export_to_onnx, verify_onnx, benchmark
export_to_onnx(model, "models/cnn.onnx")
verify_onnx(model, "models/cnn.onnx")
benchmark(model, "models/cnn.onnx")CUDA out of memory. Reduce batch_size. AST needs batch_size=16 on T4.
Dataset scripts no longer supported. Pin datasets==2.21.0.
torchaudio undefined symbol. Reinstall torch and torchaudio together: pip install torch torchaudio --upgrade.
ReduceLROnPlateau verbose error. PyTorch 2.x removed verbose param. Don't pass it.
- Speech Commands Dataset, Warden 2018
- Audio Spectrogram Transformer, Gong et al. 2021
- SpecAugment, Park et al. 2019
MIT