Skip to content

ByJH/speech-command-recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CI

Speech Command Recognition

Deep learning project comparing CNN, LSTM, and Transformer architectures for spoken command classification. Trained on Google Speech Commands V2 dataset. Best model (fine-tuned AST) achieves 97.23% test accuracy.

Live demo: huggingface.co/spaces/ByJH/speech-command-demo

Overview

Keyword spotting system that classifies 1-second audio clips into 12 classes:

  • 10 core commands: yes, no, up, down, left, right, on, off, stop, go
  • Unknown: other words not in the core set
  • Silence: background noise / no speech

Results

Model Params Test Acc Macro F1
CNN Baseline 1.21M 95.29% 95.39%
LSTM (BiLSTM) 677K 93.75% 93.83%
CNN-LSTM Hybrid 1.65M 95.16% 95.24%
AST (fine-tuned) 86.2M 97.23% 94.68%

Trained on Google Colab (Tesla T4 GPU) with mixed precision. AST uses pretrained AudioSet weights.

Project Structure

speech-command-recognition/
├── src/
│   ├── augmentations.py         # mel spectrogram + audio augmentation
│   ├── data_loader.py           # dataset loading & preprocessing
│   ├── models.py                # CNN, LSTM, CNN-LSTM architectures
│   ├── models_transformer.py    # AST, Efficient Transformer
│   ├── train.py                 # training loop, optimizer, scheduler
│   ├── evaluate.py              # metrics, confusion matrix, plots
│   └── export.py                # ONNX export & benchmarking
├── configs/
│   └── config.yaml
├── notebooks/                   # Jupyter notebooks for exploration
├── models/                      # saved checkpoints (.pt)
└── results/                     # eval metrics, confusion matrices

Setup

git clone https://github.com/ByJH/speech-command-recognition.git
cd speech-command-recognition
pip install -r requirements.txt

Quick Start

from src.data_loader import SpeechCommandsDataset
from src.augmentations import MelSpectrogramTransform
from src.models import get_model
from src.train import Trainer, create_optimizer, create_scheduler
from torch.utils.data import DataLoader

# load data
tf = MelSpectrogramTransform(n_mels=128, augment=False)
train_ds = SpeechCommandsDataset(split="train", transform=tf, kws_mode=True, cache=True)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=0)

# train
model = get_model("cnn", num_classes=12, dropout=0.3)
opt = create_optimizer(model, "adamw", lr=1e-3)
sched = create_scheduler(opt, "plateau")

trainer = Trainer(model, train_loader, val_loader, opt, sched,
                  device="cuda", experiment_name="cnn_baseline")
history = trainer.train(epochs=50, patience=10, save_best=True)

Evaluation

from src.evaluate import Evaluator

trainer.load_checkpoint("cnn_baseline_best.pt")
evaluator = Evaluator(model, test_loader, id2label, device="cuda", save_dir="results")
metrics = evaluator.generate_report("cnn_baseline")

Models

CNN Baseline. VGG-style conv net with batch norm and global avg pooling. 95.3%.

LSTM. Bidirectional LSTM processing mel spectrogram frames as a time series. 93.8%.

CNN-LSTM. CNN extracts spatial features, LSTM models temporal sequence, with attention pooling. 95.2%.

AST (Audio Spectrogram Transformer). Pretrained on AudioSet, fine-tuned on 12 commands. Interpolates 1s spectrograms to 10s expected input. 97.2%.

Audio Pipeline

  1. Load 1-second audio at 16kHz
  2. Convert to 128-bin mel spectrogram (n_fft=1024, hop=256)
  3. Optional augmentation: time shift, speed perturbation, Gaussian noise, SpecAugment

Training Details

  • Mixed precision FP16 training
  • AdamW optimizer with ReduceLROnPlateau scheduler
  • Early stopping, patience=10 for CNN/LSTM, 7 for AST
  • Checkpoint saving, best val + final
  • All training done on Google Colab, Tesla T4 GPU
  • AST trained with batch_size=16 due to model size

Colab Training

The project is developed locally and trained in Colab:

  1. Code is synced to Google Drive
  2. Colab copies project, installs deps, and trains on GPU
  3. Models and results save back to Drive

Key Colab notes:

  • Use datasets==2.21.0, newer versions break Speech Commands loading
  • Use num_workers=0, Colab multiprocessing is unreliable
  • Cache dataset in memory for fast training, ~2 min/epoch vs ~5 min/batch

ONNX Export

from src.export import export_to_onnx, verify_onnx, benchmark

export_to_onnx(model, "models/cnn.onnx")
verify_onnx(model, "models/cnn.onnx")
benchmark(model, "models/cnn.onnx")

Troubleshooting

CUDA out of memory. Reduce batch_size. AST needs batch_size=16 on T4.

Dataset scripts no longer supported. Pin datasets==2.21.0.

torchaudio undefined symbol. Reinstall torch and torchaudio together: pip install torch torchaudio --upgrade.

ReduceLROnPlateau verbose error. PyTorch 2.x removed verbose param. Don't pass it.

References

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors