PyTorch-PCEN

Efficient PyTorch reimplementation of per-channel energy normalization with Mel spectrogram features.

Overview

Robustness to loudness differences in near- and far-field conditions is critical in high-quality speech recognition applications. Obviously, spectrogram energies differ significantly between, say, shouting at arms-length and whispering from a distance. This can worsen model quality, since the model itself would need to be robust across a wide range of input. The log-compression step in the popular log-Mel transform partially addresses this issue by reducing the dynamic range of audio; however, it ignores per-channel energy differences and is static by definition.

Per-channel energy normalization is one such solution to the aforementioned problems. It provides a per-channel, trainable front-end in place of the log compression, greatly improving model robustness in keyword spotting systems -- all the while being resource-efficient and easy to implement.

Installation and Usage

PyTorch and NumPy are required. LibROSA and matplotlib are required only for the example.
To install via pip, run pip install git+https://github.com/daemon/pytorch-pcen. Otherwise, clone this repository and run python setup.py install.
To run the example in the module, place a 16kHz WAV file named yes.wav in the current directory. Then, do python -m pcen.pcen.

The following is a self-contained example for using a streaming PCEN layer:

import pcen
import torch

# 40-dimensional features, 30-millisecond window, 10-millisecond shift; trainable is false by default
transform = pcen.StreamingPCENTransform(n_mels=40, n_fft=480, hop_length=160, trainable=True)
audio = torch.empty(1, 16000).normal_(0, 0.1) # Gaussian noise

# 1600 is an arbitrary chunk size; This step is unnecessary but demonstrates the streaming nature
streaming_chunks = audio.split(1600, 1)
pcen_chunks = [transform(chunk) for chunk in streaming_chunks] # Transform each chunk
transform.reset() # Reset the persistent streaming state
pcen_ = torch.cat(pcen_chunks, 1)

Citation

Wang, Yuxuan, Pascal Getreuer, Thad Hughes, Richard F. Lyon, and Rif A. Saurous. Trainable frontend for robust and far-field keyword spotting. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 5670-5674. IEEE, 2017.

@inproceedings{wang2017trainable,
  title={Trainable frontend for robust and far-field keyword spotting},
  author={Wang, Yuxuan and Getreuer, Pascal and Hughes, Thad and Lyon, Richard F and Saurous, Rif A},
  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on},
  pages={5670--5674},
  year={2017},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
pcen		pcen
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pcen

pcen

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

PyTorch-PCEN

Overview

Installation and Usage

Citation

About

Releases

Packages

Languages

License

daemon/pytorch-pcen

Folders and files

Latest commit

History

Repository files navigation

PyTorch-PCEN

Overview

Installation and Usage

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages