Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ASR] Fix for multi-channel signals in AudioSegment #4824

Merged
merged 1 commit into from
Sep 6, 2022

Conversation

anteju
Copy link
Collaborator

@anteju anteju commented Aug 27, 2022

Signed-off-by: Ante Jukić ajukic@nvidia.com

What does this PR do ?

Currently, AudioSegment aims to do channel averaging for multi-channel audio signals:

if self._samples.ndim >= 2:
  self._samples = np.mean(self._samples, 1)

However, samples in its constructor are arranged as (num_channels, num_samples), instead of the assumed (num_samples, num_channels). This means that samples of the created AudioSegment will have an unexpected size, and also ASR will fail when running transcribe on a multi-channel audio signal.

This can be tested by running test_audiosegment.py and test_asr_multi_channel_audio.py from the attached scripts in test_audiosegment_with_channel_selector.zip

Collection: ASR

Changelog

Necessary fix in AudioSegment

  • correction to dimension ordering for multi-channel audio
  • added a new parameter channel_selector which can be used to select a single-channel (or a subset of channels) from a multi-channel audio signal
  • added unit tests for both channel selector and AudioSegment

Changes above are in segment.py and audio_utils.py and tests are in test_preprocessing_segment.py and test_audio_utils.py.

Nice to have: channel_selector for transcribe

  • Added a channel_selector option to transcribe, and forwarded it through models, audio-to-text dataset and WaveformFeaturizer to AudioSegment
  • This is a convenience when running an existing model on a multi-channel audio file, we can select either one of the channels or the channel-wise average.

If preferred, this can be removed from the PR.

Usage

Example: AudioSegment

  • Check test_audiosegment.py
AudioSegment.from_file(audio_file) # loads the complete signal (including all channels)
AudioSegment.from_file(audio_file, channel_selector='average') # loads a single-channel signal (average across channels)
AudioSegment.from_file(audio_file, channel_selector=0) # loads only first channel

Example: AudioSegment

  • Check test_asr_with_channel_selector.py
asr_model.transcribe(mc_files, channel_selector='average') # apply channel averaging when loading audio

Tests

  • Existing unit tests + newly-added unit tests
  • Compared transcription using single-channel signal matches between main and fix branch for wav, flac, mp3 input (test_asr_formats.py, attached)
  • Compared transcription of a channel from a multi-channel signal is matching transcription using a single-channel input signal (test_asr_with_channel_selector.py, attached)
  • Tutorials ASR_with_NeMo.ipynb and ASR_with_Transducers.ipynb running fine with the fix branch

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • n/a

@anteju anteju force-pushed the fix/audiosegment-multi-channel branch 3 times, most recently from 82f8122 to 809c71a Compare August 29, 2022 22:10
@lgtm-com
Copy link

lgtm-com bot commented Aug 29, 2022

This pull request introduces 1 alert when merging 809c71a into d969162 - view on LGTM.com

new alerts:

  • 1 for Unused import

@anteju anteju force-pushed the fix/audiosegment-multi-channel branch 3 times, most recently from 03689d1 to 33c5053 Compare August 31, 2022 04:21
@anteju anteju force-pushed the fix/audiosegment-multi-channel branch 2 times, most recently from 4cf3633 to 27e7993 Compare August 31, 2022 18:04
yzhang123
yzhang123 previously approved these changes Sep 1, 2022
Copy link
Collaborator

@jbalam-nv jbalam-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

nemo/collections/asr/parts/utils/audio_utils.py Outdated Show resolved Hide resolved
- Enable channel selector for AudioToText datasets for `transcribe`

Signed-off-by: Ante Jukić <ajukic@nvidia.com>
@anteju anteju force-pushed the fix/audiosegment-multi-channel branch from ad4cf71 to 71759a3 Compare September 2, 2022 20:55
@anteju anteju marked this pull request as ready for review September 2, 2022 21:01
@anteju anteju changed the title [Draft] Fix for multi-channel signals in AudioSegment [ASR] Fix for multi-channel signals in AudioSegment Sep 2, 2022
@anteju anteju mentioned this pull request Sep 2, 2022
6 tasks
@yzhang123 yzhang123 merged commit 760d0c8 into NVIDIA:main Sep 6, 2022
jubick1337 pushed a commit to jubick1337/NeMo that referenced this pull request Oct 3, 2022
- Enable channel selector for AudioToText datasets for `transcribe`

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
@anteju anteju mentioned this pull request Oct 25, 2022
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
- Enable channel selector for AudioToText datasets for `transcribe`

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
- Enable channel selector for AudioToText datasets for `transcribe`

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants