Add SeekFrames to audio decoder. Redesign to allow deciding decoded data type at runtime. #2334

jantonguirao · 2020-10-07T09:05:24Z

Signed-off-by: Joaquin Anton janton@nvidia.com

Why we need this PR?

It adds new feature needed to do partial decoding of audio files
Refactoring to improve usability of Audio Decoder

What happened in this PR?

What solution was applied:
Make AudioDecoderBase instances type-agnostic. Decode function is overloaded to decode to different types
Added SeekFrames to allow for partial decoding
Added handling of offset and duration in audio_decoder_impl.cc
Adjusted usage of audio decoder in AudioDecoder operator and NemoAsrReader
Affected modules and functionalities:
AudioDecoder operator and NemoAsrReader
Key points relevant for the review:
Design changes
Validation and testing:
Tests added
Documentation (including examples):
Docstr

JIRA TASK: [DALI-1634]
JIRA TASK: [DALI-1567]

…ata type at runtime. Signed-off-by: Joaquin Anton <janton@nvidia.com>

dali/operators/decoder/audio/audio_decoder_impl.cc

JanuszL · 2020-10-07T11:00:14Z

dali/operators/decoder/audio/audio_decoder_test.cc

+    auto meta = decoder->Open(make_cspan(bytes));
+    int64_t offset = meta.length / 2;
+    int64_t length = meta.length - offset;
+    std::vector<DataType> output(length * meta.channels);


Maybe you can use larger output buffer, fill the end with the dummy data, and see if the DecodeFrames doesn't over run it.

mzient · 2020-10-07T11:22:45Z

dali/operators/decoder/audio/audio_decoder_impl.cc

-template <typename T>
-span<char> as_raw_span(T *buffer, ptrdiff_t length) {
-  return make_span(reinterpret_cast<char*>(buffer), length*sizeof(T));
+std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, float offset_sec, float length_sec) {


Suggested change

std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, float offset_sec, float length_sec) {

std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, double offset_sec, double length_sec) {

With any decent sampling rate, float is not big enough to store the time.

Float will not be able to represent the time down to a sample at mere 8 minutes at 16kHz.

Do we expect to have such accuracy?

We can also use int64_t (nanoseconds). The capacity is >200 years - much more than enough for audio.

mzient · 2020-10-07T11:23:09Z

dali/operators/decoder/audio/audio_decoder_impl.cc

 }

 TensorShape<> DecodedAudioShape(const AudioMetadata &meta, float target_sample_rate,
-                                bool downmix) {
+                                bool downmix, float offset_sec, float duration_sec) {


Suggested change

bool downmix, float offset_sec, float duration_sec) {

bool downmix, double offset_sec, double duration_sec) {

mzient · 2020-10-07T11:27:38Z

dali/operators/decoder/audio/audio_decoder_impl.cc

                 span<float> resample_scratch_mem,
                 float target_sample_rate, bool downmix,
-                 const char *audio_filepath) {  // audio_filepath for debug purposes
+                 const char *audio_filepath,  // audio_filepath for debug purposes
+                 float offset_sec, float duration_sec) {


Suggested change

float offset_sec, float duration_sec) {

double offset_sec, double duration_sec) {

This doesn't seem like a good API - the offset and duration must have already been processed in order to allocate appropriate audio tensor - passing them here, especially in seconds, can only lead to inconsistency.
Suggested alternatives:

seek in the decoder before this call and read audio.shape[0] samples

pass the offset in samples and read audio.shape[0] samples.

mzient · 2020-10-07T12:02:39Z

dali/operators/reader/loader/nemo_asr_loader.cc

@@ -141,7 +141,7 @@ void NemoAsrLoader::ReadAudio(Tensor<CPUBackend> &audio,
  if (should_resample || should_downmix || dtype_ != DALI_INT16)


Suggested change

if (should_resample || should_downmix || dtype_ != DALI_INT16)

if (should_resample || should_downmix)

?

mzient · 2020-10-07T12:29:42Z

dali/operators/decoder/audio/audio_decoder_impl.cc

+  if (length_sec > 0.0f)
+    length = static_cast<int64_t>(length_sec * meta.sample_rate);


I don't like 0 meaning "everything". I think that a negative value would be better.

szalpal · 2020-10-07T12:32:21Z

dali/operators/decoder/audio/audio_decoder.h

+  virtual ptrdiff_t Decode(span<float> output) = 0;
+  virtual ptrdiff_t Decode(span<int16_t> output) = 0;
+  virtual ptrdiff_t Decode(span<int32_t> output) = 0;


No me gusta :(
What's the reason for removing TypedDecoder and having these 3 virtuals here? It feels like the opposite of how the interface should look like...

TypedDecoder introduced a compile time limitation that was artificial. If instantiate a float decoder because I think that I will need to do resampling, but later I get a recording with the same sampling rate as the requested one, I couldn't decode directly to T because of the static type of the instance.

Can't agree more with Joaquin. If an interface gets in the way of functionality, it's not a good one.

One more note, defining the data type at instantiation introduces this problem, because it is only after I have read the file meta-data that I can decide the data type of the output.

How about hiding the implementation detail, that there are different ways to decode different types? Something like we do for Open and Close:

public: template<typename T> ptrdiff_t Decode (span<T> output) { return DecodeImpl(output); } private: virtual ptrdiff_t DecodeImpl(span<float> output) = 0; virtual ptrdiff_t DecodeImpl(span<int16_t> output) = 0; virtual ptrdiff_t DecodeImpl(span<int32_t> output) = 0;

I'm thinking in the way that if we would need, e.g. perform some particular check before actually decoding. This way we would do it in a single point, not in every function with implementation:

ptrdiff_t Decode(span<T> output) { assert(output.length > 0); return DecodeImpl(output); }

What bothers me in this interface how it looks now is that it exposes that implementation detail...

Signed-off-by: Joaquin Anton <janton@nvidia.com>

JanuszL · 2020-10-07T15:38:55Z

dali/operators/decoder/audio/audio_decoder_test.cc

+                 length * meta.channels);
+    // Verifying that we didn't read more than we should
+    for (size_t i = length * meta.channels; i < output.size(); i++) {
+      ASSERT_EQ(0xBE, output[i]);


mzient · 2020-10-07T15:43:27Z

dali/operators/reader/loader/nemo_asr_loader.cc

@@ -152,9 +152,11 @@ void NemoAsrLoader::ReadAudio(Tensor<CPUBackend> &audio,

  resample_scratch.resize(resample_scratch_sz);

-  DecodeAudio<OutputType, DecoderOutputType>(
+  // TODO(janton): handle offset and duration


What's stopping you from doing it now?

I am going to do it in two separate PRs, one for AudioDecoder and one for NemoAsrReader, including proper python tests

szalpal · 2020-10-07T23:57:56Z

dali/operators/decoder/audio/audio_decoder.h

+   * @param nframes Number of full frames (1 frame is equivalent to nchannel samples)
+   * @param whence Like in lseek, SEEK_SET. SEEK_CUR, SEEK_END
+   */
+  virtual int64_t SeekFrames(int64_t nframes, int whence = SEEK_CUR) = 0;


IMHO, having pure virtual function with default parameter is bad practice. Considering this example:

class A { virtual foo(int a = 2) = 0; }; class B : A { virtual foo(int a = 3) { cout << a; } }; A *a = new B(); a->foo(); // 2 or 3?

szalpal · 2020-10-08T00:00:54Z

dali/operators/decoder/audio/audio_decoder.h

+  /**
+   * @brief Seeks full frames, or multichannel samples, much like lseek in unistd.h
+   * @param nframes Number of full frames (1 frame is equivalent to nchannel samples)
+   * @param whence Like in lseek, SEEK_SET. SEEK_CUR, SEEK_END
+   */


Please describe what's in the returned value

szalpal · 2020-10-08T00:04:01Z

dali/operators/decoder/audio/audio_decoder_impl.cc

+                                                   double offset_sec, double length_sec) {
+  int64_t offset = 0;
+  int64_t length = meta.length;
+  if (offset_sec >= 0.0f) {


Suggested change

if (offset_sec >= 0.0f) {

if (offset_sec >= 0.0) {

Comparing double to double

szalpal · 2020-10-08T00:04:16Z

dali/operators/decoder/audio/audio_decoder_impl.cc

+    offset = static_cast<int64_t>(offset_sec * meta.sample_rate);
+  }
+
+  if (length_sec >= 0.0f) {


Suggested change

if (length_sec >= 0.0f) {

if (length_sec >= 0.0) {

szalpal · 2020-10-08T00:10:03Z

dali/operators/decoder/audio/audio_decoder_impl.h

+DLL_PUBLIC std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta,
+                                                              double offset_sec, double length_sec);


Could you add docs to this function? Including what's in the returned pair

szalpal · 2020-10-08T00:22:08Z

dali/operators/decoder/audio/generic_decoder.cc

+  GenericAudioDecoder(GenericAudioDecoder&&) = default;
+  GenericAudioDecoder& operator=(GenericAudioDecoder&&) = default;


Is this correct? I believe that at least sndfile_handle_ should be reset. I don't know how remaining fields should be moved

szalpal · 2020-10-08T00:32:00Z

dali/operators/decoder/audio/audio_decoder.h

   */
-  virtual ptrdiff_t Decode(span<char> raw_output) = 0;
+  virtual bool CanSeekFrames() const = 0;


Is there a decoder, that wouldn't be able to seek frames? If not, maybe we could remove this function?
If we would really want to optionally add seeking, IMHO it should be covered by a decorator. But I'd prefer assuming, that we can always seek frames

szalpal · 2020-10-08T09:54:26Z

dali/operators/decoder/audio/audio_decoder.h

+  int64_t SeekFrames(int64_t nframes, int whence = SEEK_CUR) {
+    return SeekFramesImpl(nframes, whence);
+  }


Signed-off-by: Joaquin Anton <janton@nvidia.com>

mzient · 2020-10-08T12:39:11Z

!build

dali-automaton · 2020-10-08T13:10:48Z

CI MESSAGE: [1685928]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao · 2020-10-08T16:14:40Z

!build

dali-automaton · 2020-10-08T16:23:57Z

CI MESSAGE: [1686373]: BUILD STARTED

dali-automaton · 2020-10-08T20:22:49Z

CI MESSAGE: [1686373]: BUILD FAILED

dali-automaton · 2020-10-09T16:55:20Z

CI MESSAGE: [1686373]: BUILD PASSED

faroit · 2020-12-03T09:40:05Z

@jantonguirao this looks great. Do you have a user example how iterate over specific excerpts?

jantonguirao · 2020-12-03T12:10:41Z

@faroit You can iterate over portions of audio files by using our NemoAsrReader. We do not have a comprehensive notebook about this operator, but you can refer to the operator documentation or check our operator tests: https://github.com/NVIDIA/DALI/blob/master/dali/test/python/test_operator_nemo_asr_reader.py
If you have any specific questions you can post it here in the issues tab.

faroit · 2020-12-03T13:16:26Z

thanks a lot. This is a great addition!

Add SeekFrames to audio decoder. Redesign to allow deciding decoded d…

eaa1608

…ata type at runtime. Signed-off-by: Joaquin Anton <janton@nvidia.com>

JanuszL reviewed Oct 7, 2020

View reviewed changes

dali/operators/decoder/audio/audio_decoder_impl.cc Show resolved Hide resolved

JanuszL reviewed Oct 7, 2020

View reviewed changes

mzient reviewed Oct 7, 2020

View reviewed changes

szalpal reviewed Oct 7, 2020

View reviewed changes

Code review fixes

9c00fa0

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao force-pushed the audio_decoder_impl_offset_duration branch from dbb6919 to 9c00fa0 Compare October 7, 2020 14:06

JanuszL reviewed Oct 7, 2020

View reviewed changes

JanuszL approved these changes Oct 7, 2020

View reviewed changes

mzient reviewed Oct 7, 2020

View reviewed changes

szalpal reviewed Oct 8, 2020

View reviewed changes

jantonguirao force-pushed the audio_decoder_impl_offset_duration branch from 78b152a to 2013f62 Compare October 8, 2020 09:21

szalpal reviewed Oct 8, 2020

View reviewed changes

Code review fixes

3a4f3c0

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao force-pushed the audio_decoder_impl_offset_duration branch from 2013f62 to 3a4f3c0 Compare October 8, 2020 09:59

szalpal approved these changes Oct 8, 2020

View reviewed changes

Enable offset and duration in NemoAsrReader

7acb7b7

Signed-off-by: Joaquin Anton <janton@nvidia.com>

mzient approved these changes Oct 8, 2020

View reviewed changes

Bug fix

9f468dc

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao merged commit 83aa29f into NVIDIA:master Oct 9, 2020

	std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, float offset_sec, float length_sec) {
	std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, double offset_sec, double length_sec) {

	bool downmix, float offset_sec, float duration_sec) {
	bool downmix, double offset_sec, double duration_sec) {

	float offset_sec, float duration_sec) {
	double offset_sec, double duration_sec) {

		@@ -141,7 +141,7 @@ void NemoAsrLoader::ReadAudio(Tensor<CPUBackend> &audio,
		if (should_resample \|\| should_downmix \|\| dtype_ != DALI_INT16)

	if (should_resample \|\| should_downmix \|\| dtype_ != DALI_INT16)
	if (should_resample \|\| should_downmix)

		if (length_sec > 0.0f)
		length = static_cast<int64_t>(length_sec * meta.sample_rate);

		DLL_PUBLIC std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta,
		double offset_sec, double length_sec);

		GenericAudioDecoder(GenericAudioDecoder&&) = default;
		GenericAudioDecoder& operator=(GenericAudioDecoder&&) = default;

Add SeekFrames to audio decoder. Redesign to allow deciding decoded data type at runtime. #2334

Add SeekFrames to audio decoder. Redesign to allow deciding decoded data type at runtime. #2334

Conversation

jantonguirao commented Oct 7, 2020 • edited Loading

Why we need this PR?

What happened in this PR?

Choose a reason for hiding this comment

mzient Oct 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient commented Oct 8, 2020

dali-automaton commented Oct 8, 2020

jantonguirao commented Oct 8, 2020

dali-automaton commented Oct 8, 2020

dali-automaton commented Oct 8, 2020

dali-automaton commented Oct 9, 2020

faroit commented Dec 3, 2020

jantonguirao commented Dec 3, 2020

faroit commented Dec 3, 2020

jantonguirao commented Oct 7, 2020 •

edited

Loading

mzient Oct 7, 2020 •

edited

Loading