Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SeekFrames to audio decoder. Redesign to allow deciding decoded data type at runtime. #2334

Merged

Conversation

jantonguirao
Copy link
Contributor

@jantonguirao jantonguirao commented Oct 7, 2020

Signed-off-by: Joaquin Anton janton@nvidia.com

Why we need this PR?

  • It adds new feature needed to do partial decoding of audio files
  • Refactoring to improve usability of Audio Decoder

What happened in this PR?

  • What solution was applied:
    Make AudioDecoderBase instances type-agnostic. Decode function is overloaded to decode to different types
    Added SeekFrames to allow for partial decoding
    Added handling of offset and duration in audio_decoder_impl.cc
    Adjusted usage of audio decoder in AudioDecoder operator and NemoAsrReader
  • Affected modules and functionalities:
    AudioDecoder operator and NemoAsrReader
  • Key points relevant for the review:
    Design changes
  • Validation and testing:
    Tests added
  • Documentation (including examples):
    Docstr

JIRA TASK: [DALI-1634]
JIRA TASK: [DALI-1567]

…ata type at runtime.

Signed-off-by: Joaquin Anton <janton@nvidia.com>
auto meta = decoder->Open(make_cspan(bytes));
int64_t offset = meta.length / 2;
int64_t length = meta.length - offset;
std::vector<DataType> output(length * meta.channels);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can use larger output buffer, fill the end with the dummy data, and see if the DecodeFrames doesn't over run it.

template <typename T>
span<char> as_raw_span(T *buffer, ptrdiff_t length) {
return make_span(reinterpret_cast<char*>(buffer), length*sizeof(T));
std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, float offset_sec, float length_sec) {
Copy link
Contributor

@mzient mzient Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, float offset_sec, float length_sec) {
std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, double offset_sec, double length_sec) {

With any decent sampling rate, float is not big enough to store the time.

Float will not be able to represent the time down to a sample at mere 8 minutes at 16kHz.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to have such accuracy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use int64_t (nanoseconds). The capacity is >200 years - much more than enough for audio.

}

TensorShape<> DecodedAudioShape(const AudioMetadata &meta, float target_sample_rate,
bool downmix) {
bool downmix, float offset_sec, float duration_sec) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool downmix, float offset_sec, float duration_sec) {
bool downmix, double offset_sec, double duration_sec) {

span<float> resample_scratch_mem,
float target_sample_rate, bool downmix,
const char *audio_filepath) { // audio_filepath for debug purposes
const char *audio_filepath, // audio_filepath for debug purposes
float offset_sec, float duration_sec) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float offset_sec, float duration_sec) {
double offset_sec, double duration_sec) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a good API - the offset and duration must have already been processed in order to allocate appropriate audio tensor - passing them here, especially in seconds, can only lead to inconsistency.
Suggested alternatives:

  • seek in the decoder before this call and read audio.shape[0] samples
  • pass the offset in samples and read audio.shape[0] samples.

@@ -141,7 +141,7 @@ void NemoAsrLoader::ReadAudio(Tensor<CPUBackend> &audio,
if (should_resample || should_downmix || dtype_ != DALI_INT16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (should_resample || should_downmix || dtype_ != DALI_INT16)
if (should_resample || should_downmix)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Comment on lines 25 to 26
if (length_sec > 0.0f)
length = static_cast<int64_t>(length_sec * meta.sample_rate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like 0 meaning "everything". I think that a negative value would be better.

Comment on lines 67 to 69
virtual ptrdiff_t Decode(span<float> output) = 0;
virtual ptrdiff_t Decode(span<int16_t> output) = 0;
virtual ptrdiff_t Decode(span<int32_t> output) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No me gusta :(
What's the reason for removing TypedDecoder and having these 3 virtuals here? It feels like the opposite of how the interface should look like...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypedDecoder introduced a compile time limitation that was artificial. If instantiate a float decoder because I think that I will need to do resampling, but later I get a recording with the same sampling rate as the requested one, I couldn't decode directly to T because of the static type of the instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't agree more with Joaquin. If an interface gets in the way of functionality, it's not a good one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more note, defining the data type at instantiation introduces this problem, because it is only after I have read the file meta-data that I can decide the data type of the output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about hiding the implementation detail, that there are different ways to decode different types? Something like we do for Open and Close:

public:
template<typename T>
ptrdiff_t Decode (span<T> output) {
  return DecodeImpl(output);
}

private:
virtual ptrdiff_t DecodeImpl(span<float> output) = 0;
virtual ptrdiff_t DecodeImpl(span<int16_t> output) = 0;
virtual ptrdiff_t DecodeImpl(span<int32_t> output) = 0;

I'm thinking in the way that if we would need, e.g. perform some particular check before actually decoding. This way we would do it in a single point, not in every function with implementation:

ptrdiff_t Decode(span<T> output) {
  assert(output.length > 0);
  return DecodeImpl(output);
}

What bothers me in this interface how it looks now is that it exposes that implementation detail...

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao jantonguirao force-pushed the audio_decoder_impl_offset_duration branch from dbb6919 to 9c00fa0 Compare October 7, 2020 14:06
length * meta.channels);
// Verifying that we didn't read more than we should
for (size_t i = length * meta.channels; i < output.size(); i++) {
ASSERT_EQ(0xBE, output[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -152,9 +152,11 @@ void NemoAsrLoader::ReadAudio(Tensor<CPUBackend> &audio,

resample_scratch.resize(resample_scratch_sz);

DecodeAudio<OutputType, DecoderOutputType>(
// TODO(janton): handle offset and duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's stopping you from doing it now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to do it in two separate PRs, one for AudioDecoder and one for NemoAsrReader, including proper python tests

* @param nframes Number of full frames (1 frame is equivalent to nchannel samples)
* @param whence Like in lseek, SEEK_SET. SEEK_CUR, SEEK_END
*/
virtual int64_t SeekFrames(int64_t nframes, int whence = SEEK_CUR) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, having pure virtual function with default parameter is bad practice. Considering this example:

class A {
  virtual foo(int a = 2) = 0;
};

class B : A {
  virtual foo(int a = 3) {
    cout << a;
  }
};

A *a = new B();
a->foo();  // 2 or 3?

Comment on lines 54 to 58
/**
* @brief Seeks full frames, or multichannel samples, much like lseek in unistd.h
* @param nframes Number of full frames (1 frame is equivalent to nchannel samples)
* @param whence Like in lseek, SEEK_SET. SEEK_CUR, SEEK_END
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe what's in the returned value

double offset_sec, double length_sec) {
int64_t offset = 0;
int64_t length = meta.length;
if (offset_sec >= 0.0f) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (offset_sec >= 0.0f) {
if (offset_sec >= 0.0) {

Comparing double to double

offset = static_cast<int64_t>(offset_sec * meta.sample_rate);
}

if (length_sec >= 0.0f) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (length_sec >= 0.0f) {
if (length_sec >= 0.0) {

Comment on lines 27 to 28
DLL_PUBLIC std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta,
double offset_sec, double length_sec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add docs to this function? Including what's in the returned pair

Comment on lines 141 to 142
GenericAudioDecoder(GenericAudioDecoder&&) = default;
GenericAudioDecoder& operator=(GenericAudioDecoder&&) = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? I believe that at least sndfile_handle_ should be reset. I don't know how remaining fields should be moved

*/
virtual ptrdiff_t Decode(span<char> raw_output) = 0;
virtual bool CanSeekFrames() const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a decoder, that wouldn't be able to seek frames? If not, maybe we could remove this function?
If we would really want to optionally add seeking, IMHO it should be covered by a decorator. But I'd prefer assuming, that we can always seek frames

@jantonguirao jantonguirao force-pushed the audio_decoder_impl_offset_duration branch from 78b152a to 2013f62 Compare October 8, 2020 09:21
Comment on lines +55 to +57
int64_t SeekFrames(int64_t nframes, int whence = SEEK_CUR) {
return SeekFramesImpl(nframes, whence);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao jantonguirao force-pushed the audio_decoder_impl_offset_duration branch from 2013f62 to 3a4f3c0 Compare October 8, 2020 09:59
Signed-off-by: Joaquin Anton <janton@nvidia.com>
@mzient
Copy link
Contributor

mzient commented Oct 8, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1685928]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1686373]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1686373]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1686373]: BUILD PASSED

@jantonguirao jantonguirao merged commit 83aa29f into NVIDIA:master Oct 9, 2020
@faroit
Copy link

faroit commented Dec 3, 2020

@jantonguirao this looks great. Do you have a user example how iterate over specific excerpts?

@jantonguirao
Copy link
Contributor Author

@faroit You can iterate over portions of audio files by using our NemoAsrReader. We do not have a comprehensive notebook about this operator, but you can refer to the operator documentation or check our operator tests: https://github.com/NVIDIA/DALI/blob/master/dali/test/python/test_operator_nemo_asr_reader.py
If you have any specific questions you can post it here in the issues tab.

@faroit
Copy link

faroit commented Dec 3, 2020

thanks a lot. This is a great addition!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants