-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SeekFrames to audio decoder. Redesign to allow deciding decoded data type at runtime. #2334
Add SeekFrames to audio decoder. Redesign to allow deciding decoded data type at runtime. #2334
Conversation
…ata type at runtime. Signed-off-by: Joaquin Anton <janton@nvidia.com>
auto meta = decoder->Open(make_cspan(bytes)); | ||
int64_t offset = meta.length / 2; | ||
int64_t length = meta.length - offset; | ||
std::vector<DataType> output(length * meta.channels); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can use larger output
buffer, fill the end with the dummy data, and see if the DecodeFrames doesn't over run it.
template <typename T> | ||
span<char> as_raw_span(T *buffer, ptrdiff_t length) { | ||
return make_span(reinterpret_cast<char*>(buffer), length*sizeof(T)); | ||
std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, float offset_sec, float length_sec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, float offset_sec, float length_sec) { | |
std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, double offset_sec, double length_sec) { |
With any decent sampling rate, float is not big enough to store the time.
Float will not be able to represent the time down to a sample at mere 8 minutes at 16kHz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to have such accuracy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also use int64_t (nanoseconds). The capacity is >200 years - much more than enough for audio.
} | ||
|
||
TensorShape<> DecodedAudioShape(const AudioMetadata &meta, float target_sample_rate, | ||
bool downmix) { | ||
bool downmix, float offset_sec, float duration_sec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool downmix, float offset_sec, float duration_sec) { | |
bool downmix, double offset_sec, double duration_sec) { |
span<float> resample_scratch_mem, | ||
float target_sample_rate, bool downmix, | ||
const char *audio_filepath) { // audio_filepath for debug purposes | ||
const char *audio_filepath, // audio_filepath for debug purposes | ||
float offset_sec, float duration_sec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float offset_sec, float duration_sec) { | |
double offset_sec, double duration_sec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like a good API - the offset and duration must have already been processed in order to allocate appropriate audio
tensor - passing them here, especially in seconds, can only lead to inconsistency.
Suggested alternatives:
seek
in the decoder before this call and readaudio.shape[0]
samples- pass the offset in samples and read
audio.shape[0]
samples.
@@ -141,7 +141,7 @@ void NemoAsrLoader::ReadAudio(Tensor<CPUBackend> &audio, | |||
if (should_resample || should_downmix || dtype_ != DALI_INT16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (should_resample || should_downmix || dtype_ != DALI_INT16) | |
if (should_resample || should_downmix) |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
if (length_sec > 0.0f) | ||
length = static_cast<int64_t>(length_sec * meta.sample_rate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like 0 meaning "everything". I think that a negative value would be better.
virtual ptrdiff_t Decode(span<float> output) = 0; | ||
virtual ptrdiff_t Decode(span<int16_t> output) = 0; | ||
virtual ptrdiff_t Decode(span<int32_t> output) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No me gusta :(
What's the reason for removing TypedDecoder
and having these 3 virtuals here? It feels like the opposite of how the interface should look like...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypedDecoder introduced a compile time limitation that was artificial. If instantiate a float decoder because I think that I will need to do resampling, but later I get a recording with the same sampling rate as the requested one, I couldn't decode directly to T because of the static type of the instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't agree more with Joaquin. If an interface gets in the way of functionality, it's not a good one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more note, defining the data type at instantiation introduces this problem, because it is only after I have read the file meta-data that I can decide the data type of the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about hiding the implementation detail, that there are different ways to decode different types? Something like we do for Open
and Close
:
public:
template<typename T>
ptrdiff_t Decode (span<T> output) {
return DecodeImpl(output);
}
private:
virtual ptrdiff_t DecodeImpl(span<float> output) = 0;
virtual ptrdiff_t DecodeImpl(span<int16_t> output) = 0;
virtual ptrdiff_t DecodeImpl(span<int32_t> output) = 0;
I'm thinking in the way that if we would need, e.g. perform some particular check before actually decoding. This way we would do it in a single point, not in every function with implementation:
ptrdiff_t Decode(span<T> output) {
assert(output.length > 0);
return DecodeImpl(output);
}
What bothers me in this interface how it looks now is that it exposes that implementation detail...
Signed-off-by: Joaquin Anton <janton@nvidia.com>
dbb6919
to
9c00fa0
Compare
length * meta.channels); | ||
// Verifying that we didn't read more than we should | ||
for (size_t i = length * meta.channels; i < output.size(); i++) { | ||
ASSERT_EQ(0xBE, output[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -152,9 +152,11 @@ void NemoAsrLoader::ReadAudio(Tensor<CPUBackend> &audio, | |||
|
|||
resample_scratch.resize(resample_scratch_sz); | |||
|
|||
DecodeAudio<OutputType, DecoderOutputType>( | |||
// TODO(janton): handle offset and duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's stopping you from doing it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to do it in two separate PRs, one for AudioDecoder and one for NemoAsrReader, including proper python tests
* @param nframes Number of full frames (1 frame is equivalent to nchannel samples) | ||
* @param whence Like in lseek, SEEK_SET. SEEK_CUR, SEEK_END | ||
*/ | ||
virtual int64_t SeekFrames(int64_t nframes, int whence = SEEK_CUR) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, having pure virtual function with default parameter is bad practice. Considering this example:
class A {
virtual foo(int a = 2) = 0;
};
class B : A {
virtual foo(int a = 3) {
cout << a;
}
};
A *a = new B();
a->foo(); // 2 or 3?
/** | ||
* @brief Seeks full frames, or multichannel samples, much like lseek in unistd.h | ||
* @param nframes Number of full frames (1 frame is equivalent to nchannel samples) | ||
* @param whence Like in lseek, SEEK_SET. SEEK_CUR, SEEK_END | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe what's in the returned value
double offset_sec, double length_sec) { | ||
int64_t offset = 0; | ||
int64_t length = meta.length; | ||
if (offset_sec >= 0.0f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (offset_sec >= 0.0f) { | |
if (offset_sec >= 0.0) { |
Comparing double to double
offset = static_cast<int64_t>(offset_sec * meta.sample_rate); | ||
} | ||
|
||
if (length_sec >= 0.0f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (length_sec >= 0.0f) { | |
if (length_sec >= 0.0) { |
DLL_PUBLIC std::pair<int64_t, int64_t> ProcessOffsetAndLength(const AudioMetadata &meta, | ||
double offset_sec, double length_sec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add docs to this function? Including what's in the returned pair
GenericAudioDecoder(GenericAudioDecoder&&) = default; | ||
GenericAudioDecoder& operator=(GenericAudioDecoder&&) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? I believe that at least sndfile_handle_
should be reset. I don't know how remaining fields should be moved
*/ | ||
virtual ptrdiff_t Decode(span<char> raw_output) = 0; | ||
virtual bool CanSeekFrames() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a decoder, that wouldn't be able to seek frames? If not, maybe we could remove this function?
If we would really want to optionally add seeking, IMHO it should be covered by a decorator. But I'd prefer assuming, that we can always seek frames
78b152a
to
2013f62
Compare
int64_t SeekFrames(int64_t nframes, int whence = SEEK_CUR) { | ||
return SeekFramesImpl(nframes, whence); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: Joaquin Anton <janton@nvidia.com>
2013f62
to
3a4f3c0
Compare
Signed-off-by: Joaquin Anton <janton@nvidia.com>
!build |
CI MESSAGE: [1685928]: BUILD STARTED |
!build |
CI MESSAGE: [1686373]: BUILD STARTED |
CI MESSAGE: [1686373]: BUILD FAILED |
CI MESSAGE: [1686373]: BUILD PASSED |
@jantonguirao this looks great. Do you have a user example how iterate over specific excerpts? |
@faroit You can iterate over portions of audio files by using our NemoAsrReader. We do not have a comprehensive notebook about this operator, but you can refer to the operator documentation or check our operator tests: https://github.com/NVIDIA/DALI/blob/master/dali/test/python/test_operator_nemo_asr_reader.py |
thanks a lot. This is a great addition! |
Signed-off-by: Joaquin Anton janton@nvidia.com
Why we need this PR?
What happened in this PR?
Make AudioDecoderBase instances type-agnostic. Decode function is overloaded to decode to different types
Added SeekFrames to allow for partial decoding
Added handling of offset and duration in audio_decoder_impl.cc
Adjusted usage of audio decoder in AudioDecoder operator and NemoAsrReader
AudioDecoder operator and NemoAsrReader
Design changes
Tests added
Docstr
JIRA TASK: [DALI-1634]
JIRA TASK: [DALI-1567]