Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resampling decoder #1582

Merged
merged 19 commits into from
Dec 19, 2019
Merged

Resampling decoder #1582

merged 19 commits into from
Dec 19, 2019

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Dec 17, 2019

Why we need this PR?

  • It adds new feature: resampling and downmixing in audio decoder
  • Refactoring to improve code reuse
  • It fixes: removes dangerous global variables used in the operator

What happened in this PR?

  • What solution was applied:
    • Major rework of the operator!
  • Affected modules and functionalities:
    • Audio decoder operator + auxiliary functions + decoder primitives
  • Key points relevant for the review:
    • Look for various combinations of arguments - many ifs there...
  • Validation and testing:
    • unit test
    • end-to-end python test against ground truth
  • Documentation (including examples):
    • Operator doc-strings

THIS PR SUPERSEDES #1574

JIRA TASK: * don't remember

@mzient mzient force-pushed the resample_decoder branch 4 times, most recently from 3932c70 to efe0d7a Compare December 17, 2019 21:31
list.append(np.array(bytearray(f.read()), np.uint8))
self.feed_input(self.raw_file, list)

def test_decoded_vs_generated():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still compare with librosa to be aware how close we are just in case we need to debug some discrepancies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've checked. Librosa does a bad thing here - namely, they do this:

  1. calculate output length as ceil(in_length * out_rate / in_rate)
  2. resample at a slightly different effective rate: in_rate * out_length / in_length
    So, they're changing the sampling rate instead of doing what is asked of them. At least it looks like it - I calculated the "corrected" (or rather librosa-like-distorted) rate and generated the new reference signal. Librosa matches this one much better than the original "ground truth".
    I don't think it's our goal to reproduce this particular bug - because it is a bug; among other things, it precludes resampling in variable length chunks, because each output chunk will be resampled at a potentially different rate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add some tests with generous epsilon - or abandon the weird rate of 12999 Hz, but I don't think it does much good.
Anyway, since there's a random playback speed distortion as an augmentation (and a ton of others), I don't think anyone will look into 0.01% change in sampling rate vs some other library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've checked. Librosa does a bad thing here - namely, they do this:
calculate output length as ceil(in_length * out_rate / in_rate)

Doesn't they have an option to turn off this called fix ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe we should have a switch to support this. I believe they just wanted to make sure that the calculated number of output samples is an integer number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to leave the librosa comparing for just the typical conversions (22050 to 16000 and such)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what happens to the rate then, but the output length was different, so I couldn't compare the arrays directly.

@mzient mzient changed the title [WIP] Resample decoder Resampling decoder Dec 18, 2019
@mzient
Copy link
Contributor Author

mzient commented Dec 18, 2019

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1037901]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1037901]: BUILD PASSED

@mzient mzient mentioned this pull request Dec 18, 2019
Comment on lines 170 to 177
DALI_HOST_DEV DALI_FORCEINLINE
double sinc(double x) {
x *= M_PI;
if (std::abs(x) < 1e-10)
return 1 - x * x * 0.25; // approximate by a parabola near the pole
return std::sin(x) / x;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

  1. There should be a documentation specifying, what is x (deg or rad)
  2. Out of curiosity, why this approximation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What really deserves a comment is that the function calculates the so-called "normalized sinc". That definition already covers what x is.
  2. For x == 0 there's a removable singularity, so we need a check anyway. For very small values of x calculating sin(x)/x may lose more precision than using an approximation.
    By the way, the approximation should be 1 - x * x * (1.0f / 6) - and perhaps the limit 1e-10f set even closer to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked: for double, around x = 1e-8 the approximation with 1-x^2/6 is more accurate than sin(x)/x. I used long double version as a reference.

Comment on lines 178 to 186
DALI_HOST_DEV DALI_FORCEINLINE
float sinc(float x) {
x *= M_PI;
if (std::abs(x) < 1e-10f)
return 1 - x * x * 0.25f; // approximate by a parabola near the pole
return std::sin(x) / x;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, please specify what's x

int length;
int sample_rate; /// [Hz]
/// @brief Length, in samples, of the recording
int64_t length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DALI we tend to use "sample" to name a tensor in a batch. While here it's pretty obvious (from the context), that this is certainly not a tensor in a batch, I'd propose to name it "audio sample", to distinguish these two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It you're talking about the comment, I'll replace it with "(multi-channel) samples", because that's what it is. In libsnd they are using frames for multi-channel samples, but that's not that common and even their own documenation uses "multi-channel samples" in some places.

Comment on lines +43 to +52
template <int static_channels = -1, typename Out, typename In>
void DownmixChannels(
Out *out, const In *in, int64_t samples, int channels,
const float *weights, bool normalize_weights = false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of having channels statically specified in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps the compiler unroll the loop over channels, opening new optimization opportunities. For 2D resamping this amounted to 20-30% speedup, I assumed it wouldn't be much different here.

Comment on lines 40 to 42
for (size_t i = 0; i < ref.size(); i++) {
EXPECT_NEAR(out[i], ref[i], 1e-6);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using EXPECT_FLOAT_EQ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be OK, I guess. This one is a tad faster and still the dynamic range of values is small enough for a fixed epsilon.

Comment on lines +43 to +52
template <int static_channels = -1, typename Out, typename In>
void DownmixChannels(
Out *out, const In *in, int64_t samples, int channels,
const float *weights, bool normalize_weights = false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a point in the doc, that this can be done in-place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if you want to vectorize the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a restricted qualifier to in and out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a C++ qualifier, unfortunately. The __restrict keyword is in C and in CUDA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCC and Clang supports __restrict__, do you think we can use it or we want to stick to the standard?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if you want to vectorize the loop.

Then is Span_DefaultWeights incorrect? It does downmixing in-place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not adding it here, but I will in resampling which is utterly impossible to do in-place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, after a thought I'll add the in-place info - obviously with the caveat that Out and In should be of the same time - otherwise the compiler assumes no aliasing.


namespace dali {

DALI_SCHEMA(AudioDecoder)
.DocStr(R"code(Decode audio data.
.DocStr(R"code(Decode audio data.
This operator is a generic way of handling encoded data in DALI.
It supports most of well-known audio formats (wav, flac, ogg).

This operator produces two outputs:

* output[0]: batch of decoded data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* output[0]: batch of decoded data
* output[0]: batch of decoded data<br/>

Comment on lines +37 to +39
"Resampling quality, 0 is lowest, 100 is highest.\n\n"
"0 corresponds to 3 lobes of the sinc filter;\n"
"50 gives 16 lobes and 100 gives 64 lobes.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, this:

      0 corresponds to 3 lobes of the sinc filter
      50 gives 16 lobes and 100 gives 64 lobes.

Is an implementation detail. I know, that "quality" doesn't mean too much, but neither the latter. Especially, that it can change in future, e.g. we can add another constrain to the "quality". I'd remove this and leave only relative measure 0..100.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, librosa has some quality enum/string:
"kaiser_best" - 64-zero crossings
"kaiser_fast" - 16-zero crossings
I just followed that example.
Please note that I'm not specifying the envelope/window for the sinc, so we have some freedom here, still (e.g. we can swtich to Kaiser window when we find some implementation of Bessel functions in C++).

}

std::vector<float> target_sample_rates_;
kernels::signal::resampling::Resampler resampler_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the resampler_ ber per-sample?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not unless per-sample quality is needed - and I have quite strong opinion that it won't and I wouldn't waste memory and cache locality on that assumption unless we see at least potential use case (i.e. if we find it in actual network or at least offline augmentation for that network).

@szalpal
Copy link
Member

szalpal commented Dec 18, 2019

The doc right now doesn't look too pretty:
image
(look the paragraph-broken list)

@szalpal
Copy link
Member

szalpal commented Dec 18, 2019

Also in the "quality" param
image

* The function can seamlessly resample the input and produce the result in chunks.
* To reuse memory and still simulate chunk processing, adjust the in/out pointers.
*/
template <typename Out>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered the static definition of a number of channels? Would it help with the perf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would. I can do the VALUE_SWITCH trick here, no problem.

@mzient
Copy link
Contributor Author

mzient commented Dec 19, 2019

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1040143]: BUILD STARTED

@szalpal
Copy link
Member

szalpal commented Dec 19, 2019

Aside comment, to the tests. I recently found out, thet gtest discourages using underscore in test names:
https://github.com/google/googletest/blob/master/googletest/docs/faq.md#why-should-test-suite-names-and-test-names-not-contain-underscore
Could you change the names?

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1040143]: BUILD PASSED

// Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this file to downmix.h

Out *out, const In *in, int64_t samples, int channels,
const float *weights, bool normalize_weights = false) {
SmallVector<float, 8> normalized_weights; // 8 channels should be enough for 7.1 audio
int actual_channels = static_channels < 0 ? channels : static_channels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about the case when static_channels==0?

weights = normalized_weights.data(); // use this pointer now
}
for (int64_t o = 0, i = 0; o < samples; o++, i += channels) {
float sum = ConvertNorm<float>(in[i]) * weights[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

Suggested change
float sum = ConvertNorm<float>(in[i]) * weights[0];
float sum = 0;

and start the loot at c = 0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way you get one more addition. Given the simplicity of the inner loop, it may actually matter.

}

/**
* Downmix data to a single channel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Downmix data to a single channel.
* @brief Downmix data to a single channel.

// See the License for the specific language governing permissions and
// limitations under the License.

#ifndef DALI_KERNELS_SIGNAL_RESAMPLING_H_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not resample.h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header / library names are usually nouns. Moreover, what's inside is the Resampler class, not a free function - though resampler.h also doesn't sound well.

audio.data[ofs] = ConvertSatNorm<OutputType>(tmp_buf[ofs]);
}
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I feel there are too many if/else levels here. We could at least get rid of the outer-most one by having this "decode-only" case at the top with an early return

Copy link
Contributor Author

@mzient mzient Dec 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find it a bit confusing. Early returns are great for checks like "do we have to do anything?" or checking for error conditions - this is neither.
I thought about decoding and then postprocessing - but it's impossible, because we're potentially decoding to different buffers.

double q = quality_;
DALI_ENFORCE(q >= 0 && q <= 100, "Resampling quality must be in [0..100] range");
// this should give 3 lobes for q = 0, 16 lobes for q = 50 and 64 lobes for q = 100
int lobes = std::round(0.007 * q * q - 0.09 * q + 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a wikipedia reference here to know where the formula comes from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes from nowhere - it's a solution of a linear system:

a * 0 + b * 0 + c = 3
a * 50^2 + b * 50 + c = 16
a * 100^2 + b * 100 + c = 64

It does exactly, what's written - gives you 3 at 0, 16 at 50 and 64 at 100.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that quality concept is invented by us?

@@ -53,13 +59,12 @@ class AudioDecoderBase {
template<typename SampleType>
class TypedAudioDecoderBase : public AudioDecoderBase {
public:
void Decode(span<char> raw_output) override {
ptrdiff_t Decode(span<char> raw_output) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just int64_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the return type of span::size().

list.append(np.array(bytearray(f.read()), np.uint8))
self.feed_input(self.raw_file, list)

def test_decoded_vs_generated():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to leave the librosa comparing for just the typical conversions (22050 to 16000 and such)

@@ -167,6 +167,25 @@ constexpr double rad2deg(double rad) {
return rad * r2d;
}

/// @brief Calculates normalized sinc i.e. `sin(pi * x) / (pi * x)`
DALI_HOST_DEV DALI_FORCEINLINE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessarily duplicated code. You can templatize this and replace 1.0 and 1.0f by T(1)

Copy link
Contributor Author

@mzient mzient Dec 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold for transition to approximation is different - and putting
std::is_same<T, float>::value ? T(1e-5) : T(1e-8)
doesn't sound like a simplification to me.
Going further, it would require having some template magic for integer arguments or long double.

@mzient
Copy link
Contributor Author

mzient commented Dec 19, 2019

Aside comment, to the tests. I recently found out, thet gtest discourages using underscore in test names:
https://github.com/google/googletest/blob/master/googletest/docs/faq.md#why-should-test-suite-names-and-test-names-not-contain-underscore
Could you change the names?

In principle, I could, but they would be harder to read. When I type "Span_DefaultWeights" , it's pretty clear that it's about some "Span" and "DefaultWeights". "SpanDefaultWeights" sounds like something spans default weights (which makes no sense in this context) or "Span, Default, Weights", whatever is "Default", etc. SpanAndDefaultWeights is also so-so at best.
So, I'm willing to take the risk that it won't compile some time in the distant future, with some yet-to-be-seen version of gtest.

Signed-off-by: Michał Szołucha <mszolucha@nvidia.com>
Signed-off-by: Michał Szołucha <mszolucha@nvidia.com>
Signed-off-by: Michał Szołucha <mszolucha@nvidia.com>
Signed-off-by: Michał Szołucha <mszolucha@nvidia.com>
Signed-off-by: Michał Szołucha <mszolucha@nvidia.com>
szalpal and others added 11 commits December 19, 2019 16:04
Signed-off-by: Michał Szołucha <mszolucha@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
* Add multi-channel resampling.

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
* Add output conversion to resampling.
* Essentially rewrite the operator:
    * Add per-sample sampling rate
    * Add explicit downmixing option
    * Add multi-channel resampling
    * Remove dangerous global buffers (BUG!)

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Fix window function generator.
Fix multi-channel resampling.

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Fix bugs.
Lint.

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
* Add Python test against librosa resampling
* Change sample center to sample zero

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@mzient
Copy link
Contributor Author

mzient commented Dec 19, 2019

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1040438]: BUILD STARTED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@mzient
Copy link
Contributor Author

mzient commented Dec 19, 2019

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1040494]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1040494]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1040494]: BUILD PASSED

@mzient mzient merged commit ae6e0ee into NVIDIA:master Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants