GPU MFCC operator. #2423

banasraf · 2020-11-02T10:52:08Z

Why we need this PR?

It adds MFCC operator for GPU.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
DCT kernel was extended to support lifter coefficients. The operator is a simple wrapper.
Affected modules and functionalities:
DCT GPU kernel, new MFCC GPU operator.
Key points relevant for the review:
Changes in the kernel. Lifter coefficients calculation.
Validation and testing:
I've extended DCT kernel tests to support lifter coefficients and added GPU to MFCC python tests.
Documentation (including examples):
NA

JIRA TASK: DALI-1664

JanuszL · 2020-11-02T11:07:57Z

dali/kernels/signal/dct/dct_gpu.cu

+  DALI_HOST_DEV
+  float operator()(float val) { return val * coeff_; }
+
+  const float coeff_;


Either:

Suggested change

const float coeff_;

const float coeff;

or

Suggested change

const float coeff_;

private:

const float coeff_;

I would go for the second as you don't want to access coeff_ directly anyway.

I've remove LifterTabble

JanuszL · 2020-11-02T11:09:31Z

dali/kernels/signal/dct/dct_gpu_test.cc

      , in_shape_(batch_size_, dims_) {
+        if (lifter_) {
+          FillLifter();
+          const int max_ndct = 40;


Where this 40 comes from?
I see it repeated in L65, maybe extract it to a common place.

JanuszL · 2020-11-02T11:16:51Z

dali/test/python/test_operator_mfcc.py

+            ]:
+            yield check_operator_mfcc_wrong_args, device, batch_size, shape, \
+                axis, dct_type, lifter, n_mfcc, norm


Add new line.

jantonguirao · 2020-11-02T11:22:40Z

dali/kernels/signal/dct/dct_gpu.cu

+struct LiftersTable {};
+
+template <>
+struct LiftersTable<true> {


Suggested change

struct LiftersTable<true> {

struct LifterTable<true> {

I think this class is not necessary at all. Why do we need this abstraction over a simple pointer to an array of coefficients?

@mzient It's static optimization - to get rid of if in case of no liftering. But as Joaquin suggested I can just use if on the static parameter without all those classes

I've remove LifterTabble

jantonguirao · 2020-11-02T11:25:53Z

dali/kernels/signal/dct/dct_gpu.cu

 // The kernel processes data with the shape reduced to 3D.
 // Transform is applied over the middle axis.
-template <typename OutputType, typename InputType>
+template <typename OutputType, typename InputType, bool nonzero>


I'd do either
1)

Suggested change

template <typename OutputType, typename InputType, bool nonzero>

template <typename OutputType, typename InputType, typename LifterTable>

or remove the zero version and rely on the bool directly:

sample.output[output_idx] = HasLifter ? coeff * out_val : out_val;

where HasLifter is your nonzero template argument

I've remove LifterTabble
used HasLifter param

jantonguirao · 2020-11-02T11:33:44Z

dali/operators/audio/mfcc/mfcc_test.cc


  auto lifter = 0.0f;
  coeffs.Calculate(10, lifter);
  ASSERT_TRUE(coeffs.empty());

  lifter = 1.234f;
  coeffs.Calculate(10, lifter);
-  check_lifter_coeffs(coeffs, lifter, 10);
+  check_lifter_coeffs(span<const float>(coeffs.data(), coeffs.size()), lifter, 10);


make_cspan(coeffs) should work

Even make_span<coeffs> would - there's an implicit conversion to a span of const-qualified objects.

mzient · 2020-11-02T11:40:01Z

dali/kernels/signal/dct/dct_gpu.cu

+  explicit Lifter(float coeff): coeff_(coeff) {}
+
+  DALI_HOST_DEV
+  float operator()(float val) { return val * coeff_; }


Suggested change

float operator()(float val) { return val * coeff_; }

constexpr float operator()(float val) const { return val * coeff_; }

I've remove LifterTabble

mzient · 2020-11-02T11:42:04Z

dali/kernels/signal/dct/dct_gpu.cu

+  const float coeff_;
+};
+
+struct IdLifter {


Why not just use identity from "core/util.h"?

I've remove LifterTabble

mzient · 2020-11-02T11:42:27Z

dali/kernels/signal/dct/dct_gpu.cu

+template <>
+struct LiftersTable<false> {
+  DALI_HOST_DEV
+  IdLifter lifter(int) {return IdLifter{}; }


Suggested change

IdLifter lifter(int) {return IdLifter{}; }

static identity lifter(int) {return {}; }

I've remove LifterTabble

mzient · 2020-11-02T11:48:22Z

dali/kernels/signal/dct/dct_gpu.cu

 __global__ void ApplyDct(const typename Dct1DGpu<OutputType, InputType>::SampleDesc *samples,
-                         const BlockDesc<3> *blocks)  {
+                         const BlockDesc<3> *blocks, LiftersTable<nonzero> lifters)  {


Suggested change

const BlockDesc<3> *blocks, LiftersTable<nonzero> lifters) {

const BlockDesc<3> *blocks, const float *lifter_coeffs = nullptr) {

mzient · 2020-11-02T11:49:37Z

dali/kernels/signal/dct/dct_gpu.cu

@@ -51,7 +90,7 @@ __global__ void ApplyDct(const typename Dct1DGpu<OutputType, InputType>::SampleD
          out_val += *input * cos_row[i];
          input += in_stride[1];
        }
-        sample.output[output_idx] = out_val;
+        sample.output[output_idx] = lifter(out_val);


Suggested change

sample.output[output_idx] = lifter(out_val);

if (lifter_coeffs)

out_val *= lifter_coeffs[y];

sample.output[output_idx] = out_val;

I've used static parameter

mzient · 2020-11-02T11:58:58Z

dali/kernels/signal/dct/dct_gpu.cu

-                                                          span<const DctArgs> args,
-                                                          int axis) {
+                                                          span<const DctArgs> args, int axis,
+                                                          span<const float>) {


We no longer pass unused arguments to setup. Please remove.

mzient · 2020-11-02T11:59:48Z

dali/kernels/signal/dct/dct_gpu.cu

@@ -120,7 +159,8 @@ template <typename OutputType, typename InputType>
 DLL_PUBLIC void Dct1DGpu<OutputType, InputType>::Run(KernelContext &ctx,
                                                     const OutListGPU<OutputType> &out,
                                                     const InListGPU<InputType> &in,
-                                                     span<const DctArgs>, int) {
+                                                     span<const DctArgs>, int,
+                                                     span<const float> lifter_coeffs) {


Is that a device pointer? If so, it should be marked as such:

Suggested change

span<const float> lifter_coeffs) {

span<const float> lifter_coeffs_dev) {

or just use

Suggested change

span<const float> lifter_coeffs) {

InTensorGPU<float, 1> lifter_coeffs) {

I've used InTensorGPU

mzient · 2020-11-02T12:01:33Z

dali/kernels/signal/dct/dct_gpu.h

@@ -77,12 +77,14 @@ class DLL_PUBLIC Dct1DGpu {

  DLL_PUBLIC KernelRequirements Setup(KernelContext &context,
                                      const InListGPU<InputType> &in,
-                                      span<const DctArgs> args, int axis);
+                                      span<const DctArgs> args, int axis,
+                                      span<const float> lifter_coeffs);


If it's not used in Setup, don't add it.

mzient · 2020-11-02T12:01:56Z

dali/kernels/signal/dct/dct_gpu.h


  DLL_PUBLIC void Run(KernelContext &context,
                      const OutListGPU<OutputType> &out,
                      const InListGPU<InputType> &in,
-                      span<const DctArgs> args, int axis);
+                      span<const DctArgs> args, int axis,
+                      span<const float> lifter_coeffs);


Likewise - preferably use InTesnorGPU

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

mzient · 2020-11-02T12:22:52Z

dali/operators/audio/mfcc/mfcc.cu

+    int added_length = target_length - start_idx;
+    coeffs_.resize(target_length, stream);
+    int threads = std::min(added_length, 256);
+    CalcLifterKernel<<<1, threads, 0, stream>>>(coeffs_.data(), start_idx, target_length, lifter);


This is a very small job - perhaps it'd be better to utilize more SMs and launch div_ceil(added_length, threads) blocks and remove the loop from the kernel.

Taking another angle at it: since this is a very small job and it's done just once, I doubt there's any performance gain from calculating it on device - and maybe there's some value in calculating the coeffs on host and copying them to device, so they match exactly across backends.

I've moved coefficients calculation to CPU.

mzient · 2020-11-02T12:23:16Z

dali/operators/audio/mfcc/mfcc.cu

+__global__ void CalcLifterKernel(float *coeffs, int64_t start_idx, int64_t target_length,
+                                 float lifter) {
+  float ampl_mult = lifter / 2;
+  float phase_mult = M_PI / lifter;


Suggested change

float phase_mult = M_PI / lifter;

float phase_mult = static_cast<float>(M_PI) / lifter;

mzient · 2020-11-02T12:23:38Z

dali/operators/audio/mfcc/mfcc.cu

+  float ampl_mult = lifter / 2;
+  float phase_mult = M_PI / lifter;
+  for (int64_t i = start_idx + threadIdx.x; i < target_length; i += blockDim.x)
+    coeffs[i] = 1.0 + ampl_mult * sinf(phase_mult * (i + 1));


Suggested change

coeffs[i] = 1.0 + ampl_mult * sinf(phase_mult * (i + 1));

coeffs[i] = 1.0f + ampl_mult * sinf(phase_mult * (i + 1));

mzient · 2020-11-02T12:27:42Z

dali/operators/audio/mfcc/mfcc.cu

+                                 const workspace_t<GPUBackend> &ws) {
+  GetArguments(ws);
+  auto &input = ws.InputRef<GPUBackend>(0);
+  TYPE_SWITCH(input.type().id(), type2id, T, MFCC_SUPPORTED_TYPES, (


There's a lot going on inside. I'd extract it to SetupTyped - this would give superior compiler diagnostics and precise run-time error traceback.

Added statically typed detail::SetupKernel

mzient · 2020-11-02T12:28:58Z

dali/operators/audio/mfcc/mfcc.cu

+    using Kernel = kernels::signal::dct::Dct1DGpu<T>;
+    auto in_view = view<const T>(input);
+    auto out_view = view<T>(ws.OutputRef<GPUBackend>(0));
+    span<const float> lifter_span(lifter_coeffs_.data(), lifter_coeffs_.size());


As mentioned before, for GPU data use a tensor view:

Suggested change

span<const float> lifter_span(lifter_coeffs_.data(), lifter_coeffs_.size());

auto lifter_coeffs = make_tensor_gpu<1>(lifter_coeffs_.data(), {lifter_coeffs_.size()});

jantonguirao · 2020-11-02T12:47:48Z

dali/operators/audio/mfcc/mfcc.cu

+    int64_t max_ndct = 0;
+    for (int i = 0; i < nsamples_; ++i) {
+      int64_t ndct = output_desc[0].shape[i][axis_];
+      if (ndct > max_ndct) max_ndct = ndct;


We normally break line for if statements without braces

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf · 2020-11-02T14:42:26Z

!build

dali-automaton · 2020-11-02T14:46:09Z

CI MESSAGE: [1754530]: BUILD STARTED

mzient · 2020-11-02T15:49:24Z

dali/kernels/signal/dct/dct_gpu.cu

 __global__ void ApplyDct(const typename Dct1DGpu<OutputType, InputType>::SampleDesc *samples,
-                         const BlockDesc<3> *blocks)  {
+                         const BlockDesc<3> *blocks,  const float *lifter_coeffs)  {


nitpick (linter might complain?)

Suggested change

const BlockDesc<3> *blocks, const float *lifter_coeffs) {

const BlockDesc<3> *blocks, const float *lifter_coeffs) {

dali-automaton · 2020-11-02T16:31:07Z

CI MESSAGE: [1754530]: BUILD PASSED

Extend GPU DCT kernel to support liftering and add MFCC operator for GPU. Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf requested a review from jantonguirao November 2, 2020 10:55

JanuszL reviewed Nov 2, 2020

View reviewed changes

jantonguirao reviewed Nov 2, 2020

View reviewed changes

mzient reviewed Nov 2, 2020

View reviewed changes

Extend DCT kernel and add MFCC operator for GPU.

a77641e

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf force-pushed the mfcc-gpu-operator branch from 2ebb7b5 to a77641e Compare November 2, 2020 12:17

mzient reviewed Nov 2, 2020

View reviewed changes

jantonguirao reviewed Nov 2, 2020

View reviewed changes

review fixes

a21197a

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

jantonguirao approved these changes Nov 2, 2020

View reviewed changes

Calculate lifter coefficients on GPU

424bfe8

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

mzient reviewed Nov 2, 2020

View reviewed changes

mzient approved these changes Nov 2, 2020

View reviewed changes

banasraf merged commit 359a6a5 into NVIDIA:master Nov 2, 2020

klecki pushed a commit that referenced this pull request Nov 3, 2020

GPU MFCC operator. (#2423)

40f321a

Extend GPU DCT kernel to support liftering and add MFCC operator for GPU. Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

	template <typename OutputType, typename InputType, bool nonzero>
	template <typename OutputType, typename InputType, typename LifterTable>

	float operator()(float val) { return val * coeff_; }
	constexpr float operator()(float val) const { return val * coeff_; }

	IdLifter lifter(int) {return IdLifter{}; }
	static identity lifter(int) {return {}; }

	const BlockDesc<3> *blocks, LiftersTable<nonzero> lifters) {
	const BlockDesc<3> blocks, const float lifter_coeffs = nullptr) {

	span<const float> lifter_coeffs) {
	span<const float> lifter_coeffs_dev) {

	span<const float> lifter_coeffs) {
	InTensorGPU<float, 1> lifter_coeffs) {

	float phase_mult = M_PI / lifter;
	float phase_mult = static_cast<float>(M_PI) / lifter;

	coeffs[i] = 1.0 + ampl_mult * sinf(phase_mult * (i + 1));
	coeffs[i] = 1.0f + ampl_mult * sinf(phase_mult * (i + 1));

	span<const float> lifter_span(lifter_coeffs_.data(), lifter_coeffs_.size());
	auto lifter_coeffs = make_tensor_gpu<1>(lifter_coeffs_.data(), {lifter_coeffs_.size()});

GPU MFCC operator. #2423

GPU MFCC operator. #2423

Conversation

banasraf commented Nov 2, 2020

Why we need this PR?

What happened in this PR?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banasraf commented Nov 2, 2020

dali-automaton commented Nov 2, 2020

Choose a reason for hiding this comment

dali-automaton commented Nov 2, 2020

mzient Nov 2, 2020 •

edited

Loading

mzient Nov 2, 2020 •

edited

Loading

mzient Nov 2, 2020 •

edited

Loading