Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU MFCC operator. #2423

Merged
merged 3 commits into from
Nov 2, 2020
Merged

GPU MFCC operator. #2423

merged 3 commits into from
Nov 2, 2020

Conversation

banasraf
Copy link
Collaborator

@banasraf banasraf commented Nov 2, 2020

Why we need this PR?

  • It adds MFCC operator for GPU.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    DCT kernel was extended to support lifter coefficients. The operator is a simple wrapper.
  • Affected modules and functionalities:
    DCT GPU kernel, new MFCC GPU operator.
  • Key points relevant for the review:
    Changes in the kernel. Lifter coefficients calculation.
  • Validation and testing:
    I've extended DCT kernel tests to support lifter coefficients and added GPU to MFCC python tests.
  • Documentation (including examples):
    NA

JIRA TASK: DALI-1664

DALI_HOST_DEV
float operator()(float val) { return val * coeff_; }

const float coeff_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either:

Suggested change
const float coeff_;
const float coeff;

or

Suggested change
const float coeff_;
private:
const float coeff_;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go for the second as you don't want to access coeff_ directly anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remove LifterTabble

, in_shape_(batch_size_, dims_) {
if (lifter_) {
FillLifter();
const int max_ndct = 40;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where this 40 comes from?
I see it repeated in L65, maybe extract it to a common place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

]:
yield check_operator_mfcc_wrong_args, device, batch_size, shape, \
axis, dct_type, lifter, n_mfcc, norm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

struct LiftersTable {};

template <>
struct LiftersTable<true> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct LiftersTable<true> {
struct LifterTable<true> {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class is not necessary at all. Why do we need this abstraction over a simple pointer to an array of coefficients?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzient It's static optimization - to get rid of if in case of no liftering. But as Joaquin suggested I can just use if on the static parameter without all those classes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remove LifterTabble

// The kernel processes data with the shape reduced to 3D.
// Transform is applied over the middle axis.
template <typename OutputType, typename InputType>
template <typename OutputType, typename InputType, bool nonzero>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do either
1)

Suggested change
template <typename OutputType, typename InputType, bool nonzero>
template <typename OutputType, typename InputType, typename LifterTable>
  1. or remove the zero version and rely on the bool directly:
sample.output[output_idx] = HasLifter ? coeff * out_val : out_val;

where HasLifter is your nonzero template argument

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remove LifterTabble
used HasLifter param


auto lifter = 0.0f;
coeffs.Calculate(10, lifter);
ASSERT_TRUE(coeffs.empty());

lifter = 1.234f;
coeffs.Calculate(10, lifter);
check_lifter_coeffs(coeffs, lifter, 10);
check_lifter_coeffs(span<const float>(coeffs.data(), coeffs.size()), lifter, 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_cspan(coeffs) should work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even make_span<coeffs> would - there's an implicit conversion to a span of const-qualified objects.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

explicit Lifter(float coeff): coeff_(coeff) {}

DALI_HOST_DEV
float operator()(float val) { return val * coeff_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float operator()(float val) { return val * coeff_; }
constexpr float operator()(float val) const { return val * coeff_; }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remove LifterTabble

const float coeff_;
};

struct IdLifter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use identity from "core/util.h"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remove LifterTabble

template <>
struct LiftersTable<false> {
DALI_HOST_DEV
IdLifter lifter(int) {return IdLifter{}; }
Copy link
Contributor

@mzient mzient Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IdLifter lifter(int) {return IdLifter{}; }
static identity lifter(int) {return {}; }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remove LifterTabble

__global__ void ApplyDct(const typename Dct1DGpu<OutputType, InputType>::SampleDesc *samples,
const BlockDesc<3> *blocks) {
const BlockDesc<3> *blocks, LiftersTable<nonzero> lifters) {
Copy link
Contributor

@mzient mzient Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const BlockDesc<3> *blocks, LiftersTable<nonzero> lifters) {
const BlockDesc<3> *blocks, const float *lifter_coeffs = nullptr) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -51,7 +90,7 @@ __global__ void ApplyDct(const typename Dct1DGpu<OutputType, InputType>::SampleD
out_val += *input * cos_row[i];
input += in_stride[1];
}
sample.output[output_idx] = out_val;
sample.output[output_idx] = lifter(out_val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sample.output[output_idx] = lifter(out_val);
if (lifter_coeffs)
out_val *= lifter_coeffs[y];
sample.output[output_idx] = out_val;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used static parameter

span<const DctArgs> args,
int axis) {
span<const DctArgs> args, int axis,
span<const float>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer pass unused arguments to setup. Please remove.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -120,7 +159,8 @@ template <typename OutputType, typename InputType>
DLL_PUBLIC void Dct1DGpu<OutputType, InputType>::Run(KernelContext &ctx,
const OutListGPU<OutputType> &out,
const InListGPU<InputType> &in,
span<const DctArgs>, int) {
span<const DctArgs>, int,
span<const float> lifter_coeffs) {
Copy link
Contributor

@mzient mzient Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a device pointer? If so, it should be marked as such:

Suggested change
span<const float> lifter_coeffs) {
span<const float> lifter_coeffs_dev) {

or just use

Suggested change
span<const float> lifter_coeffs) {
InTensorGPU<float, 1> lifter_coeffs) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used InTensorGPU

@@ -77,12 +77,14 @@ class DLL_PUBLIC Dct1DGpu {

DLL_PUBLIC KernelRequirements Setup(KernelContext &context,
const InListGPU<InputType> &in,
span<const DctArgs> args, int axis);
span<const DctArgs> args, int axis,
span<const float> lifter_coeffs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not used in Setup, don't add it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


DLL_PUBLIC void Run(KernelContext &context,
const OutListGPU<OutputType> &out,
const InListGPU<InputType> &in,
span<const DctArgs> args, int axis);
span<const DctArgs> args, int axis,
span<const float> lifter_coeffs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise - preferably use InTesnorGPU

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
int added_length = target_length - start_idx;
coeffs_.resize(target_length, stream);
int threads = std::min(added_length, 256);
CalcLifterKernel<<<1, threads, 0, stream>>>(coeffs_.data(), start_idx, target_length, lifter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very small job - perhaps it'd be better to utilize more SMs and launch div_ceil(added_length, threads) blocks and remove the loop from the kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another angle at it: since this is a very small job and it's done just once, I doubt there's any performance gain from calculating it on device - and maybe there's some value in calculating the coeffs on host and copying them to device, so they match exactly across backends.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved coefficients calculation to CPU.

__global__ void CalcLifterKernel(float *coeffs, int64_t start_idx, int64_t target_length,
float lifter) {
float ampl_mult = lifter / 2;
float phase_mult = M_PI / lifter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float phase_mult = M_PI / lifter;
float phase_mult = static_cast<float>(M_PI) / lifter;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

float ampl_mult = lifter / 2;
float phase_mult = M_PI / lifter;
for (int64_t i = start_idx + threadIdx.x; i < target_length; i += blockDim.x)
coeffs[i] = 1.0 + ampl_mult * sinf(phase_mult * (i + 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
coeffs[i] = 1.0 + ampl_mult * sinf(phase_mult * (i + 1));
coeffs[i] = 1.0f + ampl_mult * sinf(phase_mult * (i + 1));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const workspace_t<GPUBackend> &ws) {
GetArguments(ws);
auto &input = ws.InputRef<GPUBackend>(0);
TYPE_SWITCH(input.type().id(), type2id, T, MFCC_SUPPORTED_TYPES, (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot going on inside. I'd extract it to SetupTyped - this would give superior compiler diagnostics and precise run-time error traceback.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added statically typed detail::SetupKernel

using Kernel = kernels::signal::dct::Dct1DGpu<T>;
auto in_view = view<const T>(input);
auto out_view = view<T>(ws.OutputRef<GPUBackend>(0));
span<const float> lifter_span(lifter_coeffs_.data(), lifter_coeffs_.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, for GPU data use a tensor view:

Suggested change
span<const float> lifter_span(lifter_coeffs_.data(), lifter_coeffs_.size());
auto lifter_coeffs = make_tensor_gpu<1>(lifter_coeffs_.data(), {lifter_coeffs_.size()});

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

int64_t max_ndct = 0;
for (int i = 0; i < nsamples_; ++i) {
int64_t ndct = output_desc[0].shape[i][axis_];
if (ndct > max_ndct) max_ndct = ndct;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We normally break line for if statements without braces

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
@banasraf
Copy link
Collaborator Author

banasraf commented Nov 2, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1754530]: BUILD STARTED

__global__ void ApplyDct(const typename Dct1DGpu<OutputType, InputType>::SampleDesc *samples,
const BlockDesc<3> *blocks) {
const BlockDesc<3> *blocks, const float *lifter_coeffs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (linter might complain?)

Suggested change
const BlockDesc<3> *blocks, const float *lifter_coeffs) {
const BlockDesc<3> *blocks, const float *lifter_coeffs) {

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1754530]: BUILD PASSED

@banasraf banasraf merged commit 359a6a5 into NVIDIA:master Nov 2, 2020
klecki pushed a commit that referenced this pull request Nov 3, 2020
Extend GPU DCT kernel to support liftering and add MFCC operator for GPU.

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants