Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erase GPU operator #1971

Merged
merged 5 commits into from
May 22, 2020
Merged

Erase GPU operator #1971

merged 5 commits into from
May 22, 2020

Conversation

banasraf
Copy link
Collaborator

Why we need this PR?

Pick one, remove the rest

  • It adds a GPU erase operator needed for audio support.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    The gpu erase kernel was modified to support different number of erase regions for each sample. The operator implementation is mostly just instantiating the kernel.
  • Affected modules and functionalities:
    GPU erase kernel and a new file with GPU operator.
  • Key points relevant for the review:
    Instantiating the kernel.
  • Validation and testing:
    Existing python test was extended to GPU. Kernel test was extended to cover different number of erase regions per sample.
  • Documentation (including examples):
    N/A

JIRA TASK: [DALI-1245]

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather minor things, otherwise looks ok.

}

private:
OpSpec spec_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That smells of bad design - base Operator class has this field - maybe we should just rework OpImplBase into something that doesn't require this kind of ugly tricks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to const OpSpec&

@NVIDIA NVIDIA deleted a comment from banasraf May 20, 2020
Rafal added 2 commits May 21, 2020 16:19
Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
@banasraf
Copy link
Collaborator Author

!build

@@ -309,7 +310,7 @@ struct do_copy_or_erase {

template <int channel_dim = -1, typename T, int ndim = 2>
__global__ void erase_gpu_impl(erase_sample_desc<T, ndim> *samples, ivec<ndim> region_shape,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__global__ void erase_gpu_impl(erase_sample_desc<T, ndim> *samples, ivec<ndim> region_shape,
__global__ void erase_gpu_impl(const erase_sample_desc<T, ndim> *samples, ivec<ndim> region_shape,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1337975]: BUILD STARTED

Comment on lines 512 to 514
auto *sample_desc_gpu = ctx.scratchpad->ToGPU(stream, make_span(sample_desc_cpu, num_samples));
auto* fill_values_gpu =
ctx.scratchpad->ToGPU(stream, make_span(fill_values_cpu, num_fill_values));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better option would be to use ToContiguousGPU - it will issue just one cudaMemcpy.

It's not in the scope of this task, I guess, however, if we keep some sane limit on number of channels, then the fill value could be copied to a __constant__ - it should improve the performance, since the fill_value won't have to be read from global memory and will not compete for cache with input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used ToContiguousGPU

Comment on lines 72 to 73
const T *in = nullptr;
T* out = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const T *in = nullptr;
T* out = nullptr;
const T *__restrict__ in = nullptr;
T *__restrict__ out = nullptr;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
return {reinterpret_cast<ptrs_t>(tlv.data.data()), new_shape};
}

template <int ndim, typename Storage>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this overload is necessary - a non-const TensorListView should convert implicitly to const one.

@@ -126,8 +126,8 @@ std::tuple<std::remove_cv_t<element_t<Collections>>*...>
ToContiguousGPUMem(Scratchpad &scratchpad, cudaStream_t stream, const Collections &... c) {
const size_t N = sizeof...(Collections);
static_assert(
all_of<std::is_pod<std::remove_cv_t<element_t<Collections>>>::value...>::value,
"ToContiguousGPUMem must be used with collections of POD types");
all_of<std::is_trivially_copyable<std::remove_cv_t<element_t<Collections>>>::value...>::value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one more is_pod in this file - please change it, too.

regions_shape.set_tensor_shape(i, {n_regions, 2, Dims});
}
TensorList<CPUBackend> regions_cpu;
regions_cpu.set_type(TypeTable::GetTypeInfo(TypeTable::GetTypeID<int32_t>()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not:

Suggested change
regions_cpu.set_type(TypeTable::GetTypeInfo(TypeTable::GetTypeID<int32_t>()));
regions_cpu.set_type(TypeInfo::Create<ibox<Dims>());

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

TensorList<CPUBackend> regions_cpu;
regions_cpu.set_type(TypeTable::GetTypeInfo(TypeTable::GetTypeID<int32_t>()));
regions_cpu.Resize(regions_shape);
auto regions_tlv = detail::as_boxes<Dims>(view<int32_t, 3>(regions_cpu));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you don't need this as_boxes thing at all, you can take a view<ibox<Dims>>(regions_cpu) and place the data in it directly. AFAIK it should work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
@banasraf
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1340155]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1340155]: BUILD PASSED

@banasraf banasraf merged commit 175e9ff into NVIDIA:master May 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants