Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize GPU kernel #1974

Merged
merged 7 commits into from
May 27, 2020
Merged

Normalize GPU kernel #1974

merged 7 commits into from
May 27, 2020

Conversation

mzient
Copy link
Contributor

@mzient mzient commented May 21, 2020

Why we need this PR?

Pick one, remove the rest

  • It adds NormalizeImplGPU because we want a GPU-based normalization

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    • DropDims as a way to recalculate data/param coordinates
    • trivial slicing by data chunks
    • broadcasting using DropDims
    • full broadcasting (from host-side scalars) is a special case
    • NOTE: pImpl front-end kernel is not a part of this PR, it's in Normalize GPU - pImpl + Bessel's corrections #1981
  • Affected modules and functionalities:
    • Reductions - ReduceDims now uses fast_div
  • Key points relevant for the review:
    • The kernel?
  • Validation and testing:
    • Unit tests (GTest)
  • Documentation (including examples):
    • Doxygen

JIRA TASK: DALI-1267

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Tests - begun.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@mzient mzient requested a review from a team May 21, 2020 17:21
@mzient
Copy link
Contributor Author

mzient commented May 21, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1338269]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1338269]: BUILD FAILED

}

std::pair<dim3, dim3> GetLaunchParams(const TensorListShape<> &data_shape) const {
int64_t block = 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should query the GPU for capabilities - I mean user should and provide the necessary values to the kernel?

Add performance tests.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@mzient
Copy link
Contributor Author

mzient commented May 22, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1340822]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1340822]: BUILD PASSED

Launch(ctx);
cudaEventRecord(end, ctx.gpu.stream);
float time;
cudaDeviceSynchronize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: isn't it enough with synchronizing with ctx.gpu.stream ?

int64_t base_size = scalar_base_ ? 0 : param_shape_.num_elements() * sizeof(float);
int64_t scale_size = scalar_scale_ ? 0 : param_shape_.num_elements() * sizeof(float);
int64_t data_size = out_size + in_size + base_size + scale_size;
std::cerr << "Throughput: " << data_size / time << " GB/s\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those perf tests are not really checking anything just printing the throughput. I am wondering:

  1. How long do those tests take?
  2. Do we want to have those run as part of our unit tests?

If they don't take a lot of time, I wouldn't mind to leave them here anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't check anything because generating reference data took far too long. The tests take <5s on a machine equipped with a Tesla P40, on RTX 2080 Super I measured 3.6s for all NormalizeGPU tests.

* be regularized and inversed.
*
* The output elements are calculated as:
* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon)
* mul = 1 / sqrt(square(stddev[param_offset]) + epsilon)

* be regularized and inversed.
*
* The output elements are calculated as:
* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon)
* mul = 1 / sqrt(square(stddev[param_offset]) + epsilon)

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1345384]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1345384]: BUILD PASSED


template <typename Desc, typename KernelFunc>
std::pair<dim3, dim3>
GetLaunchParams(const TensorListShape<> &data_shape, KernelFunc func) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much we can generalize and reuse across all kernel launches?

Copy link
Collaborator

@banasraf banasraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No critical comments. Only formatting and typo.

Comment on lines +178 to +179
__global__ void NormalizeKernel(const NormalizeParams *sample_params,
float scale, float shift) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__global__ void NormalizeKernel(const NormalizeParams *sample_params,
float scale, float shift) {
__global__ void NormalizeKernel(const NormalizeParams *sample_params,
float scale, float shift) {

Comment on lines +455 to +458
void RunInvStdDev(KernelContext &ctx,
const OutListGPU<Out> &out, const InListGPU<In> &in,
const BaseParam &base, const ScaleParam &scale,
float epsilon, float global_scale, float shift) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void RunInvStdDev(KernelContext &ctx,
const OutListGPU<Out> &out, const InListGPU<In> &in,
const BaseParam &base, const ScaleParam &scale,
float epsilon, float global_scale, float shift) {
void RunInvStdDev(KernelContext &ctx,
const OutListGPU<Out> &out, const InListGPU<In> &in,
const BaseParam &base, const ScaleParam &scale,
float epsilon, float global_scale, float shift) {

@@ -43,14 +45,58 @@ namespace reduce_impl {
*
* The reindexing is done by either dividing and multiplying by old/new strides or by takind modulo.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The reindexing is done by either dividing and multiplying by old/new strides or by takind modulo.
* The reindexing is done by either dividing and multiplying by old/new strides or by taking modulo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it in the follow up.

@mzient mzient merged commit d6df3de into NVIDIA:master May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants