-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize GPU kernel #1974
Normalize GPU kernel #1974
Conversation
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Tests - begun. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
!build |
CI MESSAGE: [1338269]: BUILD STARTED |
CI MESSAGE: [1338269]: BUILD FAILED |
} | ||
|
||
std::pair<dim3, dim3> GetLaunchParams(const TensorListShape<> &data_shape) const { | ||
int64_t block = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should query the GPU for capabilities - I mean user should and provide the necessary values to the kernel?
Add performance tests. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
!build |
CI MESSAGE: [1340822]: BUILD STARTED |
CI MESSAGE: [1340822]: BUILD PASSED |
Launch(ctx); | ||
cudaEventRecord(end, ctx.gpu.stream); | ||
float time; | ||
cudaDeviceSynchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: isn't it enough with synchronizing with ctx.gpu.stream
?
int64_t base_size = scalar_base_ ? 0 : param_shape_.num_elements() * sizeof(float); | ||
int64_t scale_size = scalar_scale_ ? 0 : param_shape_.num_elements() * sizeof(float); | ||
int64_t data_size = out_size + in_size + base_size + scale_size; | ||
std::cerr << "Throughput: " << data_size / time << " GB/s\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those perf tests are not really checking anything just printing the throughput. I am wondering:
- How long do those tests take?
- Do we want to have those run as part of our unit tests?
If they don't take a lot of time, I wouldn't mind to leave them here anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't check anything because generating reference data took far too long. The tests take <5s on a machine equipped with a Tesla P40, on RTX 2080 Super I measured 3.6s for all NormalizeGPU tests.
* be regularized and inversed. | ||
* | ||
* The output elements are calculated as: | ||
* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon) | |
* mul = 1 / sqrt(square(stddev[param_offset]) + epsilon) |
* be regularized and inversed. | ||
* | ||
* The output elements are calculated as: | ||
* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon) | |
* mul = 1 / sqrt(square(stddev[param_offset]) + epsilon) |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
CI MESSAGE: [1345384]: BUILD STARTED |
CI MESSAGE: [1345384]: BUILD PASSED |
|
||
template <typename Desc, typename KernelFunc> | ||
std::pair<dim3, dim3> | ||
GetLaunchParams(const TensorListShape<> &data_shape, KernelFunc func) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much we can generalize and reuse across all kernel launches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No critical comments. Only formatting and typo.
__global__ void NormalizeKernel(const NormalizeParams *sample_params, | ||
float scale, float shift) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__global__ void NormalizeKernel(const NormalizeParams *sample_params, | |
float scale, float shift) { | |
__global__ void NormalizeKernel(const NormalizeParams *sample_params, | |
float scale, float shift) { |
void RunInvStdDev(KernelContext &ctx, | ||
const OutListGPU<Out> &out, const InListGPU<In> &in, | ||
const BaseParam &base, const ScaleParam &scale, | ||
float epsilon, float global_scale, float shift) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void RunInvStdDev(KernelContext &ctx, | |
const OutListGPU<Out> &out, const InListGPU<In> &in, | |
const BaseParam &base, const ScaleParam &scale, | |
float epsilon, float global_scale, float shift) { | |
void RunInvStdDev(KernelContext &ctx, | |
const OutListGPU<Out> &out, const InListGPU<In> &in, | |
const BaseParam &base, const ScaleParam &scale, | |
float epsilon, float global_scale, float shift) { |
@@ -43,14 +45,58 @@ namespace reduce_impl { | |||
* | |||
* The reindexing is done by either dividing and multiplying by old/new strides or by takind modulo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The reindexing is done by either dividing and multiplying by old/new strides or by takind modulo. | |
* The reindexing is done by either dividing and multiplying by old/new strides or by taking modulo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix it in the follow up.
Why we need this PR?
Pick one, remove the rest
What happened in this PR?
Fill relevant points, put NA otherwise. Replace anything inside []
JIRA TASK: DALI-1267