-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use new ThreadPool API to post work with priority #2102
Use new ThreadPool API to post work with priority #2102
Conversation
@@ -74,17 +74,18 @@ void NormalDistributionCpu::AssignTensorToOutput(workspace_t<CPUBackend> &ws) { | |||
auto &tp = ws.GetThreadPool(); | |||
TYPE_SWITCH(dtype_, type2id, DType, NORM_TYPES, ( | |||
for (int sample_id = 0; sample_id < batch_size_; ++sample_id) { | |||
tp.DoWorkWithID( | |||
[&, sample_id](int thread_id) { | |||
auto out_size = volume(output[sample_id].shape()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you cannot use tensor_size
here?
kernels::VarianceCPU<float, InputType> stddev; | ||
stddev.Setup(mutable_stddev[i], in_view[i], make_span(axes_), sample_mean); | ||
// Reset per-sample values, but don't postprocess | ||
stddev.Run(true, false); | ||
}); | ||
}, volume(in_view[i].shape)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can add a method that would do the same as tensor_size
to shape here as well?
@@ -37,9 +36,9 @@ void ArithmeticGenericOp<CPUBackend>::RunImpl(HostWorkspace &ws) { | |||
{extent_idx, extent_idx + 1}); | |||
} | |||
} | |||
}); | |||
}, -task_idx); // Descending numbers for FIFO execution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want FIFO here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work is already divided into similar sized chunks. Also @klecki is planning to change the work balancing so I'll leave it up to him to set the priorities here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Can you extend comment there by this info?
Signed-off-by: Joaquin Anton <janton@nvidia.com>
a51b2c5
to
b040748
Compare
output_data, | ||
streams_[0], | ||
file_name); | ||
}, -i); // -i for FIFO order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment that samples are sorted already and you wan to preserve that order.
[this, sample, &in, output_data, shape](int tid) { | ||
SampleWorker(sample->sample_idx, sample->file_name, in.size(), tid, | ||
in.data<uint8_t>(), output_data, streams_[tid]); | ||
CacheStore(sample->file_name, output_data, shape, streams_[tid]); | ||
}); | ||
}, GetTaskPrioritySeq()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment about FIFO here too.
Signed-off-by: Joaquin Anton <janton@nvidia.com>
@@ -207,9 +207,9 @@ class NonsilenceOperatorCpu : public NonsilenceOperator<CPUBackend> { | |||
auto &output_begin = ws.OutputRef<CPUBackend>(0); | |||
auto &output_length = ws.OutputRef<CPUBackend>(1); | |||
auto &tp = ws.GetThreadPool(); | |||
|
|||
auto in_shape = input.shape(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need a copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's returned by value.
Still I wonder, if we want explicit copy, or const auto&
?
int d = 0; | ||
int64_t i = 0; | ||
for (; i < in_size; i++, d++) { | ||
if (d == ndim_) d = 0; | ||
auto in_val = in[i]; | ||
out[i] = flip_dim[d] ? mirrored_origin[d] - in_val : in_val; | ||
} | ||
}); | ||
}, in_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input.shape().tensor_size(sample_id)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that'd construct the tensor list shape for every sample
@@ -178,17 +178,17 @@ void EraseImplCpu<T, Dims>::RunImpl(HostWorkspace &ws) { | |||
auto &output = ws.OutputRef<CPUBackend>(0); | |||
int nsamples = input.size(); | |||
auto& thread_pool = ws.GetThreadPool(); | |||
|
|||
auto out_shape = output.shape(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And why sometimes input and here output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be input as well here
[this, &input, &output, i](int thread_id) { | ||
kernels::KernelContext ctx; | ||
auto in_view = view<const T, Dims>(input[i]); | ||
auto out_view = view<T, Dims>(output[i]); | ||
kmgr_.Run<EraseKernel>(thread_id, i, ctx, out_view, in_view, args_[i]); | ||
}); | ||
}, out_shape.tensor_size(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a bit more tricky.
You first do a generic memcopy, and later you apply some unspecified amount of erase regions that could impact the performance. Maybe we can approximate additional time for that? But we would need to calculate the overlap between the actual image and all erase region and sum that. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed, this is doable but overly complicated, and just using the output size seems to be a good estimate of the work length
[this, &input, &output, i](int thread_id) { | ||
kernels::KernelContext ctx; | ||
auto in_view = view<const T, Dims>(input[i]); | ||
auto out_view = view<T, Dims>(output[i]); | ||
auto &kernel_sample_args = any_cast<std::vector<Args>&>(kernel_sample_args_); | ||
kmgr_.Run<Kernel>(thread_id, i, ctx, out_view, in_view, kernel_sample_args[i]); | ||
}); | ||
}, out_shape.tensor_size(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the output-dominated time of processing for pad and crop/slice.
@@ -248,7 +249,7 @@ void SpectrogramImplCpu::RunImpl(workspace_t<CPUBackend> &ws) { | |||
view<OutputType, WindowsDims>(output[i]), | |||
view<const InputType, WindowsDims>(win_out), | |||
fft_args_); | |||
}); | |||
}, out_shape.tensor_size(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with the op here, but can the stuff inside have different intermediate output sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be proportional to the size of the spectrogram
/** | ||
* @brief Gets the next task priority to ensure FIFO execution in the thread pool (descencing integers) | ||
*/ | ||
int64_t GetTaskPrioritySeq() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit weird with all the wrappers for task_seq--, but maybe it's better that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
Signed-off-by: Joaquin Anton <janton@nvidia.com>
!build |
CI MESSAGE: [1466599]: BUILD STARTED |
CI MESSAGE: [1466599]: BUILD FAILED |
!build |
CI MESSAGE: [1466910]: BUILD STARTED |
!build |
CI MESSAGE: [1467189]: BUILD STARTED |
CI MESSAGE: [1467189]: BUILD PASSED |
Signed-off-by: Joaquin Anton janton@nvidia.com
Why we need this PR?
What happened in this PR?
Fill relevant points, put NA otherwise. Replace anything inside []
Modified every use of thread_pool from DoWorkFromID/WaitForWork API to AddWork/RunAll pattern
Mostly all CPU operators using thread pool
All of it
Existing tests
N/A
JIRA TASK: [DALI-1473]