Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new ThreadPool API to post work with priority #2102

Merged
merged 5 commits into from
Jul 14, 2020

Conversation

jantonguirao
Copy link
Contributor

Signed-off-by: Joaquin Anton janton@nvidia.com

Why we need this PR?

  • Refactoring DALI operators to use the new way to post work to the thread pool according to the size of the task (typically volume of the sample is a good indicator of the size of the task)

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    Modified every use of thread_pool from DoWorkFromID/WaitForWork API to AddWork/RunAll pattern
  • Affected modules and functionalities:
    Mostly all CPU operators using thread pool
  • Key points relevant for the review:
    All of it
  • Validation and testing:
    Existing tests
  • Documentation (including examples):
    N/A

JIRA TASK: [DALI-1473]

@@ -74,17 +74,18 @@ void NormalDistributionCpu::AssignTensorToOutput(workspace_t<CPUBackend> &ws) {
auto &tp = ws.GetThreadPool();
TYPE_SWITCH(dtype_, type2id, DType, NORM_TYPES, (
for (int sample_id = 0; sample_id < batch_size_; ++sample_id) {
tp.DoWorkWithID(
[&, sample_id](int thread_id) {
auto out_size = volume(output[sample_id].shape());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you cannot use tensor_size here?

kernels::VarianceCPU<float, InputType> stddev;
stddev.Setup(mutable_stddev[i], in_view[i], make_span(axes_), sample_mean);
// Reset per-sample values, but don't postprocess
stddev.Run(true, false);
});
}, volume(in_view[i].shape));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can add a method that would do the same as tensor_size to shape here as well?

@@ -37,9 +36,9 @@ void ArithmeticGenericOp<CPUBackend>::RunImpl(HostWorkspace &ws) {
{extent_idx, extent_idx + 1});
}
}
});
}, -task_idx); // Descending numbers for FIFO execution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want FIFO here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work is already divided into similar sized chunks. Also @klecki is planning to change the work balancing so I'll leave it up to him to set the priorities here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Can you extend comment there by this info?

Signed-off-by: Joaquin Anton <janton@nvidia.com>
output_data,
streams_[0],
file_name);
}, -i); // -i for FIFO order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment that samples are sorted already and you wan to preserve that order.

[this, sample, &in, output_data, shape](int tid) {
SampleWorker(sample->sample_idx, sample->file_name, in.size(), tid,
in.data<uint8_t>(), output_data, streams_[tid]);
CacheStore(sample->file_name, output_data, shape, streams_[tid]);
});
}, GetTaskPrioritySeq());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment about FIFO here too.

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@@ -207,9 +207,9 @@ class NonsilenceOperatorCpu : public NonsilenceOperator<CPUBackend> {
auto &output_begin = ws.OutputRef<CPUBackend>(0);
auto &output_length = ws.OutputRef<CPUBackend>(1);
auto &tp = ws.GetThreadPool();

auto in_shape = input.shape();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a copy here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's returned by value.
Still I wonder, if we want explicit copy, or const auto&?

int d = 0;
int64_t i = 0;
for (; i < in_size; i++, d++) {
if (d == ndim_) d = 0;
auto in_val = in[i];
out[i] = flip_dim[d] ? mirrored_origin[d] - in_val : in_val;
}
});
}, in_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input.shape().tensor_size(sample_id)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'd construct the tensor list shape for every sample

@@ -178,17 +178,17 @@ void EraseImplCpu<T, Dims>::RunImpl(HostWorkspace &ws) {
auto &output = ws.OutputRef<CPUBackend>(0);
int nsamples = input.size();
auto& thread_pool = ws.GetThreadPool();

auto out_shape = output.shape();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why sometimes input and here output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be input as well here

[this, &input, &output, i](int thread_id) {
kernels::KernelContext ctx;
auto in_view = view<const T, Dims>(input[i]);
auto out_view = view<T, Dims>(output[i]);
kmgr_.Run<EraseKernel>(thread_id, i, ctx, out_view, in_view, args_[i]);
});
}, out_shape.tensor_size(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a bit more tricky.
You first do a generic memcopy, and later you apply some unspecified amount of erase regions that could impact the performance. Maybe we can approximate additional time for that? But we would need to calculate the overlap between the actual image and all erase region and sum that. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed, this is doable but overly complicated, and just using the output size seems to be a good estimate of the work length

[this, &input, &output, i](int thread_id) {
kernels::KernelContext ctx;
auto in_view = view<const T, Dims>(input[i]);
auto out_view = view<T, Dims>(output[i]);
auto &kernel_sample_args = any_cast<std::vector<Args>&>(kernel_sample_args_);
kmgr_.Run<Kernel>(thread_id, i, ctx, out_view, in_view, kernel_sample_args[i]);
});
}, out_shape.tensor_size(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the output-dominated time of processing for pad and crop/slice.

@@ -248,7 +249,7 @@ void SpectrogramImplCpu::RunImpl(workspace_t<CPUBackend> &ws) {
view<OutputType, WindowsDims>(output[i]),
view<const InputType, WindowsDims>(win_out),
fft_args_);
});
}, out_shape.tensor_size(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with the op here, but can the stuff inside have different intermediate output sizes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be proportional to the size of the spectrogram

/**
* @brief Gets the next task priority to ensure FIFO execution in the thread pool (descencing integers)
*/
int64_t GetTaskPrioritySeq() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit weird with all the wrappers for task_seq--, but maybe it's better that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1466599]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1466599]: BUILD FAILED

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1466910]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1467189]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1467189]: BUILD PASSED

@jantonguirao jantonguirao merged commit b518087 into NVIDIA:master Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants