Use new ThreadPool API to post work with priority #2102

jantonguirao · 2020-07-10T12:14:04Z

Signed-off-by: Joaquin Anton janton@nvidia.com

Why we need this PR?

Refactoring DALI operators to use the new way to post work to the thread pool according to the size of the task (typically volume of the sample is a good indicator of the size of the task)

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
Modified every use of thread_pool from DoWorkFromID/WaitForWork API to AddWork/RunAll pattern
Affected modules and functionalities:
Mostly all CPU operators using thread pool
Key points relevant for the review:
All of it
Validation and testing:
Existing tests
Documentation (including examples):
N/A

JIRA TASK: [DALI-1473]

JanuszL · 2020-07-10T12:31:51Z

dali/operators/random/normal_distribution_op.cc

@@ -74,17 +74,18 @@ void NormalDistributionCpu::AssignTensorToOutput(workspace_t<CPUBackend> &ws) {
  auto &tp = ws.GetThreadPool();
  TYPE_SWITCH(dtype_, type2id, DType, NORM_TYPES, (
            for (int sample_id = 0; sample_id < batch_size_; ++sample_id) {
-              tp.DoWorkWithID(
-                  [&, sample_id](int thread_id) {
+              auto out_size = volume(output[sample_id].shape());


Why you cannot use tensor_size here?

dali/operators/random/normal_distribution_op.cc

JanuszL · 2020-07-10T12:35:09Z

dali/operators/math/normalize/normalize.cc

          kernels::VarianceCPU<float, InputType> stddev;
          stddev.Setup(mutable_stddev[i], in_view[i], make_span(axes_), sample_mean);
          // Reset per-sample values, but don't postprocess
          stddev.Run(true, false);
-        });
+        }, volume(in_view[i].shape));


Maybe you can add a method that would do the same as tensor_size to shape here as well?

JanuszL · 2020-07-10T12:35:43Z

dali/operators/math/expressions/arithmetic.cc

@@ -37,9 +36,9 @@ void ArithmeticGenericOp<CPUBackend>::RunImpl(HostWorkspace &ws) {
                                       {extent_idx, extent_idx + 1});
        }
      }
-    });
+    }, -task_idx);  // Descending numbers for FIFO execution


Why do we want FIFO here?

The work is already divided into similar sized chunks. Also @klecki is planning to change the work balancing so I'll leave it up to him to set the priorities here

Sure. Can you extend comment there by this info?

Signed-off-by: Joaquin Anton <janton@nvidia.com>

JanuszL · 2020-07-10T12:42:08Z

dali/operators/decoder/nvjpeg/legacy_api/nvjpeg_decoder.h

+                                      output_data,
+                                      streams_[0],
+                                      file_name);
+            }, -i);  // -i for FIFO order


Maybe add a comment that samples are sorted already and you wan to preserve that order.

JanuszL · 2020-07-10T12:42:47Z

dali/operators/decoder/nvjpeg/decoupled_api/nvjpeg_decoder_decoupled_api.h

        [this, sample, &in, output_data, shape](int tid) {
          SampleWorker(sample->sample_idx, sample->file_name, in.size(), tid,
            in.data<uint8_t>(), output_data, streams_[tid]);
          CacheStore(sample->file_name, output_data, shape, streams_[tid]);
-        });
+        }, GetTaskPrioritySeq());


I would add a comment about FIFO here too.

Signed-off-by: Joaquin Anton <janton@nvidia.com>

klecki · 2020-07-10T13:42:13Z

dali/operators/audio/nonsilence_op.h

@@ -207,9 +207,9 @@ class NonsilenceOperatorCpu : public NonsilenceOperator<CPUBackend> {
    auto &output_begin = ws.OutputRef<CPUBackend>(0);
    auto &output_length = ws.OutputRef<CPUBackend>(1);
    auto &tp = ws.GetThreadPool();
-
+    auto in_shape = input.shape();


Do you need a copy here?

Ah, it's returned by value.
Still I wonder, if we want explicit copy, or const auto&?

klecki · 2020-07-10T14:07:15Z

dali/operators/coord/coord_flip.cc

          int d = 0;
          int64_t i = 0;
          for (; i < in_size; i++, d++) {
            if (d == ndim_) d = 0;
            auto in_val = in[i];
            out[i] = flip_dim[d] ? mirrored_origin[d] - in_val : in_val;
          }
-        });
+        }, in_size);


input.shape().tensor_size(sample_id)?

that'd construct the tensor list shape for every sample

klecki · 2020-07-10T14:26:54Z

dali/operators/generic/erase/erase.cc

@@ -178,17 +178,17 @@ void EraseImplCpu<T, Dims>::RunImpl(HostWorkspace &ws) {
  auto &output = ws.OutputRef<CPUBackend>(0);
  int nsamples = input.size();
  auto& thread_pool = ws.GetThreadPool();
-
+  auto out_shape = output.shape();


And why sometimes input and here output?

could be input as well here

klecki · 2020-07-10T14:30:37Z

dali/operators/generic/erase/erase.cc

      [this, &input, &output, i](int thread_id) {
        kernels::KernelContext ctx;
        auto in_view = view<const T, Dims>(input[i]);
        auto out_view = view<T, Dims>(output[i]);
        kmgr_.Run<EraseKernel>(thread_id, i, ctx, out_view, in_view, args_[i]);
-      });
+      }, out_shape.tensor_size(i));


This one is a bit more tricky.
You first do a generic memcopy, and later you apply some unspecified amount of erase regions that could impact the performance. Maybe we can approximate additional time for that? But we would need to calculate the overlap between the actual image and all erase region and sum that. 🤔

as we discussed, this is doable but overly complicated, and just using the output size seems to be a good estimate of the work length

klecki · 2020-07-10T14:42:02Z

dali/operators/generic/pad.cc

          [this, &input, &output, i](int thread_id) {
            kernels::KernelContext ctx;
            auto in_view = view<const T, Dims>(input[i]);
            auto out_view = view<T, Dims>(output[i]);
            auto &kernel_sample_args = any_cast<std::vector<Args>&>(kernel_sample_args_);
            kmgr_.Run<Kernel>(thread_id, i, ctx, out_view, in_view, kernel_sample_args[i]);
-          });
+          }, out_shape.tensor_size(i));


Agree with the output-dominated time of processing for pad and crop/slice.

klecki · 2020-07-10T14:58:56Z

dali/operators/signal/fft/spectrogram.cc

@@ -248,7 +249,7 @@ void SpectrogramImplCpu::RunImpl(workspace_t<CPUBackend> &ws) {
          view<OutputType, WindowsDims>(output[i]),
          view<const InputType, WindowsDims>(win_out),
          fft_args_);
-    });
+    }, out_shape.tensor_size(i));


Not familiar with the op here, but can the stuff inside have different intermediate output sizes?

It'll be proportional to the size of the spectrogram

klecki · 2020-07-10T15:14:00Z

dali/operators/decoder/nvjpeg/decoupled_api/nvjpeg_decoder_decoupled_api.h

+  /**
+   * @brief Gets the next task priority to ensure FIFO execution in the thread pool (descencing integers)
+   */
+  int64_t GetTaskPrioritySeq() {


A bit weird with all the wrappers for task_seq--, but maybe it's better that way.

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao · 2020-07-14T07:33:06Z

!build

dali-automaton · 2020-07-14T07:51:00Z

CI MESSAGE: [1466599]: BUILD STARTED

dali-automaton · 2020-07-14T09:14:03Z

CI MESSAGE: [1466599]: BUILD FAILED

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao · 2020-07-14T10:58:51Z

!build

dali-automaton · 2020-07-14T11:00:45Z

CI MESSAGE: [1466910]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao · 2020-07-14T14:03:02Z

!build

dali-automaton · 2020-07-14T14:05:43Z

CI MESSAGE: [1467189]: BUILD STARTED

dali-automaton · 2020-07-14T16:50:09Z

CI MESSAGE: [1467189]: BUILD PASSED

JanuszL reviewed Jul 10, 2020

View reviewed changes

dali/operators/random/normal_distribution_op.cc Show resolved Hide resolved

JanuszL reviewed Jul 10, 2020

View reviewed changes

Use new ThreadPool API to post work with priority

b040748

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao force-pushed the thread_pool_add_copy_all_ops branch from a51b2c5 to b040748 Compare July 10, 2020 12:39

JanuszL reviewed Jul 10, 2020

View reviewed changes

Code review fixes

e21a1fd

Signed-off-by: Joaquin Anton <janton@nvidia.com>

JanuszL approved these changes Jul 10, 2020

View reviewed changes

klecki reviewed Jul 10, 2020

View reviewed changes

Code review fixes and unit test fix

998982c

Signed-off-by: Joaquin Anton <janton@nvidia.com>

klecki approved these changes Jul 14, 2020

View reviewed changes

Bug fix

6c4ae01

Signed-off-by: Joaquin Anton <janton@nvidia.com>

Bug fix 2

90ea302

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao merged commit b518087 into NVIDIA:master Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use new ThreadPool API to post work with priority #2102

Use new ThreadPool API to post work with priority #2102

jantonguirao commented Jul 10, 2020

JanuszL Jul 10, 2020

JanuszL Jul 10, 2020

JanuszL Jul 10, 2020

jantonguirao Jul 10, 2020

JanuszL Jul 10, 2020

JanuszL Jul 10, 2020

JanuszL Jul 10, 2020

klecki Jul 10, 2020

klecki Jul 10, 2020

klecki Jul 10, 2020

jantonguirao Jul 14, 2020

klecki Jul 10, 2020

jantonguirao Jul 14, 2020

klecki Jul 10, 2020

jantonguirao Jul 14, 2020

klecki Jul 10, 2020

klecki Jul 10, 2020

jantonguirao Jul 14, 2020

klecki Jul 10, 2020

jantonguirao Jul 14, 2020

jantonguirao commented Jul 14, 2020

dali-automaton commented Jul 14, 2020

dali-automaton commented Jul 14, 2020

jantonguirao commented Jul 14, 2020

dali-automaton commented Jul 14, 2020

jantonguirao commented Jul 14, 2020

dali-automaton commented Jul 14, 2020

dali-automaton commented Jul 14, 2020

Use new ThreadPool API to post work with priority #2102

Use new ThreadPool API to post work with priority #2102

Conversation

jantonguirao commented Jul 10, 2020

Why we need this PR?

What happened in this PR?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jantonguirao commented Jul 14, 2020

dali-automaton commented Jul 14, 2020

dali-automaton commented Jul 14, 2020

jantonguirao commented Jul 14, 2020

dali-automaton commented Jul 14, 2020

jantonguirao commented Jul 14, 2020

dali-automaton commented Jul 14, 2020

dali-automaton commented Jul 14, 2020