New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add NonsilentRegion GPU, implemented in terms of the CPU version #3874

Merged

jantonguirao merged 4 commits into NVIDIA:main from jantonguirao:false_gpu_impl_non_silent_region

May 10, 2022

Contributor

jantonguirao commented May 6, 2022 •

edited

Signed-off-by: Joaquin Anton janton@nvidia.com

Category:

New feature

Description:

Adds a NonsilentRegion GPU operator, implemented by wrapping the CPU implementation.
Effectively, it runs the CPU operator under the hood, and copies the memory from device to host and viceversa.
This on its own doesn't add any performance benefit, but it enables moving previous operators in the pipeline to the GPU (e.g. resampling)
Added a FalseGPUOperator template that can be used to apply the same pattern to other ops.

Additional information:

Affected modules and functionalities:

NonsilentRegion

Key points relevant for the review:

All

Checklist

Tests

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: NSILENT.02
In fact it doesn't really implement the requirement, as it doesn't use the GPU for processing, but it is closely related to that requirement.

JIRA TASK: DALI-2777


          Add NonsilentRegion GPU, implemented in terms of the CPU version

0d88573

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao assigned mzient and JanuszL

JanuszL reviewed

View reviewed changes

dali/operators/audio/nonsilence_op.cc Outdated

+                At this moment, the 'gpu' backend of this operator is implemented in terms of the 'cpu'
+                implementation. This results in a device-to-host copy of the inputs and a host-to-device copy of the
+                outputs. While using the 'gpu' implementation of this operator doesn't add any performance
+                benefit on its own, using it might make sense in order to enable moving previous operations to the GPU.

Contributor

JanuszL May 6, 2022

Suggested change

      
              benefit on its own, using it might make sense in order to enable moving previous operations to the GPU.
          
              benefit on its own, using it might make sense in order to enable moving preceding operations in the pipeline to the GPU.

Contributor Author

jantonguirao May 9, 2022

Done

JanuszL reviewed

View reviewed changes

dali/pipeline/operator/false_gpu_operator.h Outdated

+              template <typename CPUOperator>
+              class FalseGPUOperator : public Operator<GPUBackend> {
+               public:
+                FalseGPUOperator(const OpSpec &spec) : Operator<GPUBackend>(spec), cpu_impl_(spec), thread_pool_(3, 0, false) {

Contributor

JanuszL May 6, 2022

Suggested change

      
              FalseGPUOperator(const OpSpec &spec) : Operator<GPUBackend>(spec), cpu_impl_(spec), thread_pool_(3, 0, false) {
          
              FalseGPUOperator(const OpSpec &spec) : Operator<GPUBackend>(spec), cpu_impl_(spec), thread_pool_(PIPELINE_NUMBER_OF_THREADS, REAL_DEVICE_ID, true) {

Contributor Author

jantonguirao May 9, 2022

any particular reason for set_affine=true?

Contributor

JanuszL May 9, 2022

This locates CPU threads on the cores that are directly connected with the GPU. Check https://docs.nvidia.com/deploy/nvml-api/group__nvmlAffinity.html.

JanuszL reviewed

View reviewed changes

dali/pipeline/operator/false_gpu_operator.h

+                bool SetupImpl(std::vector<OutputDesc> &output_desc,
+                               const workspace_t<GPUBackend> &ws) override {
+                  if (cpu_ws_.NumInput() == 0 && cpu_ws_.NumOutput() == 0) {

Contributor

JanuszL May 6, 2022

I think you need to set CPU input and output buffer to be pinned, otherwise we may jam the GPU execution flow even more.

Contributor Author

jantonguirao May 9, 2022

Done

JanuszL reviewed

View reviewed changes

dali/pipeline/operator/false_gpu_operator.h Outdated

+                    } else {
+                      // Some GPU operators might accept CPU inputs (e.g. Slice)
+                      auto& cpu_input = ws.Input<CPUBackend>(input_idx);
+                      cpu_inputs_[input_idx]->Copy(cpu_input);

Contributor

JanuszL May 6, 2022

Can we share instead of copy?

Contributor Author

jantonguirao May 9, 2022

Done

JanuszL reviewed

View reviewed changes

dali/pipeline/operator/false_gpu_operator.h Outdated

Comment on lines 70 to 72

+                    if (ws.InputIsType<GPUBackend>(0)) {
+                      auto& gpu_input = ws.Input<GPUBackend>(input_idx);
+                      cpu_inputs_[input_idx]->Copy(gpu_input, AccessOrder(ws.stream()));

Contributor

JanuszL May 6, 2022

Maybe we should copy inside RunImpl?

Contributor Author

jantonguirao May 9, 2022

Done, I had to move Setup and Run to RunImpl


          Code review fixes

246d287

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from a646457 to 246d287 Compare

May 9, 2022 09:08

mzient reviewed

View reviewed changes

dali/pipeline/operator/false_gpu_operator.h Show resolved Hide resolved

mzient reviewed

View reviewed changes

dali/pipeline/operator/false_gpu_operator.h Outdated

Comment on lines 98 to 103

+                  auto ret = cpu_impl_.Setup(output_desc_, cpu_ws_);
+                  for (int output_idx = 0; output_idx < cpu_ws_.NumOutput(); output_idx++) {
+                    auto &desc = output_desc_[output_idx];
+                    cpu_ws_.template Output<CPUBackend>(output_idx).Resize(desc.shape, desc.type);
+                  }

Contributor

mzient May 9, 2022

Suggested change

      
                auto ret = cpu_impl_.Setup(output_desc_, cpu_ws_);
          
                for (int output_idx = 0; output_idx < cpu_ws_.NumOutput(); output_idx++) {
          
                  auto &desc = output_desc_[output_idx];
          
                  cpu_ws_.template Output<CPUBackend>(output_idx).Resize(desc.shape, desc.type);
          
                }
          
                if (cpu_impl_.Setup(output_desc_, cpu_ws_)) {
          
                  assert(output_desc_.size() == cpu_ws_.NumOutput());
          
                  for (int output_idx = 0; output_idx < cpu_ws_.NumOutput(); output_idx++) {
          
                    auto &desc = output_desc_[output_idx];
          
                    cpu_ws_.template Output<CPUBackend>(output_idx).Resize(desc.shape, desc.type);
          
                  }
          
                }

jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from 890427f to fe1fd3d Compare

May 9, 2022 13:08

mzient reviewed

View reviewed changes

dali/pipeline/operator/false_gpu_operator.h Outdated

+                  for (int output_idx = 0; output_idx < ws.NumOutput(); output_idx++) {
+                    const auto& cpu_output = cpu_ws_.Output<CPUBackend>(output_idx);
+                    ws.Output<GPUBackend>(output_idx).Copy(cpu_output, AccessOrder(ws.stream()));

Contributor

mzient May 9, 2022

Better not to create AccessOrder in a loop - unless you know the device id.
AccessOrder(stream, device) is cheap
AccessOrder(stream) is much more expensive (needs to look up the device id by stream).


          Code review fixes

fba84c0

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from fe1fd3d to fba84c0 Compare

May 9, 2022 13:11

mzient approved these changes

View reviewed changes

JanuszL reviewed

View reviewed changes

dali/pipeline/operator/false_gpu_operator.h Outdated

+                explicit FalseGPUOperator(const OpSpec &spec)
+                    : Operator<GPUBackend>(spec),
+                      cpu_impl_(spec),
+                      thread_pool_(num_threads_, spec.GetArgument<int>("device_id"), true /** set_affine */ ) {

Contributor

JanuszL May 9, 2022

Nitpick: I'm not sure if we need this argument name.

Suggested change

      
                    thread_pool_(num_threads_, spec.GetArgument<int>("device_id"), true /** set_affine */ ) {
          
                    thread_pool_(num_threads_, spec.GetArgument<int>("device_id"), true) {

JanuszL reviewed

View reviewed changes

dali/pipeline/operator/false_gpu_operator.h

+               protected:
+                bool CanInferOutputs() const override {
+                  // To run Setup we need to first copy from device to host.

Contributor

JanuszL May 9, 2022

Do we need to copy all data or just CPU arguments?
I have mixed feelings regarding moving the whole setup to Run, but it is up to you.

Contributor Author

jantonguirao May 9, 2022

To run Setup we need to have a valid HostWorkspace

JanuszL approved these changes

View reviewed changes

jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from 9a67630 to de0edfc Compare

May 9, 2022 14:20

Contributor Author

jantonguirao commented May 9, 2022

!build

Collaborator

dali-automaton commented May 9, 2022

CI MESSAGE: [4788313]: BUILD STARTED

Collaborator

dali-automaton commented May 9, 2022

CI MESSAGE: [4788313]: BUILD FAILED

jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from de0edfc to afcfbb0 Compare

May 9, 2022 15:53

Contributor Author

jantonguirao commented May 9, 2022

!build

Collaborator

dali-automaton commented May 9, 2022

CI MESSAGE: [4788843]: BUILD STARTED

Collaborator

dali-automaton commented May 9, 2022

CI MESSAGE: [4788843]: BUILD FAILED


          Save testing time by calculating CPU and GPU in one go

5a36d12

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao force-pushed the false_gpu_impl_non_silent_region branch from afcfbb0 to 5a36d12 Compare

May 9, 2022 17:58

Contributor Author

jantonguirao commented May 9, 2022

!build

Collaborator

dali-automaton commented May 9, 2022

CI MESSAGE: [4790314]: BUILD STARTED

Collaborator

dali-automaton commented May 9, 2022

CI MESSAGE: [4790314]: BUILD PASSED

jantonguirao merged commit a14ab64 into NVIDIA:main

cyyever pushed a commit to cyyever/DALI that referenced this pull request


          Add NonsilentRegion GPU, implemented in terms of the CPU version (NVI…

a5fd5e5

…DIA#3874)

Signed-off-by: Joaquin Anton <janton@nvidia.com>

cyyever pushed a commit to cyyever/DALI that referenced this pull request


          Add NonsilentRegion GPU, implemented in terms of the CPU version (NVI…

f166330

…DIA#3874)

Signed-off-by: Joaquin Anton <janton@nvidia.com>

JanuszL mentioned this pull request

DALI 2022 roadmap #3774

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment