Fix TensorFlow plugin operation without GPU #3719

JanuszL · 2022-03-07T13:40:30Z

TensorFlow DALI plugin doesn't work without GPU as it forces
synchronization after copying out the output data to the TF tensor
what invokes CUDA call. This PR removes this synchronization when
we copy to the non-pinned CPU buffer and when the DALI TF operator
is placed on the CPU

Signed-off-by: Janusz Lisiecki jlisiecki@nvidia.com

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

TensorFlow DALI plugin doesn't work without GPU as it forces
synchronization after copying out the output data to the TF tensor
what invokes CUDA call. This PR removes this synchronization when
we copy to the non-pinned CPU buffer and when the DALI TF operator
is placed on the CPU

Additional information:

Affected modules and functionalities:

DALI TF operator and dataset
CPU only tests

Key points relevant for the review:

tests

Checklist

Tests

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

- TensorFlow DALI plugin doesn't work without GPU as it forces synchronization after copying out the output data to the TF tensor what invokes CUDA call. This PR removes this synchronization when we copy to the non-pinned CPU buffer and when the DALI TF operator is placed on the CPU Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

jantonguirao · 2022-03-08T08:55:52Z

dali/test/python/test_dali_tf_plugin_cpu_only.py

+    return data
+
+def get_data(batch_size, value):
+    pipe = get_datali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)


Suggested change

pipe = get_datali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)

pipe = get_dali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)

jantonguirao · 2022-03-08T08:56:04Z

dali/test/python/test_dali_tf_plugin_cpu_only.py

+def test_dali_tf_op_cpu_only():
+    try:
+        tf.compat.v1.disable_eager_execution()
+    except:


Suggested change

except:

except Exception:

LGTM was complaining about using "except:" as a bad practice

jantonguirao · 2022-03-08T08:56:57Z

dali/test/python/test_dali_tf_plugin_cpu_only_dataset.py

+
+
+@pipeline_def()
+def get_datali_pipe(value):


Suggested change

def get_datali_pipe(value):

def get_dali_pipe(value):

jantonguirao · 2022-03-08T08:57:07Z

dali/test/python/test_dali_tf_plugin_cpu_only_dataset.py

+    return data
+
+def get_data(batch_size, value):
+    pipe = get_datali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)


Suggested change

pipe = get_datali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)

pipe = get_dali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)

jantonguirao · 2022-03-08T08:57:30Z

dali/test/python/test_dali_tf_plugin_cpu_only_dataset.py

+    skip_for_incompatible_tf()
+    try:
+        tf.compat.v1.enable_eager_execution()
+    except:


Suggested change

except:

except Exception:

jantonguirao · 2022-03-08T08:57:39Z

dali/test/python/test_dali_tf_plugin_cpu_only_dataset.py

+
+    batch_size = 3
+    value = random.randint(0, 1000)
+    pipe = get_datali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)


Suggested change

pipe = get_datali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)

pipe = get_dali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)

jantonguirao · 2022-03-08T08:59:02Z

dali_tf_plugin/dali_dataset_op.cc

-      unsigned int wait_flag = (out_id == num_outputs - 1) ? DALI_ext_force_sync : DALI_ext_default;
+      // if the OP runs on the CPU the output memory is not pinned and we don't need to sync
+      unsigned int wait_flag = (i == dali_num_out - 1) ?
+                            (this->device_type_ == device_type_t::CPU ? 0 : DALI_ext_force_sync) :


Suggested change

(this->device_type_ == device_type_t::CPU ? 0 : DALI_ext_force_sync) :

(this->device_type_ == device_type_t::CPU ? DALI_ext_default : DALI_ext_force_sync) :

Or

unsigned int wait_flag = this->device_type_ != device_type_t::CPU && (i == dali_num_out - 1) ? DALI_ext_force_sync : DALI_ext_default;

jantonguirao · 2022-03-08T09:02:08Z

dali_tf_plugin/daliop.cc

      // Synchronize with the dataset()->stream_ when doing the last copy, so the outputs
      // are fully finished before we release the output buffers for reuse.
-      unsigned int wait_flag = (i == dali_num_out - 1) ? DALI_ext_force_sync : DALI_ext_default;
+      // if the OP runs on the CPU the output memory is not pinned and we don't need to sync
+      unsigned int wait_flag = (i == dali_num_out - 1) ?


jantonguirao · 2022-03-08T09:04:28Z

qa/TL0_cpu_only/test_pytorch.sh

+  # CPU only test, remove CUDA from the search path just in case
+  export LD_LIBRARY_PATH=""
+  export PATH=${PATH/cuda/}
+  nosetests --verbose test_dali_cpu_only.py


Suggested change

nosetests --verbose test_dali_cpu_only.py

nosetests --verbose -m '(?:^|[\b_\./-])[Tt]est.*pytorch*' test_dali_cpu_only.py

?

Or should we use --attr 'pytorch' vs. --attr '!pytorch' ? Right now we are running all tests under test_pytorch.sh

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2022-03-09T05:24:27Z

CI MESSAGE: [4093192]: BUILD FAILED

dali-automaton · 2022-03-09T07:32:20Z

CI MESSAGE: [4106587]: BUILD STARTED

dali-automaton · 2022-03-09T08:19:07Z

CI MESSAGE: [4106908]: BUILD STARTED

dali-automaton · 2022-03-09T10:34:55Z

CI MESSAGE: [4107712]: BUILD STARTED

dali-automaton · 2022-03-09T12:11:05Z

CI MESSAGE: [4106908]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2022-03-09T12:19:06Z

CI MESSAGE: [4108281]: BUILD STARTED

dali-automaton · 2022-03-09T18:13:18Z

CI MESSAGE: [4108281]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2022-03-09T19:23:35Z

CI MESSAGE: [4111696]: BUILD STARTED

dali-automaton · 2022-03-09T22:33:33Z

CI MESSAGE: [4111696]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2022-03-10T07:24:34Z

CI MESSAGE: [4116022]: BUILD STARTED

dali-automaton · 2022-03-10T08:58:44Z

CI MESSAGE: [4116022]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2022-03-10T10:59:31Z

CI MESSAGE: [4117029]: BUILD STARTED

dali-automaton · 2022-03-10T12:59:56Z

CI MESSAGE: [4117029]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2022-03-10T13:17:22Z

CI MESSAGE: [4117747]: BUILD STARTED

dali-automaton · 2022-03-10T14:50:44Z

CI MESSAGE: [4117747]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2022-03-10T16:52:38Z

CI MESSAGE: [4118800]: BUILD STARTED

dali-automaton · 2022-03-10T18:14:11Z

CI MESSAGE: [4118800]: BUILD PASSED

- TensorFlow DALI plugin doesn't work without GPU as it forces synchronization after copying out the output data to the TF tensor what invokes CUDA call. This PR removes this synchronization when we copy to the non-pinned CPU buffer and when the DALI TF operator is placed on the CPU Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL added the important-fix Fixes an important issue in the software or development environment. label Mar 7, 2022

jantonguirao reviewed Mar 8, 2022

View reviewed changes

jantonguirao self-assigned this Mar 8, 2022

Review fixes

0baf5f4

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

jantonguirao assigned prak-nv Mar 8, 2022

prak-nv approved these changes Mar 9, 2022

View reviewed changes

Fix

26cb71b

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the fix_cpu_only_tf branch from 08944c6 to 26cb71b Compare March 9, 2022 12:17

Fix

833bb65

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

Fix

7adae77

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

Fix

0e14b24

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

Fix

d3158f0

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

jantonguirao approved these changes Mar 10, 2022

View reviewed changes

Fix

54ff27e

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL merged commit ff497cc into NVIDIA:main Mar 10, 2022

JanuszL deleted the fix_cpu_only_tf branch March 10, 2022 18:19

	pipe = get_datali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)
	pipe = get_dali_pipe(batch_size=batch_size, device_id=types.CPU_ONLY_DEVICE_ID, num_threads=1, value=value)

	(this->device_type_ == device_type_t::CPU ? 0 : DALI_ext_force_sync) :
	(this->device_type_ == device_type_t::CPU ? DALI_ext_default : DALI_ext_force_sync) :

	nosetests --verbose test_dali_cpu_only.py
	nosetests --verbose -m '(?:^\|[\b_\./-])[Tt]est.pytorch' test_dali_cpu_only.py

Fix TensorFlow plugin operation without GPU #3719

Fix TensorFlow plugin operation without GPU #3719

Conversation

JanuszL commented Mar 7, 2022

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Checklist

Tests

Documentation

DALI team only

Requirements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Mar 9, 2022

dali-automaton commented Mar 9, 2022

dali-automaton commented Mar 9, 2022

dali-automaton commented Mar 9, 2022

dali-automaton commented Mar 9, 2022

dali-automaton commented Mar 9, 2022

dali-automaton commented Mar 9, 2022

dali-automaton commented Mar 9, 2022

dali-automaton commented Mar 9, 2022

dali-automaton commented Mar 10, 2022

dali-automaton commented Mar 10, 2022

dali-automaton commented Mar 10, 2022

dali-automaton commented Mar 10, 2022

dali-automaton commented Mar 10, 2022

dali-automaton commented Mar 10, 2022

dali-automaton commented Mar 10, 2022

dali-automaton commented Mar 10, 2022