Prevent clobbering of outputs before non-blocking copy_to_external finishes. #3953

mzient · 2022-06-02T10:21:57Z

Category:

Bug fix
Tests

Description:

This PR fixes the issue with copy_to_external / daliCopyOutput where not requiring host synchronization introduced a race condition - once ReleaseOutputs was called, the output buffer that is still being copied could be clobbered by the next iteration of the pipeline.
This PR adds a device/device synchronization between the stream associated with the tensor being copied (usually that's the GPU stage's stream) and the user stream. With this change, any work submitted on the GPU stream after copy_to_external exits will be scheduled after the copy.
There are extensive tests with PyTorch CUDA streams and in C API that triggered the issue and a test-driven approach was used to make the issue go away without altering the tests.

Additional information:

The tests are not 100% reliable, as is always the case with race condition - i.e. there can be false-negatives if the issue reappears.

Affected modules and functionalities:

bakcend_impl - copy_to_external
C API
C API tests

Key points relevant for the review:

N/A

Tests:

test_copy_to_external_torch.py - all tests (it's a new file)
c_api_test.cu - likewise
existing tests - many of these, should check for regressions - these include
c_api_test.cc - tests with daliCopyOutput, daliCopyOutputSamples in their name
framework tests - they use copy_to_external to populate framework tensors

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: DALI-2467

dali-automaton · 2022-06-02T10:22:54Z

CI MESSAGE: [4994090]: BUILD STARTED

dali-automaton · 2022-06-02T17:55:46Z

CI MESSAGE: [4994090]: BUILD PASSED

dali-automaton · 2022-06-03T12:47:45Z

CI MESSAGE: [5004719]: BUILD STARTED

dali-automaton · 2022-06-03T14:22:50Z

CI MESSAGE: [5004719]: BUILD PASSED

dali-automaton · 2022-06-07T08:55:22Z

CI MESSAGE: [5032484]: BUILD STARTED

dali-automaton · 2022-06-07T10:13:46Z

CI MESSAGE: [5032484]: BUILD PASSED

jantonguirao · 2022-06-08T08:52:11Z

dali/python/backend_impl.cc

+ * The copy will be scheduled on the provided `cuda_stream` or, if left out, on an internal DALI
+ * stream.
+ * If a non-blocking copy is requested, the function will synchronize the source buffer's
+ * associated access order with the provided stream; otherwie, the function will wait until the


Suggested change

* associated access order with the provided stream; otherwie, the function will wait until the

* associated access order with the provided stream; otherwise, the function will wait until the

jantonguirao · 2022-06-08T08:52:21Z

dali/python/backend_impl.cc

+ * associated access order with the provided stream; otherwie, the function will wait until the
+ * copy completes.
+ *
+ * @tparam SourceObject  a data store on GPUBackend (Tensor, TensorList, TensorVector)


Suggested change

* @tparam SourceObject a data store on GPUBackend (Tensor, TensorList, TensorVector)

* @tparam SourceObject a data store on GPUBackend (Tensor, TensorList, TensorVector)

jantonguirao · 2022-06-08T08:56:11Z

dali/test/python/test_copy_to_external_torch.py

+to_torch_type = {
+    types.DALIDataType.FLOAT   : torch.float32,
+    types.DALIDataType.FLOAT64 : torch.float64,
+    types.DALIDataType.FLOAT16 : torch.float16,
+    types.DALIDataType.UINT8   : torch.uint8,
+    types.DALIDataType.INT8    : torch.int8,
+    types.DALIDataType.INT16   : torch.int16,
+    types.DALIDataType.INT32   : torch.int32,
+    types.DALIDataType.INT64   : torch.int64
+}


Use
from nvidia.dali.plugin.pytorch import to_torch_type
?

jantonguirao · 2022-06-08T08:57:12Z

dali/test/python/test_copy_to_external_torch.py

+def feed_ndarray(tensor_or_tl, arr, cuda_stream=None, non_blocking=False):
+    """
+    Copy contents of DALI tensor to PyTorch's Tensor.
+
+    Parameters
+    ----------
+    `tensor_or_tl` : TensorGPU or TensorListGPU
+    `arr` : torch.Tensor
+            Destination of the copy
+    `cuda_stream` : torch.cuda.Stream, cudaStream_t or any value that can be cast to cudaStream_t.
+                    CUDA stream to be used for the copy
+                    (if not provided, an internal user stream will be selected)
+                    In most cases, using pytorch's current stream is expected (for example,
+                    if we are copying to a tensor allocated with torch.zeros(...))
+    """
+    dali_type = to_torch_type[tensor_or_tl.dtype]
+    if isinstance(tensor_or_tl, TensorListGPU):
+        dali_tensor = tensor_or_tl.as_tensor()
+    else:
+        dali_tensor = tensor_or_tl
+
+
+    assert dali_type == arr.dtype, ("The element type of DALI Tensor/TensorList"
+            " doesn't match the element type of the target PyTorch Tensor:"
+            "{} vs {}".format(dali_type, arr.dtype))
+
+    assert dali_tensor.shape() == list(arr.size()), \
+            ("Shapes do not match: DALI tensor has size {0}"
+            ", but PyTorch Tensor has size {1}".format(dali_tensor.shape(), list(arr.size())))
+    cuda_stream = types._raw_cuda_stream(cuda_stream)
+
+    # turn raw int to a c void pointer
+    c_type_pointer = ctypes.c_void_p(arr.data_ptr())
+    stream = None if cuda_stream is None else ctypes.c_void_p(cuda_stream)
+    tensor_or_tl.copy_to_external(c_type_pointer, stream, non_blocking)
+    return arr


Ideally we would remove this and extend the feed_ndarray in the plugin to support the non_blocking argument. This can be handle as a separate task

jantonguirao · 2022-06-08T08:59:12Z

dali/test/python/test_copy_to_external_torch.py

+            pipe.release_outputs()
+            # if no appropriate synchronization is done, the array is likely
+            # clobbered with the results from the second iteration
+            assert check(arr, ref)


Suggested change

assert check(arr, ref)

assert torch.equal(arr, ref)

would read better, IMHO

jantonguirao · 2022-06-08T09:07:42Z

dali/c_api/c_api_test.cu

+
+  // This loop is tuned so that if the output buffer is recycled before the asynchronous copy
+  // finishes, the buffer is clobbered and an error is detected.
+  // (michalz) Verified on my desktop. The changes in c_api that came with this test


This makes sense in the context of this PR, but will have no meaning some time in the future. Do we keep such comment?

Well, this is a test and a delicate one, too. It was very hard to get a repro, so this comment is a word of caution for whoever is touching this. I could remove the sentence about my desktop, but I'd add a repro on how to break the code to trigger a failure.

jantonguirao · 2022-06-08T09:11:50Z

dali/c_api/c_api.cc

  }
+  wait_order.wait(copy_order);


You can extract

if (!host_sync) wait_order = src.order();

outside of the if/else

I cannot - src is a local variable initialized inside if/else and has a different type in these branches.
I can, however, remove the duplicate asssignment, which I've just noticed here.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2022-06-08T09:53:36Z

CI MESSAGE: [5042627]: BUILD STARTED

dali-automaton · 2022-06-08T11:39:51Z

CI MESSAGE: [5042627]: BUILD PASSED

mzient changed the title ~~Async feed nd array~~ Make feed_ndarray non-blocking Jun 2, 2022

jantonguirao assigned szalpal and jantonguirao Jun 3, 2022

mzient force-pushed the AsyncFeedNDArray branch from 664c462 to d177fcf Compare June 3, 2022 12:46

mzient changed the title ~~Make feed_ndarray non-blocking~~ Prevent clobbering of outputs before non-blocking copy_to_external finishes. Jun 3, 2022

szalpal approved these changes Jun 6, 2022

View reviewed changes

mzient force-pushed the AsyncFeedNDArray branch from d177fcf to 17b51ab Compare June 7, 2022 08:54

jantonguirao approved these changes Jun 8, 2022

View reviewed changes

mzient and others added 11 commits June 8, 2022 11:52

[WIP]

8786798

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

[WIP]

9972ecf

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

[WIP]

ebd81cc

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Trying to repro in C API.

12c9d03

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Working in both Python and C API.

fde1396

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Improved docs. Added test launch script.

0f0aad5

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Minor refactoring. Improved tests.

c0a8984

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Revert useless change in c_api_test.cc

375e09a

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Make copy_to_external in framework iterators non-blocking.

95b4ead

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Removed comment accidentally copied from another file.

1c45b0b

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Fix review issues.

30d4b9f

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient force-pushed the AsyncFeedNDArray branch from 17b51ab to 30d4b9f Compare June 8, 2022 09:52

mzient merged commit 80fce13 into NVIDIA:main Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent clobbering of outputs before non-blocking copy_to_external finishes. #3953

Prevent clobbering of outputs before non-blocking copy_to_external finishes. #3953

mzient commented Jun 2, 2022 •

edited

Loading

dali-automaton commented Jun 2, 2022

dali-automaton commented Jun 2, 2022

dali-automaton commented Jun 3, 2022

dali-automaton commented Jun 3, 2022

dali-automaton commented Jun 7, 2022

dali-automaton commented Jun 7, 2022

jantonguirao Jun 8, 2022

jantonguirao Jun 8, 2022

jantonguirao Jun 8, 2022

jantonguirao Jun 8, 2022

jantonguirao Jun 8, 2022

jantonguirao Jun 8, 2022

mzient Jun 8, 2022

jantonguirao Jun 8, 2022

mzient Jun 8, 2022 •

edited

Loading

dali-automaton commented Jun 8, 2022

dali-automaton commented Jun 8, 2022

	* associated access order with the provided stream; otherwie, the function will wait until the
	* associated access order with the provided stream; otherwise, the function will wait until the

	* @tparam SourceObject a data store on GPUBackend (Tensor, TensorList, TensorVector)
	* @tparam SourceObject a data store on GPUBackend (Tensor, TensorList, TensorVector)

Prevent clobbering of outputs before non-blocking copy_to_external finishes. #3953

Prevent clobbering of outputs before non-blocking copy_to_external finishes. #3953

Conversation

mzient commented Jun 2, 2022 • edited Loading

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

dali-automaton commented Jun 2, 2022

dali-automaton commented Jun 2, 2022

dali-automaton commented Jun 3, 2022

dali-automaton commented Jun 3, 2022

dali-automaton commented Jun 7, 2022

dali-automaton commented Jun 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

dali-automaton commented Jun 8, 2022

dali-automaton commented Jun 8, 2022

mzient commented Jun 2, 2022 •

edited

Loading

mzient Jun 8, 2022 •

edited

Loading