Fix handling of CUDA streams in Python frontend #2050

mzient · 2020-06-23T13:49:13Z

Signed-off-by: Michał Zientkiewicz mzient@gmail.com

Why we need this PR?

Pick one, remove the rest

It fixes a bug: gpu external source test failure under GPU load

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
- When cuda_stream is None in ExternalSource and the object is a CuPy array, issue the copy on CuPy's current stream.
Affected modules and functionalities:
- pipeline.py
Key points relevant for the review:
- N/A
Validation and testing:
- Existing tests apply
Documentation (including examples):
- N/A

JIRA TASK: DALI-1474

* When cuda_stream is None in ExternalSource and the object is a CuPy array, issue the copy on CuPy's current stream. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient · 2020-06-23T14:00:57Z

!build

dali-automaton · 2020-06-23T14:20:56Z

CI MESSAGE: [1417005]: BUILD STARTED

klecki · 2020-06-23T15:30:17Z

My points:

we say in the doc, that if you don't provide the stream, we will use an internal one for the copy. If the user knows that memory is ready, he may want to preserve this behaviour.
this is fix for CuPy test by adding a bit of CuPy-specific workaround. There are other sources of cuda memory we can get here that will still cause the bug.
I think we should add a mention in the doc, that if the user doesn't provide the stream the memory should be ready.

Maybe we can make this configurable and by default extract the stream from whatever library we can recognize and know how to do it. Otherwise allow user to turn it off?

This problem can also be fixed by using the cupy API to sync before passing memory to DALI in the test.

dali-automaton · 2020-06-23T16:46:05Z

CI MESSAGE: [1417005]: BUILD PASSED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient · 2020-06-23T17:17:59Z

!build

dali-automaton · 2020-06-23T17:24:32Z

CI MESSAGE: [1417491]: BUILD STARTED

JanuszL · 2020-06-23T17:31:41Z

dali/python/nvidia/dali/plugin/mxnet.py

@@ -62,9 +62,15 @@ def feed_ndarray(dali_tensor, arr, cuda_stream = None):
    # Get CTypes void pointer to the underlying memory held by arr
    ptr = ctypes.c_void_p()
    mx.base._LIB.MXNDArrayGetData(arr.handle, ctypes.byref(ptr))
+
+    if hasattr(cuda_stream, "cuda_stream"):  # torch


Extract to common utility

dali-automaton · 2020-06-23T19:21:20Z

CI MESSAGE: [1417491]: BUILD FAILED

JanuszL · 2020-06-23T22:34:41Z

dali/python/nvidia/dali/plugin/pytorch.py

@@ -57,10 +57,14 @@ def feed_ndarray(dali_tensor, arr, cuda_stream = None):
    assert dali_tensor.shape() == list(arr.size()), \
            ("Shapes do not match: DALI tensor has size {0}"
            ", but PyTorch Tensor has size {1}".format(dali_tensor.shape(), list(arr.size())))
-    #turn raw int to a c void pointer
+    if cuda_stream is torch.cuda.Stream:


Suggested change

if cuda_stream is torch.cuda.Stream:

if isinstance(cuda_stream, torch.cuda.streams.Stream):

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2020-06-24T07:09:51Z

!build

dali-automaton · 2020-06-24T07:15:57Z

CI MESSAGE: [1419648]: BUILD STARTED

dali-automaton · 2020-06-24T08:47:12Z

CI MESSAGE: [1419648]: BUILD FAILED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient · 2020-06-24T10:58:18Z

!build

dali-automaton · 2020-06-24T11:00:52Z

CI MESSAGE: [1420074]: BUILD STARTED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

klecki

Mostly wording and the repetition of c_void_p usage.

klecki · 2020-06-24T11:17:06Z

dali/python/nvidia/dali/external_source.py

-    and all work is properly queued). If no stream is provided feeding input blocks until the
-    provided memory is copied to the internal buffer
+    and all work is properly queued). If no stream is provided, DALI will use a default, with
+    best-effort approach at correctness (see ``cuda_stream`` argument documentation for details).


So, we're ditching the idea of blocking copy? I see it will still happen as you didn't change the internals.

It's sort of orthogonal - we'll still block if using internal stream, because the kind of bug it protects against is even harder to detect. Maybe we should make it explicitly configurable. Let's discuss it on dev.

I don't think we need that in python level. In C++ it is there. I would still mention here that if no stream is provided it will block.

It's mentioned in the documentation of cuda_stream parameter.

klecki · 2020-06-24T11:33:55Z

dali/python/nvidia/dali/plugin/paddle.py

@@ -70,13 +70,16 @@ def feed_ndarray(dali_tensor, ptr, cuda_stream = None):
                    Tensor from which to copy
    `ptr` : LoDTensor data pointer
            Destination of the copy
-    `cuda_stream` : Any value that can be casted to cudaStream_t
+    `cuda_stream` : Any value that can be caste to cudaStream_t


Suggested change

`cuda_stream` : Any value that can be caste to cudaStream_t

`cuda_stream` : Any value that can be cast to cudaStream_t

But maybe we should call it representing cudaStream_t? Accessing some attributes is not exactly casting, right?

klecki · 2020-06-24T11:35:40Z

dali/python/nvidia/dali/plugin/mxnet.py

@@ -48,7 +48,7 @@ def feed_ndarray(dali_tensor, arr, cuda_stream = None):
                    Tensor from which to copy
    `arr` : mxnet.nd.NDArray
            Destination of the copy
-    `cuda_stream` : Any value that can be casted to cudaStream_t
+    `cuda_stream` : Any value that can be cast to cudaStream_t


As mentioned elsewhere, maybe:

Suggested change

`cuda_stream` : Any value that can be cast to cudaStream_t

`cuda_stream` : Any value that can be cast or represents cudaStream_t

klecki · 2020-06-24T11:59:20Z

dali/python/nvidia/dali/pipeline.py

+            The array(s) may be one of:
+              * NumPy ndarray (CPU)
+              * MXNet ndarray (CPU)
+              * PyTorch tensor (CPU or GPU)
+              * CuPy array (GPU)


Can't we handle anything with [cuda] array interface?

Good point.

And Python Buffer Protocol (Numpy is just an example of it). I would add this info to the ExternalSource docs as well.

klecki · 2020-06-24T12:01:45Z

dali/python/nvidia/dali/pipeline.py

+        infer_stream = False
+        if cuda_stream is None:
+            infer_stream = True
+        if cuda_stream == -1:
+            cuda_stream = None
+        else:
+            cuda_stream = types._raw_cuda_stream(cuda_stream)


Can we maybe use another name for cuda_stream that is passed further as the argument to this function and the one we use later have a bit different meanings with the None and -1.

Maybe something like stream_ptr, and make the _raw_cuda_stream already return a c_void_p values?
You're packing it by hand in every invocation place.

Or maybe it would be better to have some boolean for SetExternalTensorInput indicating if the stream should be generated internally?

Unpacking it as a raw pointer may cause hard errors in case of the following or similar scenario:

stream = torch.cuda.Stream() fn.external_source(src, cuda_stream = stream) # stream reference is forgotten

It's a bug, I agree. But if we unwrap immediately, we'll lose the reference to the stream and will be destroyed - our stream pointer will be invalid and can even be recycled by the driver upon next stream creation, where we'd coincidentally have a different stream, but still invalid. We convert a python-level logic error to a potentially disastrous hard error in native code. I don't want our users to debug THAT kind of erros.

Or maybe it would be better to have some boolean for SetExternalTensorInput indicating if the stream should be generated internally?

I think None is fine.

klecki · 2020-06-24T12:23:41Z

dali/test/python/test_external_source_pytorch_gpu.py

+            pipe = Pipeline(1, 3, 0)
+
+            def gen_batch():
+                nonlocal t0


Why do you need non-local for t0 and not increment?

Because I need the same tensor to change - simply returning a new one in PyTorch resulted in synchronization and the error could not be reproduced even when streams were wrong.

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

klecki · 2020-06-24T14:21:26Z

!build

dali-automaton · 2020-06-24T14:25:40Z

CI MESSAGE: [1420366]: BUILD STARTED

JanuszL · 2020-06-24T14:38:44Z

dali/python/nvidia/dali/pipeline.py

-        provided GPU memory content only using provided stream (DALI schedules
-        a copy on it and all work is properly queued). If no stream is provided
-        feed_input blocks until the provided memory is copied to the internal buffer
+        """Pass a mutlidimensional array (or a list thereof) to an output of ExternalSource.


I would copy/paste more info from ExternalSource docs.

JanuszL · 2020-06-24T14:48:24Z

dali/python/nvidia/dali/plugin/paddle.py

    c_type_pointer = ctypes.c_void_p(ptr)
    if isinstance(dali_tensor, (TensorGPU, TensorListGPU)):
-        dali_tensor.copy_to_external(c_type_pointer, cuda_stream)
+        dali_tensor.copy_to_external(c_type_pointer, ctypes.c_void_p(cuda_stream))


Now you will get nullptr instead of None. This breaks the logic in:

.def("copy_to_external", [](Tensor<GPUBackend> &t, py::object p, py::object cuda_stream, bool non_blocking) { void *ptr = ctypes_void_ptr(p); cudaStream_t stream = cuda_stream.is_none() ? UserStream::Get()->GetStream(t) : static_cast<cudaStream_t>(ctypes_void_ptr(cuda_stream));

JanuszL · 2020-06-24T14:48:32Z

dali/python/nvidia/dali/plugin/pytorch.py

    c_type_pointer = ctypes.c_void_p(arr.data_ptr())
    if isinstance(dali_tensor, (TensorGPU, TensorListGPU)):
-        dali_tensor.copy_to_external(c_type_pointer, cuda_stream)
+        dali_tensor.copy_to_external(c_type_pointer, ctypes.c_void_p(cuda_stream))


dali-automaton · 2020-06-24T15:48:40Z

CI MESSAGE: [1420366]: BUILD PASSED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient · 2020-06-24T16:20:32Z

!build

dali-automaton · 2020-06-24T16:25:53Z

CI MESSAGE: [1420580]: BUILD STARTED

dali-automaton · 2020-06-24T17:53:16Z

CI MESSAGE: [1420580]: BUILD PASSED

Reject masked tensors. Fix documentation formatting issues. Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2020-06-24T20:32:09Z

CI MESSAGE: [1421224]: BUILD STARTED

dali-automaton · 2020-06-24T21:59:56Z

CI MESSAGE: [1421224]: BUILD PASSED

dali-automaton · 2020-06-24T23:45:23Z

CI MESSAGE: [1421798]: BUILD STARTED

dali-automaton · 2020-06-25T01:08:57Z

CI MESSAGE: [1421798]: BUILD PASSED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2020-06-25T05:52:24Z

CI MESSAGE: [1422380]: BUILD STARTED

JanuszL · 2020-06-25T09:01:07Z

dali/python/backend_impl.cc

+    for (int i = strides.size() - 1; i >= 0; i--) {
+      DALI_ENFORCE(strides[i] == stride_from_shape,
+          make_string("Strided data not supported. Dimension ", i, " has stride ", strides[i],
+          " whereas densely packed data of this shape would have a stride ", stride_from_shape));
+      stride_from_shape *= shape[i];


We use this check in a couple of places now. Can you extract it to the function?

dali-automaton · 2020-06-25T10:02:05Z

CI MESSAGE: [1422380]: BUILD PASSED

Fix CuPy external source.

0676873

* When cuda_stream is None in ExternalSource and the object is a CuPy array, issue the copy on CuPy's current stream. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

JanuszL approved these changes Jun 23, 2020

View reviewed changes

Fix handling of CUDA streams in Python frontend.

f7bbf36

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient requested a review from JanuszL June 23, 2020 17:17

mzient changed the title ~~Fix CuPy external source.~~ Fix handling of CUDA streams in Python frontend Jun 23, 2020

JanuszL reviewed Jun 23, 2020

View reviewed changes

JanuszL force-pushed the FixCuPyExternalSource branch from 7748e14 to ed007ab Compare June 24, 2020 07:05

Review fixes

0c2c7d9

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the FixCuPyExternalSource branch from ed007ab to 0c2c7d9 Compare June 24, 2020 07:08

mzient added 3 commits June 24, 2020 12:21

Add pytorch support for external source.

254de11

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Removed debug stuff.

a2c6038

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Updated docs.

e046bf0

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient added 2 commits June 24, 2020 14:06

Fix script names in qa tests.

eb3df10

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Rename pytorch test, as it does not use iteration number.

ad6216f

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

klecki reviewed Jun 24, 2020

View reviewed changes

Fix review issues.

4081e67

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

klecki approved these changes Jun 24, 2020

View reviewed changes

JanuszL self-requested a review June 24, 2020 14:34

JanuszL reviewed Jun 24, 2020

View reviewed changes

Fix copy_to_external.

5a98ecb

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

JanuszL approved these changes Jun 24, 2020

View reviewed changes

Lift requirement for __cuda_array_interface__ version >= 2

b5038ed

Reject masked tensors. Fix documentation formatting issues. Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Support strides when they are set to default anyway.

cc72aa9

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient force-pushed the FixCuPyExternalSource branch from 40edb33 to cc72aa9 Compare June 25, 2020 04:57

JanuszL reviewed Jun 25, 2020

View reviewed changes

mzient merged commit d9a5f03 into NVIDIA:master Jun 25, 2020

	if cuda_stream is torch.cuda.Stream:
	if isinstance(cuda_stream, torch.cuda.streams.Stream):

	`cuda_stream` : Any value that can be caste to cudaStream_t
	`cuda_stream` : Any value that can be cast to cudaStream_t

Fix handling of CUDA streams in Python frontend #2050

Fix handling of CUDA streams in Python frontend #2050

Conversation

mzient commented Jun 23, 2020

Why we need this PR?

What happened in this PR?

mzient commented Jun 23, 2020

dali-automaton commented Jun 23, 2020

klecki commented Jun 23, 2020 • edited Loading

dali-automaton commented Jun 23, 2020

mzient commented Jun 23, 2020

dali-automaton commented Jun 23, 2020

JanuszL Jun 23, 2020 • edited Loading

Choose a reason for hiding this comment

dali-automaton commented Jun 23, 2020

JanuszL Jun 23, 2020 • edited Loading

Choose a reason for hiding this comment

JanuszL commented Jun 24, 2020

dali-automaton commented Jun 24, 2020

dali-automaton commented Jun 24, 2020

mzient commented Jun 24, 2020

dali-automaton commented Jun 24, 2020

klecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Jun 24, 2020

dali-automaton commented Jun 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Jun 24, 2020

mzient commented Jun 24, 2020

dali-automaton commented Jun 24, 2020

dali-automaton commented Jun 24, 2020

dali-automaton commented Jun 24, 2020

dali-automaton commented Jun 24, 2020

dali-automaton commented Jun 24, 2020

dali-automaton commented Jun 25, 2020

dali-automaton commented Jun 25, 2020

Choose a reason for hiding this comment

dali-automaton commented Jun 25, 2020

klecki commented Jun 23, 2020 •

edited

Loading

JanuszL Jun 23, 2020 •

edited

Loading

JanuszL Jun 23, 2020 •

edited

Loading

mzient Jun 24, 2020 •

edited

Loading