Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable python ExternalSource operator for the GPU data #1997

Merged
merged 7 commits into from
Jun 17, 2020

Conversation

JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jun 4, 2020

  • adds Python side support for GPU data feed to ExterenlSource operator
  • extends ExternalSource example

Signed-off-by: Janusz Lisiecki jlisiecki@nvidia.com

Why we need this PR?

Pick one, remove the rest

  • It adds support a GPU input to te ExternalSource operator via Python API

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    adds Python side support for GPU data feed to ExterenlSource operator
    extends ExternalSource example
  • Affected modules and functionalities:
    ExternalSource
    Python API
    backend_impl
  • Key points relevant for the review:
    NA
  • Validation and testing:
    new CI tests are added
  • Documentation (including examples):
    example is extended

JIRA TASK: [DALI-182]

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

- adds Python side support for GPU data feed to ExterenlSource operator
- extends ExternalSource example

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 4, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1371416]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1371416]: BUILD FAILED

@JanuszL JanuszL force-pushed the external_python_gpu branch 2 times, most recently from f7893fb to 505712a Compare June 4, 2020 22:16
@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 4, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1371753]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1371753]: BUILD FAILED

@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 5, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1372990]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1372990]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL JanuszL force-pushed the external_python_gpu branch 2 times, most recently from 60e4ce0 to 90c80b5 Compare June 5, 2020 12:27
@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 5, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1373346]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1373346]: BUILD FAILED

@JanuszL JanuszL force-pushed the external_python_gpu branch 2 times, most recently from 0e0743a to 20113e6 Compare June 5, 2020 14:34
@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 5, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1373692]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1373692]: BUILD FAILED

@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 5, 2020

!builld

@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 5, 2020

!builld

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1401034]: BUILD FAILED

@szalpal
Copy link
Member

szalpal commented Jun 17, 2020

Same can probably happen for GPU data and CPU ExternalSource.
The SetExternalInput is non-blocki, apparently there was idea that the user can reuse the memory by observing the stream he provided by this is not documented (apparently @szalpal can say more about this) - we probably need to highlight that in the doc.

Correct, when user calls daliSetExternalInputAsync and synchronizes on the stream provided there, the memory he provided should not be needed anymore by DALI. But that's rather something obvious, isn't it? I mean, there are in fact 2 functions - sync and async - provided, but that's only for convenience. Stream management here is added, because the copy might be expensive, but user might be able to use few different streams to perform it, to make it faster.

User now has no idea when he can touch his memory again in a safe manner.

Again, I guess that's rather obvious - you can touch the memory after sync function returns. If you feel, that it's not that obvious as I think, let's add this to the docs

@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 17, 2020

I think there are potential ownership issues if we call SetExternalInput and we deal with anything that does async copy as we synchronize on the event only in run (and we make the stream wait for it). So if we decide to do this for GPU->GPU copy and we change/deallocate the memory before DALI's async copy to ExternalSource queue is finished.

Same can probably happen for GPU data and CPU ExternalSource.
The SetExternalInput is non-blocki, apparently there was idea that the user can reuse the memory by observing the stream he provided by this is not documented (apparently @szalpal can say more about this) - we probably need to highlight that in the doc.

Also, if the user doesn't provide the stream for setting the external input we use some UserStream that is DALI-internal thing. User now has no idea when he can touch his memory again in a safe manner.

One solution would be to make SetExternalInput blocking in such case (till the internal copy is finished) or providing some other mechanism.

I added a sync parameter that is used when user doesn't provide the stream.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1402155]: BUILD STARTED

@@ -302,7 +311,7 @@ void ExposeTensor(py::module &m) {
layout : str
Layout of the data
device_id: int
Device of where this tensor resides
Device of where this tensor resides. If no is provided the current device is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Device of where this tensor resides. If no is provided the current device is used.
Device of where this tensor resides. If not provided, the current device is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

device_id: int
Device of where this lists of tensors resides
device_id : int
Device of where this tensor resides. If no is provided the current device is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1401034]: BUILD PASSED

*/
DLL_PUBLIC void
daliSetExternalInputAsync(daliPipelineHandle *pipe_handle, const char *name,
device_type_t device, const void *data_ptr,
dali_data_type_t data_type, const int64_t *shapes,
int sample_dim, const char *layout_str,
cudaStream_t stream);
cudaStream_t stream, int sync);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a question if we still need both async and sync variants of this function if this can be handled by the parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync has own stream, in this variant you still need to provide one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1402155]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync for the SetExternalInput seems to solve the problems, I'm not sure how I feel about the name and behaviour of the blocking ExternalSource parameter, if it won't bring some confusion.

*/
DLL_PUBLIC void
daliSetExternalInputAsync(daliPipelineHandle *pipe_handle, const char *name,
device_type_t device, const void *data_ptr,
dali_data_type_t data_type, const int64_t *shapes,
int sample_dim, const char *layout_str,
cudaStream_t stream);
cudaStream_t stream, int sync);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense.

@@ -139,13 +139,14 @@ DLL_PUBLIC void daliDeserializeDefault(daliPipelineHandle *pipe_handle,
* Can be set to NULL.
* @param stream CUDA stream to use when copying the data onto GPU. Remember to synchronize on the
* provided stream.
* @param sync If block until data provided is copied to the internal DALI buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param sync If block until data provided is copied to the internal DALI buffer
* @param sync Whether to block until the provided data is copied to the internal DALI buffer

Copy link
Contributor Author

@JanuszL JanuszL Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -148,7 +144,7 @@ class ExternalSource : public Operator<Backend> {
copy_to_storage_event = copy_to_storage_events_.GetEmpty();
}

data.front()->Copy(tl, stream);
data.front()->Copy(t, stream);
if (std::is_same<SrcBackend, GPUBackend>::value) {
cudaEventRecord(*copy_to_storage_event.front(), stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Thanks for adding the comment.

{
std::unique_lock<std::mutex> busy_lock(busy_m_);
cv_.wait(busy_lock, [&data = tl_data_]{return !data.IsEmpty();});
cv_.wait(busy_lock, [&data = tl_data_, &blocking = blocking_] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to capture the blocking_ as copy instead of reference as it wouldn't change.

Copy link
Contributor Author

@JanuszL JanuszL Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{
std::unique_lock<std::mutex> busy_lock(busy_m_);
cv_.wait(busy_lock, [&data = tl_data_]{return !data.IsEmpty();});
cv_.wait(busy_lock, [&data = tl_data_, &blocking = blocking_] {
return !(data.IsEmpty() && blocking);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also alternatively:

if (blocking) {
	cv_.wait(busy_lock, [&data = tl_data_]{return !data.IsEmpty();});
} else {
    // we have the lock, fail if there is no data.
}

Copy link
Contributor Author

@JanuszL JanuszL Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1075,7 +1089,7 @@ PYBIND11_MODULE(backend_impl, m) {
.def("SetExternalTLInput",
[](Pipeline *p, const string &name, const TensorList<CPUBackend> &tl,
py::object /*cuda_stream*/) {
p->SetExternalInput(name, tl, 0);
p->SetExternalInput(name, tl, 0, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it is fine to pass a false here as well (or rather it just doesn't matter)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's more aligned with the docs that claim to be blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is CPU so it should sync anyway, so it is better to pass the true to be aligned with what happens under the hood.

Comment on lines 54 to 59
cv_.wait(busy_lock, [&data = tl_data_, &blocking = blocking_] {
return !(data.IsEmpty() && blocking);
});
if (!blocking_ && tl_data_.IsEmpty()) {
DALI_FAIL("No data was provided to the ExternalSource. Make sure to feed it properly.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion as for the CPU.

Copy link
Contributor Author

@JanuszL JanuszL Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 17, 2020

The sync for the SetExternalInput seems to solve the problems, I'm not sure how I feel about the name and behaviour of the blocking ExternalSource parameter, if it won't bring some confusion.

I'm open for suggestions regarding the naming.

@JanuszL JanuszL force-pushed the external_python_gpu branch 2 times, most recently from 47f976c to e1efecd Compare June 17, 2020 14:09
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 17, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1402575]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1402575]: BUILD PASSED

@JanuszL JanuszL merged commit dd39fa9 into NVIDIA:master Jun 17, 2020
@JanuszL JanuszL deleted the external_python_gpu branch June 17, 2020 19:27
JanuszL added a commit to JanuszL/DALI that referenced this pull request Jun 24, 2020
- changes introduced by NVIDIA#1997
  were not applied to conda based test
- ExternalSource jupyter example is extended by GPU case and
  requires cupy and imageio to run, this PR fixes this

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
JanuszL added a commit that referenced this pull request Jun 24, 2020
- changes introduced by #1997 were not applied to conda based test
- ExternalSource jupyter example is extended by GPU case and requires cupy and imageio to run, this PR fixes this

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants