-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run external source callback in parallel #2543
Conversation
06d4128
to
5d9b5f7
Compare
!build |
CI MESSAGE: [1893889]: BUILD STARTED |
This pull request introduces 7 alerts when merging 5d9b5f70e6e41c487b166ad3d09947177caab9d7 into 5d5844b - view on LGTM.com new alerts:
|
CI MESSAGE: [1893889]: BUILD FAILED |
4f815a1
to
93fdbb1
Compare
!build |
CI MESSAGE: [1900108]: BUILD STARTED |
CI MESSAGE: [1900108]: BUILD PASSED |
!build |
CI MESSAGE: [1900572]: BUILD STARTED |
CI MESSAGE: [1900572]: BUILD PASSED |
include/dali/core/device_guard.h
Outdated
@@ -35,6 +35,9 @@ class DLL_PUBLIC DeviceGuard { | |||
// for device id < 0 it is no-op | |||
explicit DeviceGuard(int new_device); | |||
~DeviceGuard(); | |||
|
|||
bool has_old_context(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function pollutes this utility that has its very well defined purpose. Moreover, this function will not work well if new_device is -1, so it sort of breaks contract here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dali/python/backend_impl.cc
Outdated
@@ -1117,6 +1118,22 @@ PYBIND11_MODULE(backend_impl, m) { | |||
|
|||
m.def("GetCxx11AbiFlag", &GetCxx11AbiFlag); | |||
|
|||
m.def("HasCudaContext", []{ | |||
return DeviceGuard{}.has_old_context(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't abuse DeviceGuard for this.
return DeviceGuard{}.has_old_context(); | |
DALI_ENFORCE(cuInitChecked(), | |
"Failed to load libcuda.so. " | |
"Check your library paths and if the driver is installed correctly."); | |
CUcontext ctx; | |
CUDA_CALL(cuCtxGetCurrent(&ctx)); | |
return ctx != nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say if we didn't load CUDA, that's still OK, and we can return, that we don't have the context. There is CPU only mode that may not want to load CUDA (cause it's not there), so the cuInitChecked() will return 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
!build |
CI MESSAGE: [1914410]: BUILD STARTED |
CI MESSAGE: [1914410]: BUILD FAILED |
9e99ac6
to
6a826d6
Compare
!build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some comments and some questions.
I would be glad for some documentation, especially nice would be:
- purpose of what is stored in those wrappers like MemChunk, SharedBatchSerialized etc
- what some of the indexes mean: context_i, mem_chunk_id, chunk_version, etc
- maybe it would be really nice to have some overview what a worker does - what part of batch is it expected to process and how the batch is formed back again. Alternatively how is the communication handled -> task numbers going as requests for processing through some pipes, info about ready data in other pipes and the stop signal as None.
I will need to look a bit more at the communication, look mostly fine.
dali/util/shared_mem.cc
Outdated
ShmFdWrapper::ShmFdWrapper() { | ||
static std::minstd_rand rand_suffix((unsigned)time(NULL) * getpid()); | ||
std::string name; | ||
int sanity_counter = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to feel about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was in doubt if to go there, but alternatives - failing with the first conflict or trying in while true
didn't seem very nice too. It's up to you, I can change it in any of the directions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using mkstemp instead of the loop
dali/util/shared_mem.cc
Outdated
if (fd < 0) { | ||
throw std::runtime_error("shm_open call failed"); | ||
} | ||
shm_unlink(name.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the behaviour of unlinking the file but passing the fd around may not be obvious for random reader, I would like to see the comment why you unlink it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a coment
dali/util/shared_mem.h
Outdated
|
||
namespace dali { | ||
namespace python { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file needs some docstrings, what is meant to create handles, what allocate, and what just store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dali/python/backend_impl.cc
Outdated
@@ -1117,6 +1118,22 @@ PYBIND11_MODULE(backend_impl, m) { | |||
|
|||
m.def("GetCxx11AbiFlag", &GetCxx11AbiFlag); | |||
|
|||
m.def("HasCudaContext", []{ | |||
return DeviceGuard{}.has_old_context(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say if we didn't load CUDA, that's still OK, and we can return, that we don't have the context. There is CPU only mode that may not want to load CUDA (cause it's not there), so the cuInitChecked() will return 0.
@@ -94,6 +98,23 @@ def reset_indices(self): | |||
self.current_iter = 0 | |||
self.current_sample = 0 | |||
|
|||
def schedule_batch(self, pipeline, pool, context_i, batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think most of the stuff in ExternalSource should be private interface. Not sure how much we can move back, but certainly it's not a part of public api other than init and call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire _ExternalSourceGroup
is private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I am leaving it as it is for know.
dali/python/nvidia/dali/pool.py
Outdated
del self.batch_pool[batch.mem_chunk_id] | ||
fd, shm_chunk = -1, None | ||
try: | ||
[fd] = multiprocessing.reduction.recvfds(sock, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you find this thing :P
dali/python/nvidia/dali/pool.py
Outdated
os.close(fd) | ||
raise | ||
chunk = MemChunk(shm_chunk, batch.chunk_version, batch.capacity) | ||
self.batch_pool[batch.mem_chunk_id] = chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are mem_chunk_id
and chunk_version
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When worker starts it creates pool (or rather cyclic buffer) of shared memory chunks for each externalsource it supports. Mem_chunk_id is a key identifying those chunks in communication between worker and main process. This way receiving process knows if it has given chunk already mmaped or maybe needs to receive new file descriptor through the socket etc.
I got rid of chunk_version when applying review remarks.
dali/python/nvidia/dali/pool.py
Outdated
def add_mem_chunk(self, sock, batch): | ||
chunk = self.batch_pool.get(batch.mem_chunk_id) | ||
if chunk is not None: | ||
if chunk.version == batch.chunk_version: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we receive the same batch several times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't relate to batch but rather underlying shared memory chunk. Anyway I got rid of this counter, I am simply checking if the expected capacity of the chunk changed - if so, receiving process knows it needs to adjust it.
dali/python/nvidia/dali/pool.py
Outdated
self.rec_pipes = self.pool.res_pipes + [self.pool.from_tracker] | ||
|
||
@classmethod | ||
def from_groups(cls, groups, workers_no, init_method, keep_alive_queue_size, initial_chunk_size=1024 * 1024): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess at some point the initial_chunk_size will be configurable as I see it's not yet set in the pipeline.py. Maybe let it stay how it is right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it should be straightforward, just wasn't sure if we need it, whether it should be per externalsource parameter or single one per pipeline. And how to call it.
dali/python/nvidia/dali/pool.py
Outdated
batch_i, tasks = context.scheduled.popitem(last=False) | ||
awaiting_batch = context.partially_received[batch_i] | ||
while len(awaiting_batch) < len(tasks) and batch_i not in context.iter_ended: | ||
self._receive_chunk() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random question: can we somehow get starved/blocked by a consuming batches after batch_i
that come from other workers? I guess in the end, the "lazy" worker will send us the last part of the batch, and when receiving next batch we will already have most of it in that case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly the idea. Even if we receive some other parts along the way they will simply be stored and ready for being collected by the right externalsource group in the right order. Worker will finally send its part or will die, either way we should not wait forever.
CI MESSAGE: [1915700]: BUILD STARTED |
CI MESSAGE: [1915700]: BUILD PASSED |
dali/python/backend_impl.cc
Outdated
DALI_ENFORCE(cuInitChecked(), | ||
"Failed to load libcuda.so. " | ||
"Check your library paths and if the driver is installed correctly."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then:
DALI_ENFORCE(cuInitChecked(), | |
"Failed to load libcuda.so. " | |
"Check your library paths and if the driver is installed correctly."); | |
if (!cuInitChecked()) | |
return false; |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return False | ||
call = getattr(x, "__call__", None) | ||
return _is_generator_function(call) | ||
|
||
def _accepted_args_count(callable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar Nazi attack!
def _accepted_args_count(callable): | |
def _accepted_arg_count(callable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dali/python/backend_impl.cc
Outdated
.def_property_readonly("fd", &SharedMem::fd) | ||
.def("buf", | ||
[](SharedMem *shm) { | ||
auto ptr = shm->ptr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know if shm != nullptr. I would add a check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the check
dali/util/shared_mem.cc
Outdated
int sanity_counter = 0; | ||
do { | ||
std::stringstream ss; | ||
ss << "/nvidia_dali_" << rand_suffix(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sounds like exactly what I need. On the other hand though in section bugs is says never use me. :D Don't know how to feel about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe https://man7.org/linux/man-pages/man3/mkstemp.3.html ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to work! The only thing that bothers me a little with this approach is that I need to specify full path in mkstemp to /dev/shm which is the place where shm_open would create a file. I wonder if that's portable enough for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be fine.
dali/util/shared_mem.h
Outdated
}; | ||
|
||
class DLL_PUBLIC MapMemWrapper { | ||
uint64_t size_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that usually the private members goes to the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
6deb921
to
d036bd6
Compare
CI MESSAGE: [2125074]: BUILD STARTED |
dali/python/nvidia/dali/worker.py
Outdated
if processed_task is None: | ||
self.ready_queue.insert(0, None) | ||
else: | ||
# assert len(ready_queue) < prefetch_queue_depths[scheduled.context_i], "Worker queue size exceeded." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, the assert was there before refactor, and after refactor I got encapsulated out of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, it doesn't make much sense, as the queue of the thread that sends back the data can be longer than the prefetch_queue_depth for given callback (as it can have accumulated data from several callbacks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
dali/python/nvidia/dali/worker.py
Outdated
self.tasks_cv = threading.Condition() | ||
self.tasks_queue = [] | ||
|
||
def get_task(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that get_task and _insert_task matches what dispatch and _wait_for_processed do. It could be extracted to a common class. Just an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's easier to just repeat here at this point. Otherwise we would add more code with wrapper and how to use it I feel. It's not like we want this to look like CRTP.
# When initializing DALI, we do the following in order: | ||
# * Discover the ops specified in Python, group the ExternalSources (_build_graph()) | ||
# * Start the Python workers pool (_start_py_workers()) | ||
# * Construct the C++ Pipeline backend and pass the graph to it (_init_pipeline_backend()) | ||
# * Build the pieline. (_pipe.Build()) | ||
self._py_graph_built = False | ||
self._py_pool_started = False | ||
self._backend_prepared = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of having a plethora of variables and every time checking all of them we should have an entity that would track the state when queried?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to group them in one place. Do we want some kind of state machine for that? Maybe good idea for a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, follow up it is.
dali/python/nvidia/dali/pool.py
Outdated
try: | ||
self._to_tracker.send(None) | ||
except BrokenPipeError: | ||
"""workers already exited, tracker_thread finished its task and exited and closed the pipe""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the docs string equivalent to pass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it is. Will turn it into a comment. (and check if this is what it does).
CI MESSAGE: [2125074]: BUILD FAILED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
This pull request introduces 2 alerts when merging 0cddf7e into 8736bf4 - view on LGTM.com new alerts:
|
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
CI MESSAGE: [2128207]: BUILD STARTED |
CI MESSAGE: [2128216]: BUILD STARTED |
CI MESSAGE: [2128216]: BUILD FAILED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
02528c5
to
faef7d0
Compare
!build |
CI MESSAGE: [2128464]: BUILD STARTED |
CI MESSAGE: [2128799]: BUILD STARTED |
CI MESSAGE: [2128464]: BUILD PASSED |
CI MESSAGE: [2128799]: BUILD PASSED |
Why we need this PR?
It adds option to run per-sample external source callbacks in process based python workers.
What happened in this PR?
Fill relevant points, put NA otherwise. Replace anything inside []
Added process based workers using multiprocessing module, added custom wrapper around shared memory and mmap to avoid unnecessary copies when data between workers, utilized no_copy mode of external source, added prefetching of batches.
Mostly python wrappers around pipeline and external source. Added shared_mem.cc util.
[ Describe here what is the most important part that reviewers should focus on. ]
Prepared benchmark test to compare parallelized externalsource with cpu FileReader and sequential externalsource both in training and as a plain piepline just augmenting the data
Added relevant parameters description to ExternalSource and Pipeline. Documented shared_mem, shared_batch, worker and pool modules.
DALI-1651