Run external source callback in parallel #2543

stiepan · 2020-12-11T16:10:28Z

Why we need this PR?

It adds option to run per-sample external source callbacks in process based python workers.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
Added process based workers using multiprocessing module, added custom wrapper around shared memory and mmap to avoid unnecessary copies when data between workers, utilized no_copy mode of external source, added prefetching of batches.
Affected modules and functionalities:
Mostly python wrappers around pipeline and external source. Added shared_mem.cc util.
Key points relevant for the review:
[ Describe here what is the most important part that reviewers should focus on. ]
Validation and testing:
Prepared benchmark test to compare parallelized externalsource with cpu FileReader and sequential externalsource both in training and as a plain piepline just augmenting the data
Documentation (including examples):
Added relevant parameters description to ExternalSource and Pipeline. Documented shared_mem, shared_batch, worker and pool modules.

DALI-1651

stiepan · 2020-12-11T16:13:24Z

!build

dali-automaton · 2020-12-11T16:15:40Z

CI MESSAGE: [1893889]: BUILD STARTED

lgtm-com · 2020-12-11T16:22:34Z

This pull request introduces 7 alerts when merging 5d9b5f70e6e41c487b166ad3d09947177caab9d7 into 5d5844b - view on LGTM.com

new alerts:

5 for Unused import
2 for Module is imported with 'import' and 'import from'

dali-automaton · 2020-12-11T17:24:44Z

CI MESSAGE: [1893889]: BUILD FAILED

stiepan · 2020-12-14T13:21:57Z

!build

dali-automaton · 2020-12-14T13:26:02Z

CI MESSAGE: [1900108]: BUILD STARTED

dali-automaton · 2020-12-14T14:58:13Z

CI MESSAGE: [1900108]: BUILD PASSED

stiepan · 2020-12-14T15:54:59Z

!build

dali-automaton · 2020-12-14T16:00:32Z

CI MESSAGE: [1900572]: BUILD STARTED

dali-automaton · 2020-12-14T18:02:17Z

CI MESSAGE: [1900572]: BUILD PASSED

mzient · 2020-12-15T09:35:42Z

include/dali/core/device_guard.h

@@ -35,6 +35,9 @@ class DLL_PUBLIC DeviceGuard {
  //         for device id < 0 it is no-op
  explicit DeviceGuard(int new_device);
  ~DeviceGuard();
+
+  bool has_old_context();


I think this function pollutes this utility that has its very well defined purpose. Moreover, this function will not work well if new_device is -1, so it sort of breaks contract here.

mzient · 2020-12-15T09:43:50Z

dali/python/backend_impl.cc

@@ -1117,6 +1118,22 @@ PYBIND11_MODULE(backend_impl, m) {

  m.def("GetCxx11AbiFlag", &GetCxx11AbiFlag);

+  m.def("HasCudaContext", []{
+    return DeviceGuard{}.has_old_context();


Don't abuse DeviceGuard for this.

Suggested change

return DeviceGuard{}.has_old_context();

DALI_ENFORCE(cuInitChecked(),

"Failed to load libcuda.so. "

"Check your library paths and if the driver is installed correctly.");

CUcontext ctx;

CUDA_CALL(cuCtxGetCurrent(&ctx));

return ctx != nullptr;

I'd say if we didn't load CUDA, that's still OK, and we can return, that we don't have the context. There is CPU only mode that may not want to load CUDA (cause it's not there), so the cuInitChecked() will return 0.

stiepan · 2020-12-16T15:05:57Z

!build

dali-automaton · 2020-12-16T15:11:20Z

CI MESSAGE: [1914410]: BUILD STARTED

dali-automaton · 2020-12-16T17:07:23Z

CI MESSAGE: [1914410]: BUILD FAILED

stiepan · 2020-12-16T20:01:09Z

!build

klecki

Leaving some comments and some questions.

I would be glad for some documentation, especially nice would be:

purpose of what is stored in those wrappers like MemChunk, SharedBatchSerialized etc
what some of the indexes mean: context_i, mem_chunk_id, chunk_version, etc
maybe it would be really nice to have some overview what a worker does - what part of batch is it expected to process and how the batch is formed back again. Alternatively how is the communication handled -> task numbers going as requests for processing through some pipes, info about ready data in other pipes and the stop signal as None.

I will need to look a bit more at the communication, look mostly fine.

klecki · 2020-12-15T16:31:03Z

dali/util/shared_mem.cc

+ShmFdWrapper::ShmFdWrapper() {
+  static std::minstd_rand rand_suffix((unsigned)time(NULL) * getpid());
+  std::string name;
+  int sanity_counter = 0;


I'm not sure how to feel about this.

Yeah, I was in doubt if to go there, but alternatives - failing with the first conflict or trying in while true didn't seem very nice too. It's up to you, I can change it in any of the directions.

using mkstemp instead of the loop

klecki · 2020-12-15T16:33:41Z

dali/util/shared_mem.cc

+  if (fd < 0) {
+    throw std::runtime_error("shm_open call failed");
+  }
+  shm_unlink(name.c_str());


I think the behaviour of unlinking the file but passing the fd around may not be obvious for random reader, I would like to see the comment why you unlink it here.

added a coment

klecki · 2020-12-15T16:50:05Z

dali/util/shared_mem.h

+
+namespace dali {
+namespace python {
+


This file needs some docstrings, what is meant to create handles, what allocate, and what just store.

klecki · 2020-12-15T17:39:07Z

dali/python/backend_impl.cc

@@ -1117,6 +1118,22 @@ PYBIND11_MODULE(backend_impl, m) {

  m.def("GetCxx11AbiFlag", &GetCxx11AbiFlag);

+  m.def("HasCudaContext", []{
+    return DeviceGuard{}.has_old_context();


I'd say if we didn't load CUDA, that's still OK, and we can return, that we don't have the context. There is CPU only mode that may not want to load CUDA (cause it's not there), so the cuInitChecked() will return 0.

klecki · 2020-12-15T18:31:30Z

dali/python/nvidia/dali/external_source.py

@@ -94,6 +98,23 @@ def reset_indices(self):
        self.current_iter = 0
        self.current_sample = 0

+    def schedule_batch(self, pipeline, pool, context_i, batch_size):


Hmm, I think most of the stuff in ExternalSource should be private interface. Not sure how much we can move back, but certainly it's not a part of public api other than init and call.

The entire _ExternalSourceGroup is private.

Ok, so I am leaving it as it is for know.

klecki · 2020-12-16T20:43:00Z

dali/python/nvidia/dali/pool.py

+            del self.batch_pool[batch.mem_chunk_id]
+        fd, shm_chunk = -1, None
+        try:
+            [fd] = multiprocessing.reduction.recvfds(sock, 1)


Where did you find this thing :P

klecki · 2020-12-16T20:54:20Z

dali/python/nvidia/dali/pool.py

+                os.close(fd)
+            raise
+        chunk = MemChunk(shm_chunk, batch.chunk_version, batch.capacity)
+        self.batch_pool[batch.mem_chunk_id] = chunk


What are mem_chunk_id and chunk_version?

When worker starts it creates pool (or rather cyclic buffer) of shared memory chunks for each externalsource it supports. Mem_chunk_id is a key identifying those chunks in communication between worker and main process. This way receiving process knows if it has given chunk already mmaped or maybe needs to receive new file descriptor through the socket etc.

I got rid of chunk_version when applying review remarks.

klecki · 2020-12-16T20:56:21Z

dali/python/nvidia/dali/pool.py

+    def add_mem_chunk(self, sock, batch):
+        chunk = self.batch_pool.get(batch.mem_chunk_id)
+        if chunk is not None:
+            if chunk.version == batch.chunk_version:


Do we receive the same batch several times?

It doesn't relate to batch but rather underlying shared memory chunk. Anyway I got rid of this counter, I am simply checking if the expected capacity of the chunk changed - if so, receiving process knows it needs to adjust it.

klecki · 2020-12-16T21:02:59Z

dali/python/nvidia/dali/pool.py

+        self.rec_pipes = self.pool.res_pipes + [self.pool.from_tracker]
+
+    @classmethod
+    def from_groups(cls, groups, workers_no, init_method, keep_alive_queue_size, initial_chunk_size=1024 * 1024):


I guess at some point the initial_chunk_size will be configurable as I see it's not yet set in the pipeline.py. Maybe let it stay how it is right now.

Sure, it should be straightforward, just wasn't sure if we need it, whether it should be per externalsource parameter or single one per pipeline. And how to call it.

klecki · 2020-12-16T21:33:56Z

dali/python/nvidia/dali/pool.py

+        batch_i, tasks = context.scheduled.popitem(last=False)
+        awaiting_batch = context.partially_received[batch_i]
+        while len(awaiting_batch) < len(tasks) and batch_i not in context.iter_ended:
+            self._receive_chunk()


Random question: can we somehow get starved/blocked by a consuming batches after batch_i that come from other workers? I guess in the end, the "lazy" worker will send us the last part of the batch, and when receiving next batch we will already have most of it in that case, right?

Yes, that's exactly the idea. Even if we receive some other parts along the way they will simply be stored and ready for being collected by the right externalsource group in the right order. Worker will finally send its part or will die, either way we should not wait forever.

dali-automaton · 2020-12-16T22:38:16Z

CI MESSAGE: [1915700]: BUILD STARTED

dali-automaton · 2020-12-17T00:23:23Z

CI MESSAGE: [1915700]: BUILD PASSED

mzient · 2020-12-17T12:36:53Z

dali/python/backend_impl.cc

+    DALI_ENFORCE(cuInitChecked(),
+    "Failed to load libcuda.so. "
+    "Check your library paths and if the driver is installed correctly.");


then:

Suggested change

DALI_ENFORCE(cuInitChecked(),

"Failed to load libcuda.so. "

"Check your library paths and if the driver is installed correctly.");

if (!cuInitChecked())

return false;

?

mzient · 2020-12-17T12:38:49Z

dali/python/nvidia/dali/external_source.py

        return False
    call = getattr(x, "__call__", None)
    return _is_generator_function(call)

+def _accepted_args_count(callable):


Grammar Nazi attack!

Suggested change

def _accepted_args_count(callable):

def _accepted_arg_count(callable):

JanuszL · 2020-12-17T13:39:42Z

dali/python/backend_impl.cc

+      .def_property_readonly("fd", &SharedMem::fd)
+      .def("buf",
+        [](SharedMem *shm) {
+          auto ptr = shm->ptr();


How do you know if shm != nullptr. I would add a check.

added the check

JanuszL · 2020-12-17T13:58:05Z

dali/util/shared_mem.cc

+  int sanity_counter = 0;
+  do {
+    std::stringstream ss;
+    ss << "/nvidia_dali_" << rand_suffix();


Why not https://man7.org/linux/man-pages/man3/mktemp.3.html ?

Hmm, sounds like exactly what I need. On the other hand though in section bugs is says never use me. :D Don't know how to feel about it.

Then maybe https://man7.org/linux/man-pages/man3/mkstemp.3.html ?

It seems to work! The only thing that bothers me a little with this approach is that I need to specify full path in mkstemp to /dev/shm which is the place where shm_open would create a file. I wonder if that's portable enough for us.

I think it should be fine.

JanuszL · 2020-12-17T14:02:40Z

dali/util/shared_mem.h

+};
+
+class DLL_PUBLIC MapMemWrapper {
+  uint64_t size_;


I think that usually the private members goes to the end.

dali-automaton · 2021-03-02T20:41:28Z

CI MESSAGE: [2125074]: BUILD STARTED

JanuszL · 2021-03-02T21:19:59Z

dali/python/nvidia/dali/worker.py

+            if processed_task is None:
+                self.ready_queue.insert(0, None)
+            else:
+                # assert len(ready_queue) < prefetch_queue_depths[scheduled.context_i], "Worker queue size exceeded."


Hmm, yeah, the assert was there before refactor, and after refactor I got encapsulated out of the data.

TBH, it doesn't make much sense, as the queue of the thread that sends back the data can be longer than the prefetch_queue_depth for given callback (as it can have accumulated data from several callbacks).

So maybe just remove it?

JanuszL · 2021-03-02T21:31:03Z

dali/python/nvidia/dali/worker.py

+        self.tasks_cv = threading.Condition()
+        self.tasks_queue = []
+
+    def get_task(self):


It seems that get_task and _insert_task matches what dispatch and _wait_for_processed do. It could be extracted to a common class. Just an idea.

I think it's easier to just repeat here at this point. Otherwise we would add more code with wrapper and how to use it I feel. It's not like we want this to look like CRTP.

JanuszL · 2021-03-02T21:44:53Z

dali/python/nvidia/dali/pipeline.py

+        # When initializing DALI, we do the following in order:
+        # * Discover the ops specified in Python, group the ExternalSources (_build_graph())
+        # * Start the Python workers pool (_start_py_workers())
+        # * Construct the C++ Pipeline backend and pass the graph to it (_init_pipeline_backend())
+        # * Build the pieline. (_pipe.Build())
+        self._py_graph_built = False
+        self._py_pool_started = False
+        self._backend_prepared = False


Maybe instead of having a plethora of variables and every time checking all of them we should have an entity that would track the state when queried?

I tried to group them in one place. Do we want some kind of state machine for that? Maybe good idea for a followup.

Yep, follow up it is.

JanuszL · 2021-03-02T22:02:10Z

dali/python/nvidia/dali/pool.py

+        try:
+            self._to_tracker.send(None)
+        except BrokenPipeError:
+            """workers already exited, tracker_thread finished its task and exited and closed the pipe"""


Is the docs string equivalent to pass?

Probably it is. Will turn it into a comment. (and check if this is what it does).

dali-automaton · 2021-03-02T22:12:01Z

CI MESSAGE: [2125074]: BUILD FAILED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

lgtm-com · 2021-03-03T14:43:45Z

This pull request introduces 2 alerts when merging 0cddf7e into 8736bf4 - view on LGTM.com

new alerts:

2 for Wrong number of arguments in a call

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

dali-automaton · 2021-03-03T15:40:20Z

CI MESSAGE: [2128207]: BUILD STARTED

dali-automaton · 2021-03-03T15:45:53Z

CI MESSAGE: [2128216]: BUILD STARTED

dali-automaton · 2021-03-03T16:45:45Z

CI MESSAGE: [2128216]: BUILD FAILED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2021-03-03T17:13:46Z

!build

dali-automaton · 2021-03-03T17:15:59Z

CI MESSAGE: [2128464]: BUILD STARTED

dali-automaton · 2021-03-03T18:21:11Z

CI MESSAGE: [2128799]: BUILD STARTED

dali-automaton · 2021-03-03T18:44:40Z

CI MESSAGE: [2128464]: BUILD PASSED

dali-automaton · 2021-03-03T19:52:35Z

CI MESSAGE: [2128799]: BUILD PASSED

stiepan force-pushed the external_source_multichunks branch from 06d4128 to 5d9b5f7 Compare December 11, 2020 16:11

stiepan force-pushed the external_source_multichunks branch from 4f815a1 to 93fdbb1 Compare December 14, 2020 12:27

mzient reviewed Dec 15, 2020

View reviewed changes

klecki self-assigned this Dec 15, 2020

stiepan force-pushed the external_source_multichunks branch from 9e99ac6 to 6a826d6 Compare December 16, 2020 19:59

klecki reviewed Dec 16, 2020

View reviewed changes

jantonguirao assigned JanuszL and mzient Dec 17, 2020

mzient reviewed Dec 17, 2020

View reviewed changes

JanuszL reviewed Dec 17, 2020

View reviewed changes

klecki force-pushed the external_source_multichunks branch from 6deb921 to d036bd6 Compare March 2, 2021 20:34

JanuszL reviewed Mar 2, 2021

View reviewed changes

JanuszL approved these changes Mar 2, 2021

View reviewed changes

klecki added 4 commits March 3, 2021 13:40

Fix psutil, add handling and tests for invalid returns

5093716

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Better message, test in pool

0ee4802

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Remove old assert

700dc08

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Replace string with comment

0cddf7e

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki added 4 commits March 3, 2021 15:45

Move into internal module

ca2a665

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Make tests shorter a bit

1d9efe5

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Docs touchups

d432a23

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Reduce data size in tests

9dfc8f2

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Make the multiproce work with wheel

faef7d0

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the external_source_multichunks branch from 02528c5 to faef7d0 Compare March 3, 2021 17:01

klecki merged commit 06c803f into NVIDIA:master Mar 3, 2021

JanuszL mentioned this pull request May 19, 2021

DALI 2021 roadmap #2978

Closed

JanuszL mentioned this pull request Jan 9, 2023

Remove unused define_graph argument from build pipeline method #4555

Merged

18 tasks

	def _accepted_args_count(callable):
	def _accepted_arg_count(callable):


		namespace dali {
		namespace python {

Run external source callback in parallel #2543

Run external source callback in parallel #2543

Conversation

stiepan commented Dec 11, 2020 • edited Loading

Why we need this PR?

What happened in this PR?

stiepan commented Dec 11, 2020

dali-automaton commented Dec 11, 2020

lgtm-com bot commented Dec 11, 2020

dali-automaton commented Dec 11, 2020

stiepan commented Dec 14, 2020

dali-automaton commented Dec 14, 2020

dali-automaton commented Dec 14, 2020

stiepan commented Dec 14, 2020

dali-automaton commented Dec 14, 2020

dali-automaton commented Dec 14, 2020

mzient Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stiepan commented Dec 16, 2020

dali-automaton commented Dec 16, 2020

dali-automaton commented Dec 16, 2020

stiepan commented Dec 16, 2020

klecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stiepan Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stiepan Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

dali-automaton commented Dec 16, 2020

dali-automaton commented Dec 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stiepan Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Mar 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Mar 2, 2021

stiepan commented Dec 11, 2020 •

edited

Loading

mzient Dec 15, 2020 •

edited

Loading

stiepan Dec 17, 2020 •

edited

Loading

stiepan Dec 17, 2020 •

edited

Loading

stiepan Dec 17, 2020 •

edited

Loading