Add support for variable batch size debug mode and NVTX ranges #3799

ksztenderski · 2022-04-06T17:36:47Z

Category:

New feature (non-breaking change which adds functionality)

Description:

Adds support for NVTX ranges in eager operators.
Adds support for iteration batch size based on outputs from external_source.
The support is limited to cases when external_source is the first operator in the pipeline.

Why is it impossible/inconvenient to support external_source that is defined later?

For this to work we would have to have similar execution flow to the one from standard mode (fetching data from external_source at the start of each run). This means we would have to first build the pipeline to save all external_source callbacks. The problem with this approach is that because debug mode allows to access and modify the data inside the pipeline during the build we have to operate on the actual data, we cannot have some placeholders like in standard build.

Why can we not run the pipeline once to build the graph and collect external_source operators?

We cannot do this in the pipeline build because external_source may not be fed with data.
In the first iteration we could run the pipeline twice, once to build the graph and collect ESs, and then a regular run. That way we would introduce some possible garbage outputs from the build phase and would have to account for all possible iteration-based states (maybe someone uses some clear python code inside our pipeline that is dependent on the number of executions) and it would probably be impossible to actually catch all of these cases.

Additional information:

Affected modules and functionalities:

debug mode pipeline

Key points relevant for the review:

Checklist

Tests

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

External source can define iteration batch size when it is a first operator in a pipeline. Signed-off-by: ksztenderski <ksztenderski@nvidia.com>

ksztenderski · 2022-04-06T17:37:16Z

!build

lgtm-com · 2022-04-06T17:48:19Z

This pull request introduces 1 alert when merging 717cd54 into 0fafc72 - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: ksztenderski <ksztenderski@nvidia.com>

dali-automaton · 2022-04-06T18:23:20Z

CI MESSAGE: [4443163]: BUILD STARTED

dali-automaton · 2022-04-06T20:01:36Z

CI MESSAGE: [4443163]: BUILD PASSED

klecki · 2022-04-06T18:08:37Z

dali/pipeline/pipeline_debug.h

@@ -44,6 +44,7 @@ class DLL_PUBLIC PipelineDebug {

  DLL_PUBLIC void AddOperator(OpSpec &spec, int logical_id) {
    FillOpSpec(spec);
+    std::string op_name = "__debug__" + spec.name() + "_" + std::to_string(logical_id);


Removed (an artifact of an earlier version).

klecki · 2022-04-06T18:12:01Z

dali/pipeline/operator/eager_operator.h

+      : max_batch_size_(spec.GetArgument<int>("max_batch_size")),
+        op_spec_(spec),
+        name_(spec.name()),
+        op_(InstantiateOperator(spec)) {
+    num_outputs_ = op_spec_.GetSchema().CalculateOutputs(op_spec_) +
+                   op_spec_.GetSchema().CalculateAdditionalOutputs(op_spec_);
+  }


I think you can delegate to the other constructor to avoid repetition

Suggested change

: max_batch_size_(spec.GetArgument<int>("max_batch_size")),

op_spec_(spec),

name_(spec.name()),

op_(InstantiateOperator(spec)) {

num_outputs_ = op_spec_.GetSchema().CalculateOutputs(op_spec_) +

op_spec_.GetSchema().CalculateAdditionalOutputs(op_spec_);

}

: EagerOperator(spec, spec.name()) {}

Of course, done. It was a quick check and I forgot to change that.

klecki · 2022-04-06T18:13:49Z

dali/pipeline/operator/eager_operator.h

@@ -123,24 +140,26 @@ template <>
 std::vector<std::shared_ptr<TensorList<CPUBackend>>> EagerOperator<CPUBackend>::Run(
    const std::vector<std::shared_ptr<TensorList<CPUBackend>>> &inputs,
    const std::unordered_map<std::string, std::shared_ptr<TensorList<CPUBackend>>> &kwargs,
-    ThreadPool *thread_pool) {
+    ThreadPool *thread_pool, int batch_size) {
+  DomainTimeRange tr("[DALI][CPU op] " + name_, DomainTimeRange::kBlue1);


I won't insist on doing it in separate PR, but it would be nice to note in the commit message (and PR description) that we added the NVTX ranges as well.

klecki · 2022-04-06T18:22:28Z

dali/pipeline/operator/eager_operator.h

+    DALI_ENFORCE(cur_batch_size == batch_size,
+                 make_string("Expected uniform batch size in a single operator. Expected: ",
+                             batch_size, ", got: ", cur_batch_size));
+    DALI_ENFORCE(
+        cur_batch_size <= max_batch_size_,
+        make_string("Expected batch size lower or equal to max batch size. Expected at most: ",
+                    max_batch_size_, ", got: ", batch_size));


Those errors could mention the index of offending inputs maybe?
In first we can write, that the first input has <batch_size> size and the input has <batch_size>.

In the second, we can say that it's input with too big of a batch.

Also, when throwing errors for debug mode from the backend it would be good to introduce something similar to HandleError from executor? (I guess it won't apply here).
It can be some other mechanism but it would be nice to let the user know where in his pipeline he encountered the error. Or it will already show the name of the Python wrapper and I didn't need to type all this and it's enough?

Also do you want to add similar checks post-run for sanity?

klecki · 2022-04-06T18:26:21Z

dali/python/nvidia/dali/_debug_mode.py

@@ -314,13 +315,19 @@ def is_primitive_type(x):
        return False, 'cpu', data


+class _BatchInfo:


klecki · 2022-04-06T18:42:06Z

dali/python/nvidia/dali/_debug_mode.py

@@ -533,6 +549,7 @@ def run(self):

        self._debug_on = True
        self._cur_operator_id = -1
+        self._cur_batch_info = _BatchInfo(-1, None)  # Used for variable batch sizes.


General remark:
Interactions with _BatchInfo can be hidden behind some interface and maybe offer a .reset() and .validate_batch_size(bs) functionality?

klecki · 2022-04-06T18:49:22Z

dali/python/nvidia/dali/_debug_mode.py

+                else:
+                    raise RuntimeError(
+                        ("Batch size must be uniform across an iteration. External Source operator returned batch with "
+                         "size = {}, for this iteration used batch size = {}.\nIf you want to use batch size returned by "


This for this iteration used batch size ... reads a bit weird.

For the "If you want" part I would suggest:

If you want to use variable batch size - that is different batch size in each iteration - you must call all the external source operators at the beginning of your debug pipeline, before other DALI operators.
All the external source operators are expected to return the same batch size in given iteration, but it can change between the iterations. Other operators will use that batch size for processing.

and now the part about what was called first

I don't know if it's not an overkill.

klecki · 2022-04-07T09:04:10Z

dali/test/python/test_pipeline_debug.py

+
+
+@pipeline_def(batch_size=8, num_threads=3, device_id=0, seed=47, debug=True)
+def variable_batch_size_from_external_source_pipeline(variable_batch_size):


This just tests one iteration to be smaller than the max_batch_size, but not exactly changing it iter2iter, right?

Yes, added more iterations with changing batch size

* Move batch size checks to _IterBatchInfo Signed-off-by: ksztenderski <ksztenderski@nvidia.com>

Signed-off-by: ksztenderski <ksztenderski@nvidia.com>

ksztenderski · 2022-04-08T15:51:22Z

!build

dali-automaton · 2022-04-08T15:55:18Z

CI MESSAGE: [4463485]: BUILD STARTED

dali-automaton · 2022-04-08T16:59:23Z

CI MESSAGE: [4463485]: BUILD PASSED

dali/python/nvidia/dali/_debug_mode.py

…IDIA#3799) * Add support for NVTX ranges in eager operators. * Add support for iteration batch size based on outputs from external_source. It is limited to cases when external_source is the first operator in the pipeline. Signed-off-by: ksztenderski <ksztenderski@nvidia.com>

Add support for variable batch size and nvtx ranges in debug mode

717cd54

External source can define iteration batch size when it is a first operator in a pipeline. Signed-off-by: ksztenderski <ksztenderski@nvidia.com>

ksztenderski assigned klecki Apr 6, 2022

Remove unused import

45a8714

Signed-off-by: ksztenderski <ksztenderski@nvidia.com>

klecki reviewed Apr 7, 2022

View reviewed changes

jantonguirao assigned banasraf Apr 7, 2022

ksztenderski changed the title ~~Add support for variable batch size debug mode~~ Add support for variable batch size debug mode and NVTX ranges Apr 8, 2022

ksztenderski added 2 commits April 8, 2022 17:40

Add extended error messages for eager operators and update IterBatchInfo

7db5a36

* Move batch size checks to _IterBatchInfo Signed-off-by: ksztenderski <ksztenderski@nvidia.com>

Change ES variable batch size error message

7c03ada

Signed-off-by: ksztenderski <ksztenderski@nvidia.com>

klecki approved these changes Apr 11, 2022

View reviewed changes

dali/python/nvidia/dali/_debug_mode.py Show resolved Hide resolved

banasraf approved these changes Apr 12, 2022

View reviewed changes

ksztenderski merged commit 7836d2d into NVIDIA:main Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for variable batch size debug mode and NVTX ranges #3799

Add support for variable batch size debug mode and NVTX ranges #3799

ksztenderski commented Apr 6, 2022 •

edited

Loading

ksztenderski commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

dali-automaton commented Apr 6, 2022

dali-automaton commented Apr 6, 2022

klecki Apr 6, 2022

ksztenderski Apr 8, 2022

klecki Apr 6, 2022

ksztenderski Apr 8, 2022 •

edited

Loading

klecki Apr 6, 2022

ksztenderski Apr 8, 2022

klecki Apr 6, 2022

klecki Apr 6, 2022

ksztenderski Apr 8, 2022

klecki Apr 6, 2022

klecki Apr 6, 2022

ksztenderski Apr 8, 2022

klecki Apr 6, 2022 •

edited by ksztenderski

Loading

ksztenderski Apr 8, 2022

klecki Apr 7, 2022

ksztenderski Apr 8, 2022

ksztenderski commented Apr 8, 2022

dali-automaton commented Apr 8, 2022

dali-automaton commented Apr 8, 2022

		@@ -314,13 +315,19 @@ def is_primitive_type(x):
		return False, 'cpu', data


		class _BatchInfo:



		@pipeline_def(batch_size=8, num_threads=3, device_id=0, seed=47, debug=True)
		def variable_batch_size_from_external_source_pipeline(variable_batch_size):

Add support for variable batch size debug mode and NVTX ranges #3799

Add support for variable batch size debug mode and NVTX ranges #3799

Conversation

ksztenderski commented Apr 6, 2022 • edited Loading

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Checklist

Tests

Documentation

DALI team only

Requirements

ksztenderski commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

dali-automaton commented Apr 6, 2022

dali-automaton commented Apr 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ksztenderski Apr 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki Apr 6, 2022 • edited by ksztenderski Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ksztenderski commented Apr 8, 2022

dali-automaton commented Apr 8, 2022

dali-automaton commented Apr 8, 2022

ksztenderski commented Apr 6, 2022 •

edited

Loading

ksztenderski Apr 8, 2022 •

edited

Loading

klecki Apr 6, 2022 •

edited by ksztenderski

Loading