Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for variable batch size debug mode and NVTX ranges #3799

Merged
merged 4 commits into from
Apr 12, 2022

Conversation

ksztenderski
Copy link
Contributor

@ksztenderski ksztenderski commented Apr 6, 2022

Category:

New feature (non-breaking change which adds functionality)

Description:

Adds support for NVTX ranges in eager operators.
Adds support for iteration batch size based on outputs from external_source.
The support is limited to cases when external_source is the first operator in the pipeline.

Why is it impossible/inconvenient to support external_source that is defined later?

For this to work we would have to have similar execution flow to the one from standard mode (fetching data from external_source at the start of each run). This means we would have to first build the pipeline to save all external_source callbacks. The problem with this approach is that because debug mode allows to access and modify the data inside the pipeline during the build we have to operate on the actual data, we cannot have some placeholders like in standard build.

Why can we not run the pipeline once to build the graph and collect external_source operators?

  • We cannot do this in the pipeline build because external_source may not be fed with data.
  • In the first iteration we could run the pipeline twice, once to build the graph and collect ESs, and then a regular run. That way we would introduce some possible garbage outputs from the build phase and would have to account for all possible iteration-based states (maybe someone uses some clear python code inside our pipeline that is dependent on the number of executions) and it would probably be impossible to actually catch all of these cases.

Additional information:

Affected modules and functionalities:

debug mode pipeline

Key points relevant for the review:

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

External source can define iteration batch size when it is a first operator in
a pipeline.

Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
@ksztenderski
Copy link
Contributor Author

!build

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Apr 6, 2022

This pull request introduces 1 alert when merging 717cd54 into 0fafc72 - view on LGTM.com

new alerts:

  • 1 for Unused import

Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4443163]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4443163]: BUILD PASSED

@@ -44,6 +44,7 @@ class DLL_PUBLIC PipelineDebug {

DLL_PUBLIC void AddOperator(OpSpec &spec, int logical_id) {
FillOpSpec(spec);
std::string op_name = "__debug__" + spec.name() + "_" + std::to_string(logical_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed (an artifact of an earlier version).

Comment on lines 62 to 68
: max_batch_size_(spec.GetArgument<int>("max_batch_size")),
op_spec_(spec),
name_(spec.name()),
op_(InstantiateOperator(spec)) {
num_outputs_ = op_spec_.GetSchema().CalculateOutputs(op_spec_) +
op_spec_.GetSchema().CalculateAdditionalOutputs(op_spec_);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can delegate to the other constructor to avoid repetition

Suggested change
: max_batch_size_(spec.GetArgument<int>("max_batch_size")),
op_spec_(spec),
name_(spec.name()),
op_(InstantiateOperator(spec)) {
num_outputs_ = op_spec_.GetSchema().CalculateOutputs(op_spec_) +
op_spec_.GetSchema().CalculateAdditionalOutputs(op_spec_);
}
: EagerOperator(spec, spec.name()) {}

Copy link
Contributor Author

@ksztenderski ksztenderski Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, done. It was a quick check and I forgot to change that.

@@ -123,24 +140,26 @@ template <>
std::vector<std::shared_ptr<TensorList<CPUBackend>>> EagerOperator<CPUBackend>::Run(
const std::vector<std::shared_ptr<TensorList<CPUBackend>>> &inputs,
const std::unordered_map<std::string, std::shared_ptr<TensorList<CPUBackend>>> &kwargs,
ThreadPool *thread_pool) {
ThreadPool *thread_pool, int batch_size) {
DomainTimeRange tr("[DALI][CPU op] " + name_, DomainTimeRange::kBlue1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't insist on doing it in separate PR, but it would be nice to note in the commit message (and PR description) that we added the NVTX ranges as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 228 to 234
DALI_ENFORCE(cur_batch_size == batch_size,
make_string("Expected uniform batch size in a single operator. Expected: ",
batch_size, ", got: ", cur_batch_size));
DALI_ENFORCE(
cur_batch_size <= max_batch_size_,
make_string("Expected batch size lower or equal to max batch size. Expected at most: ",
max_batch_size_, ", got: ", batch_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those errors could mention the index of offending inputs maybe?
In first we can write, that the first input has <batch_size> size and the input has <batch_size>.

In the second, we can say that it's input with too big of a batch.

Also, when throwing errors for debug mode from the backend it would be good to introduce something similar to HandleError from executor? (I guess it won't apply here).
It can be some other mechanism but it would be nice to let the user know where in his pipeline he encountered the error. Or it will already show the name of the Python wrapper and I didn't need to type all this and it's enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also do you want to add similar checks post-run for sanity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -314,13 +315,19 @@ def is_primitive_type(x):
return False, 'cpu', data


class _BatchInfo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -533,6 +549,7 @@ def run(self):

self._debug_on = True
self._cur_operator_id = -1
self._cur_batch_info = _BatchInfo(-1, None) # Used for variable batch sizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General remark:
Interactions with _BatchInfo can be hidden behind some interface and maybe offer a .reset() and .validate_batch_size(bs) functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

else:
raise RuntimeError(
("Batch size must be uniform across an iteration. External Source operator returned batch with "
"size = {}, for this iteration used batch size = {}.\nIf you want to use batch size returned by "
Copy link
Contributor

@klecki klecki Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for this iteration used batch size ... reads a bit weird.

For the "If you want" part I would suggest:

If you want to use variable batch size - that is different batch size in each iteration - you must call all the external source operators at the beginning of your debug pipeline, before other DALI operators.
All the external source operators are expected to return the same batch size in given iteration, but it can change between the iterations. Other operators will use that batch size for processing.

and now the part about what was called first

I don't know if it's not an overkill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



@pipeline_def(batch_size=8, num_threads=3, device_id=0, seed=47, debug=True)
def variable_batch_size_from_external_source_pipeline(variable_batch_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just tests one iteration to be smaller than the max_batch_size, but not exactly changing it iter2iter, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added more iterations with changing batch size

@ksztenderski ksztenderski changed the title Add support for variable batch size debug mode Add support for variable batch size debug mode and NVTX ranges Apr 8, 2022
* Move batch size checks to _IterBatchInfo

Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
@ksztenderski
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4463485]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4463485]: BUILD PASSED

@ksztenderski ksztenderski merged commit 7836d2d into NVIDIA:main Apr 12, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022
…IDIA#3799)

* Add support for NVTX ranges in eager operators.
* Add support for iteration batch size based on outputs from external_source.
It is limited to cases when external_source is the first operator in the pipeline.

Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022
…IDIA#3799)

* Add support for NVTX ranges in eager operators.
* Add support for iteration batch size based on outputs from external_source.
It is limited to cases when external_source is the first operator in the pipeline.

Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants