-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for variable batch size debug mode and NVTX ranges #3799
Add support for variable batch size debug mode and NVTX ranges #3799
Conversation
External source can define iteration batch size when it is a first operator in a pipeline. Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
!build |
This pull request introduces 1 alert when merging 717cd54 into 0fafc72 - view on LGTM.com new alerts:
|
Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
CI MESSAGE: [4443163]: BUILD STARTED |
CI MESSAGE: [4443163]: BUILD PASSED |
dali/pipeline/pipeline_debug.h
Outdated
@@ -44,6 +44,7 @@ class DLL_PUBLIC PipelineDebug { | |||
|
|||
DLL_PUBLIC void AddOperator(OpSpec &spec, int logical_id) { | |||
FillOpSpec(spec); | |||
std::string op_name = "__debug__" + spec.name() + "_" + std::to_string(logical_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed (an artifact of an earlier version).
: max_batch_size_(spec.GetArgument<int>("max_batch_size")), | ||
op_spec_(spec), | ||
name_(spec.name()), | ||
op_(InstantiateOperator(spec)) { | ||
num_outputs_ = op_spec_.GetSchema().CalculateOutputs(op_spec_) + | ||
op_spec_.GetSchema().CalculateAdditionalOutputs(op_spec_); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can delegate to the other constructor to avoid repetition
: max_batch_size_(spec.GetArgument<int>("max_batch_size")), | |
op_spec_(spec), | |
name_(spec.name()), | |
op_(InstantiateOperator(spec)) { | |
num_outputs_ = op_spec_.GetSchema().CalculateOutputs(op_spec_) + | |
op_spec_.GetSchema().CalculateAdditionalOutputs(op_spec_); | |
} | |
: EagerOperator(spec, spec.name()) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, done. It was a quick check and I forgot to change that.
@@ -123,24 +140,26 @@ template <> | |||
std::vector<std::shared_ptr<TensorList<CPUBackend>>> EagerOperator<CPUBackend>::Run( | |||
const std::vector<std::shared_ptr<TensorList<CPUBackend>>> &inputs, | |||
const std::unordered_map<std::string, std::shared_ptr<TensorList<CPUBackend>>> &kwargs, | |||
ThreadPool *thread_pool) { | |||
ThreadPool *thread_pool, int batch_size) { | |||
DomainTimeRange tr("[DALI][CPU op] " + name_, DomainTimeRange::kBlue1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't insist on doing it in separate PR, but it would be nice to note in the commit message (and PR description) that we added the NVTX ranges as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
DALI_ENFORCE(cur_batch_size == batch_size, | ||
make_string("Expected uniform batch size in a single operator. Expected: ", | ||
batch_size, ", got: ", cur_batch_size)); | ||
DALI_ENFORCE( | ||
cur_batch_size <= max_batch_size_, | ||
make_string("Expected batch size lower or equal to max batch size. Expected at most: ", | ||
max_batch_size_, ", got: ", batch_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those errors could mention the index of offending inputs maybe?
In first we can write, that the first input has <batch_size> size and the input has <batch_size>.
In the second, we can say that it's input with too big of a batch.
Also, when throwing errors for debug mode from the backend it would be good to introduce something similar to HandleError
from executor? (I guess it won't apply here).
It can be some other mechanism but it would be nice to let the user know where in his pipeline he encountered the error. Or it will already show the name of the Python wrapper and I didn't need to type all this and it's enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do you want to add similar checks post-run for sanity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -314,13 +315,19 @@ def is_primitive_type(x): | |||
return False, 'cpu', data | |||
|
|||
|
|||
class _BatchInfo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -533,6 +549,7 @@ def run(self): | |||
|
|||
self._debug_on = True | |||
self._cur_operator_id = -1 | |||
self._cur_batch_info = _BatchInfo(-1, None) # Used for variable batch sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General remark:
Interactions with _BatchInfo can be hidden behind some interface and maybe offer a .reset() and .validate_batch_size(bs) functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
else: | ||
raise RuntimeError( | ||
("Batch size must be uniform across an iteration. External Source operator returned batch with " | ||
"size = {}, for this iteration used batch size = {}.\nIf you want to use batch size returned by " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for this iteration used batch size ...
reads a bit weird.
For the "If you want" part I would suggest:
If you want to use variable batch size - that is different batch size in each iteration - you must call all the external source operators at the beginning of your debug pipeline, before other DALI operators.
All the external source operators are expected to return the same batch size in given iteration, but it can change between the iterations. Other operators will use that batch size for processing.
and now the part about what was called first
I don't know if it's not an overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
|
||
@pipeline_def(batch_size=8, num_threads=3, device_id=0, seed=47, debug=True) | ||
def variable_batch_size_from_external_source_pipeline(variable_batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just tests one iteration to be smaller than the max_batch_size, but not exactly changing it iter2iter, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added more iterations with changing batch size
* Move batch size checks to _IterBatchInfo Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
!build |
CI MESSAGE: [4463485]: BUILD STARTED |
CI MESSAGE: [4463485]: BUILD PASSED |
…IDIA#3799) * Add support for NVTX ranges in eager operators. * Add support for iteration batch size based on outputs from external_source. It is limited to cases when external_source is the first operator in the pipeline. Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
…IDIA#3799) * Add support for NVTX ranges in eager operators. * Add support for iteration batch size based on outputs from external_source. It is limited to cases when external_source is the first operator in the pipeline. Signed-off-by: ksztenderski <ksztenderski@nvidia.com>
Category:
New feature (non-breaking change which adds functionality)
Description:
Adds support for NVTX ranges in eager operators.
Adds support for iteration batch size based on outputs from
external_source
.The support is limited to cases when
external_source
is the first operator in the pipeline.Why is it impossible/inconvenient to support
external_source
that is defined later?For this to work we would have to have similar execution flow to the one from standard mode (fetching data from
external_source
at the start of each run). This means we would have to first build the pipeline to save allexternal_source
callbacks. The problem with this approach is that because debug mode allows to access and modify the data inside the pipeline during the build we have to operate on the actual data, we cannot have some placeholders like in standard build.Why can we not run the pipeline once to build the graph and collect
external_source
operators?external_source
may not be fed with data.Additional information:
Affected modules and functionalities:
debug mode pipeline
Key points relevant for the review:
Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A