-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dump operator stats #2039
Dump operator stats #2039
Conversation
- adds an option the executor that makes it gather size of the operator output buffer statistics so the user can select the right value for `bytes_per_sample_hint` Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
!build |
CI MESSAGE: [1409842]: BUILD STARTED |
CI MESSAGE: [1409842]: BUILD FAILED |
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
ee9a0ec
to
416e5a7
Compare
!build |
CI MESSAGE: [1409898]: BUILD STARTED |
CI MESSAGE: [1409898]: BUILD PASSED |
dali/c_api/c_api.cc
Outdated
@@ -109,7 +109,8 @@ void daliCreatePipeline(daliPipelineHandle *pipe_handle, | |||
int separated_execution, | |||
int prefetch_queue_depth, | |||
int cpu_prefetch_queue_depth, | |||
int gpu_prefetch_queue_depth) { | |||
int gpu_prefetch_queue_depth, | |||
int get_memory_stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: get_memory_stats gives me the feeling that is a function name. Consider naming it something like "memory_stats_enabled'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/c_api/c_api.cc
Outdated
int i = 0; | ||
for (const auto &stat : returned_meta) { | ||
auto op_name_size = stat.first.size(); | ||
(*operator_meta)[i].operator_name = static_cast<char*>(malloc(sizeof(char) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(*operator_meta)[i].operator_name = static_cast<char*>(malloc(sizeof(char) * | |
auto &op_meta = (*operator_meta)[i]; | |
op_meta.operator_name = static_cast<char*>(malloc(sizeof(char) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
size_t out_num; // number of the operator outputs | ||
size_t *real_size; // real size of the operator output, user need to free the memory | ||
size_t *reserved; // reserved size of the operator output, user need to free the memory | ||
} daliExecutorMetadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you create a freeDaliExecutorMetadata
or similar, that invokes free where necessary. No need for the user to do this explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
include/dali/c_api.h
Outdated
*/ | ||
DLL_PUBLIC void daliGetReaderMetadata(daliPipelineHandle* pipe_handle, const char *reader_name, | ||
daliReaderMetadata* meta); | ||
/** | ||
* @brief Returns obtains the executor statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief Returns obtains the executor statistics | |
* @brief Obtains the executor statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
include/dali/c_api.h
Outdated
/** | ||
* @brief Returns obtains the executor statistics | ||
* @param operator_meta Pointer to the memory allocated by the function with operator_meta_num | ||
* number of metadata entries. The user need to free that memory, as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* number of metadata entries. The user need to free that memory, as well | |
* number of metadata entries. The user need to free that memory by invoking `freeDaliExecutorMetadata` |
if you follow my suggestion above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/test/python/test_pipeline.py
Outdated
for k in meta.keys(): | ||
if "CropMirrorNormalize" in k: | ||
crop_meta = meta[k] | ||
assert(crop_meta["real_memory_size"] == crop_meta["reserver_memory_size"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(crop_meta["real_memory_size"] == crop_meta["reserver_memory_size"]) | |
assert(crop_meta["real_memory_size"] == crop_meta["reserved_memory_size"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/python/nvidia/dali/pipeline.py
Outdated
@@ -812,6 +836,7 @@ def deserialize_and_build(self, serialized_pipeline): | |||
self._default_cuda_stream_priority) | |||
self._pipe.SetExecutionTypes(self._exec_pipelined, self._exec_separated, self._exec_async) | |||
self._pipe.SetQueueSizes(self._cpu_queue_size, self._gpu_queue_size) | |||
self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats) | |
self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats) |
Consider shortening name here:
EnableMemoryStatistics
EnableOpMemoryStats
or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (somehow)
dali/python/nvidia/dali/pipeline.py
Outdated
@@ -177,6 +183,22 @@ def epoch_size(self, name = None): | |||
return self._pipe.reader_meta(name)["epoch_size_padded"] | |||
return {name : v["epoch_size_padded"] for k, v in self._pipe.reader_meta()} | |||
|
|||
def executor_meta(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
executor_statistics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/python/backend_impl.cc
Outdated
reserver_memory_size.append(entry.reserved); | ||
} | ||
op_dict["real_memory_size"] = real_memory_size; | ||
op_dict["reserver_memory_size"] = reserver_memory_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
op_dict["reserver_memory_size"] = reserver_memory_size; | |
op_dict["reserved_memory_size"] = reserver_memory_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/pipeline/pipeline.h
Outdated
@@ -492,6 +514,7 @@ class DLL_PUBLIC Pipeline { | |||
int next_logical_id_ = 0; | |||
int next_internal_logical_id_ = -1; | |||
QueueSizes prefetch_queue_depth_; | |||
bool get_memory_stats_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool get_memory_stats_ = false; | |
bool memory_stats_enabled_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/c_api/c_api_test.cc
Outdated
for (size_t i = 0; i < N; ++i) { | ||
free(meta[i].operator_name); | ||
free(meta[i].real_size); | ||
free(meta[i].reserved); | ||
} | ||
free(meta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be beneficial to add function for freeing such metadata to C api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/pipeline/data/tensor_vector.h
Outdated
size_t total_nbytes = 0; | ||
for (const auto &t : tensors_) { | ||
total_nbytes += t->nbytes(); | ||
} | ||
return total_nbytes; | ||
} | ||
|
||
size_t capacity() const noexcept { | ||
size_t total_capacity = 0; | ||
for (const auto &t : tensors_) { | ||
total_capacity += t->capacity(); | ||
} | ||
return total_capacity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those functions need to be handle similarly to, for example shape(), so
if (state_ == State::contiguous) {
return tl->nbytes();
}
size_t total_nbytes = 0;
for (const auto &t : tensors_) {
total_nbytes += t->nbytes();
}
return total_nbytes;
as you can have them backed by tensor list or vector of tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
2f15c08
to
89341cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one downside to using this with presize hints.
The memory we get for fragmented CPU buffers (non-contiguous TensorVector, so vector of tensors), we will get the sums for all samples. In some cases it would be probably better to return max(used/reserved memory) * nsamples.
dali/pipeline/executor/executor.h
Outdated
@@ -118,6 +132,31 @@ class DLL_PUBLIC Executor : public ExecutorBase, public WorkspacePolicy, public | |||
DISABLE_COPY_MOVE_ASSIGN(Executor); | |||
|
|||
protected: | |||
template <typename W> | |||
inline void FillStats(ExecutorMetaMap &memory_stats, W ws, std::string op_name, | |||
std::mutex &write_mutex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: please add a space here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/pipeline/executor/executor.h
Outdated
@@ -220,6 +259,12 @@ class DLL_PUBLIC Executor : public ExecutorBase, public WorkspacePolicy, public | |||
// in some edge cases where there are no operators | |||
std::vector<cudaEvent_t> mixed_callback_events_; | |||
|
|||
std::atomic<bool> get_memory_stats_ = ATOMIC_VAR_INIT(false);; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double ;
.
Do we really need the init, isn't constructor enough? Maybe with C API we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/pipeline/executor/executor.h
Outdated
template<typename map> | ||
void AppendToMap(map &ret, map &in_stats, std::mutex &mutex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template<typename map> | |
void AppendToMap(map &ret, map &in_stats, std::mutex &mutex) { | |
void AppendToMap(ExecutorMetaMap &ret, const ExecutorMetaMap &in_stats, std::mutex &mutex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/pipeline/executor/executor.h
Outdated
for (auto const& stats : in_stats) { | ||
ret.emplace(stats); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't
for (auto const& stats : in_stats) { | |
ret.emplace(stats); | |
} | |
ret.insert(stats.begin(), stats.end()); |
also work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -347,6 +409,7 @@ void Executor<WorkspacePolicy, QueuePolicy>::RunCPU() { | |||
|
|||
try { | |||
RunHelper(op_node, ws); | |||
FillStats(cpu_memory_stats_, ws, "CPU_" + op_node.instance_name, cpu_memory_stats_mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instance names should be unique, what's the rationale for the CPU_
suffixes etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have an operator for CPU and GPU then it is hard to tell which instance is placed where. Name is unique but a bit mangled and not always self explanatory to the user.
dali/python/backend_impl.cc
Outdated
.def("executor_meta", | ||
[](Pipeline *p) { | ||
auto ret = p->GetExecutorMeta(); | ||
return ExecutorMetaToDict(ret); | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if having one entry point for all statistics would be beneficial, but it probably depends on what kind statistics we provide.
If everything can be merge into a dictionary { "op_name" : { "stats_name : value }}
, than it's fine, if not, than it doesn't make much sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe it should be something like executor_meta("memory")
or executor_meta(["memory", "threading", ...])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep it as it is, as we don't have a good plan for other stats now. Having 5 level of nesting just to read one int is not good.
dali/python/nvidia/dali/pipeline.py
Outdated
``real_memory_size``: list of memory sizes that is used by each output of the operator; | ||
index in the list corresponds to the output index | ||
|
||
``reserver_memory_size``: list of memory sizes that is reserved for each of the operator outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``reserver_memory_size``: list of memory sizes that is reserved for each of the operator outputs | |
``reserved_memory_size``: list of memory sizes that is reserved for each of the operator outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention here what option should be set to get the stats here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/python/nvidia/dali/pipeline.py
Outdated
@@ -326,9 +348,10 @@ def _prepare_graph(self, define_graph = None): | |||
self._bytes_per_sample, | |||
self._set_affinity, | |||
self._max_streams, | |||
self._default_cuda_stream_priority) | |||
self._default_cuda_stream_priority,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._default_cuda_stream_priority,) | |
self._default_cuda_stream_priority) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/advanced_topics.rst
Outdated
@@ -48,6 +48,7 @@ The purpose of this functionality is to enable the user to fine-tune the process | |||
DALI uses intermediate buffers to pass data between operators in the processing graph. With DALI, the memory is never freed but just enlarged when present buffers are not sufficient to hold the data. However, in some cases, even this limited number of allocations still could affect DALI performance. Hence, if the user knows how much memory each operator buffer needs, then it is possible to provide a hint to presize buffers before the first run. | |||
Two parameters are available: First, the ``bytes_per_sample`` pipeline argument, which accepts one value that is used globally across all operators for all buffers. | |||
The second parameter is the ``bytes_per_sample_hint`` per operator argument, which accepts one value or a list of values. When one value is provided it is used for all output buffers for a given operator. When a list is provided then each buffer is presized to the corresponding size. | |||
To learn how much memory outputs of each operator need, the user may create the pipeline with ``get_memory_stats`` set to ``True`` and then query the pipeline for the operator output memory occupation by calling ``executor_meta`` method on the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example here how to use that information?
Obtained dictionary will contain the currently used memory for latest batch and the total reserved memory which will indicate the maximal memory usage (unless the thresholds for memory reallocation were adjusted (link to the Memory consumption section)).
Than you can use the reserved memory as presize hints (link to section).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
89341cb
to
22fd3bc
Compare
dali/python/nvidia/dali/pipeline.py
Outdated
@@ -92,11 +92,16 @@ class Pipeline(object): | |||
unrestricted number of streams is assumed). | |||
`default_cuda_stream_priority` : int, optional, default = 0 | |||
CUDA stream priority used by DALI. See `cudaStreamCreateWithPriority` in CUDA documentation | |||
`get_memory_stats`: bool, optional, default = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`get_memory_stats`: bool, optional, default = False | |
`get_memory_stats`: bool, optional, default = False |
enable_memory_stats
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
9b7f298
to
7f75031
Compare
CI MESSAGE: [1413697]: BUILD STARTED |
!build |
CI MESSAGE: [1413748]: BUILD STARTED |
CI MESSAGE: [1413748]: BUILD PASSED |
CI MESSAGE: [1413697]: BUILD PASSED |
``max_real_memory_size``: list of maximum tensor size that is used by each output of the operator; | ||
index in the list corresponds to the output index | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only small nitpick: this is max for non-contiguous buffers and average for contiguous. In the case of real_memory we theoretically could calculate the volume of the samples but I don't think it's worth.
size_t &max_reserved_size) { | ||
for (size_t j = 0; j < in.ntensor(); ++j) { | ||
max_out_size = std::max(in[j].nbytes(), max_out_size); | ||
max_reserved_size = std::max(in[j].capacity(), max_reserved_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some reservations regarding this. In the case of contiguous TensorVector, the Tensors that you're accessing with operator[] here are kind of views into the backing TensorList. So the capacity
will probably match exactly the nbytes
here, but the total capacity of the TensorVector might be bigger than nsamples * max(number of bytes)
(I suspect).
This can be probably checked with a test that sets the TensorVector to contiguous mode, resizes it to some big shapes and than sets all the shapes to be for example half the initial size.
Can you check this? I think the best solution would be to take max of this reserved size and the average reserved size you calculate for the TensorList in function above.
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
!build |
CI MESSAGE: [1414632]: BUILD STARTED |
CI MESSAGE: [1414632]: BUILD PASSED |
docs/advanced_topics.rst
Outdated
@@ -48,6 +48,7 @@ The purpose of this functionality is to enable the user to fine-tune the process | |||
DALI uses intermediate buffers to pass data between operators in the processing graph. With DALI, the memory is never freed but just enlarged when present buffers are not sufficient to hold the data. However, in some cases, even this limited number of allocations still could affect DALI performance. Hence, if the user knows how much memory each operator buffer needs, then it is possible to provide a hint to presize buffers before the first run. | |||
Two parameters are available: First, the ``bytes_per_sample`` pipeline argument, which accepts one value that is used globally across all operators for all buffers. | |||
The second parameter is the ``bytes_per_sample_hint`` per operator argument, which accepts one value or a list of values. When one value is provided it is used for all output buffers for a given operator. When a list is provided then each buffer is presized to the corresponding size. | |||
To learn how much memory outputs of each operator need, the user may create the pipeline with ``enable_memory_stats`` set to ``True`` and then query the pipeline for the operator's output memory statistics by calling ``executor_meta`` method on the pipeline. The ``max_real_memory_size`` value tells what is the biggest tensor in the batch for the outputs that allocate memory per sample, not for the whole batch at the time, or average tensor size when the allocation is continuous. This value is the one that should be provided to ``bytes_per_sample_hint``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that be the max_reserved_memory_size
as this is the bigger value which includes some "padding"? Or, as we calculate max, both will be similar here (and we don't need to actually return all the values?) When they will be different - only for small data with the padding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. If the user uses SetBufferGrowthFactor
with a big value then better stick to the used memory, not the reserved one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it makes more sense to use the real, as there won't be padding used if there is enough memory -> no reallocation.
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
db3e16b
to
41bd1b1
Compare
bytes_per_sample_hint
Signed-off-by: Janusz Lisiecki jlisiecki@nvidia.com
Why we need this PR?
Pick one, remove the rest
What happened in this PR?
Fill relevant points, put NA otherwise. Replace anything inside []
adds an option the executor that makes it gather size of the operator output buffer statistics so the user can select the right value for
bytes_per_sample_hint
Executor, pipeline API across the stack
How it is done in the executor
CI, but no test targeting the prints
API description updated
JIRA TASK: [DALI-1024]