Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dump operator stats #2039

Merged
merged 6 commits into from
Jun 23, 2020
Merged

Dump operator stats #2039

merged 6 commits into from
Jun 23, 2020

Conversation

JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jun 19, 2020

  • adds an option the executor that makes it gather size of the operator output buffer statistics so the user can select the right value for bytes_per_sample_hint

Signed-off-by: Janusz Lisiecki jlisiecki@nvidia.com

Why we need this PR?

Pick one, remove the rest

  • It adds an ability to dump info about operator output buffer size

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    adds an option the executor that makes it gather size of the operator output buffer statistics so the user can select the right value for bytes_per_sample_hint
  • Affected modules and functionalities:
    Executor, pipeline API across the stack
  • Key points relevant for the review:
    How it is done in the executor
  • Validation and testing:
    CI, but no test targeting the prints
  • Documentation (including examples):
    API description updated

JIRA TASK: [DALI-1024]

- adds an option the executor that makes it gather size of the
  operator output buffer statistics so the user can select the right
  value for `bytes_per_sample_hint`

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 19, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1409842]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1409842]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 19, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1409898]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1409898]: BUILD PASSED

@@ -109,7 +109,8 @@ void daliCreatePipeline(daliPipelineHandle *pipe_handle,
int separated_execution,
int prefetch_queue_depth,
int cpu_prefetch_queue_depth,
int gpu_prefetch_queue_depth) {
int gpu_prefetch_queue_depth,
int get_memory_stats) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: get_memory_stats gives me the feeling that is a function name. Consider naming it something like "memory_stats_enabled'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

int i = 0;
for (const auto &stat : returned_meta) {
auto op_name_size = stat.first.size();
(*operator_meta)[i].operator_name = static_cast<char*>(malloc(sizeof(char) *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(*operator_meta)[i].operator_name = static_cast<char*>(malloc(sizeof(char) *
auto &op_meta = (*operator_meta)[i];
op_meta.operator_name = static_cast<char*>(malloc(sizeof(char) *

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

size_t out_num; // number of the operator outputs
size_t *real_size; // real size of the operator output, user need to free the memory
size_t *reserved; // reserved size of the operator output, user need to free the memory
} daliExecutorMetadata;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you create a freeDaliExecutorMetadata or similar, that invokes free where necessary. No need for the user to do this explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
DLL_PUBLIC void daliGetReaderMetadata(daliPipelineHandle* pipe_handle, const char *reader_name,
daliReaderMetadata* meta);
/**
* @brief Returns obtains the executor statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Returns obtains the executor statistics
* @brief Obtains the executor statistics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/**
* @brief Returns obtains the executor statistics
* @param operator_meta Pointer to the memory allocated by the function with operator_meta_num
* number of metadata entries. The user need to free that memory, as well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* number of metadata entries. The user need to free that memory, as well
* number of metadata entries. The user need to free that memory by invoking `freeDaliExecutorMetadata`

if you follow my suggestion above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for k in meta.keys():
if "CropMirrorNormalize" in k:
crop_meta = meta[k]
assert(crop_meta["real_memory_size"] == crop_meta["reserver_memory_size"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(crop_meta["real_memory_size"] == crop_meta["reserver_memory_size"])
assert(crop_meta["real_memory_size"] == crop_meta["reserved_memory_size"])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -812,6 +836,7 @@ def deserialize_and_build(self, serialized_pipeline):
self._default_cuda_stream_priority)
self._pipe.SetExecutionTypes(self._exec_pipelined, self._exec_separated, self._exec_async)
self._pipe.SetQueueSizes(self._cpu_queue_size, self._gpu_queue_size)
self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats)
self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats)

Consider shortening name here:
EnableMemoryStatistics
EnableOpMemoryStats
or something like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (somehow)

@@ -177,6 +183,22 @@ def epoch_size(self, name = None):
return self._pipe.reader_meta(name)["epoch_size_padded"]
return {name : v["epoch_size_padded"] for k, v in self._pipe.reader_meta()}

def executor_meta(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executor_statistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

reserver_memory_size.append(entry.reserved);
}
op_dict["real_memory_size"] = real_memory_size;
op_dict["reserver_memory_size"] = reserver_memory_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
op_dict["reserver_memory_size"] = reserver_memory_size;
op_dict["reserved_memory_size"] = reserver_memory_size;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -492,6 +514,7 @@ class DLL_PUBLIC Pipeline {
int next_logical_id_ = 0;
int next_internal_logical_id_ = -1;
QueueSizes prefetch_queue_depth_;
bool get_memory_stats_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool get_memory_stats_ = false;
bool memory_stats_enabled_ = false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 465 to 470
for (size_t i = 0; i < N; ++i) {
free(meta[i].operator_name);
free(meta[i].real_size);
free(meta[i].reserved);
}
free(meta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be beneficial to add function for freeing such metadata to C api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 117 to 129
size_t total_nbytes = 0;
for (const auto &t : tensors_) {
total_nbytes += t->nbytes();
}
return total_nbytes;
}

size_t capacity() const noexcept {
size_t total_capacity = 0;
for (const auto &t : tensors_) {
total_capacity += t->capacity();
}
return total_capacity;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those functions need to be handle similarly to, for example shape(), so

if (state_ == State::contiguous) {
  return tl->nbytes();
} 
size_t total_nbytes = 0;
for (const auto &t : tensors_) {
  total_nbytes += t->nbytes();
}
return total_nbytes;

as you can have them backed by tensor list or vector of tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@JanuszL JanuszL force-pushed the dump_operator_stats branch 3 times, most recently from 2f15c08 to 89341cb Compare June 22, 2020 09:59
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one downside to using this with presize hints.

The memory we get for fragmented CPU buffers (non-contiguous TensorVector, so vector of tensors), we will get the sums for all samples. In some cases it would be probably better to return max(used/reserved memory) * nsamples.

@@ -118,6 +132,31 @@ class DLL_PUBLIC Executor : public ExecutorBase, public WorkspacePolicy, public
DISABLE_COPY_MOVE_ASSIGN(Executor);

protected:
template <typename W>
inline void FillStats(ExecutorMetaMap &memory_stats, W ws, std::string op_name,
std::mutex &write_mutex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: please add a space here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -220,6 +259,12 @@ class DLL_PUBLIC Executor : public ExecutorBase, public WorkspacePolicy, public
// in some edge cases where there are no operators
std::vector<cudaEvent_t> mixed_callback_events_;

std::atomic<bool> get_memory_stats_ = ATOMIC_VAR_INIT(false);;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double ;.
Do we really need the init, isn't constructor enough? Maybe with C API we need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 314 to 315
template<typename map>
void AppendToMap(map &ret, map &in_stats, std::mutex &mutex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template<typename map>
void AppendToMap(map &ret, map &in_stats, std::mutex &mutex) {
void AppendToMap(ExecutorMetaMap &ret, const ExecutorMetaMap &in_stats, std::mutex &mutex) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 317 to 319
for (auto const& stats : in_stats) {
ret.emplace(stats);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't

Suggested change
for (auto const& stats : in_stats) {
ret.emplace(stats);
}
ret.insert(stats.begin(), stats.end());

also work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -347,6 +409,7 @@ void Executor<WorkspacePolicy, QueuePolicy>::RunCPU() {

try {
RunHelper(op_node, ws);
FillStats(cpu_memory_stats_, ws, "CPU_" + op_node.instance_name, cpu_memory_stats_mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instance names should be unique, what's the rationale for the CPU_ suffixes etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have an operator for CPU and GPU then it is hard to tell which instance is placed where. Name is unique but a bit mangled and not always self explanatory to the user.

dali/pipeline/pipeline.h Show resolved Hide resolved
Comment on lines 1064 to 1068
.def("executor_meta",
[](Pipeline *p) {
auto ret = p->GetExecutorMeta();
return ExecutorMetaToDict(ret);
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if having one entry point for all statistics would be beneficial, but it probably depends on what kind statistics we provide.
If everything can be merge into a dictionary { "op_name" : { "stats_name : value }}, than it's fine, if not, than it doesn't make much sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe it should be something like executor_meta("memory") or executor_meta(["memory", "threading", ...])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it as it is, as we don't have a good plan for other stats now. Having 5 level of nesting just to read one int is not good.

``real_memory_size``: list of memory sizes that is used by each output of the operator;
index in the list corresponds to the output index

``reserver_memory_size``: list of memory sizes that is reserved for each of the operator outputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``reserver_memory_size``: list of memory sizes that is reserved for each of the operator outputs
``reserved_memory_size``: list of memory sizes that is reserved for each of the operator outputs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention here what option should be set to get the stats here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -326,9 +348,10 @@ def _prepare_graph(self, define_graph = None):
self._bytes_per_sample,
self._set_affinity,
self._max_streams,
self._default_cuda_stream_priority)
self._default_cuda_stream_priority,)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._default_cuda_stream_priority,)
self._default_cuda_stream_priority)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -48,6 +48,7 @@ The purpose of this functionality is to enable the user to fine-tune the process
DALI uses intermediate buffers to pass data between operators in the processing graph. With DALI, the memory is never freed but just enlarged when present buffers are not sufficient to hold the data. However, in some cases, even this limited number of allocations still could affect DALI performance. Hence, if the user knows how much memory each operator buffer needs, then it is possible to provide a hint to presize buffers before the first run.
Two parameters are available: First, the ``bytes_per_sample`` pipeline argument, which accepts one value that is used globally across all operators for all buffers.
The second parameter is the ``bytes_per_sample_hint`` per operator argument, which accepts one value or a list of values. When one value is provided it is used for all output buffers for a given operator. When a list is provided then each buffer is presized to the corresponding size.
To learn how much memory outputs of each operator need, the user may create the pipeline with ``get_memory_stats`` set to ``True`` and then query the pipeline for the operator output memory occupation by calling ``executor_meta`` method on the pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example here how to use that information?

Obtained dictionary will contain the currently used memory for latest batch and the total reserved memory which will indicate the maximal memory usage (unless the thresholds for memory reallocation were adjusted (link to the Memory consumption section)).

Than you can use the reserved memory as presize hints (link to section).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@@ -92,11 +92,16 @@ class Pipeline(object):
unrestricted number of streams is assumed).
`default_cuda_stream_priority` : int, optional, default = 0
CUDA stream priority used by DALI. See `cudaStreamCreateWithPriority` in CUDA documentation
`get_memory_stats`: bool, optional, default = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`get_memory_stats`: bool, optional, default = False
`get_memory_stats`: bool, optional, default = False

enable_memory_stats

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1413697]: BUILD STARTED

@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 22, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1413748]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1413748]: BUILD PASSED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1413697]: BUILD PASSED

Comment on lines +195 to +197
``max_real_memory_size``: list of maximum tensor size that is used by each output of the operator;
index in the list corresponds to the output index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small nitpick: this is max for non-contiguous buffers and average for contiguous. In the case of real_memory we theoretically could calculate the volume of the samples but I don't think it's worth.

size_t &max_reserved_size) {
for (size_t j = 0; j < in.ntensor(); ++j) {
max_out_size = std::max(in[j].nbytes(), max_out_size);
max_reserved_size = std::max(in[j].capacity(), max_reserved_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some reservations regarding this. In the case of contiguous TensorVector, the Tensors that you're accessing with operator[] here are kind of views into the backing TensorList. So the capacity will probably match exactly the nbytes here, but the total capacity of the TensorVector might be bigger than nsamples * max(number of bytes) (I suspect).

This can be probably checked with a test that sets the TensorVector to contiguous mode, resizes it to some big shapes and than sets all the shapes to be for example half the initial size.

Can you check this? I think the best solution would be to take max of this reserved size and the average reserved size you calculate for the TensorList in function above.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL
Copy link
Contributor Author

JanuszL commented Jun 22, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1414632]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1414632]: BUILD PASSED

@@ -48,6 +48,7 @@ The purpose of this functionality is to enable the user to fine-tune the process
DALI uses intermediate buffers to pass data between operators in the processing graph. With DALI, the memory is never freed but just enlarged when present buffers are not sufficient to hold the data. However, in some cases, even this limited number of allocations still could affect DALI performance. Hence, if the user knows how much memory each operator buffer needs, then it is possible to provide a hint to presize buffers before the first run.
Two parameters are available: First, the ``bytes_per_sample`` pipeline argument, which accepts one value that is used globally across all operators for all buffers.
The second parameter is the ``bytes_per_sample_hint`` per operator argument, which accepts one value or a list of values. When one value is provided it is used for all output buffers for a given operator. When a list is provided then each buffer is presized to the corresponding size.
To learn how much memory outputs of each operator need, the user may create the pipeline with ``enable_memory_stats`` set to ``True`` and then query the pipeline for the operator's output memory statistics by calling ``executor_meta`` method on the pipeline. The ``max_real_memory_size`` value tells what is the biggest tensor in the batch for the outputs that allocate memory per sample, not for the whole batch at the time, or average tensor size when the allocation is continuous. This value is the one that should be provided to ``bytes_per_sample_hint``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be the max_reserved_memory_size as this is the bigger value which includes some "padding"? Or, as we calculate max, both will be similar here (and we don't need to actually return all the values?) When they will be different - only for small data with the padding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. If the user uses SetBufferGrowthFactor with a big value then better stick to the used memory, not the reserved one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it makes more sense to use the real, as there won't be padding used if there is enough memory -> no reallocation.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL JanuszL merged commit f4f88e7 into NVIDIA:master Jun 23, 2020
@JanuszL JanuszL deleted the dump_operator_stats branch June 23, 2020 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants