Dump operator stats #2039

JanuszL · 2020-06-19T20:52:55Z

adds an option the executor that makes it gather size of the operator output buffer statistics so the user can select the right value for bytes_per_sample_hint

Signed-off-by: Janusz Lisiecki jlisiecki@nvidia.com

Why we need this PR?

Pick one, remove the rest

It adds an ability to dump info about operator output buffer size

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
adds an option the executor that makes it gather size of the operator output buffer statistics so the user can select the right value for bytes_per_sample_hint
Affected modules and functionalities:
Executor, pipeline API across the stack
Key points relevant for the review:
How it is done in the executor
Validation and testing:
CI, but no test targeting the prints
Documentation (including examples):
API description updated

JIRA TASK: [DALI-1024]

- adds an option the executor that makes it gather size of the operator output buffer statistics so the user can select the right value for `bytes_per_sample_hint` Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2020-06-19T20:53:08Z

!build

dali-automaton · 2020-06-19T20:55:32Z

CI MESSAGE: [1409842]: BUILD STARTED

dali-automaton · 2020-06-19T21:02:12Z

CI MESSAGE: [1409842]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2020-06-19T21:04:36Z

!build

dali-automaton · 2020-06-19T21:10:38Z

CI MESSAGE: [1409898]: BUILD STARTED

dali-automaton · 2020-06-19T23:55:23Z

CI MESSAGE: [1409898]: BUILD PASSED

jantonguirao · 2020-06-22T05:50:19Z

dali/c_api/c_api.cc

@@ -109,7 +109,8 @@ void daliCreatePipeline(daliPipelineHandle *pipe_handle,
                        int separated_execution,
                        int prefetch_queue_depth,
                        int cpu_prefetch_queue_depth,
-                        int gpu_prefetch_queue_depth) {
+                        int gpu_prefetch_queue_depth,
+                        int get_memory_stats) {


nitpick: get_memory_stats gives me the feeling that is a function name. Consider naming it something like "memory_stats_enabled'

jantonguirao · 2020-06-22T05:51:06Z

dali/c_api/c_api.cc

+  int i = 0;
+  for (const auto &stat : returned_meta) {
+    auto op_name_size = stat.first.size();
+    (*operator_meta)[i].operator_name = static_cast<char*>(malloc(sizeof(char) *


Suggested change

(*operator_meta)[i].operator_name = static_cast<char*>(malloc(sizeof(char) *

auto &op_meta = (*operator_meta)[i];

op_meta.operator_name = static_cast<char*>(malloc(sizeof(char) *

jantonguirao · 2020-06-22T05:56:30Z

include/dali/c_api.h

+  size_t out_num;              // number of the operator outputs
+  size_t *real_size;           // real size of the operator output, user need to free the memory
+  size_t *reserved;            // reserved size of the operator output, user need to free the memory
+} daliExecutorMetadata;


I suggest you create a freeDaliExecutorMetadata or similar, that invokes free where necessary. No need for the user to do this explicitly.

jantonguirao · 2020-06-22T05:56:59Z

include/dali/c_api.h

 */
 DLL_PUBLIC void daliGetReaderMetadata(daliPipelineHandle* pipe_handle, const char *reader_name,
                                      daliReaderMetadata* meta);
+/**
+ * @brief Returns obtains the executor statistics


Suggested change

* @brief Returns obtains the executor statistics

* @brief Obtains the executor statistics

jantonguirao · 2020-06-22T05:57:30Z

include/dali/c_api.h

+/**
+ * @brief Returns obtains the executor statistics
+ *  @param operator_meta Pointer to the memory allocated by the function with operator_meta_num
+ *                       number of metadata entries. The user need to free that memory, as well


Suggested change

* number of metadata entries. The user need to free that memory, as well

* number of metadata entries. The user need to free that memory by invoking `freeDaliExecutorMetadata`

if you follow my suggestion above

jantonguirao · 2020-06-22T06:04:51Z

dali/test/python/test_pipeline.py

+    for k in meta.keys():
+        if "CropMirrorNormalize" in k:
+            crop_meta = meta[k]
+    assert(crop_meta["real_memory_size"] == crop_meta["reserver_memory_size"])


Suggested change

assert(crop_meta["real_memory_size"] == crop_meta["reserver_memory_size"])

assert(crop_meta["real_memory_size"] == crop_meta["reserved_memory_size"])

jantonguirao · 2020-06-22T06:06:26Z

dali/python/nvidia/dali/pipeline.py

@@ -812,6 +836,7 @@ def deserialize_and_build(self, serialized_pipeline):
                                self._default_cuda_stream_priority)
        self._pipe.SetExecutionTypes(self._exec_pipelined, self._exec_separated, self._exec_async)
        self._pipe.SetQueueSizes(self._cpu_queue_size, self._gpu_queue_size)
+        self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats)


Suggested change

self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats)

self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats)

Consider shortening name here:
EnableMemoryStatistics
EnableOpMemoryStats
or something like that

Done (somehow)

jantonguirao · 2020-06-22T06:07:03Z

dali/python/nvidia/dali/pipeline.py

@@ -177,6 +183,22 @@ def epoch_size(self, name = None):
            return self._pipe.reader_meta(name)["epoch_size_padded"]
        return {name : v["epoch_size_padded"] for k, v in self._pipe.reader_meta()}

+    def executor_meta(self):


executor_statistics?

jantonguirao · 2020-06-22T06:07:44Z

dali/python/backend_impl.cc

+      reserver_memory_size.append(entry.reserved);
+    }
+    op_dict["real_memory_size"] = real_memory_size;
+    op_dict["reserver_memory_size"] = reserver_memory_size;


Suggested change

op_dict["reserver_memory_size"] = reserver_memory_size;

op_dict["reserved_memory_size"] = reserver_memory_size;

jantonguirao · 2020-06-22T06:09:17Z

dali/pipeline/pipeline.h

@@ -492,6 +514,7 @@ class DLL_PUBLIC Pipeline {
  int next_logical_id_ = 0;
  int next_internal_logical_id_ = -1;
  QueueSizes prefetch_queue_depth_;
+  bool get_memory_stats_ = false;


Suggested change

bool get_memory_stats_ = false;

bool memory_stats_enabled_ = false;

klecki · 2020-06-22T08:31:07Z

dali/c_api/c_api_test.cc

+  for (size_t i = 0; i < N; ++i) {
+    free(meta[i].operator_name);
+    free(meta[i].real_size);
+    free(meta[i].reserved);
+  }
+  free(meta);


Maybe it would be beneficial to add function for freeing such metadata to C api?

klecki · 2020-06-22T08:38:51Z

dali/pipeline/data/tensor_vector.h

+    size_t total_nbytes = 0;
+    for (const auto &t : tensors_) {
+      total_nbytes += t->nbytes();
+    }
+    return total_nbytes;
+  }
+
+  size_t capacity() const noexcept {
+    size_t total_capacity = 0;
+    for (const auto &t : tensors_) {
+      total_capacity += t->capacity();
+    }
+    return total_capacity;


Those functions need to be handle similarly to, for example shape(), so

if (state_ == State::contiguous) { return tl->nbytes(); } size_t total_nbytes = 0; for (const auto &t : tensors_) { total_nbytes += t->nbytes(); } return total_nbytes;

as you can have them backed by tensor list or vector of tensors.

klecki

There is one downside to using this with presize hints.

The memory we get for fragmented CPU buffers (non-contiguous TensorVector, so vector of tensors), we will get the sums for all samples. In some cases it would be probably better to return max(used/reserved memory) * nsamples.

klecki · 2020-06-22T08:47:52Z

dali/pipeline/executor/executor.h

@@ -118,6 +132,31 @@ class DLL_PUBLIC Executor : public ExecutorBase, public WorkspacePolicy, public
  DISABLE_COPY_MOVE_ASSIGN(Executor);

 protected:
+  template <typename W>
+  inline void FillStats(ExecutorMetaMap &memory_stats, W ws, std::string op_name,
+                       std::mutex &write_mutex) {


Nitpick: please add a space here.

klecki · 2020-06-22T08:49:01Z

dali/pipeline/executor/executor.h

@@ -220,6 +259,12 @@ class DLL_PUBLIC Executor : public ExecutorBase, public WorkspacePolicy, public
  // in some edge cases where there are no operators
  std::vector<cudaEvent_t> mixed_callback_events_;

+  std::atomic<bool> get_memory_stats_ = ATOMIC_VAR_INIT(false);;


Double ;.
Do we really need the init, isn't constructor enough? Maybe with C API we need.

klecki · 2020-06-22T08:54:19Z

dali/pipeline/executor/executor.h

+template<typename map>
+void AppendToMap(map &ret, map &in_stats, std::mutex &mutex) {


Suggested change

template<typename map>

void AppendToMap(map &ret, map &in_stats, std::mutex &mutex) {

void AppendToMap(ExecutorMetaMap &ret, const ExecutorMetaMap &in_stats, std::mutex &mutex) {

klecki · 2020-06-22T08:57:08Z

dali/pipeline/executor/executor.h

+  for (auto const& stats : in_stats) {
+    ret.emplace(stats);
+  }


Wouldn't

Suggested change

for (auto const& stats : in_stats) {

ret.emplace(stats);

}

ret.insert(stats.begin(), stats.end());

also work?

klecki · 2020-06-22T08:57:58Z

dali/pipeline/executor/executor.h

@@ -347,6 +409,7 @@ void Executor<WorkspacePolicy, QueuePolicy>::RunCPU() {

    try {
      RunHelper(op_node, ws);
+      FillStats(cpu_memory_stats_, ws, "CPU_" + op_node.instance_name, cpu_memory_stats_mutex_);


Instance names should be unique, what's the rationale for the CPU_ suffixes etc?

If you have an operator for CPU and GPU then it is hard to tell which instance is placed where. Name is unique but a bit mangled and not always self explanatory to the user.

dali/pipeline/pipeline.h

klecki · 2020-06-22T09:23:35Z

dali/python/backend_impl.cc

+    .def("executor_meta",
+        [](Pipeline *p) {
+          auto ret = p->GetExecutorMeta();
+          return ExecutorMetaToDict(ret);
+        })


I was wondering if having one entry point for all statistics would be beneficial, but it probably depends on what kind statistics we provide.
If everything can be merge into a dictionary { "op_name" : { "stats_name : value }}, than it's fine, if not, than it doesn't make much sense.

But maybe it should be something like executor_meta("memory") or executor_meta(["memory", "threading", ...])?

I would keep it as it is, as we don't have a good plan for other stats now. Having 5 level of nesting just to read one int is not good.

klecki · 2020-06-22T09:52:57Z

dali/python/nvidia/dali/pipeline.py

+        ``real_memory_size``:     list of memory sizes that is used by each output of the operator;
+                                  index in the list corresponds to the output index
+
+        ``reserver_memory_size``: list of memory sizes that is reserved for each of the operator outputs


Suggested change

``reserver_memory_size``: list of memory sizes that is reserved for each of the operator outputs

``reserved_memory_size``: list of memory sizes that is reserved for each of the operator outputs

Maybe mention here what option should be set to get the stats here.

klecki · 2020-06-22T09:55:20Z

dali/python/nvidia/dali/pipeline.py

@@ -326,9 +348,10 @@ def _prepare_graph(self, define_graph = None):
                                self._bytes_per_sample,
                                self._set_affinity,
                                self._max_streams,
-                                self._default_cuda_stream_priority)
+                                self._default_cuda_stream_priority,)


Suggested change

self._default_cuda_stream_priority,)

self._default_cuda_stream_priority)

klecki · 2020-06-22T10:03:41Z

docs/advanced_topics.rst

@@ -48,6 +48,7 @@ The purpose of this functionality is to enable the user to fine-tune the process
 DALI uses intermediate buffers to pass data between operators in the processing graph. With DALI, the memory is never freed but just enlarged when present buffers are not sufficient to hold the data. However, in some cases, even this limited number of allocations still could affect DALI performance. Hence, if the user knows how much memory each operator buffer needs, then it is possible to provide a hint to presize buffers before the first run.
 Two parameters are available: First, the ``bytes_per_sample`` pipeline argument, which accepts one value that is used globally across all operators for all buffers.
 The second parameter is the ``bytes_per_sample_hint`` per operator argument, which accepts one value or a list of values. When one value is provided it is used for all output buffers for a given operator. When a list is provided then each buffer is presized to the corresponding size.
+To learn how much memory outputs of each operator need, the user may create the pipeline with ``get_memory_stats`` set to ``True`` and then query the pipeline for the operator output memory occupation by calling ``executor_meta`` method on the pipeline.


Can you give an example here how to use that information?

Obtained dictionary will contain the currently used memory for latest batch and the total reserved memory which will indicate the maximal memory usage (unless the thresholds for memory reallocation were adjusted (link to the Memory consumption section)). Than you can use the reserved memory as presize hints (link to section).

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

jantonguirao · 2020-06-22T12:21:01Z

dali/python/nvidia/dali/pipeline.py

@@ -92,11 +92,16 @@ class Pipeline(object):
    unrestricted number of streams is assumed).
 `default_cuda_stream_priority` : int, optional, default = 0
    CUDA stream priority used by DALI. See `cudaStreamCreateWithPriority` in CUDA documentation
+`get_memory_stats`: bool, optional, default = False


Suggested change

`get_memory_stats`: bool, optional, default = False

`get_memory_stats`: bool, optional, default = False

enable_memory_stats

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2020-06-22T12:52:18Z

CI MESSAGE: [1413697]: BUILD STARTED

JanuszL · 2020-06-22T12:53:03Z

!build

dali-automaton · 2020-06-22T13:25:07Z

CI MESSAGE: [1413748]: BUILD STARTED

dali-automaton · 2020-06-22T14:50:38Z

CI MESSAGE: [1413748]: BUILD PASSED

dali-automaton · 2020-06-22T14:52:14Z

CI MESSAGE: [1413697]: BUILD PASSED

klecki · 2020-06-22T15:26:36Z

dali/python/nvidia/dali/pipeline.py

+        ``max_real_memory_size``: list of maximum tensor size that is used by each output of the operator;
+                                  index in the list corresponds to the output index
+


Only small nitpick: this is max for non-contiguous buffers and average for contiguous. In the case of real_memory we theoretically could calculate the volume of the samples but I don't think it's worth.

klecki · 2020-06-22T15:32:20Z

dali/pipeline/executor/executor.h

+                          size_t &max_reserved_size) {
+    for (size_t j = 0; j < in.ntensor(); ++j) {
+      max_out_size = std::max(in[j].nbytes(), max_out_size);
+      max_reserved_size = std::max(in[j].capacity(), max_reserved_size);


I have some reservations regarding this. In the case of contiguous TensorVector, the Tensors that you're accessing with operator[] here are kind of views into the backing TensorList. So the capacity will probably match exactly the nbytes here, but the total capacity of the TensorVector might be bigger than nsamples * max(number of bytes) (I suspect).

This can be probably checked with a test that sets the TensorVector to contiguous mode, resizes it to some big shapes and than sets all the shapes to be for example half the initial size.

Can you check this? I think the best solution would be to take max of this reserved size and the average reserved size you calculate for the TensorList in function above.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2020-06-22T19:47:55Z

!build

dali-automaton · 2020-06-22T19:50:33Z

CI MESSAGE: [1414632]: BUILD STARTED

dali-automaton · 2020-06-22T21:49:52Z

CI MESSAGE: [1414632]: BUILD PASSED

klecki · 2020-06-23T09:05:54Z

docs/advanced_topics.rst

@@ -48,6 +48,7 @@ The purpose of this functionality is to enable the user to fine-tune the process
 DALI uses intermediate buffers to pass data between operators in the processing graph. With DALI, the memory is never freed but just enlarged when present buffers are not sufficient to hold the data. However, in some cases, even this limited number of allocations still could affect DALI performance. Hence, if the user knows how much memory each operator buffer needs, then it is possible to provide a hint to presize buffers before the first run.
 Two parameters are available: First, the ``bytes_per_sample`` pipeline argument, which accepts one value that is used globally across all operators for all buffers.
 The second parameter is the ``bytes_per_sample_hint`` per operator argument, which accepts one value or a list of values. When one value is provided it is used for all output buffers for a given operator. When a list is provided then each buffer is presized to the corresponding size.
+To learn how much memory outputs of each operator need, the user may create the pipeline with ``enable_memory_stats`` set to ``True`` and then query the pipeline for the operator's output memory statistics by calling ``executor_meta`` method on the pipeline. The ``max_real_memory_size`` value tells what is the biggest tensor in the batch for the outputs that allocate memory per sample, not for the whole batch at the time, or average tensor size when the allocation is continuous. This value is the one that should be provided to ``bytes_per_sample_hint``.


Shouldn't that be the max_reserved_memory_size as this is the bigger value which includes some "padding"? Or, as we calculate max, both will be similar here (and we don't need to actually return all the values?) When they will be different - only for small data with the padding?

Not really. If the user uses SetBufferGrowthFactor with a big value then better stick to the used memory, not the reserved one.

Ok, it makes more sense to use the real, as there won't be padding used if there is enough memory -> no reallocation.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

Add an ability to dump info about operator output buffer size

8ab6cfc

- adds an option the executor that makes it gather size of the operator output buffer statistics so the user can select the right value for `bytes_per_sample_hint` Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

Review fixes

416e5a7

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the dump_operator_stats branch from ee9a0ec to 416e5a7 Compare June 19, 2020 21:04

jantonguirao reviewed Jun 22, 2020

View reviewed changes

klecki reviewed Jun 22, 2020

View reviewed changes

JanuszL force-pushed the dump_operator_stats branch 3 times, most recently from 2f15c08 to 89341cb Compare June 22, 2020 09:59

klecki reviewed Jun 22, 2020

View reviewed changes

Review fixes

22fd3bc

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the dump_operator_stats branch from 89341cb to 22fd3bc Compare June 22, 2020 12:16

jantonguirao reviewed Jun 22, 2020

View reviewed changes

get_memory_stats -> enable_memory_stats

7f75031

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the dump_operator_stats branch from 9b7f298 to 7f75031 Compare June 22, 2020 12:38

jantonguirao approved these changes Jun 22, 2020

View reviewed changes

klecki reviewed Jun 22, 2020

View reviewed changes

Another code review round

ce82218

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

klecki reviewed Jun 23, 2020

View reviewed changes

Docs update

41bd1b1

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the dump_operator_stats branch from db3e16b to 41bd1b1 Compare June 23, 2020 09:21

klecki approved these changes Jun 23, 2020

View reviewed changes

JanuszL merged commit f4f88e7 into NVIDIA:master Jun 23, 2020

JanuszL deleted the dump_operator_stats branch June 23, 2020 11:39

	(operator_meta)[i].operator_name = static_cast<char>(malloc(sizeof(char) *
	auto &op_meta = (*operator_meta)[i];
	op_meta.operator_name = static_cast<char>(malloc(sizeof(char)

	* @brief Returns obtains the executor statistics
	* @brief Obtains the executor statistics

	* number of metadata entries. The user need to free that memory, as well
	* number of metadata entries. The user need to free that memory by invoking `freeDaliExecutorMetadata`

	assert(crop_meta["real_memory_size"] == crop_meta["reserver_memory_size"])
	assert(crop_meta["real_memory_size"] == crop_meta["reserved_memory_size"])

	self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats)
	self._pipe.EnableOperatorOutputMemoryStatistics(self._get_memory_stats)

	op_dict["reserver_memory_size"] = reserver_memory_size;
	op_dict["reserved_memory_size"] = reserver_memory_size;

	bool get_memory_stats_ = false;
	bool memory_stats_enabled_ = false;

		template<typename map>
		void AppendToMap(map &ret, map &in_stats, std::mutex &mutex) {

	template<typename map>
	void AppendToMap(map &ret, map &in_stats, std::mutex &mutex) {
	void AppendToMap(ExecutorMetaMap &ret, const ExecutorMetaMap &in_stats, std::mutex &mutex) {

	``reserver_memory_size``: list of memory sizes that is reserved for each of the operator outputs
	``reserved_memory_size``: list of memory sizes that is reserved for each of the operator outputs

	self._default_cuda_stream_priority,)
	self._default_cuda_stream_priority)

	`get_memory_stats`: bool, optional, default = False
	`get_memory_stats`: bool, optional, default = False

		``max_real_memory_size``: list of maximum tensor size that is used by each output of the operator;
		index in the list corresponds to the output index

Dump operator stats #2039

Dump operator stats #2039

Conversation

JanuszL commented Jun 19, 2020

Why we need this PR?

What happened in this PR?

JanuszL commented Jun 19, 2020

dali-automaton commented Jun 19, 2020

dali-automaton commented Jun 19, 2020

JanuszL commented Jun 19, 2020

dali-automaton commented Jun 19, 2020

dali-automaton commented Jun 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Jun 22, 2020

JanuszL commented Jun 22, 2020

dali-automaton commented Jun 22, 2020

dali-automaton commented Jun 22, 2020

dali-automaton commented Jun 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanuszL commented Jun 22, 2020

dali-automaton commented Jun 22, 2020

dali-automaton commented Jun 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment