Skip to content

Commit

Permalink
Docs update
Browse files Browse the repository at this point in the history
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
  • Loading branch information
JanuszL committed Jun 23, 2020
1 parent ce82218 commit 41bd1b1
Showing 1 changed file with 5 additions and 2 deletions.
7 changes: 5 additions & 2 deletions docs/advanced_topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,19 @@ For example, if your batch consists of nine 480p images and one 4K image in rand
On the other hand, the CPU batch presented as separate buffers will need to keep a 4K allocation for every sample after several iterations.
In effect, GPU buffers are allocated to house transformation results is as large as the largest possible batch, while the CPU buffers can be as large as batch size multiplied by the size of the largest sample. Note that even though the CPU processes one sample at a time per thread, a vector of samples needs to reside in the memory.

Moreover, both host and GPU buffers have configurable growth factor - if it's above 1 and the requested new size exceeds buffer capacity, the buffer will be allocated with extra margin to potentially avoid subsequent reallocations. This functionality is disabled by default (the growth factor is set to 1). These factors can be controlled via environment variables `DALI_HOST_BUFFER_GROWTH_FACTOR` and `DALI_DEVICE_BUFFER_GROWTH_FACTOR`, respectively as well as with Python API functions `nvidia.dali.backend.SetHostBufferGrowthFactor` and `nvidia.dali.backend.SetDeviceBufferGrowthFactor`. For convenience, the variable `DALI_BUFFER_GROWTH_FACTOR` and corresponding Python function `nvidia.dali.backend.SetBufferGrowthFactor` set the same growth factor for host and GPU buffers.
Moreover, both host and GPU buffers have configurable growth factor - if it's above 1 and the requested new size exceeds buffer capacity, the buffer will be allocated with extra margin to potentially avoid subsequent reallocations. This functionality is disabled by default (the growth factor is set to 1). These factors can be controlled via environment variables `DALI_HOST_BUFFER_GROWTH_FACTOR` and `DALI_DEVICE_BUFFER_GROWTH_FACTOR`, respectively as well as with Python API functions `nvidia.dali.backend.SetHostBufferGrowthFactor` and `nvidia.dali.backend.SetDeviceBufferGrowthFactor`. For convenience, the variable ``DALI_BUFFER_GROWTH_FACTOR`` and corresponding Python function `nvidia.dali.backend.SetBufferGrowthFactor` set the same growth factor for host and GPU buffers.

Operator buffer presizing
-------------------------

The purpose of this functionality is to enable the user to fine-tune the processing pipeline in situations when it is possible to forecast precisely the memory consumption during a DALI run. This results in saving the overhead of some reallocations.

DALI uses intermediate buffers to pass data between operators in the processing graph. With DALI, the memory is never freed but just enlarged when present buffers are not sufficient to hold the data. However, in some cases, even this limited number of allocations still could affect DALI performance. Hence, if the user knows how much memory each operator buffer needs, then it is possible to provide a hint to presize buffers before the first run.

Two parameters are available: First, the ``bytes_per_sample`` pipeline argument, which accepts one value that is used globally across all operators for all buffers.
The second parameter is the ``bytes_per_sample_hint`` per operator argument, which accepts one value or a list of values. When one value is provided it is used for all output buffers for a given operator. When a list is provided then each buffer is presized to the corresponding size.
To learn how much memory outputs of each operator need, the user may create the pipeline with ``enable_memory_stats`` set to ``True`` and then query the pipeline for the operator's output memory statistics by calling ``executor_meta`` method on the pipeline. The ``max_real_memory_size`` value tells what is the biggest tensor in the batch for the outputs that allocate memory per sample, not for the whole batch at the time, or average tensor size when the allocation is continuous. This value is the one that should be provided to ``bytes_per_sample_hint``.

To learn how much memory outputs of each operator need, the user may create the pipeline with ``enable_memory_stats`` set to ``True`` and then query the pipeline for the operator's output memory statistics by calling ``executor_meta`` method on the pipeline. The ``max_real_memory_size`` value tells what is the biggest tensor in the batch for the outputs that allocate memory per sample, not for the whole batch at the time, or average tensor size when the allocation is continuous. Usually this value is the one that should be provided to ``bytes_per_sample_hint``.

Prefetching queue depth
-----------------------
Expand Down

0 comments on commit 41bd1b1

Please sign in to comment.