Add support for GPU based numpy reader #2477

JanuszL · 2020-11-18T09:48:40Z

GPU based numpy reader uses GPU Direct Storage via cufile library implementation

Authored-by: Thorsten Kurth tkurth@nvidia.com
Co-authored-by: Michał Zientkiewicz mzient@gmail.com
Co-authored-by: Janusz Lisiecki jlisiecki@nvidia.com

Why we need this PR?

Pick one, remove the rest

It adds support for GPU based numpy reader

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
GPU based numpy reader uses GPU Direct Storage via cufile library implementation
Affected modules and functionalities:
GPU numpy reader
Key points relevant for the review:
NA
Validation and testing:
a new set of tests is added
Documentation (including examples):
updated

JIRA TASK: [Use DALI-1610]

dali-automaton · 2020-11-18T10:14:39Z

CI MESSAGE: [1809719]: BUILD STARTED

dali-automaton · 2020-11-18T10:22:09Z

CI MESSAGE: [1809719]: BUILD FAILED

dali-automaton · 2020-11-18T11:54:37Z

CI MESSAGE: [1809968]: BUILD FAILED

dali-automaton · 2020-11-18T13:30:32Z

CI MESSAGE: [1810224]: BUILD STARTED

dali-automaton · 2020-11-18T13:39:04Z

CI MESSAGE: [1810224]: BUILD FAILED

dali-automaton · 2020-11-18T16:21:36Z

CI MESSAGE: [1810664]: BUILD STARTED

dali-automaton · 2020-11-19T00:43:34Z

CI MESSAGE: [1812137]: BUILD STARTED

dali-automaton · 2020-11-19T11:04:58Z

CI MESSAGE: [1812137]: BUILD FAILED

dali-automaton · 2020-11-19T13:38:20Z

CI MESSAGE: [1814348]: BUILD STARTED

dali-automaton · 2020-11-19T14:30:23Z

CI MESSAGE: [1814348]: BUILD FAILED

dali-automaton · 2020-11-19T14:39:46Z

CI MESSAGE: [1814521]: BUILD STARTED

dali-automaton · 2020-11-19T15:30:46Z

CI MESSAGE: [1814521]: BUILD FAILED

dali-automaton · 2020-11-19T19:17:21Z

CI MESSAGE: [1815161]: BUILD STARTED

dali-automaton · 2020-11-20T01:23:15Z

CI MESSAGE: [1816421]: BUILD STARTED

dali-automaton · 2020-11-20T02:01:23Z

CI MESSAGE: [1816421]: BUILD FAILED

dali-automaton · 2020-11-20T08:54:04Z

CI MESSAGE: [1817930]: BUILD STARTED

dali-automaton · 2020-11-20T16:35:06Z

CI MESSAGE: [1817930]: BUILD FAILED

dali-automaton · 2020-11-21T13:53:38Z

CI MESSAGE: [1821460]: BUILD STARTED

dali-automaton · 2020-11-21T17:04:35Z

CI MESSAGE: [1821460]: BUILD FAILED

dali-automaton · 2020-11-25T11:31:49Z

CI MESSAGE: [1828401]: BUILD PASSED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2020-11-25T12:20:39Z

CI MESSAGE: [1832413]: BUILD STARTED

dali-automaton · 2020-11-25T15:48:43Z

CI MESSAGE: [1832413]: BUILD FAILED

klecki

I have few questions, didn't find many big problems, I'm wondering if we can deduplicate code between FileLoader and CuFileLoader. Tying them together probably isn't the best option, but duplicating the sharding etc isn't ideal.

klecki · 2020-11-25T10:28:38Z

dali/core/CMakeLists.txt

+      DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/../../tools/stub_generator/stub_codegen.py
+              "${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}/cufile.h"
+              "${CMAKE_CURRENT_SOURCE_DIR}/../../tools/stub_generator/cufile.json"
+      COMMENT "Running cuda.h stub generator"


Nitpick:
cufile?

klecki · 2020-11-25T12:06:32Z

dali/operators/reader/loader/loader.h

  PrepareEmptyTensor(T&) {
-    constexpr bool T_is_Tensor = std::is_same<T, Tensor<CPUBackend>>::value;
+    constexpr bool T_is_Tensor = (std::is_same<T, Tensor<CPUBackend>>::value ||


Hmm, it's runtime enforce so we can just DALI_ERROR here, up to you

klecki · 2020-11-25T12:42:16Z

dali/operators/reader/loader/numpy_loader.cc

-  return file;
+void NumpyHeaderCache::UpdateCache(const string &file_name, const NumpyParseTarget &value) {
+  if (cache_headers_) {
+    std::unique_lock<std::mutex> cache_lock(cache_mutex_);


I assume we're fine with the cache growing, even with millions of files it's probably still around several megabytes.

We are fine with that.

klecki · 2020-11-25T13:41:12Z

dali/util/cufile.h

+  virtual size_t Read(uint8_t * buffer, size_t n_bytes, size_t offset = 0) = 0;
+  virtual size_t ReadCPU(uint8_t * buffer, size_t n_bytes) = 0;


I wonder, why we didn't go the other way round, allowing the CUFileStream to be a regular FileStream wrt to the Read() and add a ReadGPU(). But that would probably break on the Seek, etc. Maybe instead, it would make sense to be able to get a regular FileStream from the CUFileStream? I'm asking that because the call through the pointer to member in the ParseHeader was a bit surprising for me.

As I understand the goal was to seek file for GPU and CPU, while you read the content on the CPU/GPU separately.

get a regular FileStream from the CUFileStream

It would require sharing the state between both of them what could be misused I guess. No idea this is the way we want to do it.
@azrael417 ?

Yeah, using any of the reads advances the pos_ inside, so it's not like opening another stream, it's reading on CPU or GPU using the same reader "head".

Still, making it other way round would be easier wrt to the pointer to member and the CuFileStream could be a regular FileStream at the same time, couldn't it?

Done, I hope.

docker/Dockerfile.cuda100.x86_64.deps

klecki · 2020-11-25T17:16:17Z

dali/operators/reader/numpy_reader_gpu_op.cc

+
+  for (int data_idx = 0; data_idx < batch_size_; ++data_idx) {
+    const auto& imfile = GetSample(data_idx);
+    if (imfile.meta == "transpose:false") {


really? what was wrong with the boolean?

klecki · 2020-11-25T17:19:51Z

dali/operators/reader/numpy_reader_gpu_op.cc

+
+  // use copy kernel for plan samples
+  if (!copy_sizes.empty()) {
+    ref_type.template Copy<GPUBackend, GPUBackend>(copy_to.data(), copy_from.data(),


I guess detecting if all samples don't require transpose and reading directly to a target batch (that we already know size of, as we parsed it), would be a potential optimization. Or maybe doing it at least as one big D2D copy?

I guess detecting if all samples don't require transpose and reading directly to a target batch (that we already know size of, as we parsed it), would be a potential optimization.

It is not possible as in the prefetch we don't know the operator output buffer, RunImpl knows that but the prefetched batch is already there. It is not contiguous so we need that copy (we cannot simply swap buffers).

As you see only on copy kernel is invoked here. I don't think we can do anything faster to copy from non-contiguous tensor vector into tensor list.

Correction - I forgot we have prefetched_batch_tensors_ that is contiguous. We can swap it if no transposition is needed at all.
Still ad. 1 - it is crated asynchronously in the prefetch thread so nothing beyond that can be done.

klecki · 2020-11-25T17:20:39Z

dali/operators/reader/loader/numpy_loader_gpu.cc

+void NumpyLoaderGPU::ReadSampleHelper(CUFileStream *file, ImageFileWrapperGPU& imfile,
+                                      void *buffer, Index offset, size_t total_size) {
+  // register the buffer (if needed)
+  RegisterTensor(buffer, total_size);


It's not exactly Tensor, it does register the whole tensor list, right? I guess that's why we use the offset into the buffer instead of buffer + offset as a Read target.

klecki · 2020-11-25T17:21:13Z

dali/operators/reader/loader/numpy_loader_gpu.cc

+
+    imfile.type_info = target.type_info;
+    imfile.shape = target.shape;
+    imfile.meta = (target.fortran_order ? "transpose:true" : "transpose:false");


What's wrong with a boolean? Can we get rid of string here?

klecki · 2020-11-25T17:22:21Z

dali/operators/reader/loader/numpy_loader_gpu.cc

+  // set metadata
+  imfile.image.SetMeta(meta);
+
+  imfile.read_meta_f = [this, image_file, &imfile] () {


What is stopping us from executing this part here? Is it slower than running it in thread pool and waiting for the work to be done? If the prefetch is ready before we start calling read_sample_f in the reader it would be probably better?

I haven't compared the difference, but as we have a header cache it is not the fastest operation possible.

If the prefetch is ready before we start calling read_sample_f in the reader it would be probably better

The thread pool is still in the prefetch thread, but just in different place.
The usual flow is:

Prefetch thread->Prefetch()->ReadOne()->ReadSample()

The flow here is:

Prefetch thread-> first: Prefetch()->ReadOne()->ReadSample()
->then: Thread pool that runs functions obtained

So it is still asynchronous to the executor thread.

dali-automaton · 2020-11-25T17:45:41Z

CI MESSAGE: [1832413]: BUILD PASSED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2020-11-25T20:48:23Z

CI MESSAGE: [1833957]: BUILD STARTED

dali-automaton · 2020-11-25T23:27:21Z

CI MESSAGE: [1833957]: BUILD FAILED

dali-automaton · 2020-11-26T08:29:49Z

CI MESSAGE: [1833957]: BUILD PASSED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton · 2020-11-26T14:45:49Z

CI MESSAGE: [1836444]: BUILD STARTED

dali-automaton · 2020-11-26T15:14:40Z

CI MESSAGE: [1836535]: BUILD STARTED

dali-automaton · 2020-11-26T18:07:26Z

CI MESSAGE: [1836535]: BUILD PASSED

dali-automaton · 2020-11-26T18:14:04Z

CI MESSAGE: [1836444]: BUILD FAILED

klecki · 2020-11-26T15:27:11Z

dali/pipeline/operator/operator.h

-   * @brief Shared param setup
+   * @brief Shared param setup. Legacy implementation for per-sample approach
+   *
+   * Usage of this API is deprecated. For CPU Ops `void SetupSharedSampleParams(HostWorkspace &ws)`


Suggested change

* Usage of this API is deprecated. For CPU Ops `void SetupSharedSampleParams(HostWorkspace &ws)`

* Usage of this API is deprecated. For CPU Ops `void SetupImpl(HostWorkspace &ws)`

Do you mean this? Do we want this?

Copy paste. Fixed.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the rebased_gds branch 2 times, most recently from b8ac13b to 15d28e0 Compare November 18, 2020 10:13

JanuszL force-pushed the rebased_gds branch from 15d28e0 to ac8cb16 Compare November 18, 2020 11:58

JanuszL force-pushed the rebased_gds branch 2 times, most recently from 5e2970b to fa876ae Compare November 18, 2020 22:00

JanuszL force-pushed the rebased_gds branch from fa876ae to 7eace65 Compare November 19, 2020 13:36

JanuszL force-pushed the rebased_gds branch from 7eace65 to 594c065 Compare November 19, 2020 14:38

JanuszL force-pushed the rebased_gds branch from 594c065 to 4238aec Compare November 19, 2020 17:24

JanuszL force-pushed the rebased_gds branch from 4238aec to 1d6b044 Compare November 19, 2020 21:38

JanuszL force-pushed the rebased_gds branch from 1d6b044 to 6a60b8f Compare November 20, 2020 08:52

JanuszL force-pushed the rebased_gds branch from 6a60b8f to c72238e Compare November 21, 2020 13:51

JanuszL force-pushed the rebased_gds branch from 5c28718 to 4a12ba6 Compare November 25, 2020 11:34

Crash fix

dfd1c1b

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the rebased_gds branch from 4a12ba6 to dfd1c1b Compare November 25, 2020 12:18

klecki reviewed Nov 25, 2020

View reviewed changes

Test fix

a4ebdeb

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the rebased_gds branch from 72f1100 to 548cd04 Compare November 26, 2020 08:50

Review fixes

4f83024

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the rebased_gds branch 4 times, most recently from 9dae415 to 2ebd763 Compare November 26, 2020 14:05

Refactor

d8ba5a7

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the rebased_gds branch from 2ebd763 to d8ba5a7 Compare November 26, 2020 14:15

More refactor

ee94943

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL force-pushed the rebased_gds branch from b7bb9ba to ee94943 Compare November 26, 2020 14:37

klecki approved these changes Nov 26, 2020

View reviewed changes

Review fix

22cdcfa

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL merged commit add3a20 into NVIDIA:master Nov 26, 2020

JanuszL deleted the rebased_gds branch November 26, 2020 20:02

		virtual size_t Read(uint8_t * buffer, size_t n_bytes, size_t offset = 0) = 0;
		virtual size_t ReadCPU(uint8_t * buffer, size_t n_bytes) = 0;

	* Usage of this API is deprecated. For CPU Ops `void SetupSharedSampleParams(HostWorkspace &ws)`
	* Usage of this API is deprecated. For CPU Ops `void SetupImpl(HostWorkspace &ws)`

Add support for GPU based numpy reader #2477

Add support for GPU based numpy reader #2477

Conversation

JanuszL commented Nov 18, 2020

Why we need this PR?

What happened in this PR?

dali-automaton commented Nov 18, 2020

dali-automaton commented Nov 18, 2020

dali-automaton commented Nov 18, 2020

dali-automaton commented Nov 18, 2020

dali-automaton commented Nov 18, 2020

dali-automaton commented Nov 18, 2020

dali-automaton commented Nov 19, 2020

dali-automaton commented Nov 19, 2020

dali-automaton commented Nov 19, 2020

dali-automaton commented Nov 19, 2020

dali-automaton commented Nov 19, 2020

dali-automaton commented Nov 19, 2020

dali-automaton commented Nov 19, 2020

dali-automaton commented Nov 20, 2020

dali-automaton commented Nov 20, 2020

dali-automaton commented Nov 20, 2020

dali-automaton commented Nov 20, 2020

dali-automaton commented Nov 21, 2020

dali-automaton commented Nov 21, 2020

dali-automaton commented Nov 25, 2020

dali-automaton commented Nov 25, 2020

dali-automaton commented Nov 25, 2020

klecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Nov 25, 2020

dali-automaton commented Nov 25, 2020

dali-automaton commented Nov 25, 2020

dali-automaton commented Nov 26, 2020

dali-automaton commented Nov 26, 2020

dali-automaton commented Nov 26, 2020

dali-automaton commented Nov 26, 2020

dali-automaton commented Nov 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment