Fits reader gpu #4752

aderylo · 2023-03-29T08:36:45Z

Category:

New feature (non-breaking change which adds functionality)

Description:

Adds simplistic implementation of a fits reader that reads to the GPU backend.

Additional information:

Set to draft, since there are segfaults when running tests.

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

klecki · 2023-03-31T11:44:07Z

dali/operators/reader/loader/fits_loader_gpu.cc

+    fits::FITS_CALL(fits_read_img(current_file, header.datatype_code, 1, nelem, &nulval,
+                                  static_cast<uint8_t*>(buffer.raw_mutable_data()), &anynul,
+                                  &status));
+
+    cudaMemcpy(target.data[output_idx].raw_mutable_data(), buffer.raw_mutable_data(),
+               buffer.nbytes(), cudaMemcpyHostToDevice);


It is a ok starting point, but we can't put the decoding kernel here, as we have access to just one sample. We need to delay the usage of the decoding kernel till the RunImpl of the FitsReaderGpu, so we can decode whole batch at once.

I assume, we need to extend the FitsFileWrapperGPU with the shape and dtype information, and just put raw, encoded bytes in each Tensor.

Than we can decode them in RunImpl.
The question would be if we should be doing CUDA memcopy here or in the RunImpl of the reader - we should probably see the impact on the performance - again, in the RunImpl we can coalesce the copy into one.

klecki · 2023-03-31T11:55:42Z

dali/operators/reader/loader/fits_loader_gpu.h

+namespace dali {
+
+struct FitsFileWrapperGPU {
+  std::vector<Tensor<GPUBackend>> data;


Suggested change

std::vector<Tensor<GPUBackend>> data;

std::vector<Tensor<CPUBackend>> data;

TensorShape shape;

DALIDataType dtype;

bool encoded;

klecki · 2023-03-31T11:58:10Z

dali/operators/reader/fits_reader_gpu_op.cc

+      auto &sample = GetSample(sample_id);
+
+      cudaMemcpy(output.raw_mutable_tensor(sample_id), sample.data[output_idx].raw_data(),
+                 sample.data[output_idx].nbytes(), cudaMemcpyDeviceToDevice);


Collect all samples

cudaMemcpy them to the GPU (use the ToContiguousGpu for example).

call the kernel.

JanuszL · 2023-04-14T11:50:57Z

dali/operators/reader/fits_reader_gpu_op.cu

+        cudaMemcpyAsync(output.raw_mutable_tensor(sample_id), sample.data[output_idx].raw_data(),
+                        sample.data[output_idx].nbytes(), cudaMemcpyHostToDevice);


You should use a proper operator stream.

JanuszL · 2023-04-14T12:02:37Z

dali/operators/reader/fits_reader_gpu_op.cu

+            tile_size_cuda, sample.header[output_idx].bytepix, sample.header[output_idx].blocksize,
+            tiles, maxtilelen);
+
+        cudaFree(undecoded_data_cuda);


For tmp memory please use scratchpad - https://github.com/NVIDIA/DALI/blob/main/dali/operators/image/remap/remap.cu#L69 (declaration), https://github.com/NVIDIA/DALI/blob/main/dali/operators/image/remap/remap.cuh#L74 (allocates memory on the GPU and copies data from the CPU to it using given stream).
Scratchpad should live until to the last call touching given memory. So:

dali::kernels::DynamicScratchpad ds; auto stream = ws.stream(); auto tile_offset_cuda = std::get<0>(ds.ToContiguousGPU(stream, sample.tile_offset[output_idx])); auto tile_offset_cuda = std::get<0>(ds.ToContiguousGPU(stream, sample.tile_size[output_idx])); auto undecoded_data_cuda= std::get<0>(ds.ToContiguousGPU(stream, sample.data[output_idx]));

JanuszL · 2023-04-14T12:03:34Z

dali/operators/reader/fits_reader_gpu_op.cu

+        if (zbitpix == 8) {
+          cudaMalloc(&decoded_data_cuda, tiles * maxtilelen * sizeof(char));
+        } else if (zbitpix == 16) {
+          cudaMalloc(&decoded_data_cuda, tiles * maxtilelen * sizeof(short));
+        } else {
+          cudaMalloc(&decoded_data_cuda, tiles * maxtilelen * sizeof(int));
+        }


Is it used anywhere?

dali/operators/reader/fits_reader_gpu_op.cu

dali/util/fits.cc

Signed-off-by: aderylo <a.m.derylo@gmail.com>

… [floats not supported] Signed-off-by: aderylo <a.m.derylo@gmail.com>

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Signed-off-by: mskwr <michal.skwarek@protonmail.ch>

klecki

Few minor comments, otherwise looks ok.

dali/operators/reader/fits_reader_op.h

dali/util/fits.cc

dali/operators/reader/loader/fits_loader.h

dali/operators/reader/fits_reader_gpu_op.cu

JanuszL · 2023-05-11T17:23:52Z

dali/operators/reader/fits_reader_gpu_op.cu

+
+
+    if (compressed) {
+      TensorList<GPUBackend> sample_list_gpu;


@mzient - do you think it will work or as soon as sample_list_gpu goes out of scope we can get in troubles?

Or setting the order is sufficient?

As I understand, it's a termporary buffer. Setting the order should be enough from the correctness perspective. Performancewise, using a full-blown TensorList as a temporary storage for raw data may seem excessive, but at least the code, as it is, has the benefit of simplicity.

There's a problem with the source (host) TensorList, however. If it's pinned and it uses host order, then its contents may be clobbered before this H2D copy finishes.

Maybe we can:

auto out = s.ToContiguousGPU(make_span(sample_list_cpu)); TensorListView<StorageGPU, uint8_t> sample_list_gpu(out , sample_list_cpu.shape());

?

Does ToContiguousGPU accept span as an argument? I don't see any overload that would work like that.

The simplest fix here, would be to add yet another copy, this is what ToContiguousGPU does internally, so

TensorList<CPU> samples_tmp; samples_tmp.SetContiguity(BatchContiguity::Contiguous); samples_tmp.set_order(ws.stream()); samples_tmp.Copy(sample_list_cpu); sample_list_gpu.Copy(sample_list_cpu);

Honestly, I don't know why we don't have any better API for that, unless I'm missing something, but using scratchpad or TLV would require you to use manual copy.

Well, we can always add an API to Scratchpad interface. Still, we could do this:

auto tlv_pinned = s.AllocTensorList<mm::memory_kind::pinned, uint8_t>(sample_list_cpu.shape()); auto tlv_gpu = s.AllocTensorList<mm::memory_kind::device, uint8_t>(sample_list_cpu.shape()); kernels::copy(tlv_pinned, view<uint8_t>(sample_list_cpu), AccessOrder::host()); kernels::copy(tlv_gpu, tlv_pinned, ws.stream());

Signed-off-by: mskwr <michal.skwarek@protonmail.ch>

klecki · 2023-05-12T09:04:43Z

!build

dali-automaton · 2023-05-12T09:10:20Z

CI MESSAGE: [8266742]: BUILD STARTED

dali-automaton · 2023-05-12T11:00:58Z

CI MESSAGE: [8266742]: BUILD PASSED

szalpal · 2023-05-10T16:23:17Z

dali/operators/reader/CMakeLists.txt

@@ -25,6 +25,7 @@ list(APPEND DALI_OPERATOR_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/numpy_reader_op.cc")

 if(BUILD_CFITSIO)
  list(APPEND DALI_OPERATOR_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/fits_reader_op.cc")
+  list(APPEND DALI_OPERATOR_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/fits_reader_gpu_op.cu")


This makes me wonder, would DALI compile properly with BUILD_CFITSIO=OFF? I'm asking, because there are header files outside of this if. Could you verify that the build works when BUILD_CFITSIO=OFF and cfitsio lib is unavailable in the system?

Also, I'd name this fits_reader_gpu.cu, since the fact that it resides in operators already suggest it's an op. But that's nitpicking, up to you :)

With regards to naming, we followed the same convention as numpy_reader. Although, I agree that op suffix seems redundant.

szalpal · 2023-05-10T16:27:16Z

dali/util/fits.h

 DLL_PUBLIC void ParseHeader(HeaderData &parsed_header, fitsfile *src);

+/** @brief Read raw data of rice coded image HDU. */
+DLL_PUBLIC int extract_undecoded_data(fitsfile *fptr, std::vector<uint8_t> &data,


Looks like every other function in this file follows PascalCase. How about making this one too?

szalpal · 2023-05-10T16:30:27Z

dali/util/fits.cc

If I understand correctly, there are some functions in this file that should not be visible outside of this compilation unit. How about wrapping those in an anonymous namespace?

szalpal · 2023-05-10T16:31:22Z

dali/util/fits.cc

+  int32_t status = 0;
+
+  for (int32_t i = 0; i < n_dims; i++) {
+    std::string keyword = "ZTILE" + std::to_string(i + 1);


Suggested change

std::string keyword = "ZTILE" + std::to_string(i + 1);

std::string keyword = make_string("ZTILE", i + 1);

The make_string we've implemented before uses stringstream, so I guess it would be tiny little bit better than concatenation

szalpal · 2023-05-10T16:36:27Z

dali/util/fits.cc

+  for (int32_t i = 0; i < n_dims; i++) {
+    std::string keyword = "ZTILE" + std::to_string(i + 1);
+    FITS_CALL(fits_read_key(fptr, TLONG, keyword.c_str(), &tileSizes[i], NULL, &status));
+    DALI_ENFORCE(tileSizes[i] > 0, "All ZTILE{i} values must be greater than 0!");


Would you consider adding more info to this error message? In case user get onto this, it might be helpful for him. I have something like this in mind:

DALI_ENFORCE(tileSizes[i] > 0, make_string("All ZTILE{i} values must be greater than 0! Actual: ", tileSizes[i], " at index i=", i));

szalpal · 2023-05-12T10:13:59Z

dali/operators/reader/loader/fits_loader.h

+        fits::ParseHeader(header, current_file);
+        target.header[output_idx] = header;
+      } catch (const std::runtime_error& e) {
+        DALI_FAIL(e.what() + ". File: " + filename);


How about using make_string here?

szalpal · 2023-05-12T10:18:30Z

dali/operators/reader/fits_reader_gpu_op.cu

+template <typename T>
+__global__ void rice_decompress(unsigned char *compressed_data, T *uncompressed_data,
+                                const int64 *tile_offset, const int64 *tile_size, int blocksize,
+                                int64 tiles, int64 maxtilelen, double bscale, double bzero) {


Would it be possible to create some unit tests for this? It's about 100+ lines of crazy algorithmic code, I believe it would be nice to test it separately, not only with the umbrella Python test for operator

I think with the test that checks the extraction, it will be enough to test the decoding on the operator level.

szalpal · 2023-05-12T10:28:00Z

dali/operators/reader/fits_reader_gpu_op.h

@@ -0,0 +1,43 @@
+// Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Suggested change

// Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

// Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

szalpal · 2023-05-12T11:39:42Z

dali/operators/reader/fits_reader_gpu_op.cu

@@ -0,0 +1,216 @@
+// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Suggested change

// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

// Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

szalpal · 2023-05-12T11:44:27Z

dali/util/fits.cc

+
+int extract_undecoded_data(fitsfile* fptr, std::vector<uint8_t>& data,
+                           std::vector<int64_t>& tile_offset, std::vector<int64_t>& tile_size,
+                           int64 rows, int* status) {


Are these two functions (extract_undecoded_data and extract_data) tested? If not, could they be?

Sure! There might be a bit of a problem with getting grand truths for the tests, since as far as I am concerned, cfitsio doesn't support reading raw uncompressed data without doing compression and scaling first.

Signed-off-by: aderylo <a.m.derylo@gmail.com>

klecki · 2023-05-18T09:15:42Z

!build

dali-automaton · 2023-05-18T09:20:28Z

CI MESSAGE: [8331595]: BUILD FAILED

klecki · 2023-05-18T10:00:30Z

!build

dali-automaton · 2023-05-18T10:05:48Z

CI MESSAGE: [8331958]: BUILD STARTED

dali-automaton · 2023-05-18T12:03:27Z

CI MESSAGE: [8331958]: BUILD PASSED

Signed-off-by: aderylo <a.m.derylo@gmail.com>

szalpal · 2023-05-22T09:41:12Z

dali/util/fits_test.cc

+  vector<T> data;
+  data.resize(src->Size() / sizeof(T));
+  auto ret = src->Read(reinterpret_cast<uint8_t *>(data.data()), src->Size());
+  // static_cast<uint8_t *>(data.data());


Do we need this line? If so, could you describe why?

No, I've cleaned it up in the latest commit.

Signed-off-by: aderylo <a.m.derylo@gmail.com>

dali-automaton · 2023-05-22T11:55:05Z

CI MESSAGE: [8369923]: BUILD STARTED

dali-automaton · 2023-05-22T13:27:45Z

CI MESSAGE: [8369923]: BUILD PASSED

…f buffer which is pinned and uses host order. Signed-off-by: aderylo <a.m.derylo@gmail.com>

dali-automaton · 2023-05-22T16:47:09Z

CI MESSAGE: [8372926]: BUILD STARTED

dali-automaton · 2023-05-22T18:40:27Z

CI MESSAGE: [8372926]: BUILD PASSED

klecki · 2023-05-23T11:01:32Z

!build

dali-automaton · 2023-05-23T11:06:02Z

CI MESSAGE: [8382856]: BUILD STARTED

dali-automaton · 2023-05-23T13:00:50Z

CI MESSAGE: [8382856]: BUILD PASSED

Generalize the FITS loader for CPU and GPU backends. The GPU FITS loader can extract undecoded data from the file for accelerated decoding. Add CUDA kernel implementing RICE GPU-accelerated RICE decoding. Signed-off-by: Adam Deryło <a.m.derylo@gmail.com> Signed-off-by: Michał Skwarek<michal.skwarek@protonmail.ch> Co-authored-by: Michał Skwarek<michal.skwarek@protonmail.ch>

klecki self-assigned this Mar 29, 2023

klecki reviewed Mar 31, 2023

View reviewed changes

aderylo force-pushed the fits-reader-gpu branch from 4815da4 to 1462a02 Compare April 13, 2023 21:34

JanuszL reviewed Apr 14, 2023

View reviewed changes

dali/operators/reader/fits_reader_gpu_op.cu Outdated Show resolved Hide resolved

JanuszL reviewed Apr 14, 2023

View reviewed changes

dali/util/fits.cc Outdated Show resolved Hide resolved

aderylo force-pushed the fits-reader-gpu branch from 5559d56 to 4befa0b Compare April 24, 2023 18:17

aderylo added 17 commits April 24, 2023 20:19

Minimal implementation of fits gpu reader

a4db378

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Provisional draft of gpu fits loaders:

28727e3

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Add gpu reader/loader to cmake and make it compilable

9962558

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Add gpu tests to fits reader test suite

5e1bb16

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Buggy version of gpu reader, but atleast it works for int8

65df6be

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Now without bugs ;)

0d9ef5b

Signed-off-by: aderylo <a.m.derylo@gmail.com>

One less copy when reading sample

4cdf1d7

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Modify create fits file to allow for multiple hdus

03205fe

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Add test coverage for multiple hdus in one sample

d0de637

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Add reading undecoded data to fits utils

235cfc2

Signed-off-by: aderylo <a.m.derylo@gmail.com>

First draft with seg faults in extracting raw data

e881504

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Working, although synchronizing to often version of GPU decompression…

70fa09f

… [floats not supported] Signed-off-by: aderylo <a.m.derylo@gmail.com>

Clean up unnecessary allocations

9f457dc

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Partly implement scratchpad

cbb8291

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Clean up RunImpl

89c9526

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Remove CudaFree from RunImpl

5156139

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Synchronize using streams

4befa0b

Signed-off-by: aderylo <a.m.derylo@gmail.com>

mskwr force-pushed the fits-reader-gpu branch from 7a90802 to c9263fd Compare April 26, 2023 15:59

Explain six nested loops

c9263fd

Signed-off-by: mskwr <michal.skwarek@protonmail.ch>

klecki reviewed May 11, 2023

View reviewed changes

JanuszL reviewed May 11, 2023

View reviewed changes

mskwr added 2 commits May 12, 2023 00:26

Add comments

c7fe48b

Signed-off-by: mskwr <michal.skwarek@protonmail.ch>

Add minor style changes

cc569bb

Signed-off-by: mskwr <michal.skwarek@protonmail.ch>

klecki approved these changes May 12, 2023

View reviewed changes

szalpal reviewed May 12, 2023

View reviewed changes

Conform to naming convention

ae28361

Signed-off-by: aderylo <a.m.derylo@gmail.com>

aderylo mentioned this pull request May 20, 2023

Add raw undecoded reference data for tests in DALI NVIDIA/DALI_extra#121

Merged

Add tests for ExtractUndecodedData

e5e88a2

Signed-off-by: aderylo <a.m.derylo@gmail.com>

szalpal approved these changes May 22, 2023

View reviewed changes

Clean up fits_test.cc

4dda305

Signed-off-by: aderylo <a.m.derylo@gmail.com>

Use temporary buffer for H2D copy to avoid possible data clobbering o…

401cbba

…f buffer which is pinned and uses host order. Signed-off-by: aderylo <a.m.derylo@gmail.com>

Update DALI_EXTRA version

5f98350

klecki merged commit f9a3ee4 into NVIDIA:main May 23, 2023

-  std::vector<Tensor<GPUBackend>> data;
+  std::vector<Tensor<CPUBackend>> data;
+  TensorShape shape;
+  DALIDataType dtype;
+  bool encoded;

		cudaMemcpyAsync(output.raw_mutable_tensor(sample_id), sample.data[output_idx].raw_data(),
		sample.data[output_idx].nbytes(), cudaMemcpyHostToDevice);

	std::string keyword = "ZTILE" + std::to_string(i + 1);
	std::string keyword = make_string("ZTILE", i + 1);

		@@ -0,0 +1,43 @@
		// Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

	// Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
	// Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,216 @@
		// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Fits reader gpu #4752

Fits reader gpu #4752

Conversation

aderylo commented Mar 29, 2023 • edited Loading

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Choose a reason for hiding this comment

klecki Mar 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanuszL May 12, 2023 • edited Loading

Choose a reason for hiding this comment

klecki May 12, 2023 • edited Loading

Choose a reason for hiding this comment

mzient May 12, 2023 • edited Loading

Choose a reason for hiding this comment

klecki commented May 12, 2023

dali-automaton commented May 12, 2023

dali-automaton commented May 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented May 18, 2023

dali-automaton commented May 18, 2023

klecki commented May 18, 2023

dali-automaton commented May 18, 2023

dali-automaton commented May 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented May 22, 2023

dali-automaton commented May 22, 2023

dali-automaton commented May 22, 2023

dali-automaton commented May 22, 2023

klecki commented May 23, 2023

dali-automaton commented May 23, 2023

dali-automaton commented May 23, 2023

aderylo commented Mar 29, 2023 •

edited

Loading

klecki Mar 31, 2023 •

edited

Loading

JanuszL May 12, 2023 •

edited

Loading

klecki May 12, 2023 •

edited

Loading

mzient May 12, 2023 •

edited

Loading