Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fits reader gpu #4752

Merged
merged 34 commits into from
May 23, 2023
Merged

Fits reader gpu #4752

merged 34 commits into from
May 23, 2023

Conversation

aderylo
Copy link
Contributor

@aderylo aderylo commented Mar 29, 2023

Category:

New feature (non-breaking change which adds functionality)

Description:

Adds simplistic implementation of a fits reader that reads to the GPU backend.

Additional information:

Set to draft, since there are segfaults when running tests.

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@klecki klecki self-assigned this Mar 29, 2023
Comment on lines 76 to 81
fits::FITS_CALL(fits_read_img(current_file, header.datatype_code, 1, nelem, &nulval,
static_cast<uint8_t*>(buffer.raw_mutable_data()), &anynul,
&status));

cudaMemcpy(target.data[output_idx].raw_mutable_data(), buffer.raw_mutable_data(),
buffer.nbytes(), cudaMemcpyHostToDevice);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a ok starting point, but we can't put the decoding kernel here, as we have access to just one sample. We need to delay the usage of the decoding kernel till the RunImpl of the FitsReaderGpu, so we can decode whole batch at once.

I assume, we need to extend the FitsFileWrapperGPU with the shape and dtype information, and just put raw, encoded bytes in each Tensor.

Than we can decode them in RunImpl.
The question would be if we should be doing CUDA memcopy here or in the RunImpl of the reader - we should probably see the impact on the performance - again, in the RunImpl we can coalesce the copy into one.

namespace dali {

struct FitsFileWrapperGPU {
std::vector<Tensor<GPUBackend>> data;
Copy link
Contributor

@klecki klecki Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<Tensor<GPUBackend>> data;
std::vector<Tensor<CPUBackend>> data;
TensorShape shape;
DALIDataType dtype;
bool encoded;

Comment on lines 28 to 31
auto &sample = GetSample(sample_id);

cudaMemcpy(output.raw_mutable_tensor(sample_id), sample.data[output_idx].raw_data(),
sample.data[output_idx].nbytes(), cudaMemcpyDeviceToDevice);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Collect all samples
  2. cudaMemcpy them to the GPU (use the ToContiguousGpu for example).
  3. call the kernel.

Comment on lines 375 to 376
cudaMemcpyAsync(output.raw_mutable_tensor(sample_id), sample.data[output_idx].raw_data(),
sample.data[output_idx].nbytes(), cudaMemcpyHostToDevice);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use a proper operator stream.

tile_size_cuda, sample.header[output_idx].bytepix, sample.header[output_idx].blocksize,
tiles, maxtilelen);

cudaFree(undecoded_data_cuda);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tmp memory please use scratchpad - https://github.com/NVIDIA/DALI/blob/main/dali/operators/image/remap/remap.cu#L69 (declaration), https://github.com/NVIDIA/DALI/blob/main/dali/operators/image/remap/remap.cuh#L74 (allocates memory on the GPU and copies data from the CPU to it using given stream).
Scratchpad should live until to the last call touching given memory. So:

dali::kernels::DynamicScratchpad ds;
auto stream = ws.stream();
auto tile_offset_cuda = std::get<0>(ds.ToContiguousGPU(stream, sample.tile_offset[output_idx]));
auto tile_offset_cuda = std::get<0>(ds.ToContiguousGPU(stream, sample.tile_size[output_idx]));
auto undecoded_data_cuda= std::get<0>(ds.ToContiguousGPU(stream, sample.data[output_idx]));

Comment on lines 339 to 407
if (zbitpix == 8) {
cudaMalloc(&decoded_data_cuda, tiles * maxtilelen * sizeof(char));
} else if (zbitpix == 16) {
cudaMalloc(&decoded_data_cuda, tiles * maxtilelen * sizeof(short));
} else {
cudaMalloc(&decoded_data_cuda, tiles * maxtilelen * sizeof(int));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used anywhere?

dali/util/fits.cc Outdated Show resolved Hide resolved
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
… [floats not supported]

Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: mskwr <michal.skwarek@protonmail.ch>
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments, otherwise looks ok.

dali/operators/reader/fits_reader_op.h Outdated Show resolved Hide resolved
dali/util/fits.cc Show resolved Hide resolved
dali/util/fits.cc Outdated Show resolved Hide resolved
dali/util/fits.cc Show resolved Hide resolved
dali/util/fits.cc Outdated Show resolved Hide resolved
dali/operators/reader/loader/fits_loader.h Show resolved Hide resolved
dali/operators/reader/fits_reader_gpu_op.cu Show resolved Hide resolved


if (compressed) {
TensorList<GPUBackend> sample_list_gpu;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzient - do you think it will work or as soon as sample_list_gpu goes out of scope we can get in troubles?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or setting the order is sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, it's a termporary buffer. Setting the order should be enough from the correctness perspective. Performancewise, using a full-blown TensorList as a temporary storage for raw data may seem excessive, but at least the code, as it is, has the benefit of simplicity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a problem with the source (host) TensorList, however. If it's pinned and it uses host order, then its contents may be clobbered before this H2D copy finishes.

Copy link
Contributor

@JanuszL JanuszL May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can:

auto out = s.ToContiguousGPU(make_span(sample_list_cpu));
TensorListView<StorageGPU, uint8_t> sample_list_gpu(out , sample_list_cpu.shape());

?

Copy link
Contributor

@klecki klecki May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ToContiguousGPU accept span as an argument? I don't see any overload that would work like that.

The simplest fix here, would be to add yet another copy, this is what ToContiguousGPU does internally, so

TensorList<CPU> samples_tmp;
samples_tmp.SetContiguity(BatchContiguity::Contiguous);
samples_tmp.set_order(ws.stream());
samples_tmp.Copy(sample_list_cpu);
sample_list_gpu.Copy(sample_list_cpu);

Honestly, I don't know why we don't have any better API for that, unless I'm missing something, but using scratchpad or TLV would require you to use manual copy.

Copy link
Contributor

@mzient mzient May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we can always add an API to Scratchpad interface. Still, we could do this:

auto tlv_pinned = s.AllocTensorList<mm::memory_kind::pinned, uint8_t>(sample_list_cpu.shape());
auto tlv_gpu    = s.AllocTensorList<mm::memory_kind::device, uint8_t>(sample_list_cpu.shape());
kernels::copy(tlv_pinned, view<uint8_t>(sample_list_cpu), AccessOrder::host());
kernels::copy(tlv_gpu, tlv_pinned, ws.stream());

mskwr added 2 commits May 12, 2023 00:26
Signed-off-by: mskwr <michal.skwarek@protonmail.ch>
Signed-off-by: mskwr <michal.skwarek@protonmail.ch>
@klecki
Copy link
Contributor

klecki commented May 12, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8266742]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8266742]: BUILD PASSED

@@ -25,6 +25,7 @@ list(APPEND DALI_OPERATOR_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/numpy_reader_op.cc")

if(BUILD_CFITSIO)
list(APPEND DALI_OPERATOR_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/fits_reader_op.cc")
list(APPEND DALI_OPERATOR_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/fits_reader_gpu_op.cu")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder, would DALI compile properly with BUILD_CFITSIO=OFF? I'm asking, because there are header files outside of this if. Could you verify that the build works when BUILD_CFITSIO=OFF and cfitsio lib is unavailable in the system?

Also, I'd name this fits_reader_gpu.cu, since the fact that it resides in operators already suggest it's an op. But that's nitpicking, up to you :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regards to naming, we followed the same convention as numpy_reader. Although, I agree that op suffix seems redundant.

dali/util/fits.h Outdated
DLL_PUBLIC void ParseHeader(HeaderData &parsed_header, fitsfile *src);

/** @brief Read raw data of rice coded image HDU. */
DLL_PUBLIC int extract_undecoded_data(fitsfile *fptr, std::vector<uint8_t> &data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like every other function in this file follows PascalCase. How about making this one too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, there are some functions in this file that should not be visible outside of this compilation unit. How about wrapping those in an anonymous namespace?

int32_t status = 0;

for (int32_t i = 0; i < n_dims; i++) {
std::string keyword = "ZTILE" + std::to_string(i + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string keyword = "ZTILE" + std::to_string(i + 1);
std::string keyword = make_string("ZTILE", i + 1);

The make_string we've implemented before uses stringstream, so I guess it would be tiny little bit better than concatenation

for (int32_t i = 0; i < n_dims; i++) {
std::string keyword = "ZTILE" + std::to_string(i + 1);
FITS_CALL(fits_read_key(fptr, TLONG, keyword.c_str(), &tileSizes[i], NULL, &status));
DALI_ENFORCE(tileSizes[i] > 0, "All ZTILE{i} values must be greater than 0!");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider adding more info to this error message? In case user get onto this, it might be helpful for him. I have something like this in mind:

DALI_ENFORCE(tileSizes[i] > 0, make_string("All ZTILE{i} values must be greater than 0! Actual: ", tileSizes[i], " at index i=", i));

fits::ParseHeader(header, current_file);
target.header[output_idx] = header;
} catch (const std::runtime_error& e) {
DALI_FAIL(e.what() + ". File: " + filename);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using make_string here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

template <typename T>
__global__ void rice_decompress(unsigned char *compressed_data, T *uncompressed_data,
const int64 *tile_offset, const int64 *tile_size, int blocksize,
int64 tiles, int64 maxtilelen, double bscale, double bzero) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to create some unit tests for this? It's about 100+ lines of crazy algorithmic code, I believe it would be nice to test it separately, not only with the umbrella Python test for operator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the test that checks the extraction, it will be enough to test the decoding on the operator level.

@@ -0,0 +1,43 @@
// Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,216 @@
// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


int extract_undecoded_data(fitsfile* fptr, std::vector<uint8_t>& data,
std::vector<int64_t>& tile_offset, std::vector<int64_t>& tile_size,
int64 rows, int* status) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two functions (extract_undecoded_data and extract_data) tested? If not, could they be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! There might be a bit of a problem with getting grand truths for the tests, since as far as I am concerned, cfitsio doesn't support reading raw uncompressed data without doing compression and scaling first.

Signed-off-by: aderylo <a.m.derylo@gmail.com>
@klecki
Copy link
Contributor

klecki commented May 18, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8331595]: BUILD FAILED

@klecki
Copy link
Contributor

klecki commented May 18, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8331958]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8331958]: BUILD PASSED

Signed-off-by: aderylo <a.m.derylo@gmail.com>
vector<T> data;
data.resize(src->Size() / sizeof(T));
auto ret = src->Read(reinterpret_cast<uint8_t *>(data.data()), src->Size());
// static_cast<uint8_t *>(data.data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this line? If so, could you describe why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I've cleaned it up in the latest commit.

Signed-off-by: aderylo <a.m.derylo@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8369923]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8369923]: BUILD PASSED

…f buffer which is pinned and uses host order.

Signed-off-by: aderylo <a.m.derylo@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8372926]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8372926]: BUILD PASSED

@klecki
Copy link
Contributor

klecki commented May 23, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8382856]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8382856]: BUILD PASSED

@klecki klecki merged commit f9a3ee4 into NVIDIA:main May 23, 2023
JanuszL pushed a commit to JanuszL/DALI that referenced this pull request Oct 13, 2023
Generalize the FITS loader for CPU and GPU backends.
The GPU FITS loader can extract undecoded data from
the file for accelerated decoding.

Add CUDA kernel implementing RICE GPU-accelerated
RICE decoding.

Signed-off-by: Adam Deryło <a.m.derylo@gmail.com>
Signed-off-by: Michał Skwarek<michal.skwarek@protonmail.ch>
Co-authored-by: Michał Skwarek<michal.skwarek@protonmail.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants