-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fits reader gpu #4752
Fits reader gpu #4752
Conversation
fits::FITS_CALL(fits_read_img(current_file, header.datatype_code, 1, nelem, &nulval, | ||
static_cast<uint8_t*>(buffer.raw_mutable_data()), &anynul, | ||
&status)); | ||
|
||
cudaMemcpy(target.data[output_idx].raw_mutable_data(), buffer.raw_mutable_data(), | ||
buffer.nbytes(), cudaMemcpyHostToDevice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a ok starting point, but we can't put the decoding kernel here, as we have access to just one sample. We need to delay the usage of the decoding kernel till the RunImpl of the FitsReaderGpu, so we can decode whole batch at once.
I assume, we need to extend the FitsFileWrapperGPU with the shape and dtype information, and just put raw, encoded bytes in each Tensor.
Than we can decode them in RunImpl.
The question would be if we should be doing CUDA memcopy here or in the RunImpl of the reader - we should probably see the impact on the performance - again, in the RunImpl we can coalesce the copy into one.
namespace dali { | ||
|
||
struct FitsFileWrapperGPU { | ||
std::vector<Tensor<GPUBackend>> data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<Tensor<GPUBackend>> data; | |
std::vector<Tensor<CPUBackend>> data; | |
TensorShape shape; | |
DALIDataType dtype; | |
bool encoded; |
auto &sample = GetSample(sample_id); | ||
|
||
cudaMemcpy(output.raw_mutable_tensor(sample_id), sample.data[output_idx].raw_data(), | ||
sample.data[output_idx].nbytes(), cudaMemcpyDeviceToDevice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Collect all samples
- cudaMemcpy them to the GPU (use the ToContiguousGpu for example).
- call the kernel.
4815da4
to
1462a02
Compare
cudaMemcpyAsync(output.raw_mutable_tensor(sample_id), sample.data[output_idx].raw_data(), | ||
sample.data[output_idx].nbytes(), cudaMemcpyHostToDevice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use a proper operator stream.
tile_size_cuda, sample.header[output_idx].bytepix, sample.header[output_idx].blocksize, | ||
tiles, maxtilelen); | ||
|
||
cudaFree(undecoded_data_cuda); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tmp memory please use scratchpad - https://github.com/NVIDIA/DALI/blob/main/dali/operators/image/remap/remap.cu#L69 (declaration), https://github.com/NVIDIA/DALI/blob/main/dali/operators/image/remap/remap.cuh#L74 (allocates memory on the GPU and copies data from the CPU to it using given stream).
Scratchpad should live until to the last call touching given memory. So:
dali::kernels::DynamicScratchpad ds;
auto stream = ws.stream();
auto tile_offset_cuda = std::get<0>(ds.ToContiguousGPU(stream, sample.tile_offset[output_idx]));
auto tile_offset_cuda = std::get<0>(ds.ToContiguousGPU(stream, sample.tile_size[output_idx]));
auto undecoded_data_cuda= std::get<0>(ds.ToContiguousGPU(stream, sample.data[output_idx]));
if (zbitpix == 8) { | ||
cudaMalloc(&decoded_data_cuda, tiles * maxtilelen * sizeof(char)); | ||
} else if (zbitpix == 16) { | ||
cudaMalloc(&decoded_data_cuda, tiles * maxtilelen * sizeof(short)); | ||
} else { | ||
cudaMalloc(&decoded_data_cuda, tiles * maxtilelen * sizeof(int)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used anywhere?
5559d56
to
4befa0b
Compare
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
… [floats not supported] Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: aderylo <a.m.derylo@gmail.com>
Signed-off-by: mskwr <michal.skwarek@protonmail.ch>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments, otherwise looks ok.
|
||
|
||
if (compressed) { | ||
TensorList<GPUBackend> sample_list_gpu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzient - do you think it will work or as soon as sample_list_gpu
goes out of scope we can get in troubles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or setting the order is sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, it's a termporary buffer. Setting the order should be enough from the correctness perspective. Performancewise, using a full-blown TensorList as a temporary storage for raw data may seem excessive, but at least the code, as it is, has the benefit of simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a problem with the source (host) TensorList, however. If it's pinned and it uses host order, then its contents may be clobbered before this H2D copy finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can:
auto out = s.ToContiguousGPU(make_span(sample_list_cpu));
TensorListView<StorageGPU, uint8_t> sample_list_gpu(out , sample_list_cpu.shape());
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ToContiguousGPU accept span as an argument? I don't see any overload that would work like that.
The simplest fix here, would be to add yet another copy, this is what ToContiguousGPU does internally, so
TensorList<CPU> samples_tmp;
samples_tmp.SetContiguity(BatchContiguity::Contiguous);
samples_tmp.set_order(ws.stream());
samples_tmp.Copy(sample_list_cpu);
sample_list_gpu.Copy(sample_list_cpu);
Honestly, I don't know why we don't have any better API for that, unless I'm missing something, but using scratchpad or TLV would require you to use manual copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we can always add an API to Scratchpad
interface. Still, we could do this:
auto tlv_pinned = s.AllocTensorList<mm::memory_kind::pinned, uint8_t>(sample_list_cpu.shape());
auto tlv_gpu = s.AllocTensorList<mm::memory_kind::device, uint8_t>(sample_list_cpu.shape());
kernels::copy(tlv_pinned, view<uint8_t>(sample_list_cpu), AccessOrder::host());
kernels::copy(tlv_gpu, tlv_pinned, ws.stream());
Signed-off-by: mskwr <michal.skwarek@protonmail.ch>
Signed-off-by: mskwr <michal.skwarek@protonmail.ch>
!build |
CI MESSAGE: [8266742]: BUILD STARTED |
CI MESSAGE: [8266742]: BUILD PASSED |
@@ -25,6 +25,7 @@ list(APPEND DALI_OPERATOR_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/numpy_reader_op.cc") | |||
|
|||
if(BUILD_CFITSIO) | |||
list(APPEND DALI_OPERATOR_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/fits_reader_op.cc") | |||
list(APPEND DALI_OPERATOR_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/fits_reader_gpu_op.cu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wonder, would DALI compile properly with BUILD_CFITSIO=OFF
? I'm asking, because there are header files outside of this if
. Could you verify that the build works when BUILD_CFITSIO=OFF
and cfitsio
lib is unavailable in the system?
Also, I'd name this fits_reader_gpu.cu
, since the fact that it resides in operators
already suggest it's an op
. But that's nitpicking, up to you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regards to naming, we followed the same convention as numpy_reader. Although, I agree that op suffix seems redundant.
dali/util/fits.h
Outdated
DLL_PUBLIC void ParseHeader(HeaderData &parsed_header, fitsfile *src); | ||
|
||
/** @brief Read raw data of rice coded image HDU. */ | ||
DLL_PUBLIC int extract_undecoded_data(fitsfile *fptr, std::vector<uint8_t> &data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like every other function in this file follows PascalCase. How about making this one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, there are some functions in this file that should not be visible outside of this compilation unit. How about wrapping those in an anonymous namespace?
dali/util/fits.cc
Outdated
int32_t status = 0; | ||
|
||
for (int32_t i = 0; i < n_dims; i++) { | ||
std::string keyword = "ZTILE" + std::to_string(i + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::string keyword = "ZTILE" + std::to_string(i + 1); | |
std::string keyword = make_string("ZTILE", i + 1); |
The make_string
we've implemented before uses stringstream
, so I guess it would be tiny little bit better than concatenation
dali/util/fits.cc
Outdated
for (int32_t i = 0; i < n_dims; i++) { | ||
std::string keyword = "ZTILE" + std::to_string(i + 1); | ||
FITS_CALL(fits_read_key(fptr, TLONG, keyword.c_str(), &tileSizes[i], NULL, &status)); | ||
DALI_ENFORCE(tileSizes[i] > 0, "All ZTILE{i} values must be greater than 0!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider adding more info to this error message? In case user get onto this, it might be helpful for him. I have something like this in mind:
DALI_ENFORCE(tileSizes[i] > 0, make_string("All ZTILE{i} values must be greater than 0! Actual: ", tileSizes[i], " at index i=", i));
fits::ParseHeader(header, current_file); | ||
target.header[output_idx] = header; | ||
} catch (const std::runtime_error& e) { | ||
DALI_FAIL(e.what() + ". File: " + filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using make_string
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
template <typename T> | ||
__global__ void rice_decompress(unsigned char *compressed_data, T *uncompressed_data, | ||
const int64 *tile_offset, const int64 *tile_size, int blocksize, | ||
int64 tiles, int64 maxtilelen, double bscale, double bzero) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to create some unit tests for this? It's about 100+ lines of crazy algorithmic code, I believe it would be nice to test it separately, not only with the umbrella Python test for operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with the test that checks the extraction, it will be enough to test the decoding on the operator level.
@@ -0,0 +1,43 @@ | |||
// Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
// Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,216 @@ | |||
// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
// Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dali/util/fits.cc
Outdated
|
||
int extract_undecoded_data(fitsfile* fptr, std::vector<uint8_t>& data, | ||
std::vector<int64_t>& tile_offset, std::vector<int64_t>& tile_size, | ||
int64 rows, int* status) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two functions (extract_undecoded_data
and extract_data
) tested? If not, could they be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! There might be a bit of a problem with getting grand truths for the tests, since as far as I am concerned, cfitsio doesn't support reading raw uncompressed data without doing compression and scaling first.
Signed-off-by: aderylo <a.m.derylo@gmail.com>
!build |
CI MESSAGE: [8331595]: BUILD FAILED |
!build |
CI MESSAGE: [8331958]: BUILD STARTED |
CI MESSAGE: [8331958]: BUILD PASSED |
Signed-off-by: aderylo <a.m.derylo@gmail.com>
dali/util/fits_test.cc
Outdated
vector<T> data; | ||
data.resize(src->Size() / sizeof(T)); | ||
auto ret = src->Read(reinterpret_cast<uint8_t *>(data.data()), src->Size()); | ||
// static_cast<uint8_t *>(data.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this line? If so, could you describe why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I've cleaned it up in the latest commit.
Signed-off-by: aderylo <a.m.derylo@gmail.com>
CI MESSAGE: [8369923]: BUILD STARTED |
CI MESSAGE: [8369923]: BUILD PASSED |
…f buffer which is pinned and uses host order. Signed-off-by: aderylo <a.m.derylo@gmail.com>
CI MESSAGE: [8372926]: BUILD STARTED |
CI MESSAGE: [8372926]: BUILD PASSED |
!build |
CI MESSAGE: [8382856]: BUILD STARTED |
CI MESSAGE: [8382856]: BUILD PASSED |
Generalize the FITS loader for CPU and GPU backends. The GPU FITS loader can extract undecoded data from the file for accelerated decoding. Add CUDA kernel implementing RICE GPU-accelerated RICE decoding. Signed-off-by: Adam Deryło <a.m.derylo@gmail.com> Signed-off-by: Michał Skwarek<michal.skwarek@protonmail.ch> Co-authored-by: Michał Skwarek<michal.skwarek@protonmail.ch>
Category:
New feature (non-breaking change which adds functionality)
Description:
Adds simplistic implementation of a fits reader that reads to the GPU backend.
Additional information:
Set to draft, since there are segfaults when running tests.
Affected modules and functionalities:
Key points relevant for the review:
Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A