-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for GPU based numpy reader #2477
Conversation
b8ac13b
to
15d28e0
Compare
CI MESSAGE: [1809719]: BUILD STARTED |
CI MESSAGE: [1809719]: BUILD FAILED |
CI MESSAGE: [1809968]: BUILD FAILED |
CI MESSAGE: [1810224]: BUILD STARTED |
CI MESSAGE: [1810224]: BUILD FAILED |
CI MESSAGE: [1810664]: BUILD STARTED |
5e2970b
to
fa876ae
Compare
CI MESSAGE: [1812137]: BUILD STARTED |
CI MESSAGE: [1812137]: BUILD FAILED |
CI MESSAGE: [1814348]: BUILD STARTED |
CI MESSAGE: [1814348]: BUILD FAILED |
CI MESSAGE: [1814521]: BUILD STARTED |
CI MESSAGE: [1814521]: BUILD FAILED |
CI MESSAGE: [1815161]: BUILD STARTED |
CI MESSAGE: [1816421]: BUILD STARTED |
CI MESSAGE: [1816421]: BUILD FAILED |
CI MESSAGE: [1817930]: BUILD STARTED |
CI MESSAGE: [1817930]: BUILD FAILED |
CI MESSAGE: [1821460]: BUILD STARTED |
CI MESSAGE: [1821460]: BUILD FAILED |
CI MESSAGE: [1828401]: BUILD PASSED |
CI MESSAGE: [1832413]: BUILD STARTED |
CI MESSAGE: [1832413]: BUILD FAILED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have few questions, didn't find many big problems, I'm wondering if we can deduplicate code between FileLoader and CuFileLoader. Tying them together probably isn't the best option, but duplicating the sharding etc isn't ideal.
dali/core/CMakeLists.txt
Outdated
DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/../../tools/stub_generator/stub_codegen.py | ||
"${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}/cufile.h" | ||
"${CMAKE_CURRENT_SOURCE_DIR}/../../tools/stub_generator/cufile.json" | ||
COMMENT "Running cuda.h stub generator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick:
cufile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
PrepareEmptyTensor(T&) { | ||
constexpr bool T_is_Tensor = std::is_same<T, Tensor<CPUBackend>>::value; | ||
constexpr bool T_is_Tensor = (std::is_same<T, Tensor<CPUBackend>>::value || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's runtime enforce so we can just DALI_ERROR here, up to you
return file; | ||
void NumpyHeaderCache::UpdateCache(const string &file_name, const NumpyParseTarget &value) { | ||
if (cache_headers_) { | ||
std::unique_lock<std::mutex> cache_lock(cache_mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we're fine with the cache growing, even with millions of files it's probably still around several megabytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are fine with that.
dali/util/cufile.h
Outdated
virtual size_t Read(uint8_t * buffer, size_t n_bytes, size_t offset = 0) = 0; | ||
virtual size_t ReadCPU(uint8_t * buffer, size_t n_bytes) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, why we didn't go the other way round, allowing the CUFileStream to be a regular FileStream wrt to the Read() and add a ReadGPU(). But that would probably break on the Seek, etc. Maybe instead, it would make sense to be able to get a regular FileStream from the CUFileStream? I'm asking that because the call through the pointer to member in the ParseHeader was a bit surprising for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand the goal was to seek file for GPU and CPU, while you read the content on the CPU/GPU separately.
get a regular FileStream from the CUFileStream
It would require sharing the state between both of them what could be misused I guess. No idea this is the way we want to do it.
@azrael417 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, using any of the reads advances the pos_ inside, so it's not like opening another stream, it's reading on CPU or GPU using the same reader "head".
Still, making it other way round would be easier wrt to the pointer to member and the CuFileStream could be a regular FileStream at the same time, couldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I hope.
|
||
for (int data_idx = 0; data_idx < batch_size_; ++data_idx) { | ||
const auto& imfile = GetSample(data_idx); | ||
if (imfile.meta == "transpose:false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really? what was wrong with the boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// use copy kernel for plan samples | ||
if (!copy_sizes.empty()) { | ||
ref_type.template Copy<GPUBackend, GPUBackend>(copy_to.data(), copy_from.data(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess detecting if all samples don't require transpose and reading directly to a target batch (that we already know size of, as we parsed it), would be a potential optimization. Or maybe doing it at least as one big D2D copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess detecting if all samples don't require transpose and reading directly to a target batch (that we already know size of, as we parsed it), would be a potential optimization.
It is not possible as in the prefetch we don't know the operator output buffer, RunImpl knows that but the prefetched batch is already there. It is not contiguous so we need that copy (we cannot simply swap buffers).
As you see only on copy kernel is invoked here. I don't think we can do anything faster to copy from non-contiguous tensor vector into tensor list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction - I forgot we have prefetched_batch_tensors_ that is contiguous. We can swap it if no transposition is needed at all.
Still ad. 1 - it is crated asynchronously in the prefetch thread so nothing beyond that can be done.
void NumpyLoaderGPU::ReadSampleHelper(CUFileStream *file, ImageFileWrapperGPU& imfile, | ||
void *buffer, Index offset, size_t total_size) { | ||
// register the buffer (if needed) | ||
RegisterTensor(buffer, total_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not exactly Tensor, it does register the whole tensor list, right? I guess that's why we use the offset into the buffer instead of buffer + offset as a Read target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
|
||
imfile.type_info = target.type_info; | ||
imfile.shape = target.shape; | ||
imfile.meta = (target.fortran_order ? "transpose:true" : "transpose:false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with a boolean? Can we get rid of string here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// set metadata | ||
imfile.image.SetMeta(meta); | ||
|
||
imfile.read_meta_f = [this, image_file, &imfile] () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is stopping us from executing this part here? Is it slower than running it in thread pool and waiting for the work to be done? If the prefetch is ready before we start calling read_sample_f
in the reader it would be probably better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't compared the difference, but as we have a header cache it is not the fastest operation possible.
If the prefetch is ready before we start calling read_sample_f in the reader it would be probably better
The thread pool is still in the prefetch thread, but just in different place.
The usual flow is:
- Prefetch thread->Prefetch()->ReadOne()->ReadSample()
The flow here is:
- Prefetch thread-> first: Prefetch()->ReadOne()->ReadSample()
->then: Thread pool that runs functions obtained
So it is still asynchronous to the executor thread.
CI MESSAGE: [1832413]: BUILD PASSED |
CI MESSAGE: [1833957]: BUILD STARTED |
CI MESSAGE: [1833957]: BUILD FAILED |
CI MESSAGE: [1833957]: BUILD PASSED |
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
9dae415
to
2ebd763
Compare
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
CI MESSAGE: [1836444]: BUILD STARTED |
CI MESSAGE: [1836535]: BUILD STARTED |
CI MESSAGE: [1836535]: BUILD PASSED |
CI MESSAGE: [1836444]: BUILD FAILED |
dali/pipeline/operator/operator.h
Outdated
* @brief Shared param setup | ||
* @brief Shared param setup. Legacy implementation for per-sample approach | ||
* | ||
* Usage of this API is deprecated. For CPU Ops `void SetupSharedSampleParams(HostWorkspace &ws)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Usage of this API is deprecated. For CPU Ops `void SetupSharedSampleParams(HostWorkspace &ws)` | |
* Usage of this API is deprecated. For CPU Ops `void SetupImpl(HostWorkspace &ws)` |
Do you mean this? Do we want this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy paste. Fixed.
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Authored-by: Thorsten Kurth tkurth@nvidia.com
Co-authored-by: Michał Zientkiewicz mzient@gmail.com
Co-authored-by: Janusz Lisiecki jlisiecki@nvidia.com
Why we need this PR?
Pick one, remove the rest
What happened in this PR?
Fill relevant points, put NA otherwise. Replace anything inside []
GPU based numpy reader uses GPU Direct Storage via cufile library implementation
GPU numpy reader
NA
a new set of tests is added
updated
JIRA TASK: [Use DALI-1610]