Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GPU based numpy reader #2477

Merged
merged 12 commits into from
Nov 26, 2020
Merged

Conversation

JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Nov 18, 2020

  • GPU based numpy reader uses GPU Direct Storage via cufile library implementation

Authored-by: Thorsten Kurth tkurth@nvidia.com
Co-authored-by: Michał Zientkiewicz mzient@gmail.com
Co-authored-by: Janusz Lisiecki jlisiecki@nvidia.com

Why we need this PR?

Pick one, remove the rest

  • It adds support for GPU based numpy reader

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    GPU based numpy reader uses GPU Direct Storage via cufile library implementation
  • Affected modules and functionalities:
    GPU numpy reader
  • Key points relevant for the review:
    NA
  • Validation and testing:
    a new set of tests is added
  • Documentation (including examples):
    updated

JIRA TASK: [Use DALI-1610]

@JanuszL JanuszL force-pushed the rebased_gds branch 2 times, most recently from b8ac13b to 15d28e0 Compare November 18, 2020 10:13
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1809719]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1809719]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1809968]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1810224]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1810224]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1810664]: BUILD STARTED

@JanuszL JanuszL force-pushed the rebased_gds branch 2 times, most recently from 5e2970b to fa876ae Compare November 18, 2020 22:00
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1812137]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1812137]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1814348]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1814348]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1814521]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1814521]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1815161]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1816421]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1816421]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1817930]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1817930]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1821460]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1821460]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1828401]: BUILD PASSED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1832413]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1832413]: BUILD FAILED

Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have few questions, didn't find many big problems, I'm wondering if we can deduplicate code between FileLoader and CuFileLoader. Tying them together probably isn't the best option, but duplicating the sharding etc isn't ideal.

DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/../../tools/stub_generator/stub_codegen.py
"${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}/cufile.h"
"${CMAKE_CURRENT_SOURCE_DIR}/../../tools/stub_generator/cufile.json"
COMMENT "Running cuda.h stub generator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:
cufile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

PrepareEmptyTensor(T&) {
constexpr bool T_is_Tensor = std::is_same<T, Tensor<CPUBackend>>::value;
constexpr bool T_is_Tensor = (std::is_same<T, Tensor<CPUBackend>>::value ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's runtime enforce so we can just DALI_ERROR here, up to you

return file;
void NumpyHeaderCache::UpdateCache(const string &file_name, const NumpyParseTarget &value) {
if (cache_headers_) {
std::unique_lock<std::mutex> cache_lock(cache_mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we're fine with the cache growing, even with millions of files it's probably still around several megabytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are fine with that.

Comment on lines 33 to 34
virtual size_t Read(uint8_t * buffer, size_t n_bytes, size_t offset = 0) = 0;
virtual size_t ReadCPU(uint8_t * buffer, size_t n_bytes) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, why we didn't go the other way round, allowing the CUFileStream to be a regular FileStream wrt to the Read() and add a ReadGPU(). But that would probably break on the Seek, etc. Maybe instead, it would make sense to be able to get a regular FileStream from the CUFileStream? I'm asking that because the call through the pointer to member in the ParseHeader was a bit surprising for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand the goal was to seek file for GPU and CPU, while you read the content on the CPU/GPU separately.

get a regular FileStream from the CUFileStream

It would require sharing the state between both of them what could be misused I guess. No idea this is the way we want to do it.
@azrael417 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using any of the reads advances the pos_ inside, so it's not like opening another stream, it's reading on CPU or GPU using the same reader "head".

Still, making it other way round would be easier wrt to the pointer to member and the CuFileStream could be a regular FileStream at the same time, couldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I hope.

docker/Dockerfile.cuda100.x86_64.deps Show resolved Hide resolved

for (int data_idx = 0; data_idx < batch_size_; ++data_idx) {
const auto& imfile = GetSample(data_idx);
if (imfile.meta == "transpose:false") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? what was wrong with the boolean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// use copy kernel for plan samples
if (!copy_sizes.empty()) {
ref_type.template Copy<GPUBackend, GPUBackend>(copy_to.data(), copy_from.data(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess detecting if all samples don't require transpose and reading directly to a target batch (that we already know size of, as we parsed it), would be a potential optimization. Or maybe doing it at least as one big D2D copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess detecting if all samples don't require transpose and reading directly to a target batch (that we already know size of, as we parsed it), would be a potential optimization.

It is not possible as in the prefetch we don't know the operator output buffer, RunImpl knows that but the prefetched batch is already there. It is not contiguous so we need that copy (we cannot simply swap buffers).

As you see only on copy kernel is invoked here. I don't think we can do anything faster to copy from non-contiguous tensor vector into tensor list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction - I forgot we have prefetched_batch_tensors_ that is contiguous. We can swap it if no transposition is needed at all.
Still ad. 1 - it is crated asynchronously in the prefetch thread so nothing beyond that can be done.

void NumpyLoaderGPU::ReadSampleHelper(CUFileStream *file, ImageFileWrapperGPU& imfile,
void *buffer, Index offset, size_t total_size) {
// register the buffer (if needed)
RegisterTensor(buffer, total_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not exactly Tensor, it does register the whole tensor list, right? I guess that's why we use the offset into the buffer instead of buffer + offset as a Read target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed


imfile.type_info = target.type_info;
imfile.shape = target.shape;
imfile.meta = (target.fortran_order ? "transpose:true" : "transpose:false");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with a boolean? Can we get rid of string here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// set metadata
imfile.image.SetMeta(meta);

imfile.read_meta_f = [this, image_file, &imfile] () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is stopping us from executing this part here? Is it slower than running it in thread pool and waiting for the work to be done? If the prefetch is ready before we start calling read_sample_f in the reader it would be probably better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't compared the difference, but as we have a header cache it is not the fastest operation possible.

If the prefetch is ready before we start calling read_sample_f in the reader it would be probably better

The thread pool is still in the prefetch thread, but just in different place.
The usual flow is:

  • Prefetch thread->Prefetch()->ReadOne()->ReadSample()

The flow here is:

  • Prefetch thread-> first: Prefetch()->ReadOne()->ReadSample()
    ->then: Thread pool that runs functions obtained

So it is still asynchronous to the executor thread.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1832413]: BUILD PASSED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1833957]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1833957]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1833957]: BUILD PASSED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL JanuszL force-pushed the rebased_gds branch 4 times, most recently from 9dae415 to 2ebd763 Compare November 26, 2020 14:05
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1836444]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1836535]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1836535]: BUILD PASSED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1836444]: BUILD FAILED

* @brief Shared param setup
* @brief Shared param setup. Legacy implementation for per-sample approach
*
* Usage of this API is deprecated. For CPU Ops `void SetupSharedSampleParams(HostWorkspace &ws)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Usage of this API is deprecated. For CPU Ops `void SetupSharedSampleParams(HostWorkspace &ws)`
* Usage of this API is deprecated. For CPU Ops `void SetupImpl(HostWorkspace &ws)`

Do you mean this? Do we want this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy paste. Fixed.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL JanuszL merged commit add3a20 into NVIDIA:master Nov 26, 2020
@JanuszL JanuszL deleted the rebased_gds branch November 26, 2020 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants