Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GDS-compatible allocator with 4k alignment. #3754

Merged
merged 1 commit into from
Mar 24, 2022
Merged

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Mar 23, 2022

Signed-off-by: Michał Zientkiewicz mzient@gmail.com

Category:

New feature (non-breaking change which adds functionality)

Description:

GDS is more efficient when mapped memory is aligned to a 4k boundary. It also doesn't work with memory allocated with cuMemCreate.
This PR adds a dedicated GDS memory pool, which is separate and used only for that purpose.
As a preparatory step, pool_resource was extended so it can be informed about maximum supported alignment of the upstream resource and work around that limitation when allocating upstream blocks.

Additional information:

Affected modules and functionalities:

  • Pool resource
  • NumPy reader (GPU)

Key points relevant for the review:

N/A

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-2667

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4214574]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4214574]: BUILD PASSED

@@ -32,6 +32,7 @@ void TestPoolResource(int num_iter) {
test_host_resource upstream;
{
auto opt = default_host_pool_opts();
opt.max_upstream_alignment = 32; // force the use of overaligned upstream allocations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do we need that change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distribution of alignments in this test is 1-256. We need something smaller to test the codepath with overalignment. I want this to be explicit, in case the default value (which is 256) is changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this is just for test. NVM

GDSAllocator::GDSAllocator() {
// Currently, GPUDirect Storage can work only with memory allocated with cudaMalloc and
// cuMemAlloc. Since DALI is transitioning to CUDA Virtual Memory Management for memory
// allocation, we need a special allocator that's compatible with GDS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that achieved? Is it sufficient to just use coalescing_free_tree (as I understand it still uses the CUDA Virtual Memory Management)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the next line:

static auto upstream = std::make_shared<mm::cuda_malloc_memory_resource>();

cuda_malloc_resource uses plain cudaMalloc.

@jantonguirao jantonguirao assigned jantonguirao and unassigned klecki Mar 24, 2022
char *block_end = block_start + blk_size;
assert(tail <= block_end);

if (blk_size != bytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (blk_size != bytes) {
if (blk_size > bytes) {

According to what the comment says?

Copy link
Contributor Author

@mzient mzient Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it can't be less :)
If anything, there could be an assert(blk_size > bytes); inside, but wouldn't that be an overkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's already indirectly tested as assert(tail <= block_end);

lock_guard guard(lock_);
free_list_.put(static_cast<char *>(new_block) + bytes, blk_size - bytes);
return new_block;
if (ret != block_start)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ret != block_start)
if (ret > block_start)

if (ret != block_start)
free_list_.put(block_start, ret - block_start);

if (tail != block_end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (tail != block_end)
if (tail < block_end)

@mzient mzient merged commit 1ac7e7d into NVIDIA:main Mar 24, 2022
@JanuszL JanuszL mentioned this pull request Mar 30, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants