Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add host memory retries for GeneratedInternalRowToCudfRowIterator #9929

Merged
merged 2 commits into from
Dec 6, 2023

Conversation

jbrennan333
Copy link
Collaborator

This PR adds host memory allocation retries for GeneratedInternalRowToCudfRowIterator.
This initial version only handles retries, it does not handle splitting the input data into batches.
It is the first of several PRs that will be used to address #8887.

I have run the existing unit and integration tests, and I verified no performance impact for nds on spark2a, although I do not think nds is exercising this code.

I am putting this up as a draft because it includes a change that is also in #9908, and to do additional testing.

…CudfRowIterator

Signed-off-by: Jim Brennan <jimb@nvidia.com>
@jbrennan333
Copy link
Collaborator Author

build

1 similar comment
@jbrennan333
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

HostMemoryBuffer ob =
RmmRapidsRetryIterator.withRetryNoSplit( () -> {
return HostAlloc$.MODULE$.alloc(
((long) numRowsEstimate + 1) * BYTES_PER_OFFSET, true);
Copy link
Collaborator

@abellina abellina Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit,

Suggested change
((long) numRowsEstimate + 1) * BYTES_PER_OFFSET, true);
((long) numRowsEstimate + 1) * BYTES_PER_OFFSET, /*preferPinned*/ true);

batchWithColumnToConvert = makeSpillableBatch(devColumn);
HostMemoryBuffer db =
RmmRapidsRetryIterator.withRetryNoSplit( () -> {
return HostAlloc$.MODULE$.alloc(dataLength, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit,

Suggested change
return HostAlloc$.MODULE$.alloc(dataLength, true);
return HostAlloc$.MODULE$.alloc(dataLength, /*preferPinned*/ true);

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbrennan333
Copy link
Collaborator Author

build

@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Dec 5, 2023
@jbrennan333
Copy link
Collaborator Author

I have been testing this locally, running nds query75 with the ShuffleExchangeExec operator disabled. This causes it to add GpuRowToColumnar targetsize(1073741824) to the plan. I am running with 16 executor cores, and limiting off heap memory. With these changes, I hit CpuSplitAndRetryOOM after reducing the offheap memory to about 24G.

As a comparison, when I run without these changes, but I change the HostMemoryBuffer.allocate() calls for dataBuffer and offsetBuffer to use HostAlloc.alloc(), I hit CpuRetryOOM with offheap memory at around 24G as well.

So it doesn't seem like the withRetryNoSplit calls are buying us much in this case, beyond the retries in HostAlloc.alloc(). @revans2 is does this make sense?

@revans2
Copy link
Collaborator

revans2 commented Dec 6, 2023

This indicates that we are still using a lot of non-spillable host memory. So we need to track down who is using the memory and not making it spillable. I'll see what I can come up with to be able to help debug this. In the short term I am fine with merging this in as it is not worse than what we have today.

@jbrennan333 jbrennan333 marked this pull request as ready for review December 6, 2023 17:42
@jbrennan333
Copy link
Collaborator Author

Thanks @revans2. I will put this in, but will continue to test it as I work on splitting the inputs.

@jbrennan333 jbrennan333 merged commit 93c9960 into NVIDIA:branch-24.02 Dec 6, 2023
38 checks passed
@jbrennan333 jbrennan333 linked an issue Dec 14, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add Host Memory Retry for Row to Columnar Conversion
5 participants