Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cpu oom retry split handling to InternalRowToColumnarBatchIterator #10011

Merged
merged 4 commits into from
Dec 12, 2023

Conversation

jbrennan333
Copy link
Collaborator

This is a followup to #9929 and is part of addressing issue #8887.

This adds handing for CpuSplitAndRetryOOM in InternalRowToColumnarBatchIterator.
For splits, we divide the target number of rows in half via splitTargetSizeInHalfCpu, and recalculate the buffer sizes from that. It will try that until we hit 1 row.

I also included a workaround for #10004, by combining the withHostBufferWriteLock with the withHostBufferReadOnly blocks. In this case we will hold the write lock during the full operation, but this should allow us to proceed without introducing data corruption until we have a fix for #10004.

One other fix is in HostAlloc, where it was passing the wrong argument to RapidsBufferCatalog.synchronousSpill - it was passing allocSize instead of targetSize, causing us to spill a lot more than needed in some cases.

@jbrennan333 jbrennan333 added feature request New feature or request reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Dec 11, 2023
@jbrennan333 jbrennan333 self-assigned this Dec 11, 2023
@jbrennan333
Copy link
Collaborator Author

build

@@ -144,7 +144,7 @@ private class HostAlloc(nonPinnedLimit: Long) extends HostMemoryAllocator with L
logDebug(s"Targeting host store size of $targetSize bytes")
// We could not make it work so try and spill enough to make it work
val maybeAmountSpilled =
RapidsBufferCatalog.synchronousSpill(RapidsBufferCatalog.getHostStorage, allocSize)
RapidsBufferCatalog.synchronousSpill(RapidsBufferCatalog.getHostStorage, targetSize)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

SpillPriorities$.MODULE$.ACTIVE_ON_DECK_PRIORITY(),
RapidsBufferCatalog$.MODULE$.singleton());
hBufs[0] = null; // Was closed by spillable
hBufs[1] = HostAlloc$.MODULE$.alloc(offsetBytes, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to wrap this in a try/catch so we can close sBufs[0] if alloc throws?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nm, I think allocBuffersWithRestore takes care of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be a problem here actually. If we throw with something other than a retry exception, the withRestoreOnRetry won't handle it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that would be an issue

abellina
abellina previously approved these changes Dec 11, 2023
Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 95 to 96
return (int) Math.max(1,
Math.min(Integer.MAX_VALUE - 1, targetBytes / sizePerRowEstimate));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: casting is forced by targetBytes / sizePerRowEstimate? I'd prefer pushing it down to the origin:

Suggested change
return (int) Math.max(1,
Math.min(Integer.MAX_VALUE - 1, targetBytes / sizePerRowEstimate));
return Math.max(1,
Math.min(Integer.MAX_VALUE - 1, (int) (targetBytes / sizePerRowEstimate)));

@jbrennan333
Copy link
Collaborator Author

Thanks for the review comments @abellina and @gerashegalov! I believe I have addressed them in this latest commit.

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbrennan333
Copy link
Collaborator Author

build

@jbrennan333 jbrennan333 merged commit 61e8e6c into NVIDIA:branch-24.02 Dec 12, 2023
38 checks passed
@jbrennan333 jbrennan333 linked an issue Dec 14, 2023 that may be closed by this pull request
@sameerz sameerz removed the feature request New feature or request label Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add Host Memory Retry for Row to Columnar Conversion
4 participants