Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinned async resource #2858

Merged
merged 4 commits into from
Apr 13, 2021
Merged

Pinned async resource #2858

merged 4 commits into from
Apr 13, 2021

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Apr 12, 2021

Why we need this PR?

Pick one, remove the rest

  • It adds new feature needed for stream-ordered allocation of staging buffers.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    • Specialized async_pool for pinned memory kind
    • Tested
    • Fixed RMM & bumped RMM in third_party
  • Affected modules and functionalities:
    • RMM
    • memory managmenet module
  • Key points relevant for the review:
    • N/A
  • Validation and testing:
    • GTest
  • Documentation (including examples):
    • N/A

JIRA TASK: DALI-1902

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@mzient
Copy link
Contributor Author

mzient commented Apr 12, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2259724]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2259724]: BUILD PASSED

@JanuszL JanuszL self-assigned this Apr 12, 2021
Comment on lines 50 to 51
CUDA_CALL(cudaStreamSynchronize(stream));
pool.deallocate_async(mem1, N, sv);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make any sense to swap deallocate_async and cudaStreamSynchronize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no difference, since the pool will never truly deallocate the memory, so it's going to remain available anyway.

pool.deallocate_async(mem1, N, sv1);
void *mem2 = pool.allocate_async(N, sv2);
auto e = cudaStreamQuery(s1);
EXPECT_NE(e, cudaErrorNotReady) << "Syncrhonization should have occurred";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EXPECT_NE(e, cudaErrorNotReady) << "Syncrhonization should have occurred";
EXPECT_NE(e, cudaErrorNotReady) << "Synchronization should have occurred";

void *mem1 = pool.allocate_async(N, sv1);
CUDA_CALL(cudaMemsetAsync(mem1, 0, N, s1));
pool.deallocate_async(mem1, N, sv1);
void *mem2 = pool.allocate_async(N, sv2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this allocate would case synchronization?
Because of the size? If so I would add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially. It's because the resource is created with avoid upstream and the size is large - thus, it will first try to wait for the pending deallocations before resorting to upstream allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the comment in the next PR if there are no more serious issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how many times this can surprise the user. That his allocation won't happen immediately, but can sync on other stream (and random from the caller of the allocation point of view).

Copy link
Contributor Author

@mzient mzient Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if the alternative is to implicitly synchronize the device (or all of them, as would be the case of pinned memory), then I'd say the user wouldn't notice any negative impact. Also, it's similar to happens in plain malloc - either you allocate from process-local heap (fast) or issue a syscall to expand the heap (slower).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in this case you provide a soft promise to allocate from the pool without any unnecessary delay.

@jantonguirao jantonguirao self-assigned this Apr 13, 2021
Comment on lines 33 to 34
CUDAStream stream;
stream = CUDAStream::Create(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CUDAStream stream;
stream = CUDAStream::Create(true);
CUDAStream stream = CUDAStream::Create(true);

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Add missing CUDA_CALL.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@mzient
Copy link
Contributor Author

mzient commented Apr 13, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2263532]: BUILD STARTED

Comment on lines +93 to +97
DeviceGuard dg(0);
s1 = CUDAStream::Create(true);
cudaSetDevice(1);
s2 = CUDAStream::Create(true);
cudaSetDevice(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DeviceGuard dg(0);
s1 = CUDAStream::Create(true);
cudaSetDevice(1);
s2 = CUDAStream::Create(true);
cudaSetDevice(0);
DeviceGuard dg(0);
s1 = CUDAStream::Create(true);
{
DeviceGuard dg2(1);
s2 = CUDAStream::Create(true);
}

Comment on lines +129 to +133
DeviceGuard dg(0);
s1 = CUDAStream::Create(true);
cudaSetDevice(1);
s2 = CUDAStream::Create(true);
cudaSetDevice(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DeviceGuard dg(0);
s1 = CUDAStream::Create(true);
cudaSetDevice(1);
s2 = CUDAStream::Create(true);
cudaSetDevice(0);
DeviceGuard dg(0);
s1 = CUDAStream::Create(true);
{
DeviceGuard dg2(1);
s2 = CUDAStream::Create(true);
}

Comment on lines +143 to +146
cudaSetDevice(1);
void *mem2 = pool.allocate_async(N, sv2);
EXPECT_EQ(mem1, mem2) << "Memory should have been moved to stream2 on another device.";
pool.deallocate_async(mem2, N, sv2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cudaSetDevice(1);
void *mem2 = pool.allocate_async(N, sv2);
EXPECT_EQ(mem1, mem2) << "Memory should have been moved to stream2 on another device.";
pool.deallocate_async(mem2, N, sv2);
{
DeviceGuard dg2(1);
void *mem2 = pool.allocate_async(N, sv2);
EXPECT_EQ(mem1, mem2) << "Memory should have been moved to stream2 on another device.";
pool.deallocate_async(mem2, N, sv2);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters - the only reason I use device guard at all is to restore the default device at the end of the test, even if it fails.

Comment on lines +491 to +492
* Unlike DeviceGuard, which focuses on restoring the old context upon destruction,
* this object is optimized to reduce the number of API calls and doesn't restore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeviceGuard also operates on device ID and stores context to be compatible with other libs which may not use DeviceGuard like PyCuda. ContextScope operates directly on ctx.

@@ -465,7 +472,44 @@ class async_pool_base : public stream_aware_memory_resource<kind> {
using FreeDescAlloc = detail::object_pool_allocator<pending_free>;

LockType lock_;
CUDAStream sync_stream_;
vector<CUDAStream> sync_streams_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move those member variables to the end, with the rest of them? Up to you

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2263532]: BUILD PASSED

@mzient mzient merged commit ed24c32 into NVIDIA:master Apr 13, 2021
@JanuszL JanuszL mentioned this pull request May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants