-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pinned async resource #2858
Pinned async resource #2858
Conversation
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
!build |
CI MESSAGE: [2259724]: BUILD STARTED |
CI MESSAGE: [2259724]: BUILD PASSED |
dali/core/mm/pinned_pool_test.cc
Outdated
CUDA_CALL(cudaStreamSynchronize(stream)); | ||
pool.deallocate_async(mem1, N, sv); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make any sense to swap deallocate_async and cudaStreamSynchronize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes no difference, since the pool will never truly deallocate the memory, so it's going to remain available anyway.
dali/core/mm/pinned_pool_test.cc
Outdated
pool.deallocate_async(mem1, N, sv1); | ||
void *mem2 = pool.allocate_async(N, sv2); | ||
auto e = cudaStreamQuery(s1); | ||
EXPECT_NE(e, cudaErrorNotReady) << "Syncrhonization should have occurred"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EXPECT_NE(e, cudaErrorNotReady) << "Syncrhonization should have occurred"; | |
EXPECT_NE(e, cudaErrorNotReady) << "Synchronization should have occurred"; |
void *mem1 = pool.allocate_async(N, sv1); | ||
CUDA_CALL(cudaMemsetAsync(mem1, 0, N, s1)); | ||
pool.deallocate_async(mem1, N, sv1); | ||
void *mem2 = pool.allocate_async(N, sv2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this allocate would case synchronization?
Because of the size? If so I would add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially. It's because the resource is created with avoid upstream
and the size is large - thus, it will first try to wait for the pending deallocations before resorting to upstream allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the comment in the next PR if there are no more serious issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how many times this can surprise the user. That his allocation won't happen immediately, but can sync on other stream (and random from the caller of the allocation point of view).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if the alternative is to implicitly synchronize the device (or all of them, as would be the case of pinned memory), then I'd say the user wouldn't notice any negative impact. Also, it's similar to happens in plain malloc - either you allocate from process-local heap (fast) or issue a syscall to expand the heap (slower).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in this case you provide a soft promise to allocate from the pool without any unnecessary delay.
dali/core/mm/pinned_pool_test.cc
Outdated
CUDAStream stream; | ||
stream = CUDAStream::Create(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDAStream stream; | |
stream = CUDAStream::Create(true); | |
CUDAStream stream = CUDAStream::Create(true); |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Add missing CUDA_CALL. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
!build |
CI MESSAGE: [2263532]: BUILD STARTED |
DeviceGuard dg(0); | ||
s1 = CUDAStream::Create(true); | ||
cudaSetDevice(1); | ||
s2 = CUDAStream::Create(true); | ||
cudaSetDevice(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeviceGuard dg(0); | |
s1 = CUDAStream::Create(true); | |
cudaSetDevice(1); | |
s2 = CUDAStream::Create(true); | |
cudaSetDevice(0); | |
DeviceGuard dg(0); | |
s1 = CUDAStream::Create(true); | |
{ | |
DeviceGuard dg2(1); | |
s2 = CUDAStream::Create(true); | |
} |
DeviceGuard dg(0); | ||
s1 = CUDAStream::Create(true); | ||
cudaSetDevice(1); | ||
s2 = CUDAStream::Create(true); | ||
cudaSetDevice(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeviceGuard dg(0); | |
s1 = CUDAStream::Create(true); | |
cudaSetDevice(1); | |
s2 = CUDAStream::Create(true); | |
cudaSetDevice(0); | |
DeviceGuard dg(0); | |
s1 = CUDAStream::Create(true); | |
{ | |
DeviceGuard dg2(1); | |
s2 = CUDAStream::Create(true); | |
} |
cudaSetDevice(1); | ||
void *mem2 = pool.allocate_async(N, sv2); | ||
EXPECT_EQ(mem1, mem2) << "Memory should have been moved to stream2 on another device."; | ||
pool.deallocate_async(mem2, N, sv2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudaSetDevice(1); | |
void *mem2 = pool.allocate_async(N, sv2); | |
EXPECT_EQ(mem1, mem2) << "Memory should have been moved to stream2 on another device."; | |
pool.deallocate_async(mem2, N, sv2); | |
{ | |
DeviceGuard dg2(1); | |
void *mem2 = pool.allocate_async(N, sv2); | |
EXPECT_EQ(mem1, mem2) << "Memory should have been moved to stream2 on another device."; | |
pool.deallocate_async(mem2, N, sv2); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters - the only reason I use device guard at all is to restore the default device at the end of the test, even if it fails.
* Unlike DeviceGuard, which focuses on restoring the old context upon destruction, | ||
* this object is optimized to reduce the number of API calls and doesn't restore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeviceGuard also operates on device ID and stores context to be compatible with other libs which may not use DeviceGuard
like PyCuda. ContextScope operates directly on ctx.
@@ -465,7 +472,44 @@ class async_pool_base : public stream_aware_memory_resource<kind> { | |||
using FreeDescAlloc = detail::object_pool_allocator<pending_free>; | |||
|
|||
LockType lock_; | |||
CUDAStream sync_stream_; | |||
vector<CUDAStream> sync_streams_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move those member variables to the end, with the rest of them? Up to you
CI MESSAGE: [2263532]: BUILD PASSED |
Why we need this PR?
Pick one, remove the rest
What happened in this PR?
Fill relevant points, put NA otherwise. Replace anything inside []
JIRA TASK: DALI-1902