Pinned async resource #2858

mzient · 2021-04-12T13:57:21Z

Why we need this PR?

Pick one, remove the rest

It adds new feature needed for stream-ordered allocation of staging buffers.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
- Specialized async_pool for pinned memory kind
- Tested
- Fixed RMM & bumped RMM in third_party
Affected modules and functionalities:
- RMM
- memory managmenet module
Key points relevant for the review:
- N/A
Validation and testing:
- GTest
Documentation (including examples):
- N/A

JIRA TASK: DALI-1902

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient · 2021-04-12T14:03:10Z

!build

dali-automaton · 2021-04-12T14:06:25Z

CI MESSAGE: [2259724]: BUILD STARTED

dali-automaton · 2021-04-12T15:36:52Z

CI MESSAGE: [2259724]: BUILD PASSED

JanuszL · 2021-04-12T16:50:14Z

dali/core/mm/pinned_pool_test.cc

+    CUDA_CALL(cudaStreamSynchronize(stream));
+    pool.deallocate_async(mem1, N, sv);


Would it make any sense to swap deallocate_async and cudaStreamSynchronize?

It makes no difference, since the pool will never truly deallocate the memory, so it's going to remain available anyway.

JanuszL · 2021-04-12T16:58:49Z

dali/core/mm/pinned_pool_test.cc

+    pool.deallocate_async(mem1, N, sv1);
+    void *mem2 = pool.allocate_async(N, sv2);
+    auto e = cudaStreamQuery(s1);
+    EXPECT_NE(e, cudaErrorNotReady) << "Syncrhonization should have occurred";


Suggested change

EXPECT_NE(e, cudaErrorNotReady) << "Syncrhonization should have occurred";

EXPECT_NE(e, cudaErrorNotReady) << "Synchronization should have occurred";

JanuszL · 2021-04-12T17:02:43Z

dali/core/mm/pinned_pool_test.cc

+    void *mem1 = pool.allocate_async(N, sv1);
+    CUDA_CALL(cudaMemsetAsync(mem1, 0, N, s1));
+    pool.deallocate_async(mem1, N, sv1);
+    void *mem2 = pool.allocate_async(N, sv2);


Why this allocate would case synchronization?
Because of the size? If so I would add a comment.

Partially. It's because the resource is created with avoid upstream and the size is large - thus, it will first try to wait for the pending deallocations before resorting to upstream allocation.

I'll add the comment in the next PR if there are no more serious issues.

I wonder how many times this can surprise the user. That his allocation won't happen immediately, but can sync on other stream (and random from the caller of the allocation point of view).

Well, if the alternative is to implicitly synchronize the device (or all of them, as would be the case of pinned memory), then I'd say the user wouldn't notice any negative impact. Also, it's similar to happens in plain malloc - either you allocate from process-local heap (fast) or issue a syscall to expand the heap (slower).

But in this case you provide a soft promise to allocate from the pool without any unnecessary delay.

jantonguirao · 2021-04-13T10:07:11Z

dali/core/mm/pinned_pool_test.cc

+    CUDAStream stream;
+    stream = CUDAStream::Create(true);


Suggested change

CUDAStream stream;

stream = CUDAStream::Create(true);

CUDAStream stream = CUDAStream::Create(true);

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Add missing CUDA_CALL. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient · 2021-04-13T13:09:24Z

!build

dali-automaton · 2021-04-13T13:11:14Z

CI MESSAGE: [2263532]: BUILD STARTED

JanuszL · 2021-04-13T13:13:38Z

dali/core/mm/pinned_pool_test.cc

+    DeviceGuard dg(0);
+    s1 = CUDAStream::Create(true);
+    cudaSetDevice(1);
+    s2 = CUDAStream::Create(true);
+    cudaSetDevice(0);


Suggested change

DeviceGuard dg(0);

s1 = CUDAStream::Create(true);

cudaSetDevice(1);

s2 = CUDAStream::Create(true);

cudaSetDevice(0);

DeviceGuard dg(0);

s1 = CUDAStream::Create(true);

{

DeviceGuard dg2(1);

s2 = CUDAStream::Create(true);

}

JanuszL · 2021-04-13T13:15:10Z

dali/core/mm/pinned_pool_test.cc

+    DeviceGuard dg(0);
+    s1 = CUDAStream::Create(true);
+    cudaSetDevice(1);
+    s2 = CUDAStream::Create(true);
+    cudaSetDevice(0);


Suggested change

DeviceGuard dg(0);

s1 = CUDAStream::Create(true);

cudaSetDevice(1);

s2 = CUDAStream::Create(true);

cudaSetDevice(0);

DeviceGuard dg(0);

s1 = CUDAStream::Create(true);

{

DeviceGuard dg2(1);

s2 = CUDAStream::Create(true);

}

JanuszL · 2021-04-13T13:16:24Z

dali/core/mm/pinned_pool_test.cc

+    cudaSetDevice(1);
+    void *mem2 = pool.allocate_async(N, sv2);
+    EXPECT_EQ(mem1, mem2) << "Memory should have been moved to stream2 on another device.";
+    pool.deallocate_async(mem2, N, sv2);


Suggested change

cudaSetDevice(1);

void *mem2 = pool.allocate_async(N, sv2);

EXPECT_EQ(mem1, mem2) << "Memory should have been moved to stream2 on another device.";

pool.deallocate_async(mem2, N, sv2);

{

DeviceGuard dg2(1);

void *mem2 = pool.allocate_async(N, sv2);

EXPECT_EQ(mem1, mem2) << "Memory should have been moved to stream2 on another device.";

pool.deallocate_async(mem2, N, sv2);

}

I don't think it matters - the only reason I use device guard at all is to restore the default device at the end of the test, even if it fails.

JanuszL · 2021-04-13T13:23:05Z

include/dali/core/mm/async_pool.h

+   * Unlike DeviceGuard, which focuses on restoring the old context upon destruction,
+   * this object is optimized to reduce the number of API calls and doesn't restore


DeviceGuard also operates on device ID and stores context to be compatible with other libs which may not use DeviceGuard like PyCuda. ContextScope operates directly on ctx.

jantonguirao · 2021-04-13T13:45:27Z

include/dali/core/mm/async_pool.h

@@ -465,7 +472,44 @@ class async_pool_base : public stream_aware_memory_resource<kind> {
  using FreeDescAlloc = detail::object_pool_allocator<pending_free>;

  LockType lock_;
-  CUDAStream sync_stream_;
+  vector<CUDAStream> sync_streams_;


maybe move those member variables to the end, with the rest of them? Up to you

dali-automaton · 2021-04-13T14:25:06Z

CI MESSAGE: [2263532]: BUILD PASSED

Pinned async resource: fixes & tests.

50be701

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient force-pushed the PinnedAlloc branch from b21b837 to 50be701 Compare April 12, 2021 13:59

Fix linter errors.

b581455

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

JanuszL self-assigned this Apr 12, 2021

JanuszL reviewed Apr 12, 2021

View reviewed changes

JanuszL approved these changes Apr 12, 2021

View reviewed changes

jantonguirao self-assigned this Apr 13, 2021

jantonguirao approved these changes Apr 13, 2021

View reviewed changes

mzient added 2 commits April 13, 2021 14:51

Add cross-device functionality to async pool.

4aca667

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Fix a typo.

e56686e

Add missing CUDA_CALL. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient force-pushed the PinnedAlloc branch from b770162 to e56686e Compare April 13, 2021 13:01

mzient requested review from JanuszL and jantonguirao April 13, 2021 13:09

JanuszL reviewed Apr 13, 2021

View reviewed changes

JanuszL approved these changes Apr 13, 2021

View reviewed changes

jantonguirao reviewed Apr 13, 2021

View reviewed changes

jantonguirao approved these changes Apr 13, 2021

View reviewed changes

mzient merged commit ed24c32 into NVIDIA:master Apr 13, 2021

JanuszL mentioned this pull request May 19, 2021

DALI 2021 roadmap #2978

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pinned async resource #2858

Pinned async resource #2858

mzient commented Apr 12, 2021

mzient commented Apr 12, 2021

dali-automaton commented Apr 12, 2021

dali-automaton commented Apr 12, 2021

JanuszL Apr 12, 2021

mzient Apr 13, 2021

JanuszL Apr 12, 2021

JanuszL Apr 12, 2021

mzient Apr 12, 2021

mzient Apr 12, 2021

JanuszL Apr 12, 2021

mzient Apr 13, 2021 •

edited

Loading

JanuszL Apr 13, 2021

jantonguirao Apr 13, 2021

mzient commented Apr 13, 2021

dali-automaton commented Apr 13, 2021

JanuszL Apr 13, 2021

JanuszL Apr 13, 2021

JanuszL Apr 13, 2021

mzient Apr 13, 2021

JanuszL Apr 13, 2021

jantonguirao Apr 13, 2021

dali-automaton commented Apr 13, 2021

		CUDA_CALL(cudaStreamSynchronize(stream));
		pool.deallocate_async(mem1, N, sv);

	EXPECT_NE(e, cudaErrorNotReady) << "Syncrhonization should have occurred";
	EXPECT_NE(e, cudaErrorNotReady) << "Synchronization should have occurred";

	CUDAStream stream;
	stream = CUDAStream::Create(true);
	CUDAStream stream = CUDAStream::Create(true);

		* Unlike DeviceGuard, which focuses on restoring the old context upon destruction,
		* this object is optimized to reduce the number of API calls and doesn't restore

Pinned async resource #2858

Pinned async resource #2858

Conversation

mzient commented Apr 12, 2021

Why we need this PR?

What happened in this PR?

mzient commented Apr 12, 2021

dali-automaton commented Apr 12, 2021

dali-automaton commented Apr 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient commented Apr 13, 2021

dali-automaton commented Apr 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Apr 13, 2021

mzient Apr 13, 2021 •

edited

Loading