Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Optimize compilation time for the common case #400

Conversation

gevtushenko
Copy link
Collaborator

This PR contains compilation time improvement for the common case when agents fit into default shared memory size (48 KB).

@gevtushenko gevtushenko added the testing: gpuCI in progress Started gpuCI testing. label Oct 29, 2021
@gevtushenko
Copy link
Collaborator Author

gpuCI: NVIDIA/thrust#1557
DVS: 30594457

@gevtushenko gevtushenko added the testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). label Oct 29, 2021
Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @senior-zero, this is a huge improvement -- I'm seeing compilation time improve on both nvcc and nvc++. It isn't quite back to before, but it's a significant reduction!

Compile times for the thrust::sort test program:

Compiler Old merge sort Current merge sort This PR
nvcc 18.79s 29.10s (+55%) 22.44s (+19%)
nvc++ 61.81s 75.61s (+22%) 65.80s (+6%)

LGTM -- Let's get this merged once the tests are passing and see how this impacts the total build time.

cub/device/dispatch/dispatch_merge_sort.cuh Show resolved Hide resolved
@alliepiper alliepiper added this to the 1.16.0 milestone Oct 29, 2021
@alliepiper alliepiper added helps: nvc++ Helps or needed by NVC++. nvbug Has an associated internal NVIDIA NVBug. P0: must have Absolutely necessary. Critical issue, major blocker, etc. labels Oct 29, 2021
@alliepiper
Copy link
Collaborator

Related to NVBugs 3418930 and 3419768.

@gevtushenko gevtushenko added testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Oct 29, 2021
@dongxiao92
Copy link

Could you help to explain why these changes can reduce compilation time?
From the code changes, I understand that 1) instantiation for DeviceMergeSortBlockSortKernel with use_vshmem=true may not be needed and 2) compilation for code in this if branch may not be needed.
Does the compilation time reduction come from these two changes?

@gevtushenko
Copy link
Collaborator Author

Could you help to explain why these changes can reduce compilation time? From the code changes, I understand that 1) instantiation for DeviceMergeSortBlockSortKernel with use_vshmem=true may not be needed and 2) compilation for code in this if branch may not be needed. Does the compilation time reduction come from these two changes?

Hello, @dongxiao92!

Two specializations of merge sort kernels exist:

kernel use vshmem=false use vshmem=true
DeviceMergeSortBlockSortKernel + +
DeviceMergeSortMergeKernel + +

Since the check for available shared memory is performed at runtime, we had to compile for both cases (in generic case). This patch relies on the thrust approach which consists of comparing kernel shared memory size requirements with the default available shared memory size (48KB). This check can be done at compile time. If we know that virtual shared memory is not required at compile-time, there's no need to compile merge sort kernels twice.

@gevtushenko gevtushenko added testing: internal ci passed Passed internal NVIDIA CI (DVS). and removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Oct 30, 2021
@gevtushenko gevtushenko merged commit 8c32c79 into NVIDIA:main Oct 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
helps: nvc++ Helps or needed by NVC++. nvbug Has an associated internal NVIDIA NVBug. P0: must have Absolutely necessary. Critical issue, major blocker, etc. testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants