Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Fixes superfluous kernel template instantiations in the prefix scan #312

Merged
merged 1 commit into from
Jun 8, 2021

Conversation

elstehle
Copy link
Collaborator

This PR fixes superfluous kernel template instantiations in the prefix scan.

Currently, the prefix scan kernel template is getting instantiated for each tuning policy in the policy chain and for each GPU architecture that we're compiling for. This results in compiling 6 times the kernels that would actually be required.

With this fix, kernel templates are getting instantiated only once for each GPU architecture and, for each GPU architecture, we are compiling with the correct tuning policy in place.

Kernel template instantiations before and after the fix

With fix

nvcc --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_75,code=sm_75 main.cu
cuobjdump ./a.out -res-usage |c++filt

Resource usage:
 Common:
  GLOBAL:11
 Function void cub::DeviceScanKernel<cub::DeviceScanPolicy<int>::Policy600, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:48 STACK:0 SHARED:7696 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanInitKernel<cub::ScanTileState<int, true> >(cub::ScanTileState<int, true>, int):
  REG:10 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:364 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::EmptyKernel<void>():
  REG:4 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:352 TEXTURE:0 SURFACE:0 SAMPLER:0

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Resource usage:
 Common:
  GLOBAL:11
 Function void cub::DeviceScanKernel<cub::DeviceScanPolicy<int>::Policy600, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:58 STACK:0 SHARED:7696 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanInitKernel<cub::ScanTileState<int, true> >(cub::ScanTileState<int, true>, int):
  REG:10 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:364 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::EmptyKernel<void>():
  REG:4 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:352 TEXTURE:0 SURFACE:0 SAMPLER:0

Without fix

nvcc --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_75,code=sm_75 main.cu
cuobjdump ./a.out -res-usage |c++filt

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Resource usage:
 Common:
  GLOBAL:11
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<128, 15, int, (cub::BlockLoadAlgorithm)2, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)2, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<128, 15, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:48 STACK:0 SHARED:7696 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<128, 12, int, (cub::BlockLoadAlgorithm)0, (cub::CacheLoadModifier)5, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<128, 12, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:40 STACK:0 SHARED:6160 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<128, 12, int, (cub::BlockLoadAlgorithm)0, (cub::CacheLoadModifier)5, (cub::BlockStoreAlgorithm)4, (cub::BlockScanAlgorithm)0, cub::MemBoundScaling<128, 12, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:32 STACK:0 SHARED:1552 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<256, 9, int, (cub::BlockLoadAlgorithm)3, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<256, 9, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:47 STACK:0 SHARED:9232 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<128, 12, int, (cub::BlockLoadAlgorithm)3, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<128, 12, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:45 STACK:0 SHARED:6160 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<96, 21, int, (cub::BlockLoadAlgorithm)3, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)1, cub::MemBoundScaling<96, 21, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:64 STACK:0 SHARED:8080 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<64, 9, int, (cub::BlockLoadAlgorithm)3, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<64, 9, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:47 STACK:0 SHARED:2320 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanInitKernel<cub::ScanTileState<int, true> >(cub::ScanTileState<int, true>, int):
  REG:10 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:364 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::EmptyKernel<void>():
  REG:4 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:352 TEXTURE:0 SURFACE:0 SAMPLER:0

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Resource usage:
 Common:
  GLOBAL:11
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<128, 15, int, (cub::BlockLoadAlgorithm)2, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)2, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<128, 15, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:50 STACK:0 SHARED:7696 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<128, 12, int, (cub::BlockLoadAlgorithm)0, (cub::CacheLoadModifier)5, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<128, 12, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:37 STACK:0 SHARED:6160 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<128, 12, int, (cub::BlockLoadAlgorithm)0, (cub::CacheLoadModifier)5, (cub::BlockStoreAlgorithm)4, (cub::BlockScanAlgorithm)0, cub::MemBoundScaling<128, 12, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:30 STACK:0 SHARED:1552 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<256, 9, int, (cub::BlockLoadAlgorithm)3, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<256, 9, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:45 STACK:0 SHARED:9232 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<128, 12, int, (cub::BlockLoadAlgorithm)3, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<128, 12, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:47 STACK:0 SHARED:6160 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<96, 21, int, (cub::BlockLoadAlgorithm)3, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)1, cub::MemBoundScaling<96, 21, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:70 STACK:0 SHARED:8080 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanKernel<cub::AgentScanPolicy<64, 9, int, (cub::BlockLoadAlgorithm)3, (cub::CacheLoadModifier)0, (cub::BlockStoreAlgorithm)3, (cub::BlockScanAlgorithm)2, cub::MemBoundScaling<64, 9, int> >, cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, cub::Sum, int, int>(cub::TransformInputIterator<int, TypeConversionOp<int>, signed char*, long>, int*, cub::ScanTileState<int, true>, int, cub::Sum, int, int):
  REG:45 STACK:0 SHARED:2320 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::DeviceScanInitKernel<cub::ScanTileState<int, true> >(cub::ScanTileState<int, true>, int):
  REG:10 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:364 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void cub::EmptyKernel<void>():
  REG:4 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:352 TEXTURE:0 SURFACE:0 SAMPLER:0
</details>

@@ -387,12 +389,12 @@ struct DispatchScan:
CUB_RUNTIME_FUNCTION __host__ __forceinline__
cudaError_t Invoke()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host-side Invoke function template is getting instantiated for all tuning policies in the chain. Hence, the explicit kernel template instantiation for DeviceScanKernel<Policy ... below would be instantiated for all Policy.

Instead, the idea is to instantiate the DeviceScanKernel kernel template always with the top of the tuning policy chain and then, in the device code, identify and apply the tuning policy for the GPU architecture of the current compilation pass.

@elstehle elstehle changed the title Fixes instantiating kernel templates for all tuning policies Fixes superfluous kernel template instantiations in the prefix scan May 27, 2021
alliepiper added a commit to alliepiper/thrust that referenced this pull request Jun 1, 2021
@alliepiper
Copy link
Collaborator

alliepiper commented Jun 1, 2021

Thanks Elias -- this will be useful while we work out the remaining issues with the new if target dispatcher.

DVS CL: 30027193
gpuCI: NVIDIA/thrust#1450

@alliepiper alliepiper self-assigned this Jun 1, 2021
@alliepiper alliepiper added testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI in progress Started gpuCI testing. labels Jun 1, 2021
@alliepiper alliepiper added this to Inbox in PR Tracking via automation Jun 1, 2021
@alliepiper alliepiper added this to the 1.13.0 milestone Jun 1, 2021
@alliepiper alliepiper moved this from Inbox to Tests Pending in PR Tracking Jun 1, 2021
@alliepiper alliepiper added testing: internal ci passed Passed internal NVIDIA CI (DVS). testing: gpuCI passed Passed gpuCI testing. and removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI in progress Started gpuCI testing. labels Jun 8, 2021
@alliepiper alliepiper moved this from Tests Pending to Integration in PR Tracking Jun 8, 2021
@alliepiper alliepiper merged commit d684a99 into NVIDIA:main Jun 8, 2021
PR Tracking automation moved this from Integration to Done Jun 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS).
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants