query warp size for host code, do not use C10_WARP_SIZE #857

jeffdaily · 2021-09-27T23:35:36Z

ROCm supports gfx targets with 32 and 64 warp size. Device compilation
correctly handles the C10_WARP_SIZE (aka warpSize) constant. Host
compilation cannot rely on a single hard-coded value, but instead needs
to query device properties at runtime.

ROCm supports gfx targets with 32 and 64 warp size. Device compilation correctly handles the C10_WARP_SIZE (aka warpSize) constant. Host compilation cannot rely on a single hard-coded value, but instead needs to query device properties at runtime.

jithunnair-amd

@jeffdaily Looks good, just one snippet looks suspect and a clarification is needed.

jithunnair-amd · 2021-09-28T19:06:43Z

aten/src/ATen/native/cuda/Embedding.cu

-    static_assert(num_threads % C10_WARP_SIZE == 0 &&
-                  num_threads <= cuda_utils::kCUDABlockReduceMaxThreads,
+    TORCH_INTERNAL_ASSERT(num_threads % warp_size == 0);
+    static_assert(num_threads <= cuda_utils::kCUDABlockReduceMaxThreads,


@jeffdaily Sorry, I am not clear on the difference between TORCH_INTERNAL_ASSERT and static_assert. Would you mind explaining it here?

static_assert is a C++ feature that can assert at compile-time, but it requires all variables to be known at compile-time. Since warp_size here is now a runtime query, it cannot be used in the static_assert. I was treating the TORCH_INTERNAL_ASSERT as the nearest at-runtime equivalent.

jithunnair-amd · 2021-09-29T00:14:03Z

aten/src/ATen/native/cuda/TensorModeKernel.cu

-    }
+    case 2:
+      handle_fused_mode<128, scalar_t>(
+          grid, self, ti_values, ti_indices, slice_size, slices);


Is it correct to reduce the if-else to just the if logic? Shouldn't we just replace C10_WARP_SIZE with at::cuda::warp_size()?

I thought about this one for a while. I could have easily used warp_size here, the runtime query. The handle_fused_mode function takes a size template arg and the static_assert ensures the size is at least 2 * C10_WARP_SIZE.

In the switch(ceilPowerOf2) statement, the previously handled case was 256, and the next largest case is 128. For the if statement if (celPowerOf2 > 2*C10_WARP_SIZE) -- we know the warp size is going to be 32 or 64. So it becomes if (128 > 64) for warp size 32, and if (128 > 128) for warp size 64. Not terribly useful IMHO. When C10_WARP_SIZE is 64, it will always call handle_fused_mode<128> for cases 128 through 2. My change here merely makes that clear. The earlier code wasn't much of an optimization to begin with.

Perhaps the if statement was wrong in the first place. Perhaps it should have been if (ceilPowerOf2 >= 2 * C10_WARP_SIZE) --- >= instead of just >.

jithunnair-amd · 2021-09-29T00:18:31Z

aten/src/ATen/native/cuda/group_norm_kernel.cu

@@ -574,8 +574,9 @@ void GroupNormKernelImplInternal(
  T* rstd_data = rstd.data_ptr<T>();

  cudaStream_t cuda_stream = at::cuda::getCurrentCUDAStream();
+


nit: extra newline

This reverts commit 24e27af.

…to workaround certificate expiry issue

ROCm supports gfx targets with 32 and 64 warp size. Device compilation correctly handles the C10_WARP_SIZE (aka warpSize) constant. Host compilation cannot rely on a single hard-coded value, but instead needs to query device properties at runtime.

…m/ROCmSoftwarePlatform/pytorch into rocm4.5_internal_testing_warpsize

…sting_warpsize_mmelesse_pr_2 fix launch_unrolled_kernel_for_multi_outputs

jeffdaily requested a review from jithunnair-amd September 27, 2021 23:36

jithunnair-amd added module: rocm and removed module: rocm labels Sep 28, 2021

jithunnair-amd reviewed Sep 29, 2021

View reviewed changes

jithunnair-amd and others added 5 commits October 6, 2021 04:06

Revert "[ROCm] enable kernel asserts (pytorch#49624)"

e092a9c

This reverts commit 24e27af.

Use --no-check-certificate flag for valgrind install in CentOS7 only …

dc1c008

…to workaround certificate expiry issue

Don't use --no-certificate-check flag for valgrind.org website

aab8f1d

fix launch kernel for rocmloops num_threads

3ec52d0

jithunnair-amd force-pushed the rocm4.5_internal_testing_warpsize branch from 8ba50d8 to 3ec52d0 Compare October 7, 2021 23:07

Merge branch 'rocm4.5_internal_testing_warpsize' of https://github.co…

632308a

…m/ROCmSoftwarePlatform/pytorch into rocm4.5_internal_testing_warpsize

jithunnair-amd force-pushed the rocm4.5_internal_testing branch from aab8f1d to b81dc22 Compare October 8, 2021 21:21

micmelesse and others added 2 commits October 11, 2021 17:34

fix launch_unrolled_kernel_for_multi_outputs

5af04a0

Merge pull request #861 from ROCmSoftwarePlatform/rocm4.5_internal_te…

afc5a4e

…sting_warpsize_mmelesse_pr_2 fix launch_unrolled_kernel_for_multi_outputs

micmelesse mentioned this pull request Oct 20, 2021

Rocm5.0 internal testing warpsize #870

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query warp size for host code, do not use C10_WARP_SIZE #857

query warp size for host code, do not use C10_WARP_SIZE #857

jeffdaily commented Sep 27, 2021

jithunnair-amd left a comment

jithunnair-amd Sep 28, 2021

jeffdaily Sep 29, 2021

jithunnair-amd Sep 29, 2021

jeffdaily Sep 29, 2021 •

edited

jeffdaily Sep 29, 2021

jithunnair-amd Sep 29, 2021

		@@ -574,8 +574,9 @@ void GroupNormKernelImplInternal(
		T* rstd_data = rstd.data_ptr<T>();

		cudaStream_t cuda_stream = at::cuda::getCurrentCUDAStream();

query warp size for host code, do not use C10_WARP_SIZE #857

Are you sure you want to change the base?

query warp size for host code, do not use C10_WARP_SIZE #857

Conversation

jeffdaily commented Sep 27, 2021

jithunnair-amd left a comment

Choose a reason for hiding this comment

jithunnair-amd Sep 28, 2021

Choose a reason for hiding this comment

jeffdaily Sep 29, 2021

Choose a reason for hiding this comment

jithunnair-amd Sep 29, 2021

Choose a reason for hiding this comment

jeffdaily Sep 29, 2021 • edited

Choose a reason for hiding this comment

jeffdaily Sep 29, 2021

Choose a reason for hiding this comment

jithunnair-amd Sep 29, 2021

Choose a reason for hiding this comment

jeffdaily Sep 29, 2021 •

edited