Initial commit to pass scale as Tensor for multi_tensor_scale op#2594
Initial commit to pass scale as Tensor for multi_tensor_scale op#2594ksivaman merged 11 commits intoNVIDIA:mainfrom
Conversation
b2a5ae5 to
4081afc
Compare
4081afc to
cfd4370
Compare
Greptile SummaryThis PR introduces Key changes:
Outstanding concerns (carried from prior review rounds):
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant PY as Python (optimizers)
participant CPP as scale.cpp (PyTorch ext)
participant API as nvte API (scale.cu)
participant KERN as CUDA Kernel
PY->>CPP: multi_tensor_scale_tensor(chunk_size, is_infinite, tensor_lists, scale_tensor)
CPP->>CPP: makeTransformerEngineTensor(is_infinite)
CPP->>CPP: makeTransformerEngineTensor(scale)
CPP->>API: nvte_multi_tensor_scale_tensor_cuda(...)
API->>API: convertNVTETensorCheck(scale)
API->>API: reinterpret_cast<float*>(scale_tensor->data.dptr)
API->>KERN: multi_tensor_apply<2>(..., ScalePtrFunctor, scale_ptr)
loop Per block / chunk
KERN->>KERN: ScalePtrFunctor::operator() → scale_chunk(scale_ptr)
KERN->>KERN: get_scale_value(scale_ptr) — dereferences GPU ptr
KERN->>KERN: multiply inputs, track isfinite(r_in[ii])
KERN-->>API: write is_infinite_gmem if !finite
end
API-->>CPP: return
CPP-->>PY: return (is_infinite tensor updated in-place)
Last reviewed commit: e27f134 |
| Tensor *scale_tensor = convertNVTETensorCheck(scale); | ||
| multi_tensor_scale::multi_tensor_scale_tensor_cuda( | ||
| chunk_size, *convertNVTETensorCheck(noop_flag), | ||
| convert_tensor_array(tensor_lists, num_tensor_lists, num_tensors_per_list), | ||
| reinterpret_cast<float *>(scale_tensor->data.dptr), stream); |
There was a problem hiding this comment.
validate scale tensor dtype and size before casting. should check that:
- dtype is
kFloat32 numel() == 1(single element tensor)
Tensor *scale_tensor = convertNVTETensorCheck(scale);
NVTE_CHECK(scale_tensor->dtype() == DType::kFloat32,
"scale tensor must be float32, got ", to_string(scale_tensor->dtype()));
NVTE_CHECK(scale_tensor->numel() == 1,
"scale tensor must have exactly 1 element, got ", scale_tensor->numel());
multi_tensor_scale::multi_tensor_scale_tensor_cuda(
| // I'd like this kernel to propagate infs/nans. | ||
| // if(*noop_gmem == 1) | ||
| // return; | ||
| float scale = *scale_ptr; |
There was a problem hiding this comment.
all threads in a block read from the same device memory location without synchronization - could cause redundant memory traffic but functionally correct. consider caching in shared memory or using __ldg() for read-only cache optimization
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| NVTE_API_CALL(nvte_multi_tensor_scale_tensor_cuda); | ||
| using namespace transformer_engine; | ||
|
|
||
| Tensor *scale_tensor = convertNVTETensorCheck(scale); |
There was a problem hiding this comment.
validate scale tensor before casting - should check dtype is kFloat32 and numel() == 1
| Tensor *scale_tensor = convertNVTETensorCheck(scale); | |
| Tensor *scale_tensor = convertNVTETensorCheck(scale); | |
| NVTE_CHECK(scale_tensor->dtype() == DType::kFloat32, | |
| "scale tensor must be float32, got ", to_string(scale_tensor->dtype())); | |
| NVTE_CHECK(scale_tensor->numel() == 1, | |
| "scale tensor must have exactly 1 element, got ", scale_tensor->numel()); | |
| multi_tensor_scale::multi_tensor_scale_tensor_cuda( |
Additional Comments (1)
|
| * \warning This API is **experimental** and subject to change. | ||
| * | ||
| * \param[in] chunk_size Number of tensor elements processed by a CUDA block. | ||
| * \param[in] noop_flag If this single element tensor has non-zero value, kernel will exit immediately. |
There was a problem hiding this comment.
This documentation is incorrect. We should change this variable name to something more accurate and update everywhere.
| * \param[in] noop_flag If this single element tensor has non-zero value, kernel will exit immediately. | |
| * \param[out] isfinite Whether the kernel detected a non-finite input value. |
There was a problem hiding this comment.
Changed noop_flag to is_infinite for nvte_multi_tensor_scale_tensor_cuda and nvte_multi_tensor_scale_cuda ops
| int i = i_start + threadIdx.x + ii * blockDim.x; | ||
| if (i < n && i < chunk_size) r_in[ii] = in[i]; | ||
| } | ||
| // note for clarification to future michael: |
There was a problem hiding this comment.
Nit: There is no Michael taking responsibility for this. This comment is originally from NVIDIA/apex@6763a8b and it's made its way here through multiple levels of blind copy-pastes.
There was a problem hiding this comment.
Removed the comment.
| void multi_tensor_scale_tensor_cuda(int chunk_size, at::Tensor is_infinite, | ||
| std::vector<std::vector<at::Tensor>> tensor_lists, | ||
| at::Tensor scale) { | ||
| auto is_infinite_cu = makeTransformerEngineTensor(is_infinite); | ||
| auto scale_cu = makeTransformerEngineTensor(scale); | ||
| auto [_, __, tensor_lists_ptr, num_lists, num_tensors] = | ||
| makeTransformerEngineTensorList(tensor_lists); | ||
| nvte_multi_tensor_scale_tensor_cuda(chunk_size, is_infinite_cu.data(), tensor_lists_ptr.data(), | ||
| num_lists, num_tensors, scale_cu.data(), | ||
| at::cuda::getCurrentCUDAStream()); | ||
| } |
There was a problem hiding this comment.
Missing CUDA device validation for scale tensor
The scale parameter is an at::Tensor that could reside on CPU (host) memory. Inside ScalePtrFunctor, the kernel dereferences scale_ptr directly in device code:
float scale = *scale_ptr;If scale is a CPU tensor, every GPU thread will attempt to read a host memory address from device code, resulting in an illegal memory access CUDA error at runtime (typically showing as a silent hang or a CUDA error: an illegal memory access was encountered crash).
A device check should be added before wrapping the tensor, similar to how other PyTorch CUDA extensions guard their inputs:
void multi_tensor_scale_tensor_cuda(int chunk_size, at::Tensor is_infinite,
std::vector<std::vector<at::Tensor>> tensor_lists,
at::Tensor scale) {
TORCH_CHECK(scale.is_cuda(), "scale tensor must be on CUDA device");
TORCH_CHECK(scale.numel() == 1, "scale tensor must contain exactly 1 element, got ", scale.numel());
TORCH_CHECK(scale.scalar_type() == at::kFloat, "scale tensor must be float32");
auto is_infinite_cu = makeTransformerEngineTensor(is_infinite);
auto scale_cu = makeTransformerEngineTensor(scale);Without this guard, passing a CPU-side torch.tensor([1.0]) (very easy to do accidentally) silently corrupts the kernel launch.
| template <typename in_t, typename out_t> | ||
| struct ScalePtrFunctor { | ||
| __device__ __forceinline__ void operator()(int chunk_size, volatile int *is_infinite_gmem, | ||
| TensorListMetadata<2> &tl, // NOLINT(*) | ||
| float *scale_ptr) { | ||
| // I'd like this kernel to propagate infs/nans. | ||
| // if(*noop_gmem == 1) | ||
| // return; | ||
| float scale = *scale_ptr; | ||
| int tensor_loc = tl.block_to_tensor[blockIdx.x]; | ||
| int chunk_idx = tl.block_to_chunk[blockIdx.x]; | ||
| int n = tl.sizes[tensor_loc]; | ||
|
|
||
| in_t *in = reinterpret_cast<in_t *>(tl.addresses[0][tensor_loc]); | ||
| in += chunk_idx * chunk_size; | ||
|
|
||
| out_t *out = reinterpret_cast<out_t *>(tl.addresses[1][tensor_loc]); | ||
| out += chunk_idx * chunk_size; | ||
|
|
||
| n -= chunk_idx * chunk_size; | ||
|
|
||
| bool finite = true; | ||
| in_t r_in[ILP]; | ||
| out_t r_out[ILP]; | ||
|
|
||
| // to make things simple, we put aligned case in a different code path | ||
| if (n % ILP == 0 && chunk_size % ILP == 0 && is_aligned(in) && is_aligned(out)) { | ||
| for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; | ||
| i_start += blockDim.x) { | ||
| // load | ||
| load_store(r_in, in, 0, i_start); | ||
| #pragma unroll | ||
| for (int ii = 0; ii < ILP; ii++) { | ||
| r_out[ii] = static_cast<float>(r_in[ii]) * scale; | ||
| finite = finite && isfinite(static_cast<float>(r_in[ii])); | ||
| } | ||
| // store | ||
| load_store(out, r_out, i_start, 0); | ||
| } | ||
| } else { | ||
| // Non-divergent exit condition for __syncthreads, not necessary here | ||
| for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP) { | ||
| #pragma unroll | ||
| for (int ii = 0; ii < ILP; ii++) { | ||
| r_in[ii] = 0.f; | ||
| int i = i_start + threadIdx.x + ii * blockDim.x; | ||
| if (i < n && i < chunk_size) r_in[ii] = in[i]; | ||
| } | ||
| // From a pure memory dependency perspective, there's likely no point unrolling | ||
| // the write loop, since writes just fire off once their LDGs arrive. | ||
| // Put another way, the STGs are dependent on the LDGs, but not on each other. | ||
| // There is still compute ILP benefit from unrolling the loop though. | ||
| #pragma unroll | ||
| for (int ii = 0; ii < ILP; ii++) { | ||
| r_out[ii] = static_cast<float>(r_in[ii]) * scale; | ||
| finite = finite && isfinite(static_cast<float>(r_in[ii])); | ||
| } | ||
| #pragma unroll | ||
| for (int ii = 0; ii < ILP; ii++) { | ||
| int i = i_start + threadIdx.x + ii * blockDim.x; | ||
| if (i < n && i < chunk_size) out[i] = r_out[ii]; | ||
| } | ||
| } | ||
| } | ||
| if (!finite) | ||
| *is_infinite_gmem = 1; // Blindly fire off a write. These will race but that's ok. | ||
| } | ||
| }; |
There was a problem hiding this comment.
ScalePtrFunctor duplicates ScaleFunctor almost entirely
ScalePtrFunctor is identical to ScaleFunctor except for the one-line dereference float scale = *scale_ptr; at the top. The remaining ~60 lines of aligned/unaligned loop logic, ILP unrolling, finite tracking, and the is_infinite_gmem write are copy-pasted verbatim.
This duplication means any future bugfix or optimization (e.g. adding __ldg() for the scale pointer, fixing the finite flag propagation, or changing the loop bounds logic) must be applied in both functors, which is a maintenance hazard. Consider templating the scale acquisition so a single functor covers both cases, for example via a ScaleGetter policy type or by simply passing a pointer and dereferencing it in the existing ScaleFunctor when USE_PTR is set.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
…sed but not actually enabled Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
…s is passed but not actually enabled" This reverts commit 74a9bcc. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
207a20d to
aa4f177
Compare
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
| float scale = *scale_ptr; | ||
| int tensor_loc = tl.block_to_tensor[blockIdx.x]; | ||
| int chunk_idx = tl.block_to_chunk[blockIdx.x]; | ||
| int n = tl.sizes[tensor_loc]; | ||
|
|
||
| in_t *in = reinterpret_cast<in_t *>(tl.addresses[0][tensor_loc]); | ||
| in += chunk_idx * chunk_size; | ||
|
|
||
| out_t *out = reinterpret_cast<out_t *>(tl.addresses[1][tensor_loc]); | ||
| out += chunk_idx * chunk_size; | ||
|
|
||
| n -= chunk_idx * chunk_size; | ||
|
|
||
| bool finite = true; | ||
| in_t r_in[ILP]; | ||
| out_t r_out[ILP]; | ||
|
|
||
| // to make things simple, we put aligned case in a different code path | ||
| if (n % ILP == 0 && chunk_size % ILP == 0 && is_aligned(in) && is_aligned(out)) { | ||
| for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; | ||
| i_start += blockDim.x) { | ||
| // load | ||
| load_store(r_in, in, 0, i_start); | ||
| #pragma unroll | ||
| for (int ii = 0; ii < ILP; ii++) { | ||
| r_out[ii] = static_cast<float>(r_in[ii]) * scale; | ||
| finite = finite && isfinite(static_cast<float>(r_in[ii])); | ||
| } | ||
| // store | ||
| load_store(out, r_out, i_start, 0); | ||
| } | ||
| } else { | ||
| // Non-divergent exit condition for __syncthreads, not necessary here | ||
| for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP) { | ||
| #pragma unroll | ||
| for (int ii = 0; ii < ILP; ii++) { | ||
| r_in[ii] = 0.f; | ||
| int i = i_start + threadIdx.x + ii * blockDim.x; | ||
| if (i < n && i < chunk_size) r_in[ii] = in[i]; | ||
| } | ||
| // From a pure memory dependency perspective, there's likely no point unrolling | ||
| // the write loop, since writes just fire off once their LDGs arrive. | ||
| // Put another way, the STGs are dependent on the LDGs, but not on each other. | ||
| // There is still compute ILP benefit from unrolling the loop though. | ||
| #pragma unroll | ||
| for (int ii = 0; ii < ILP; ii++) { | ||
| r_out[ii] = static_cast<float>(r_in[ii]) * scale; | ||
| finite = finite && isfinite(static_cast<float>(r_in[ii])); | ||
| } | ||
| #pragma unroll | ||
| for (int ii = 0; ii < ILP; ii++) { | ||
| int i = i_start + threadIdx.x + ii * blockDim.x; | ||
| if (i < n && i < chunk_size) out[i] = r_out[ii]; | ||
| } | ||
| } | ||
| } | ||
| if (!finite) | ||
| *is_infinite_gmem = 1; // Blindly fire off a write. These will race but that's ok. |
There was a problem hiding this comment.
is_infinite_gmem not set when scale itself is non-finite
ScalePtrFunctor reads the scale from GPU memory at line 113 (float scale = *scale_ptr;) but never checks whether that value is finite. The finite flag only tracks isfinite(r_in[ii]), so if *scale_ptr is NaN or Inf:
- A finite input
r_in[ii]multiplied byInforNaNproduces a non-finite output. - Yet
finite = finite && isfinite(r_in[ii])remainstrue. is_infinite_gmemis never set, and the caller has no way to detect the overflow.
With the scalar-float variant (ScaleFunctor), callers always hold the scale value on the host and can validate it before the kernel launch. With the new tensor variant, the scale lives in GPU memory and there is no equivalent host-side check. This creates a silent correctness hole where a non-finite loss-scale tensor can corrupt the parameter tensors without triggering the overflow flag.
Consider including the scale in the finiteness check:
float scale = *scale_ptr;
bool finite = isfinite(scale); // seed finite with scale validityor alternatively check the output value:
finite = finite && isfinite(static_cast<float>(r_out[ii]));Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
|
/te-ci |
| // I'd like this kernel to propagate infs/nans. | ||
| // if(*noop_gmem == 1) | ||
| // return; |
There was a problem hiding this comment.
Stale variable name in commented-out guard
The commented-out guard references *noop_gmem, which was the old parameter name when this comment lived inside ScaleFunctor::operator(). After the refactoring that moved this logic into scale_chunk, the actual parameter is is_infinite_gmem. Any developer who later decides to enable this guard will use the wrong identifier and get a compile error.
| // I'd like this kernel to propagate infs/nans. | |
| // if(*noop_gmem == 1) | |
| // return; | |
| // I'd like this kernel to propagate infs/nans. | |
| // if(*is_infinite_gmem == 1) | |
| // return; |
…DIA#2594) * Initial commit to pass scale as Tensor for multi_tensor_scale op Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Enable capturable mode for optimizer if store_param_remainders is passed but not actually enabled Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Enable capturable mode for optimizer if store_param_remainders is passed but not actually enabled" This reverts commit 74a9bcc. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Apply suggestion from @greptile-apps[bot] Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change noop_flag to is_infinite Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Update transformer_engine/pytorch/csrc/extensions.h Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove duplication Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add test for scale tensor cuda Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: