[WIP] Feature request: Implement GroupNormalization-21 #25060

Copilot · 2025-06-14T15:57:38Z

Plan to implement ONNX GroupNormalization-21 support

After analyzing the codebase, I found that:

GroupNormalization is currently only implemented as a contrib op (com.microsoft.GroupNorm)
The standard ONNX GroupNormalization operator is NOT implemented in the CPU provider
It's supported in CoreML provider, indicating it's a valid ONNX operation
The existing test uses opset 18 but only tests CoreML provider

Implementation Checklist:

Research ONNX GroupNormalization specification details
Create CPU GroupNormalization kernel implementation
Register GroupNormalization operation in CPU provider for opsets 18-21
Create or adapt GroupNormalization implementation logic
Update tests to cover CPU provider and opset 21
Test and validate the implementation

Approach:

Follow existing LayerNorm/InstanceNorm patterns for CPU provider implementation
Leverage existing normalization helper code where possible
Implement minimal changes to add standard ONNX support alongside existing contrib op

Fixes #24538.

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

justinchuby · 2025-06-14T16:21:16Z

This is the op spec for your reference:

GroupNormalization - 21
Version
name: [GroupNormalization (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#GroupNormalization)

domain: main

since_version: 21

function: True

support_level: SupportType.COMMON

shape inference: False

This version of the operator has been available since version 21.

Summary
A GroupNormalization function. Carries out group normalization as described in the paper https://arxiv.org/abs/1803.08494

This operator transforms input according to

y = scale * (x - mean) / sqrt(variance + epsilon) + bias,
where the mean and variance are computed per instance per group of channels, and scale and bias should be specified for each channel. The number of groups num_groups should be divisible by the number of channels so that there are an equal number of channels per group.

The overall computation has two stages: the first stage normalizes the elements to have zero mean and unit variance for each instance in each group, and the second stage scales and shifts the results of the first stage. The floating-point precision used in the first stage is determined by the stash_type attribute. For example, if stash_type is 1, the operator casts all input variables to 32-bit float, performs the computation, and finally casts the normalized results back to the original type of X. The second stage does not depend on stash_type.

When the number of groups is the same as the number of channels, this operator is equivalent to InstanceNormalization. When there is only one group, this operator is equivalent to LayerNormalization.

Attributes
epsilon - FLOAT (default is '1e-05'):

The epsilon value to use to avoid division by zero.

num_groups - INT (required) :

The number of groups of channels. It should be a divisor of the number of channels C.

stash_type - INT (default is '1'):

The floating-point precision used in stage one of the computation.

Inputs
X (heterogeneous) - T:

Input data tensor. Dimensions for image cases are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and width of the data. Statistics are computed for every group of channels over C, H, and W. For non-image cases, the dimensions are in the form of (N x C x D1 x D2 ... Dn).

scale (heterogeneous) - T:

Scale tensor of shape (C).

bias (heterogeneous) - T:

Bias tensor of shape (C).

Outputs
Y (heterogeneous) - T:

The output tensor of the same shape as X.

Type Constraints
T in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):

Constrain input and output types to float tensors.

[GroupNormalization - 18 vs 21](https://onnx.ai/onnx/operators/text_diff_GroupNormalization_18_21.html)

Copilot · 2025-06-14T16:22:16Z

@justinchuby Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub.

You can ask me to try again later by mentioning me in a new comment.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 5e4b9df7-12c0-4e39-8582-7eec84f30bb3

Sorry for the inconvenience!

Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

justinchuby · 2025-06-14T22:03:15Z

@copilot try again

Copilot · 2025-06-14T22:13:33Z

@copilot try again

I've implemented the GroupNormalization-21 specification based on the details you provided. Key changes include:

Proper stash_type implementation: When stash_type=1 (default), the first stage (mean/variance calculation) now uses float32 precision as specified, then casts back to original type for the second stage (scale/bias application).
Added BFloat16 support: Extended type support to include tensor(bfloat16) as required by the specification.
Fixed computation logic: Updated the implementation to exactly match the ONNX spec formula: y = scale * (x - mean) / sqrt(variance + epsilon) + bias

The implementation now properly handles the two-stage computation as described in the specification, where stage one precision is controlled by stash_type and stage two uses original type precision.

justinchuby · 2025-06-14T22:15:15Z

@copilot do not try to build the project when you are iterating, because the build is going to take too long

justinchuby · 2025-06-14T22:16:33Z

@copilot Also try to register the cuda kernels if you can

…t16 support Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot · 2025-06-14T22:23:30Z

@copilot do not try to build the project when you are iterating, because the build is going to take too long

Understood, I'll avoid building during iteration to save time. I'll focus on the code implementation and skip build validation during development.

@copilot Also try to register the cuda kernels if you can

I've added CUDA kernel support for GroupNormalization. Created:

CUDA GroupNorm implementation (group_norm.h/cc)
CUDA kernel implementation (group_norm_impl.h/cu)
Registered kernels in CUDA execution provider for opsets 18-21
Added support for float, double, MLFloat16, and BFloat16 types

The implementation follows the same two-stage computation pattern as the CPU version with proper stash_type support.

Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-06-14T22:49:46Z

onnxruntime/core/providers/cpu/nn/group_norm.cc

+// Opset 18-20 registrations (without stash_type)
+#define REGISTER_ONNX_KERNEL_TYPED_VERSIONED(T)                                                    \
+  ONNX_CPU_OPERATOR_VERSIONED_TYPED_KERNEL(GroupNormalization, 18, 20, T,                        \
+                                           KernelDefBuilder()                                      \
+                                               .TypeConstraint("T", DataTypeImpl::GetTensorType<T>()), \


Suggested change

// Opset 18-20 registrations (without stash_type)

#define REGISTER_ONNX_KERNEL_TYPED_VERSIONED(T) \

ONNX_CPU_OPERATOR_VERSIONED_TYPED_KERNEL(GroupNormalization, 18, 20, T, \

KernelDefBuilder() \

.TypeConstraint("T", DataTypeImpl::GetTensorType<T>()), \

// Opset 18-20 registrations (without stash_type)

#define REGISTER_ONNX_KERNEL_TYPED_VERSIONED(T) \

ONNX_CPU_OPERATOR_VERSIONED_TYPED_KERNEL(GroupNormalization, 18, 20, T, \

KernelDefBuilder() \

.TypeConstraint("T", DataTypeImpl::GetTensorType<T>()), \

github-actions · 2025-06-14T22:49:46Z

onnxruntime/core/providers/cpu/nn/group_norm.cc

+// Opset 21+ registrations (with stash_type)
+#define REGISTER_ONNX_KERNEL_TYPED_21(T)                                                          \
+  ONNX_CPU_OPERATOR_TYPED_KERNEL(GroupNormalization, 21, T,                                      \
+                                 KernelDefBuilder()                                              \
+                                     .TypeConstraint("T", DataTypeImpl::GetTensorType<T>()),      \
+                                 GroupNorm);


Suggested change

// Opset 21+ registrations (with stash_type)

#define REGISTER_ONNX_KERNEL_TYPED_21(T) \

ONNX_CPU_OPERATOR_TYPED_KERNEL(GroupNormalization, 21, T, \

KernelDefBuilder() \

.TypeConstraint("T", DataTypeImpl::GetTensorType<T>()), \

GroupNorm);

// Opset 21+ registrations (with stash_type)

#define REGISTER_ONNX_KERNEL_TYPED_21(T) \

ONNX_CPU_OPERATOR_TYPED_KERNEL(GroupNormalization, 21, T, \

KernelDefBuilder() \

.TypeConstraint("T", DataTypeImpl::GetTensorType<T>()), \

GroupNorm);

github-actions · 2025-06-14T22:49:47Z

onnxruntime/core/providers/cpu/nn/group_norm.cc

+  ORT_ENFORCE(op_kernel_info.GetAttr("num_groups", &num_groups_).IsOK());
+
+  // stash_type is optional in opset 21, default to 1 (float32)


Suggested change

ORT_ENFORCE(op_kernel_info.GetAttr("num_groups", &num_groups_).IsOK());

// stash_type is optional in opset 21, default to 1 (float32)

ORT_ENFORCE(op_kernel_info.GetAttr("num_groups", &num_groups_).IsOK());

// stash_type is optional in opset 21, default to 1 (float32)

github-actions · 2025-06-14T22:49:47Z

onnxruntime/core/providers/cpu/nn/group_norm.cc

+
+template<typename T>
+Status GroupNorm::ComputeImpl(OpKernelContext* context, const Tensor* X, const Tensor* scale, const Tensor* bias) const {


Suggested change

template<typename T>

Status GroupNorm::ComputeImpl(OpKernelContext* context, const Tensor* X, const Tensor* scale, const Tensor* bias) const {

template <typename T>

Status GroupNorm::ComputeImpl(OpKernelContext* context, const Tensor* X, const Tensor* scale, const Tensor* bias) const {

github-actions · 2025-06-14T22:49:47Z

onnxruntime/core/providers/cpu/nn/group_norm.cc

+  const int64_t C = x_shape[1];  // channels
+
+  // Validate that channels are divisible by num_groups
+  ORT_RETURN_IF_NOT(C % num_groups_ == 0, "Number of channels must be divisible by num_groups");
+
+  const int64_t channels_per_group = C / num_groups_;
+
+  // Calculate spatial dimensions (H*W*... for everything after batch and channel dims)


Suggested change

const int64_t C = x_shape[1]; // channels

// Validate that channels are divisible by num_groups

ORT_RETURN_IF_NOT(C % num_groups_ == 0, "Number of channels must be divisible by num_groups");

const int64_t channels_per_group = C / num_groups_;

// Calculate spatial dimensions (H*W*... for everything after batch and channel dims)

const int64_t C = x_shape[1]; // channels

// Validate that channels are divisible by num_groups

ORT_RETURN_IF_NOT(C % num_groups_ == 0, "Number of channels must be divisible by num_groups");

const int64_t channels_per_group = C / num_groups_;

// Calculate spatial dimensions (H*W*... for everything after batch and channel dims)

github-actions · 2025-06-14T22:49:48Z

onnxruntime/core/providers/cuda/nn/group_norm.cc

+// Opset 21+ registrations (with stash_type)
+#define REGISTER_CUDA_KERNEL_TYPED_21(T)                                                            \
+  ONNX_OPERATOR_TYPED_KERNEL_EX(GroupNormalization, kOnnxDomain, 21, T, kCudaExecutionProvider,    \
+                                (*KernelDefBuilder::Create())                                       \
+                                    .TypeConstraint("T", DataTypeImpl::GetTensorType<T>()),         \
+                                GroupNorm<T, float>);


Suggested change

// Opset 21+ registrations (with stash_type)

#define REGISTER_CUDA_KERNEL_TYPED_21(T) \

ONNX_OPERATOR_TYPED_KERNEL_EX(GroupNormalization, kOnnxDomain, 21, T, kCudaExecutionProvider, \

(*KernelDefBuilder::Create()) \

.TypeConstraint("T", DataTypeImpl::GetTensorType<T>()), \

GroupNorm<T, float>);

// Opset 21+ registrations (with stash_type)

#define REGISTER_CUDA_KERNEL_TYPED_21(T) \

ONNX_OPERATOR_TYPED_KERNEL_EX(GroupNormalization, kOnnxDomain, 21, T, kCudaExecutionProvider, \

(*KernelDefBuilder::Create()) \

.TypeConstraint("T", DataTypeImpl::GetTensorType<T>()), \

GroupNorm<T, float>);

github-actions · 2025-06-14T22:49:49Z

onnxruntime/core/providers/cuda/nn/group_norm.cc

+  epsilon_ = tmp_epsilon;
+
+  ORT_ENFORCE(op_kernel_info.GetAttr("num_groups", &num_groups_).IsOK());
+
+  // stash_type is optional in opset 21, default to 1 (float32)


Suggested change

epsilon_ = tmp_epsilon;

ORT_ENFORCE(op_kernel_info.GetAttr("num_groups", &num_groups_).IsOK());

// stash_type is optional in opset 21, default to 1 (float32)

epsilon_ = tmp_epsilon;

ORT_ENFORCE(op_kernel_info.GetAttr("num_groups", &num_groups_).IsOK());

// stash_type is optional in opset 21, default to 1 (float32)

github-actions · 2025-06-14T22:49:49Z

onnxruntime/core/providers/cuda/nn/group_norm.cc

+  typedef typename ToCudaType<T>::MappedType CudaT;
+
+  // Inputs


Suggested change

typedef typename ToCudaType<T>::MappedType CudaT;

// Inputs

typedef typename ToCudaType<T>::MappedType CudaT;

// Inputs

github-actions · 2025-06-14T22:49:49Z

onnxruntime/core/providers/cuda/nn/group_norm.cc

+  const int64_t C = x_shape[1];  // channels
+
+  // Validate that channels are divisible by num_groups
+  ORT_RETURN_IF_NOT(C % num_groups_ == 0, "Number of channels must be divisible by num_groups");
+
+  // Calculate spatial dimensions (H*W*... for everything after batch and channel dims)


Suggested change

const int64_t C = x_shape[1]; // channels

// Validate that channels are divisible by num_groups

ORT_RETURN_IF_NOT(C % num_groups_ == 0, "Number of channels must be divisible by num_groups");

// Calculate spatial dimensions (H*W*... for everything after batch and channel dims)

const int64_t C = x_shape[1]; // channels

// Validate that channels are divisible by num_groups

ORT_RETURN_IF_NOT(C % num_groups_ == 0, "Number of channels must be divisible by num_groups");

// Calculate spatial dimensions (H*W*... for everything after batch and channel dims)

github-actions · 2025-06-14T22:49:49Z

onnxruntime/core/providers/cuda/nn/group_norm.cc

+  }
+
+  Tensor* Y = ctx->Output(0, x_shape);


Suggested change

}

Tensor* Y = ctx->Output(0, x_shape);

}

Tensor* Y = ctx->Output(0, x_shape);

onnxruntime/core/providers/cpu/nn/group_norm.cc

@@ -0,0 +1,154 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.


onnxruntime/core/providers/cpu/nn/group_norm.h

@@ -0,0 +1,28 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.


onnxruntime/core/providers/cuda/nn/group_norm.cc

@@ -0,0 +1,94 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.


Initial plan for issue

adee270

Copilot AI assigned Copilot and justinchuby Jun 14, 2025

Copilot started work on behalf of justinchuby June 14, 2025 15:57 View session

Changes before error encountered

db72187

Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot started work on behalf of justinchuby June 14, 2025 22:03 View session

Implement ONNX GroupNormalization-21 with proper stash_type and BFloa…

586e758

…t16 support Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot finished work on behalf of justinchuby June 14, 2025 22:17

Copilot started work on behalf of justinchuby June 14, 2025 22:17 View session

Add CUDA GroupNormalization kernel support for opsets 18-21

8136cb7

Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot finished work on behalf of justinchuby June 14, 2025 22:24

github-actions bot reviewed Jun 14, 2025

View reviewed changes

github-advanced-security bot found potential problems Jun 14, 2025

View reviewed changes

		ORT_ENFORCE(op_kernel_info.GetAttr("num_groups", &num_groups_).IsOK());

		// stash_type is optional in opset 21, default to 1 (float32)


		template<typename T>
		Status GroupNorm::ComputeImpl(OpKernelContext* context, const Tensor* X, const Tensor* scale, const Tensor* bias) const {

		@@ -0,0 +1,154 @@
		// Copyright (c) Microsoft Corporation. All rights reserved.

		@@ -0,0 +1,28 @@
		// Copyright (c) Microsoft Corporation. All rights reserved.

		@@ -0,0 +1,94 @@
		// Copyright (c) Microsoft Corporation. All rights reserved.

[WIP] Feature request: Implement GroupNormalization-21 #25060

Are you sure you want to change the base?

[WIP] Feature request: Implement GroupNormalization-21 #25060

Conversation

Copilot AI commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinchuby commented Jun 14, 2025

Uh oh!

Copilot AI commented Jun 14, 2025

Uh oh!

justinchuby commented Jun 14, 2025

Uh oh!

Copilot AI commented Jun 14, 2025

Uh oh!

justinchuby commented Jun 14, 2025

Uh oh!

justinchuby commented Jun 14, 2025

Uh oh!

Copilot AI commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

Check warning

Check warning

Check warning

Uh oh!

Copilot AI commented Jun 14, 2025 •

edited

Loading

Copilot AI commented Jun 14, 2025 •

edited

Loading