Simplify the operand layout support of conv2d and pooling 2d operations #324

New issue

Open

#770

Open

Simplify the operand layout support of conv2d and pooling 2d operations#324

#770

Labels

feature requestinteropoperator specific

huningxin

opened

on Jan 9, 2023

Contributor

In the existing WebNN spec, conv2d supports two input operand layouts defined by MLInputOperandLayout and four filter operand layouts defined by MLConv2dFilterOperandLayout.

enum MLInputOperandLayout {
  "nchw",
  "nhwc"
};

enum MLConv2dFilterOperandLayout {
  "oihw",
  "hwio",
  "ohwi",
  "ihwo"
};

This may make the implementation more complicated especially if a native ML framework or OS API doesn't support some of these layouts. If one layout is unsupported, the implementation may need to insert the transpose operations into the graph around the conv2d operation that transposes the unsupported layout to supported one. This would easily lead to an inefficient graph representation that may have redundant transpose operations. Or the implementation may need to optimize the graph by techniques such as "transpose sink" which may require a more complex implementation. This issue was raised in Chromium CL review.

To simplify the implementation, the proposal is to reduce the supported operand layouts, for example, just keep the default one. Because WebNN supports transpose operation, the layout adaption and graph level optimization can be handled by ML frameworks that usually already support such functionalities.

Thanks @wacky6 for this idea.

aarongable

added 2 commits that reference this issue

on Jan 17, 2023

WebNN: Define XNNPACK Node for conv2d MLOperator

f5b7831

WebNN: Define XNNPACK Node for pooling MLOperators

added

Member

This issue was discussed at the WebML WG Teleconference – 16 March 2023. Summary: Awaits further implementation feedback.

fdwr

Collaborator

Picking just one preferred layout in WebNN could make life easier for the calling framework and the underlying backend implementation, or it could make it harder for both:

If WebNN supports both layouts, then the calling framework can trivially pass on the data as-is. The backend can then bracket affected operators with transposes upon entry/exit if needed (note some backends support both layouts, and so there is no additional work there), but since adjacent transposes cancel out anyway (similar to how bracketed quantize/dequantize operators wrapping another operator cancel out), I wager that most internal transposes will disappear, leaving potentially just transposes on the graph edges in/out. The backend has full graph visibility by build() time and should be able to see and collapse any such adjacent transposes, and only the backend has enough information to be able to performantly select the right approach.
If WebNN removes support and only accepts one preferred layout, then calling frameworks with different existing conventions will need to insert the transposes before calling WebNN; and then lower-level backends (if their convention differs from WebNN's preferred) may also need to either strip out any unnecessary transposes or conversely add transposes for performance.

I prefer accepting both (keeping the current spec), but it would be informative to see a holistic table each major framework's preferred layout and each backend's preferred layout.

[updated...] Table added (✅ == default):

API	NCHW	NHWC	Notes
CoreML	NCHW✅	-	-
ONNX	NCHW✅	-	-
PyTorch	NCHW✅	NHWC	NCHW is default, but NHWC is supported more recently via `torch.memory_format.channels_last`
TensorFlow & Tensorflow.js	NCHW	NHWC✅	`data_format` defaults to NHWC or `channelsLast`.
TFLite	-	NHWC✅	-
Intel OneDNN dnnl_format_tag_t	NCHW✅	NHWC	"NCHW is the recommended data layout"
CuDNN cudnnTensorFormat_t	NCHW✅	NHWC	NCHW is default order for `cudnnSetTensor4dDescriptor`
DirectML	NCHW✅	NHWC	NHWC is default, but NHWC is supported via explicit strides
XNNPack	-	NHWC✅	-
NVIDIA tensor cores	NCHW	NHWC✅	"Tensor Cores are fastest when input tensors are laid out in NHWC ... NCHW layouts can still be operated on by Tensor Cores, but include some overhead due to automatic transpose operations"

anssiko

Member

@fdwr thanks for sharing your preference and the supporting details.

As an aside, I encourage incorporating considerations such as this into specification informatively alongside the normative prose. It helps explain the specification to people who look at it without the full context active WG participants have.

Honry

mentioned this

on Mar 28, 2023

Support WebNN EP huningxin/onnxruntime#13

wacky6

Layout support comes up in MLOperand implementation that allows data shape broadcasting. https://chromium-review.googlesource.com/c/chromium/src/+/4396686/comment/f02acaeb_3c2795f2/

Supporting both channel-first and channel-last layout will complicate spec steps and implementation because the current numpy broadcast rule implies right-most first broadcast.

Example: caller wants to apply a per-channel multiplication.

lhs is nchw{1,3,4,4}; caller provides rhs {3}. This will fail under the current broadcast rule. The caller will need to broadcast rhs itself.
lhs is nhwc{1,4,4,3}; caller provides rhs {3}. Works as intended.

How to support case 1 isn't clear. Some questions might help decision:

How does different backend implement broadcast? Do they silently create a broadcasted array and store it in memory before computation; Or do they store the original array then broadcast it on-the-fly during computation.
What's the cost of doing smart broadcast based on operand's layout? Or an option attribute that specifies broadcase order (channel-first, channel-last) where broadcast operations are applicable.
What's the cost (overhead) vs. benefit (simple spec and implementation) of transposing?

I have a slight preference for supporting only one layout (NHWC to be precise).

The obvious benefit: the spec and implementation will be much simpler. This is benefitial for browser vendor adoption (i.e. implement the API, make sure it conforms to the spec, less likely to produce bugs).
API user benefit: a consistent data layout usage. One less thing to check for debugging ("layout + data_shape" becomes "data_shape only").
Interop with other Web APIs: Existing Web APIs that deals with images uses channel-last ordering (like ImageData. So using NHWC for WebNN provides better interop (pass the data as-is, no conversion needed for developers).
I agree that if we choose only one layout the caller would need to convert. I think the most overhead incur at build() time (one-off cost), and a very small over head incur at compute() (convert to the right layout before passing data to backend, from result to what's defined in spec, probably negligible?).

Honry

mentioned this

on Apr 26, 2023

Support WebNN EP microsoft/onnxruntime#15698

guschmue

added a commit that references this issue

on May 9, 2023

Support WebNN EP (#15698)

00b1e79

prathikr

added a commit that references this issue

on May 16, 2023

Support WebNN EP (#15698)

ecb6add

Honry

mentioned this

on May 23, 2023

[WebNN EP] Use NCHW as preferred layout for DML backend microsoft/onnxruntime#16037

wacky6

I want to share a data point.

I was playing with Real-ESRGAN today, and found out that torch.compile channel_last layout is faster than torch.compile channel_first layout on my NVIDIA A4000.

I'm not sure how well this transfer to other models (ESRGAN is heavily based on CNN + residual connection) though.

I wonder if we should benchmark on channel ordering on different hardware (i.e. vendor other than NVIDIA could optimize for channel_first).

Or maybe this won't matter if graph builder (or rather optimizer) is "clever" enough.

huningxin

ContributorAuthor

There is a security perspective from @quidity (Thanks Alex!) in Chromium CL-4653303: WebNN: Define conv2d operator in mojo review.

Alex mentioned:

enum Conv2dFilterOperandLayout { kOihw, kHwio, kOhwi, kIhwo };
this feels very error prone - is there a better way to represent the layout of the data at this stage or restrict the ways that is presented to the privileged process?

wacky6

FWIW, another way to tackle layout is to tell the implementation which layout should be used, like: https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html

This could be a hint to GraphBuilder.build() (right before producing a graph that can be passed to compute()).

Taking a step back, I still strongly prefer a single unified layout (i.e. NHWC) that's applied throughout MLGraphBuilder methods, and let the backend (e.g. DMLImpl) change the layout (if necessary) before sending to hardware.

18 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

feature requestinteropoperator specific

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

Remove pool2d MLRoundingType - Simplify the operand layout support of conv2d and pooling 2d operationswebmachinelearning/webnn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify the operand layout support of conv2d and pooling 2d operations #324

18 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Simplify the operand layout support of conv2d and pooling 2d operations #324

Description

Activity

anssiko commented on Mar 23, 2023

fdwr commented on Mar 23, 2023

anssiko commented on Mar 24, 2023

wacky6 commented on Apr 11, 2023

wacky6 commented on Jul 20, 2023

huningxin commented on Aug 15, 2023

wacky6 commented on Aug 15, 2023

18 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions