Skip to content

Merge/upstream main 2026 05 14#17

Merged
pacowong merged 2452 commits into
mainfrom
merge/upstream-main-2026-05-14
May 15, 2026
Merged

Merge/upstream main 2026 05 14#17
pacowong merged 2452 commits into
mainfrom
merge/upstream-main-2026-05-14

Conversation

@pedrovgs
Copy link
Copy Markdown
Member

@pedrovgs pedrovgs commented May 14, 2026

Description

This PR merges 2,297 commits from the official Microsoft onnxruntime repository (up to May 13, 2026) into our Goodnotes fork, which was last synced in February 2025. Four merge conflicts were resolved: Paco's debug logging in cmake/CMakeLists.txt and cmake/onnxruntime.cmake was dropped in favor of upstream's cleaner code, the Mac Catalyst compile flags in cmake/onnxruntime_mlas.cmake were preserved as they're needed for our macabi builds, and build_apple_framework.py was merged to adopt upstream's pathlib refactor while keeping our macabi sysroot special-case handling. The update spans 4,754 files with ~507K insertions and ~156K deletions.

If you want to check the difference between our fork and the current state of onnx, you can do it here.

Motivation and Context

Update this repository with the official one so we can get a version of onnx compatible with scribble to erase model operators and also run on Catalyst.

chilo-ms and others added 30 commits April 8, 2026 16:43
### Description
To support the model package design, one of the goals for ORT is to
automatically select the most suitable compiled EPContext binary from a
collection of precompiled variants based on the EP, provider options,
metadata, and available devices.

This PR is for ORT to support first phase model package. There could be
other follow-up PRs in the future.

A model package is a collection of models, binaries, and metadata files
organized in a hierarchically structured directory.
The directory structure is not yet finalized, so the following is just a
simple example of a model package directory:

````
<model>.ortpackage/  
├── manifest.json 
├── pipeline.json 
├── configs/ 
|   ├── genai_config.json 
|   └── chat_template.jinja  
└── models/  
    └── model_name/  
        ├── metadata.json 
        |   └── Contains general information on the component model, 
        |       and specific information about each model variant 
        |       such as data types, quantization algo, EP, etc. that  
        |       is updated on add/remove of model variant 
        └── shared_weights/ (shared weights from all variants) 
            └── <checksum of weights file A>/ 
                └── model.data 
            └── <checksum of weights file B>/  
                └── model.data 
            └── ... 
        └── base model/    
            ├── model.onnx  
        └── variant A /  
            ├── optimized model.onnx (contains EPContext nodes)  
            └── [Compilation artifacts]  
        └── variant B /  
            ├── optimized model.onnx (contains EPContext nodes)  
            └── [Compilation artifacts] 

````
#### Spec and Format:
See
[here](https://github.com/microsoft/onnxruntime/blob/07e55627e75da24099c582331a0f786090e6382a/onnxruntime/core/session/model_package/README.md)
#### Definitions:

- Model Package 
   - A model package defines the overall logical ‘model’ 
   - A model package contains one or more ‘component models’ 

- Component Model 
  - A component model comprises one or more ‘model variants’ 

- Model Variant 
  - A ‘model variant’ is a single ONNX or ORT format model

#### manifest.json and metadata.json

A manifest.json may look like:

````
{ 
    "model_name":  <logical_model_name>,
    "component_models": [
        <component_model_name_1>,
        <component_model_name_2>
    ]
}
````

A metadata.json for a component model may look like:
````
{ 
    "component_model_name":  <component_model_name_1>,
    "model_variants": {
         <variant_name_1>:  {
             "file": <ep_context_model_1 onnx file>,
             "constraints": {
                 "ep": <ep_name>,
                 "device": <device_type>,
                 "architecture": <hardware_architecture>
             }
         },
         <variant_name_2>:  {
             "file": <ep_context_model_2 onnx file>,
             "constraints": {
                 "ep": <ep_name>,
                 "device": <device_type>,
                 "architecture": <hardware_architecture>
             }
         }   
    }
}
````
#### Model Selection

The selection logic is implemented in `MatchesVariant()`, which
evaluates the following constraints:
(Note: A constraint refers to a value under the "constraints" field in
either manifest.json or metadata.json.)

- Check ep constraint
- Check device constraint
- For some provider-bridge EPs, they may not implement
`OrtEpFactory::GetSupportedDevices`, therefore ORT
won't have the supported device information for those EPs. In that case,
ORT will skip the device constraint validation for those EPs.
- If provider option contains key related to device type, then the value
must match the device constraint if any.
- Check ep_compatibility_info constraint
- ORT does not directly evaluate the architecture constraint. Instead,
it relies on the ep_compatibility_info constraint, which may encode
architecture information if needed.
- The ep_compatibility_info value is expected to match the EP
compatibility string stored in the EPContext model metadata. (See
OrtEp::GetCompiledModelCompatibilityInfo() for how this string is
generated.)
- The EP implementation of
EpFactory::ValidateCompiledModelCompatibilityInfo() is responsible for
validating the compatibility string against the target device (i.e.
OrtHardwareDevice) and returning the compatibility result.

#### Note

Check the unit test
[here](https://github.com/microsoft/onnxruntime/pull/27786/changes#diff-bfa4122a85543ae2d80bf4cf6d9f85248e51c2276a5956af32f9bd8c8983d23a)
to better understand how to use model package.

#### Code Change

This pull request introduces significant enhancements to the execution
provider (EP) selection and management infrastructure in ONNX Runtime.
The main focus is on supporting more sophisticated device selection and
manifest-based model packaging, as well as refactoring provider
selection logic for modularity and future extensibility.

Key changes include:

- Introduction of model package context and manifest parsing to support
selecting model components based on device and EP constraints.
- Refactoring of the execution provider interface and related classes to
support multiple devices per provider.
- Modularization of EP/device selection, creation, and registration
logic in the provider policy context.

The most important changes are:

**Model Package Context and Manifest Support**
- Added new files `model_package_context.h` and
`model_package_context.cc` to implement manifest parsing, device/EP
constraint matching, and component selection logic for model packages.
This enables ONNX Runtime to select the most appropriate model variant
based on available hardware and EP configuration.
[[1]](diffhunk://#diff-006078879d52b421c973e2880c65db474aad6b21ad81ba69d387df8661bafeb2R1-R78)
[[2]](diffhunk://#diff-45c29f481077e424c8969dc2198a8b40ab5908cf3b0bbf25dbeaca3ec51935d5R1-R279)

**Execution Provider Interface Enhancements**
- Updated the `IExecutionProvider` class to support construction with a
list of `OrtEpDevice` pointers, and added a `GetEpDevices()` method to
retrieve the supported devices. This allows plugin and bridge EPs to
expose multiple devices.
[[1]](diffhunk://#diff-e15769e35b807986b812aae3ff7192269e171c5846b2ff4d8ec571ec8ed57aa4R87-R104)
[[2]](diffhunk://#diff-e15769e35b807986b812aae3ff7192269e171c5846b2ff4d8ec571ec8ed57aa4R203-R207)
- Updated plugin EP construction to pass the list of supported devices
to the base class.

**Provider Policy Context Refactoring**
- Refactored provider policy context logic to modularize device
ordering, device selection, telemetry logging, EP creation, and
registration. This includes splitting the monolithic
`SelectEpsForSession` into smaller methods: `OrderDevices`,
`SelectEpDevices`, `LogTelemetry`, `CreateExecutionProviders`,
`RegisterExecutionProviders`, and a new flow for model package-based EP
selection.
[[1]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0R53-R58)
[[2]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0L118-L156)
[[3]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0L225-R199)
[[4]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0R254-R365)

These changes collectively lay the groundwork for more flexible, robust,
and extensible device and EP selection in ONNX Runtime, especially in
scenarios involving packaged models with multiple variants and complex
hardware environments.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The test was originally adding for testing model selection based on
"device type" provider option.
However, the check for provider option was removed from the selection
logic but forget to remove the related test.
### Description
Fix possible out of boundary target of class ids in TreeEnsemble.



### Motivation and Context
security issue
…microsoft#27674)

This patch sets `is_channels_last` to true by default in the parameter
of `ComputeMatMul` and ignores it in `UseSplitK` when there is no
`bias`.
### Description

Improve DeformConv op performance


### Motivation and Context

This PR consolidates a series of optimizations targeting the
`DeformConv` (Deformable Convolution) operator across both CPU and CUDA
execution providers.
* **For CPU:** The previous implementation suffered from bottlenecks due
to redundant computations, lack of vectorization in bilinear sampling,
and sub-optimal thread pool utilization. This overhaul redesigns the
memory layout and execution pipeline to maximize SIMD opportunities and
harden memory safety.
* **For GPU:** The batched GEMM operation previously relied on an
intermediate buffer and a custom scatter kernel to format the output,
which consumed extra memory and kernel launch overhead. This update
introduces a zero-copy approach.

---

#### 1. CPU Optimizations & Refactoring

The CPU execution path has been heavily refactored to minimize branching
in hot paths, maximize vectorization, and safely handle edge cases.

| Feature / Optimization | Description | Key Benefit |
| :--- | :--- | :--- |
| **AoSoA Bilinear Sampling Plan** | Replaced on-the-fly interpolation
with a precomputed sampling plan using an 8-lane
Array-of-Structures-of-Arrays (AoSoA) layout (`kPlanAoSoALanes`). |
Perfectly aligns with 256-bit AVX2 vectors, enabling highly efficient
SIMD unrolling during the `im2col` gathering phase. |
| **Kernel Metadata Caching** | Introduced
`DeformConvKernelMetaCacheData` to cache static convolution geometry
(e.g., `kH`, `kW`, `padding`, `dilation`). | Eliminates the
O(kernel_size) overhead of reallocating and recomputing base offsets on
every single `Compute()` step. |
| **Fast Math & Branchless Logic** | Implemented a custom
`DeformConvFastFloor` and utilized an inverted bounds check with bitwise
operations to evaluate all four corners simultaneously. | Removes
expensive `std::floor` calls and unpredictable branches from the
operator's hottest path. |
| **Enhanced Parallelization** | Flattened the bilinear sampling plan
build tasks across spatial pixels. | Allows
`concurrency::ThreadPool::TryParallelFor` to split fine-grained work
effectively, drastically improving thread pool scaling. |
| **Hardened Bounds Checking** | Introduced compute-time bounds checks
using `CheckedMulSizeT` and `CheckedBatchSpan`. | Ensures batch indexing
and stride calculations stay within the addressable `size_t` range,
preventing integer overflow vulnerabilities. |
| **Bias Addition Refactoring** | Refactored bias addition to avoid
expensive `div`/`mod` operations, applying `ORT_CPU_RESTRICT` and
force-inlining. | Maximizes memory throughput and instruction pipelining
during the final bias addition phase. |

---

#### 2. GPU (CUDA) Optimizations

The CUDA implementation was optimized to reduce memory footprint and
eliminate unnecessary kernel launches.

* **Zero-Copy GEMM Output:** Removed the temporary `gemm_output_buffer`
allocation entirely. By carefully configuring the `stride_c` parameter
(`stride_c_y = M * output_image_size`), the
`cublasGemmStridedBatchedHelper` now writes the computed output directly
into the correct NCHW memory layout of the final `Y` tensor.
* **Kernel Elimination:** Completely removed the
`DeformConvCopyGemmOutputRowMajorToNCHW` custom kernel and its
associated dispatch logic. This reduces kernel launch overhead, lowers
GPU memory bandwidth pressure, and simplifies the overall CUDA execution
pipeline.
* **Reduced Memory Footprint:** Updated the `bytes_per_image`
calculation for workspace memory to reflect the removal of the GEMM
output buffer. This allows the operator to potentially process more
images in parallel under the same memory constraints.

---

#### 3. Changed

- **Batch chunking:** Chunk size `k` is chosen so that the number of
outer rounds is minimized under the temp-memory cap; **`k` does not have
to divide `N`**. The host loop uses `cur_parallel = min(k, N - b)`, so
the last chunk may be smaller. This is the intended default behavior for
this EP (not yet in a formal release).
- **Kernel-size templates:** Im2col is specialized for **1×1, 3×3, and
7×7**; other sizes (including **5×5**) use the **dynamic** `kH`/`kW`
path. Rationale: 5×5 is less common in current stacks (often replaced by
stacked 3×3); specializing 7×7 targets common large-kernel cases. Older
DCN/detection models that still use **5×5** deformable conv will take
the dynamic path—correctness is unchanged; only compile-time unrolling
differs.
- **Add aliasing flags:** Updated DeformConv aliasing comments to make
the stronger guarantee explicit: if output `Y` overlaps any input
buffer, results can be incorrect regardless of `restrict`, because
output writes may clobber source elements before they are fully
consumed. `restrict` further tightens this by introducing undefined
behavior when aliasing assumptions are violated.

---

### Summary

In the current implementation, CPU performance is 33x (main branch is
15x) that of TorchVision. If we were to implement AVX2/AVX512
optimizations from scratch, we could achieve a 36x performance boost.
However, I haven’t found any similar reference code in the ONNX Runtime
repository.

This PR also significantly improves parallelism:

<img width="540" height="332" alt="image"
src="https://github.com/user-attachments/assets/d4f670bd-dde3-43f1-b597-4471bfde005b"
/>

_Both ort and tv are configured with 16 threads_

### Open Question for Reviewers

**Regarding CUDA Temporary Memory Allocation:**
Currently, the effective maximum temporary memory for CUDA is calculated
using a heuristic (`total_global_mem * 0.1` or similar logic in
`GetDeformConvEffectiveMaxTempBytes`). While the removal of
`gemm_output_buffer` has reduced the memory footprint per image, I am
not entirely certain if this 10% threshold is still the most appropriate
value for balancing parallel image processing (`n_parallel_imgs`)
against overall VRAM consumption in large models.

I would appreciate any feedback or suggestions on whether we should tune
this threshold, or if there's a more robust way to dynamically determine
the optimal temporary workspace size for `DeformConv` in ORT.
…7582)

### Description
In `Dequantize4BitsKernelReOrder` (CPU and CUDA EP), values from the
`g_idx` tensor are used directly as array indices into the `scales` and
`zero_points` buffers without bounds checking. This PR adds value-range
validation and tests for the `g_idx` input tensor in the `MatMulNBits`
operator.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
)

### Description
Add input validation to the LinearClassifier operator to prevent an
out-of-bounds heap read in GEMM when a crafted model provides mismatched
coefficients/intercepts sizes.

Fixes
https://portal.microsofticm.com/imp/v5/incidents/details/31000000559851/summary

### Changes
- **Constructor**: Validate `class_count_ > 0` and `coefficients_.size()
% class_count_ == 0`
- **Compute()**: Validate `coefficients_.size() == class_count *
num_features` before GEMM call
- **Tests**: Two regression tests for invalid coefficient sizes

### Motivation and Context
MSRC case 109185 (VULN-176698): OOB read via GEMM from crafted model in
LinearClassifier operator. Root cause is missing validation that the
coefficients vector size matches `[class_count, num_features]` before
passing raw pointers to GEMM.
…28013)

### Description

Add a pre-commit [git
hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) that
runs lintrunner on staged files, catching lint and formatting issues
before they reach CI.

The hook runs lintrunner in check-only mode (no auto-fix) to avoid
issues with partial staging. If lint issues are found, the commit is
blocked and the developer is prompted to run `lintrunner -a` to fix.

The hook is opt-in. Contributors enable it with: `git config
core.hooksPath .githooks`

### Motivation and Context

Follow-up from microsoft#27856.
Catching lint issues at commit time saves CI cycles and review time.
webgpu support for qwen3.5, adding LinearAttention and
CausalConvWithState ops based on this proposal:
from onnx/onnx#7767

The model can be created with model builder from
https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/builder.py.

For example for the text only flavor:
```
python builder.py -m Qwen/Qwen3.5-0.8B  -o Qwen3.5-0.8B -e webgpu -p int4 --extra_options int4_accuracy_level=4 exclude_embeds=False
```
…rosoft#27878)

### Description

 Add Arm64 BF16 fast-math convolution support in MLAS:
  - direct NCHW conv
  - depthwise 3x3 NCHWc conv
  - pointwise 1x1 NCHWc conv

This change adds new AArch64 BF16 asm kernels, wires them into MLAS
platform dispatch, keeps accumulated pointwise batches on the custom
BF16 path instead of falling back to generic SBGEMM, and adds the
required BF16 build flags.

The new paths are only used when Arm64 BF16 fast-math is enabled via the
existing session option. Baseline FP32 behavior is unchanged.

### Performance

Individual convolution improvements when running on `c8g` AWS instance
where in columns base is FP32 execution, fast-math when enabled without
this PR and PR is fast-math with this change:

| Type | Shape | fast-math vs base | PR w/ fast-math vs base | PR w/
fast-math vs fast-math |
|---|---|---:|---:|---:|
| depthwise | N1 IC32 OC32 H112xW112->112x112 K3x3 S1x1 D1x1 P1/1/1/1
G32 | 0.991x | 1.047x | 1.057x |
| depthwise | N1 IC96 OC96 H112xW112->56x56 K3x3 S2x2 D1x1 P1/1/1/1 G96
| 1.015x | 1.015x | 1.000x |
| depthwise | N1 IC144 OC144 H56xW56->28x28 K3x3 S2x2 D1x1 P1/1/1/1 G144
| 1.020x | 1.004x | 0.984x |
| depthwise | N1 IC144 OC144 H56xW56->56x56 K3x3 S1x1 D1x1 P1/1/1/1 G144
| 1.034x | 1.138x | 1.101x |
| depthwise | N1 IC192 OC192 H28xW28->28x28 K3x3 S1x1 D1x1 P1/1/1/1 G192
| 0.997x | 1.033x | 1.037x |
| depthwise | N1 IC384 OC384 H28xW28->14x14 K3x3 S2x2 D1x1 P1/1/1/1 G384
| 1.016x | 1.021x | 1.005x |
| depthwise | N1 IC384 OC384 H28xW28->28x28 K3x3 S1x1 D1x1 P1/1/1/1 G384
| 1.011x | 1.090x | 1.077x |
| depthwise | N1 IC576 OC576 H14xW14->7x7 K3x3 S2x2 D1x1 P1/1/1/1 G576 |
1.029x | 0.995x | 0.967x |
| depthwise | N1 IC576 OC576 H14xW14->14x14 K3x3 S1x1 D1x1 P1/1/1/1 G576
| 1.025x | 1.006x | 0.982x |
| depthwise | N1 IC960 OC960 H7xW7->7x7 K3x3 S1x1 D1x1 P1/1/1/1 G960 |
1.002x | 0.941x | 0.939x |
| nchw | N1 IC3 OC32 H224xW224->112x112 K3x3 S2x2 D1x1 P1/1/1/1 G1 |
1.001x | 1.058x | 1.058x |
| pointwise | N1 IC16 OC96 H112xW112->112x112 K1x1 S1x1 D1x1 P0/0/0/0 G1
| 1.213x | 1.328x | 1.095x |
| pointwise | N1 IC32 OC16 H112xW112->112x112 K1x1 S1x1 D1x1 P0/0/0/0 G1
| 1.020x | 1.019x | 0.998x |
| pointwise | N1 IC32 OC32 H112xW112->112x112 K1x1 S1x1 D1x1 P0/0/0/0 G1
| 1.118x | 1.196x | 1.069x |
| pointwise | N1 IC32 OC144 H56xW56->56x56 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.220x | 1.528x | 1.252x |
| pointwise | N1 IC32 OC192 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.199x | 1.418x | 1.183x |
| pointwise | N1 IC64 OC384 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.294x | 1.938x | 1.497x |
| pointwise | N1 IC96 OC32 H56xW56->56x56 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.080x | 1.426x | 1.320x |
| pointwise | N1 IC96 OC576 H14xW14->14x14 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.280x | 1.961x | 1.532x |
| pointwise | N1 IC144 OC32 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.132x | 1.351x | 1.193x |
| pointwise | N1 IC144 OC32 H56xW56->56x56 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.073x | 1.374x | 1.281x |
| pointwise | N1 IC160 OC960 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.133x | 1.744x | 1.539x |
| pointwise | N1 IC192 OC32 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.166x | 1.411x | 1.210x |
| pointwise | N1 IC192 OC64 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.212x | 1.763x | 1.454x |
| pointwise | N1 IC320 OC1280 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.136x | 2.059x | 1.812x |
| pointwise | N1 IC384 OC64 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.256x | 1.904x | 1.516x |
| pointwise | N1 IC384 OC96 H14xW14->14x14 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.206x | 1.929x | 1.600x |
| pointwise | N1 IC576 OC96 H14xW14->14x14 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.250x | 2.055x | 1.644x |
| pointwise | N1 IC576 OC160 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
0.902x | 1.423x | 1.577x |
| pointwise | N1 IC960 OC160 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
0.915x | 1.527x | 1.668x |
| pointwise | N1 IC960 OC320 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
1.020x | 1.756x | 1.723x |
| pointwise | N1 IC1280 OC1008 H1xW1->1x1 K1x1 S1x1 D1x1 P0/0/0/0 G1 |
0.747x | 1.149x | 1.538x |

When running the full models the performance improvements are on `c8g`
(AWS Graviton 4) and `Standard_D32plds_v6` (Azure Cobalt-100) when
running [MobileNet
v2.7](https://github.com/onnx/models/blob/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx)
with 8 threads are:

| Instance | PR w/ fast-math vs base | PR w/ fast-math vs fast-mat |
|---|---|---|
`c8g` |  1.892x | 1.647x |
`Standard_D32plds_v6` | 2.884x | 1.692x |

(cc: @Rohanjames1997 @snadampal)

---------

Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
### Description
Fix ICM issue:
https://portal.microsofticm.com/imp/v5/incidents/details/31000000567822/summary

The ICM is mainly about 2 issues in `validate_package.py` which was
fixed by microsoft#27840.
But the ICM also references another issue in `whisper_jump_times.py`
which is what this PR fixes

### Motivation and Context
ICM fixes
## Description

Ports graph capture/replay APIs (e.g., CUDA Graph) to the Plugin EP
(`OrtEp`) C API so that plugin-based execution providers can participate
in ORT-managed graph capture and replay.

### What changed

**New Plugin EP C API functions** (`onnxruntime_ep_c_api.h`):
- `OrtEp::IsGraphCaptureEnabled` — indicates whether the EP has graph
capture enabled.
- `OrtEp::IsGraphCaptured` — indicates whether a graph has been captured
for a given annotation ID.
- `OrtEp::ReplayGraph` — replays a previously captured graph.
- `OrtEp::GetGraphCaptureNodeAssignmentPolicy` — returns the node
assignment validation policy for graph capture.

All four are optional (NULL defaults to safe behavior) and version-gated
(`ort_version_supported >= 26`).
If `IsGraphCaptureEnabled` returns true, `IsGraphCaptured` and
`ReplayGraph` must also be implemented.
otherwise `PluginExecutionProvider` logs a warning and disables graph
capture for that EP.

**New `OrtGraphCaptureNodeAssignmentPolicy` enum**
(`onnxruntime_ep_c_api.h`):
Replaces the hardcoded EP-name checks in
`InferenceSession::Initialize()` with a policy-based approach:
- `ALL_NODES_ON_EP` — all nodes must be on the target EP (e.g.,
TensorRT).
- `ALLOW_CPU_FOR_SHAPES` — CPU nodes allowed for shape computation if no
memcpy nodes exist (e.g., CUDA, WebGPU, DML).

**Refactored `InferenceSession` graph capture selection**
(`inference_session.cc`):
- Removed the hardcoded `graph_support_ep_list` and per-EP `strcmp`
checks.
- Now iterates over all registered EPs and uses
`IsGraphCaptureEnabled()` + `GetGraphCaptureNodeAssignmentPolicy()` to
select and validate the graph-capturing EP.
- `AreAllComputeNodesAssignedToCudaOrJsOrDmlEpWebGpuEp()` → generalized
to `AreAllComputeNodesAssignedToEpOrCpu()`, which also requires at least
one node on the target EP.
- `IExecutionProvider::GetGraphCaptureNodeAssignmentPolicy()` added to
the base class (defaults to `ALL_NODES_ON_EP`).

**Bounded graph capture recursion** (`inference_session.cc/h`):
- `Run()` now delegates to `RunImpl()` with a `graph_capture_depth`
parameter.
- Caps internal run attempts at `kMaxGraphCaptureRunAttempts = 8`,
returning a clear error if the EP never reports `IsGraphCaptured() ==
true`.

**EP implementations**:
- **WebGPU plugin EP**: Fully implements all four graph capture APIs by
forwarding to the underlying `IExecutionProvider`.
- **CUDA plugin EP**: Stubs with TODOs (returns
disabled/not-implemented).
- **NvTensorRTRTX EP**: `IsGraphCaptureEnabled()` now returns `false`
since this EP manages graph capture internally (not via ORT).

**C++ wrapper** (`onnxruntime_cxx_api.h` / `onnxruntime_cxx_inline.h`):
- Added `Ort::Env::CopyTensor()` convenience overload for copying a
single tensor (wraps `CopyTensors` with `num_tensors=1`).

### Tests
- **`ep_plugin_provider_test.cc`**: Unit tests for each new
`PluginExecutionProvider` graph capture method, including NULL function
pointer defaults, version < 26 backward compatibilities, and validation
that `IsGraphCaptureEnabled()` returns false when `IsGraphCaptured` or
`ReplayGraph` are NULL.
- **`test_graph_capture.cc`**: End-to-end test for WebGPU plugin EP
graph capture/replay using IO binding (warm-up + capture run, then
replay with different inputs).

### Motivation and Context

Previously, graph capture support was limited to a hardcoded list of EPs
(`kCudaExecutionProvider`, `kTensorrtExecutionProvider`,
`kJsExecutionProvider`, `kWebGpuExecutionProvider`,
`kDmlExecutionProvider`) with EP-specific validation logic in
`InferenceSession`. This made it impossible for plugin EPs to
participate in ORT-managed graph capture/replay without modifying the
core session code.

This PR makes graph capture/replay extensible to any EP, including
out-of-tree plugin EPs, by exposing it through the `OrtEp` C API.
### Description
<!-- Describe your changes. -->
- Update `WhereDummyDq` QDQ transformer to be more selective before
inserting a dummy `DequantizeLinear` around `Where`.
- `SatisfyCondition` now requires the `Where` output to have exactly one
consumer and that consumer must be `QuantizeLinear` (Q). Otherwise, the
transform is skipped.
- `InsertDummyDQ` additionally checks element type consistency between
the upstream DQ input tensor type and the downstream Q output tensor
type; if they differ, the transform returns without modifying the graph.
- Update the implementation of `WhereDummyDq` to avoid negative or zero
`scale` value. The change maps the float value to the **boundary** of
integer domain to ensure the `scale` value is positive.
- If `WhereOp` get a float scalar `xf` and a `DequantizeLinear` as its
two inputs, `WhereDummyDq` insert DQ to ensure `xf = DQ(xq, scale, zp)`
- The `xq`, `scale` and `zp` are determined with the following table.

| | uint8 | uint16 | int8 | int16 |

|-----------------|--------------|---------------|-------------|---------------|
| xf > 0 | | | | |
| xq | 255 | 65535 | 127 | 32767 |
| zp | 127 | 32767 | 0 | 0 |
| xf < 0 | | | | |
| xq | 0 | 0 | -128 | -32768 |
| zp | 127 | 32767 | 0 | 0 |
| xf = 0 | | | | |
| xq | 127 | 32767 | 0 | 0 |
| zp | 127 | 32767 | 0 | 0 |
    - `scale = xf / (xq - zp)` if `xq != zp` else `1`


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Negative or zero scale value is not friendly for various EP and
backend such as QNN-EP.
- Inserting an additional DQ is only useful when it forms a valid QDQ
“node unit” pattern. If the `Where` output is not followed by a single
`QuantizeLinear` (e.g., multiple consumers or a non-Q consumer), adding
a dummy DQ cannot create the intended pattern and may lead to
non-fusible/undesired graph structures.
## Description

This PR brings CUDA graph capture/replay to the CUDA plugin execution
provider so plugin-based CUDA deployments can get the same reduced CPU
launch overhead that the in-tree CUDA EP already supports. It also adds
the ORT framework and plugin-C-API plumbing needed to let graph-enabled
plugin EPs participate safely in warmup, capture, and replay, while
preserving compatibility with older plugins through version-gated
fallbacks.

## Summary of Changes

### CUDA plugin EP runtime and allocator integration

| File | Change |
|------|--------|
| `onnxruntime/core/providers/cuda/plugin/cuda_ep.cc` | Implements
plugin-side graph capture lifecycle callbacks, per-thread graph context
management, graph replay, and stream selection for graph-enabled runs. |
| `onnxruntime/core/providers/cuda/plugin/cuda_ep.h` | Adds CUDA graph
configuration/state to the plugin EP, including per-thread graph context
ownership. |
| `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.cc` | Adds
`CudaGraphSet`/`CudaGraphManager` to own captured graphs and coordinate
warmup, capture, and replay by annotation ID. |
| `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.h` |
Declares the new graph manager types and graph-related constants. |
| `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc` | Adds
external-stream wrapping so graph-enabled runs can reuse the thread’s
graph stream without taking ownership of it. |
| `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h` |
Declares the external-stream initialization path and stream ownership
tracking. |
| `onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc` | Parses
`enable_cuda_graph` and `min_num_runs_before_cuda_graph_capture`
provider/session options for the plugin EP. |
|
`onnxruntime/core/providers/cuda/plugin/cuda_mempool_allocator_plugin.cc`
| Updates allocator behavior needed for CUDA native mempool
compatibility during graph capture/replay. |
| `onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h` |
Adjusts plugin kernel/device helpers used by the graph-enabled execution
path. |
| `onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h` | Adds
supporting helpers used by the plugin CUDA graph flow. |

### ORT framework and plugin API support for graph replay

| File | Change |
|------|--------|
| `include/onnxruntime/core/session/onnxruntime_ep_c_api.h` | Documents
and extends the plugin EP contract for graph-enabled runs, including
replay behavior relative to `OnRunStart`/`OnRunEnd`. |
| `include/onnxruntime/core/framework/execution_provider.h` | Adds
graph-capture node-assignment policy support to the execution provider
interface. |
| `onnxruntime/core/session/inference_session.cc` | Generalizes the
session replay path and warmup/capture retry loop so ORT can drive graph
replay for graph-capable EPs. |
| `onnxruntime/core/session/inference_session.h` | Updates
replay-related messaging and supporting declarations for the new run
flow. |
| `onnxruntime/core/framework/session_state.cc` | Makes device-stream
collection reuse thread-affine so warmup/capture/replay reuse stays on
the owning thread. |
| `onnxruntime/core/framework/session_state.h` | Adds supporting state
for the thread-affine stream collection pool. |
| `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc`
| Bridges the new graph callbacks, hardens validation of plugin graph
support, and exposes effective plugin provider options gathered from
session config. |
| `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h` |
Stores provider options and declares the new accessor/graph bridge
behavior. |
| `onnxruntime/core/providers/webgpu/webgpu_execution_provider.h` |
Aligns graph-capture policy support with the new execution-provider
interface. |
| `onnxruntime/core/providers/js/js_execution_provider.h` | Aligns
graph-capture policy support with the new execution-provider interface.
|

### Tests and validation coverage

| File | Change |
|------|--------|
| `onnxruntime/test/python/transformers/test_cuda_plugin_ep.py` | Adds
end-to-end CUDA graph tests for warmup/capture/replay, replay after
input updates, CUDA mempool mode, multiple graph annotation IDs,
multi-GPU/device-id coverage, and a simple Add model. |

### Documentation

| File | Change |
|------|--------|
| `docs/cuda_plugin_ep/cuda_graph_for_cuda_plugin.md` | Adds a dedicated
design/implementation document covering architecture, lifecycle,
allocator interaction, concurrency, and verification guidance. |
| `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | Updates the broader
plugin EP design doc to reflect that CUDA graph support is implemented
and documents the framework-level changes. |
| `docs/cuda_plugin_ep/QUICK_START.md` | Updates quick-start/testing
guidance and removes the outdated “no CUDA Graph support” limitation. |

## Testing

- Build ONNX Runtime with `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`,
install the generated wheel, and deploy the CUDA plugin shared library
as described in `docs/cuda_plugin_ep/QUICK_START.md`.
- Run `python
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py`.
- Pay particular attention to the new CUDA graph scenarios in that
suite: warmup/capture/replay, replay after in-place input updates, CUDA
mempool mode, multiple `gpu_graph_id` captures, and the second-device
path when multiple GPUs are available.
- Verify backward compatibility by confirming older plugins still load
safely through the version-gated graph callback bridge, and that
graph-disabled runs continue through the normal execution path.

## Motivation and Context

The CUDA plugin EP exists to decouple CUDA EP delivery from core ONNX
Runtime releases, but that model only works well if important runtime
optimizations are also available through the plugin path. CUDA graph
replay is one of the highest-value CUDA execution optimizations because
it eliminates repeated kernel-launch overhead after capture, especially
for steady-state inference workloads.

Supporting that in the plugin EP required more than adding plugin-local
capture code. ORT also needed a framework-level replay flow that works
for plugin EPs, a plugin C API contract for graph support and
node-assignment policy, and thread-affine stream reuse so captured graph
resources and stream wrappers are not reused across unrelated threads.
This PR packages those pieces together and documents the resulting
behavior for future plugin EP work. It also depends on earlier plugin
allocator work so warmup can stabilize allocations before capture
begins.

## Checklist

- [x] Tests added/updated
- [x] Documentation updated (if applicable)
- [x] No breaking changes (or documented in description)
## Description

This fixes a flaky failure in the plugin EP profiling tests on macOS,
where reconstructed plugin event timestamps could land a few
microseconds outside the correlated ORT parent event interval.

The current example plugin profiler reconstructs EP-relative timestamps
by combining ORT's profiling-start offset with elapsed time from the EP
clock. That reconstruction is close but not exact across clocks, and on
macOS the skew was enough to fail the strict containment checks in
`KernelPluginEp_SessionProfiling` with cases like `ep_start <
parent_start` by a small margin.

Instead of weakening the test, this change keeps the strict contract and
fixes the profiler output so child EP events are always emitted within
the correlated ORT parent event interval.

## Key Changes

| File | Change |
|------|--------|
|
`onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h`
| Stores the correlated ORT parent event start timestamp and duration on
each collected EP event, and adds the helper signature updates needed to
propagate that metadata. |
|
`onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc`
| Captures parent event timing from `Ort::ConstProfilingEvent`, attaches
it to EP events during `StopEventImpl`, and clamps the reconstructed EP
start/end interval to the parent ORT interval before emitting the final
profiling event. |

## Why This Change Is Needed

- The plugin EP profiling tests intentionally require strict nesting: EP
child events must stay within the ORT parent event interval.
- The existing implementation reconstructs EP timestamps from two
different clocks, which can drift by a few microseconds depending on
platform timing behavior.
- macOS exposed that drift often enough to make
`KernelPluginEp_SessionProfiling` flaky even though the logical event
ordering was correct.
- Clamping the emitted child interval to the already-correlated parent
interval preserves the expected semantics and removes the
platform-specific skew from the final profiling output.

## Testing

- `ninja -C build/cuda/Debug onnxruntime_autoep_test`
- `cd build/cuda/Debug && ./onnxruntime_autoep_test
--gtest_filter=OrtEpLibrary.KernelPluginEp_SessionProfiling`
- `cd build/cuda/Debug && ./onnxruntime_autoep_test
--gtest_filter=OrtEpLibrary.KernelPluginEp_RunProfiling`

## Notes For Reviewers

- This is intentionally scoped to the example plugin EP profiling path
used by the AutoEP tests.
- The change avoids relaxing any assertions in `test_execution.cc`; it
fixes the emitted profiling data instead.
### Description
set the pointer to nullptr immediately after `UnloadDynamicLibrary`.



### Motivation and Context
After unload library, set the function pointer to nullptr to avoid a
dangling pointer. Otherwise, the following scenario may cause errors:
```
RegisterExecutionProviderLibrary()
SessionOptions::AppendExecutionProvider_VitisAI()
```
In this scenario, the OrtVitisAIEpAPI will call `initialize_vitisai_ep`
once but call `deinitialize_vitisai_ep` twice. During deinitialization,
functions `deinitialize_onnxruntime_vitisai_ep` are invoked, which leads
to errors.
### Description

Centralise feed authentication & setup for build systems on ADO build
pipelines.

### Motivation and Context

SDL requires official build pipelines use a single controlled feed for
external resources.

---------

Co-authored-by: Sanaa Hamel <sanaahamel@microsoft.com>
…icrosoft#27998)

### Description:

### Summary
Fuse the QMoE 1-token decode path to reduce GPU dispatches from 17 (1 +
k×4) to 5 (gate + fc1 + swiglu + fc2 + mix), improving token generation
throughput by ~21% on Meteor Lake for the gpt-oss-20b MoE model (19 → 23
tps).

### Motivation
The QMoE operator processes Mixture-of-Experts layers by selecting top-k
experts (k=4) per token. In the original 1-token decode path, each
expert is processed serially with 4 dispatches (gather + fc1 + swiglu +
fc2 + mix), totaling 17 GPU dispatches per QMoE call. Since each
dispatch has M=1, the GPU is underutilized and CPU dispatch overhead
dominates.

### Approach
For the 1-token path (num_rows == 1):

**Gate1Token** — Select top-k experts and output an
[indirect_experts](vscode-file://vscode-app/c:/Users/jiajiaqin/AppData/Local/Programs/Microsoft%20VS%20Code/ce099c1ed2/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
buffer mapping row index → expert index
**Batched fc1 MatMulNBits** — Run a single M=k matmul with
[per_row_weight_indirect](vscode-file://vscode-app/c:/Users/jiajiaqin/AppData/Local/Programs/Microsoft%20VS%20Code/ce099c1ed2/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
mode, where each row selects a different expert's weights via the
indirect buffer
**SwiGLU** — Apply activation on all k rows at once
**Batched fc2 MatMulNBits** — Same per-row expert selection for the down
projection
**FusedFinalMix** — Accumulate all k weighted expert results into the
output

### Follow-ups
Fuse Batched fc1 MatMulNBits + SwiGLU

Fuse Batched fc2 MatMulNBits + FusedFinalMix

Finally, we only need three shaders: Gate1Token, fused Batched fc1
MatMulNBits, fused batched fc2 MatMulNBits.
Bumps [lodash](https://github.com/lodash/lodash) from 4.17.23 to 4.18.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/lodash/lodash/releases">lodash's
releases</a>.</em></p>
<blockquote>
<h2>4.18.1</h2>
<h2>Bugs</h2>
<p>Fixes a <code>ReferenceError</code> issue in <code>lodash</code>
<code>lodash-es</code> <code>lodash-amd</code> and
<code>lodash.template</code> when using the <code>template</code> and
<code>fromPairs</code> functions from the modular builds. See <a
href="https://redirect.github.com/lodash/lodash/issues/6167#issuecomment-4165269769">lodash/lodash#6167</a></p>
<p>These defects were related to how lodash distributions are built from
the main branch using <a
href="https://github.com/lodash-archive/lodash-cli">https://github.com/lodash-archive/lodash-cli</a>.
When internal dependencies change inside lodash functions, equivalent
updates need to be made to a mapping in the lodash-cli. (hey, it was
ahead of its time once upon a time!). We know this, but we missed it in
the last release. It's the kind of thing that passes in CI, but fails bc
the build is not the same thing you tested.</p>
<p>There is no diff on main for this, but you can see the diffs for each
of the npm packages on their respective branches:</p>
<ul>
<li><code>lodash</code>: <a
href="https://github.com/lodash/lodash/compare/4.18.0-npm...4.18.1-npm">https://github.com/lodash/lodash/compare/4.18.0-npm...4.18.1-npm</a></li>
<li><code>lodash-es</code>: <a
href="https://github.com/lodash/lodash/compare/4.18.0-es...4.18.1-es">https://github.com/lodash/lodash/compare/4.18.0-es...4.18.1-es</a></li>
<li><code>lodash-amd</code>: <a
href="https://github.com/lodash/lodash/compare/4.18.0-amd...4.18.1-amd">https://github.com/lodash/lodash/compare/4.18.0-amd...4.18.1-amd</a></li>
<li><code>lodash.template</code><a
href="https://github.com/lodash/lodash/compare/4.18.0-npm-packages...4.18.1-npm-packages">https://github.com/lodash/lodash/compare/4.18.0-npm-packages...4.18.1-npm-packages</a></li>
</ul>
<h2>4.18.0</h2>
<h2>v4.18.0</h2>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/lodash/lodash/compare/4.17.23...4.18.0">https://github.com/lodash/lodash/compare/4.17.23...4.18.0</a></p>
<h3>Security</h3>
<p><strong><code>_.unset</code> / <code>_.omit</code></strong>: Fixed
prototype pollution via <code>constructor</code>/<code>prototype</code>
path traversal (<a
href="https://github.com/lodash/lodash/security/advisories/GHSA-f23m-r3pf-42rh">GHSA-f23m-r3pf-42rh</a>,
<a
href="https://github.com/lodash/lodash/commit/fe8d32eda854377349a4f922ab7655c8e5df9a0b">fe8d32e</a>).
Previously, array-wrapped path segments and primitive roots could bypass
the existing guards, allowing deletion of properties from built-in
prototypes. Now <code>constructor</code> and <code>prototype</code> are
blocked unconditionally as non-terminal path keys, matching
<code>baseSet</code>. Calls that previously returned <code>true</code>
and deleted the property now return <code>false</code> and leave the
target untouched.</p>
<p><strong><code>_.template</code></strong>: Fixed code injection via
<code>imports</code> keys (<a
href="https://github.com/lodash/lodash/security/advisories/GHSA-r5fr-rjxr-66jc">GHSA-r5fr-rjxr-66jc</a>,
CVE-2026-4800, <a
href="https://github.com/lodash/lodash/commit/879aaa93132d78c2f8d20c60279da9f8b21576d6">879aaa9</a>).
Fixes an incomplete patch for CVE-2021-23337. The <code>variable</code>
option was validated against <code>reForbiddenIdentifierChars</code> but
<code>importsKeys</code> was left unguarded, allowing code injection via
the same <code>Function()</code> constructor sink. <code>imports</code>
keys containing forbidden identifier characters now throw
<code>&quot;Invalid imports option passed into
_.template&quot;</code>.</p>
<h3>Docs</h3>
<ul>
<li>Add security notice for <code>_.template</code> in threat model and
API docs (<a
href="https://redirect.github.com/lodash/lodash/pull/6099">#6099</a>)</li>
<li>Document <code>lower &gt; upper</code> behavior in
<code>_.random</code> (<a
href="https://redirect.github.com/lodash/lodash/pull/6115">#6115</a>)</li>
<li>Fix quotes in <code>_.compact</code> jsdoc (<a
href="https://redirect.github.com/lodash/lodash/pull/6090">#6090</a>)</li>
</ul>
<h3><code>lodash.*</code> modular packages</h3>
<p><a
href="https://redirect.github.com/lodash/lodash/pull/6157">Diff</a></p>
<p>We have also regenerated and published a select number of the
<code>lodash.*</code> modular packages.</p>
<p>These modular packages had fallen out of sync significantly from the
minor/patch updates to lodash. Specifically, we have brought the
following packages up to parity w/ the latest lodash release because
they have had CVEs on them in the past:</p>
<ul>
<li><a
href="https://www.npmjs.com/package/lodash.orderby">lodash.orderby</a></li>
<li><a
href="https://www.npmjs.com/package/lodash.tonumber">lodash.tonumber</a></li>
<li><a
href="https://www.npmjs.com/package/lodash.trim">lodash.trim</a></li>
<li><a
href="https://www.npmjs.com/package/lodash.trimend">lodash.trimend</a></li>
<li><a
href="https://www.npmjs.com/package/lodash.sortedindexby">lodash.sortedindexby</a></li>
<li><a
href="https://www.npmjs.com/package/lodash.zipobjectdeep">lodash.zipobjectdeep</a></li>
<li><a
href="https://www.npmjs.com/package/lodash.unset">lodash.unset</a></li>
<li><a
href="https://www.npmjs.com/package/lodash.omit">lodash.omit</a></li>
<li><a
href="https://www.npmjs.com/package/lodash.template">lodash.template</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/lodash/lodash/commit/cb0b9b9212521c08e3eafe7c8cb0af1b42b6649e"><code>cb0b9b9</code></a>
release(patch): bump main to 4.18.1 (<a
href="https://redirect.github.com/lodash/lodash/issues/6177">#6177</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/75535f57883b7225adb96de1cfc1cd4169cfcb51"><code>75535f5</code></a>
chore: prune stale advisory refs (<a
href="https://redirect.github.com/lodash/lodash/issues/6170">#6170</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/62e91bc6a39c98d85b9ada8c44d40593deaf82a4"><code>62e91bc</code></a>
docs: remove n_ Node.js &lt; 6 REPL note from README (<a
href="https://redirect.github.com/lodash/lodash/issues/6165">#6165</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/59be2de61f8aa9461c7856533b51d31b7d8babc4"><code>59be2de</code></a>
release(minor): bump to 4.18.0 (<a
href="https://redirect.github.com/lodash/lodash/issues/6161">#6161</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/af634573030f979194871da7c68f79420992f53d"><code>af63457</code></a>
fix: broken tests for _.template 879aaa9</li>
<li><a
href="https://github.com/lodash/lodash/commit/1073a7693e1727e0cf3641e5f71f75ddcf8de7c0"><code>1073a76</code></a>
fix: linting issues</li>
<li><a
href="https://github.com/lodash/lodash/commit/879aaa93132d78c2f8d20c60279da9f8b21576d6"><code>879aaa9</code></a>
fix: validate imports keys in _.template</li>
<li><a
href="https://github.com/lodash/lodash/commit/fe8d32eda854377349a4f922ab7655c8e5df9a0b"><code>fe8d32e</code></a>
fix: block prototype pollution in baseUnset via constructor/prototype
traversal</li>
<li><a
href="https://github.com/lodash/lodash/commit/18ba0a32f42fd02117f096b032f89c984173462d"><code>18ba0a3</code></a>
refactor(fromPairs): use baseAssignValue for consistent assignment (<a
href="https://redirect.github.com/lodash/lodash/issues/6153">#6153</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/b8190803d48d60b8c80ad45d39125f32fa618cb2"><code>b819080</code></a>
ci: add dist sync validation workflow (<a
href="https://redirect.github.com/lodash/lodash/issues/6137">#6137</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/lodash/lodash/compare/4.17.23...4.18.1">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lodash&package-manager=npm_and_yarn&previous-version=4.17.23&new-version=4.18.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
Fix a security issue, onnxruntime could access values outside boundary.



### Motivation and Context
security
### Description
Porting over additional fixes to file mapping from the QNN EP ABI repo:
 - Only use file mapping feature if context bin version is >= 3.3.3
 - Disable file mapping on a per-model basis for edge use cases



### Motivation and Context
When testing based the QNN EP ABI repo, failed QNN context creation from
EP context due to the EP context binary being too old prevented the QNN
API from freeing all resources when file mapping is enabled. Context
creation failure was due to the context binary version being older than
3.3.3, so there is now a check to disable file mapping for any EP
context binaries that are too old.

Prior to these changes, if file mapping is enabled and QNN context
creation fails for any reason, the feature is disabled for all other
graphs. This does not account for use cases where (1) a model contains
multiple EP context nodes and some of them are incompatible with the
file mapping feature; and (2) when multiple sessions share the same EP
context and one or more of the models used are incompatible with the
file mapping feature. The code has been updated to handle this use case.

---------

Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
…rosoft#28017)

Bumps
[fast-xml-parser](https://github.com/NaturalIntelligence/fast-xml-parser)
from 4.5.3 to 4.5.6.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/NaturalIntelligence/fast-xml-parser/releases">fast-xml-parser's
releases</a>.</em></p>
<blockquote>
<h2>Summary update on all the previous releases from v4.2.4</h2>
<ul>
<li>Multiple minor fixes provided in the validator and parser</li>
<li>v6 is added for experimental use.</li>
<li>ignoreAttributes support function, and array of string or regex</li>
<li>Add support for parsing HTML numeric entities</li>
<li>v5 of the application is ESM module now. However, JS is also
supported</li>
</ul>
<p><strong>Note</strong>: Release section in not updated frequently.
Please check <a
href="https://github.com/NaturalIntelligence/fast-xml-parser/blob/master/CHANGELOG.md">CHANGELOG</a>
or <a
href="https://github.com/NaturalIntelligence/fast-xml-parser/tags">Tags</a>
for latest release information.</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/42fbb0bc95e753e03fe52cb0805a8774bba4bf28"><code>42fbb0b</code></a>
update release info</li>
<li><a
href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/805671cb6c19108b171b876cf3e8865f18cdb8fd"><code>805671c</code></a>
increase expansion limit as many system need it</li>
<li><a
href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/9a2cf097c2961d4ad878f618e39fb0a9f5a0e9e5"><code>9a2cf09</code></a>
update version</li>
<li><a
href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/88d0936a23dabe51bfbf42255e2ce912dfee2221"><code>88d0936</code></a>
apply all fixes from v5</li>
<li><a
href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/d4eb6b4713a8d11e6730943392419040898ecbc0"><code>d4eb6b4</code></a>
update release version</li>
<li>See full diff in <a
href="https://github.com/NaturalIntelligence/fast-xml-parser/compare/v4.5.3...v4.5.6">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=fast-xml-parser&package-manager=npm_and_yarn&previous-version=4.5.3&new-version=4.5.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…/nextjs-default (microsoft#28036)

Bumps [next](https://github.com/vercel/next.js) from 16.1.5 to 16.2.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/vercel/next.js/releases">next's
releases</a>.</em></p>
<blockquote>
<h2>v16.2.3</h2>
<blockquote>
<p>[!NOTE]
This release is backporting security and bug fixes. For more information
about the fixed security vulnerability, please see <a
href="https://vercel.com/changelog/summary-of-cve-2026-23869">https://vercel.com/changelog/summary-of-cve-2026-23869</a>.
The release does <strong>not</strong> include all pending
features/changes on canary.</p>
</blockquote>
<h3>Core Changes</h3>
<ul>
<li>Ensure app-page reports stale ISR revalidation errors via
onRequestError (<a
href="https://redirect.github.com/vercel/next.js/issues/92282">#92282</a>)</li>
<li>Fix [Bug]: manifest.ts breaks HMR in Next.js 16.2 (<a
href="https://redirect.github.com/vercel/next.js/issues/91981">#91981</a>
through <a
href="https://redirect.github.com/vercel/next.js/issues/92273">#92273</a>)</li>
<li>Deduplicate output assets and detect content conflicts on emit (<a
href="https://redirect.github.com/vercel/next.js/issues/92292">#92292</a>)</li>
<li>Fix styled-jsx race condition: styles lost due to concurrent
rendering (<a
href="https://redirect.github.com/vercel/next.js/issues/92459">#92459</a>)</li>
<li>turbo-tasks-backend: stability fixes for task cancellation and error
handling (<a
href="https://redirect.github.com/vercel/next.js/issues/92254">#92254</a>)</li>
</ul>
<h3>Credits</h3>
<p>Huge thanks to <a
href="https://github.com/icyJoseph"><code>@​icyJoseph</code></a>, <a
href="https://github.com/sokra"><code>@​sokra</code></a>, <a
href="https://github.com/wbinnssmith"><code>@​wbinnssmith</code></a>, <a
href="https://github.com/eps1lon"><code>@​eps1lon</code></a> and <a
href="https://github.com/ztanner"><code>@​ztanner</code></a> for
helping!</p>
<h2>v16.2.2</h2>
<blockquote>
<p>[!NOTE]
This release is backporting bug fixes. It does <strong>not</strong>
include all pending features/changes on canary.</p>
</blockquote>
<h3>Core Changes</h3>
<ul>
<li>backport: Move expanded adapters docs to API reference (<a
href="https://redirect.github.com/vercel/next.js/issues/92115">#92115</a>)
(<a
href="https://redirect.github.com/vercel/next.js/issues/92129">#92129</a>)</li>
<li>Backport: TypeScript v6 deprecations for baseUrl and
moduleResolution (<a
href="https://redirect.github.com/vercel/next.js/issues/92130">#92130</a>)</li>
<li>[create-next-app] Skip interactive prompts when CLI flags are
provided (<a
href="https://redirect.github.com/vercel/next.js/issues/91840">#91840</a>)</li>
<li>next.config.js: Accept an option for serverFastRefresh (<a
href="https://redirect.github.com/vercel/next.js/issues/91968">#91968</a>)</li>
<li>Turbopack: enable server HMR for app route handlers (<a
href="https://redirect.github.com/vercel/next.js/issues/91466">#91466</a>)</li>
<li>Turbopack: exclude metadata routes from server HMR (<a
href="https://redirect.github.com/vercel/next.js/issues/92034">#92034</a>)</li>
<li>Fix CI for glibc linux builds</li>
<li>Backport: disable bmi2 in qfilter <a
href="https://redirect.github.com/vercel/next.js/issues/92177">#92177</a></li>
<li>[backport] Fix CSS HMR on Safari (<a
href="https://redirect.github.com/vercel/next.js/issues/92174">#92174</a>)</li>
</ul>
<h3>Credits</h3>
<p>Huge thanks to <a
href="https://github.com/nextjs-bot"><code>@​nextjs-bot</code></a>, <a
href="https://github.com/icyJoseph"><code>@​icyJoseph</code></a>, <a
href="https://github.com/ijjk"><code>@​ijjk</code></a>, <a
href="https://github.com/gaojude"><code>@​gaojude</code></a>, <a
href="https://github.com/wbinnssmith"><code>@​wbinnssmith</code></a>, <a
href="https://github.com/lukesandberg"><code>@​lukesandberg</code></a>,
and <a href="https://github.com/bgw"><code>@​bgw</code></a> for
helping!</p>
<h2>v16.2.1</h2>
<blockquote>
<p>[!NOTE]
This release is backporting bug fixes. It does <strong>not</strong>
include all pending features/changes on canary.</p>
</blockquote>
<h3>Core Changes</h3>
<ul>
<li>docs: post release amends (<a
href="https://redirect.github.com/vercel/next.js/issues/91715">#91715</a>)</li>
<li>docs: fix broken Activity Patterns demo link in preserving UI state
guide (<a
href="https://redirect.github.com/vercel/next.js/issues/91698">#91698</a>)</li>
<li>Fix adapter outputs for dynamic metadata routes (<a
href="https://redirect.github.com/vercel/next.js/issues/91680">#91680</a>)</li>
<li>Turbopack: fix webpack loader runner layer (<a
href="https://redirect.github.com/vercel/next.js/issues/91727">#91727</a>)</li>
<li>Fix server actions in standalone mode with
<code>cacheComponents</code> (<a
href="https://redirect.github.com/vercel/next.js/issues/91711">#91711</a>)</li>
<li>turbo-persistence: remove Unmergeable mmap advice (<a
href="https://redirect.github.com/vercel/next.js/issues/91713">#91713</a>)</li>
<li>Fix layout segment optimization: move app-page imports to
server-utility transition (<a
href="https://redirect.github.com/vercel/next.js/issues/91701">#91701</a>)</li>
<li>Turbopack: lazy require metadata and handle TLA (<a
href="https://redirect.github.com/vercel/next.js/issues/91705">#91705</a>)</li>
<li>[turbopack] Respect <code>{eval:true}</code> in worker_threads
constructors (<a
href="https://redirect.github.com/vercel/next.js/issues/91666">#91666</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/vercel/next.js/commit/d5f649b2f4affdad1009cb178c1e3b37f4f1ad3f"><code>d5f649b</code></a>
v16.2.3</li>
<li><a
href="https://github.com/vercel/next.js/commit/28739286a88a83ab2d4e1899bdb4eb4ee7bee9a9"><code>2873928</code></a>
[16.x] Avoid consuming cyclic models multiple times (<a
href="https://redirect.github.com/vercel/next.js/issues/75">#75</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/d7c77653602ae2009595cc71eb10f1b8828cc789"><code>d7c7765</code></a>
[backport]: Ensure app-page reports stale ISR revalidation errors via
onReque...</li>
<li><a
href="https://github.com/vercel/next.js/commit/c573e8c4f3208711f52bf3b64f5db238c9164762"><code>c573e8c</code></a>
fix(server-hmr): metadata routes overwrite page runtime HMR handler (<a
href="https://redirect.github.com/vercel/next.js/issues/92273">#92273</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/57b8f659060e1d0f202273a9ed9e56d40f1d1a9c"><code>57b8f65</code></a>
next-core: deduplicate output assets and detect content conflicts on
emit (<a
href="https://redirect.github.com/vercel/next.js/issues/9">#9</a>...</li>
<li><a
href="https://github.com/vercel/next.js/commit/f158df18bd926d0c2165ad309bbb561d7e73e74a"><code>f158df1</code></a>
Fix styled-jsx race condition: styles lost due to concurrent rendering
(<a
href="https://redirect.github.com/vercel/next.js/issues/92459">#92459</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/356d605b5831ffbe12ce9c9641e5e2e55d203523"><code>356d605</code></a>
turbo-tasks-backend: stability fixes for task cancellation and error
handling...</li>
<li><a
href="https://github.com/vercel/next.js/commit/3b77a6e2670ce81d686111b8e466eec612fa1867"><code>3b77a6e</code></a>
Fix DashMap read-write self-deadlock in task_cache causing hangs (<a
href="https://redirect.github.com/vercel/next.js/issues/92210">#92210</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/b2f208ae98645d119a7e3388ab8a407005619dd8"><code>b2f208a</code></a>
Backport: new view-transitions guide, update and fixes (<a
href="https://redirect.github.com/vercel/next.js/issues/92264">#92264</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/52faae3d94641584e13691238df5be158d0f00fb"><code>52faae3</code></a>
v16.2.2</li>
<li>Additional commits viewable in <a
href="https://github.com/vercel/next.js/compare/v16.1.5...v16.2.3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=next&package-manager=npm_and_yarn&previous-version=16.1.5&new-version=16.2.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [yaml](https://github.com/eemeli/yaml) from 2.7.0 to 2.8.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/eemeli/yaml/releases">yaml's
releases</a>.</em></p>
<blockquote>
<h2>v2.8.3</h2>
<ul>
<li>Add <code>trailingComma</code> ToString option for multiline flow
formatting (<a
href="https://redirect.github.com/eemeli/yaml/issues/670">#670</a>)</li>
<li>Catch stack overflow during node composition (1e84ebb)</li>
</ul>
<h2>v2.8.2</h2>
<ul>
<li>Serialize -0 as -0 (<a
href="https://redirect.github.com/eemeli/yaml/issues/638">#638</a>)</li>
<li>Do not double newlines for empty map values (<a
href="https://redirect.github.com/eemeli/yaml/issues/642">#642</a>)</li>
</ul>
<h2>v2.8.1</h2>
<ul>
<li>Preserve empty block literals (<a
href="https://redirect.github.com/eemeli/yaml/issues/634">#634</a>)</li>
</ul>
<h2>v2.8.0</h2>
<ul>
<li>Add node cache for faster alias resolution (<a
href="https://redirect.github.com/eemeli/yaml/issues/612">#612</a>)</li>
<li>Re-introduce compatibility with Node.js 14.6 (<a
href="https://redirect.github.com/eemeli/yaml/issues/614">#614</a>)</li>
<li>Add <code>--merge</code> option to CLI tool (<a
href="https://redirect.github.com/eemeli/yaml/issues/611">#611</a>)</li>
<li>Improve error for tag resolution error on null value (<a
href="https://redirect.github.com/eemeli/yaml/issues/616">#616</a>)</li>
<li>Allow empty string as plain scalar representation, for failsafe
schema (<a
href="https://redirect.github.com/eemeli/yaml/issues/616">#616</a>)</li>
<li>docs: include cli example (<a
href="https://redirect.github.com/eemeli/yaml/issues/617">#617</a>)</li>
</ul>
<h2>v2.7.1</h2>
<ul>
<li>Do not allow seq with single-line collection value on same line with
map key (<a
href="https://redirect.github.com/eemeli/yaml/issues/603">#603</a>)</li>
<li>Improve warning &amp; avoid TypeError on bad YAML 1.1 nodes (<a
href="https://redirect.github.com/eemeli/yaml/issues/610">#610</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/eemeli/yaml/commit/ce14587484822bffb0f7d31aefedcaf2dc0d0387"><code>ce14587</code></a>
2.8.3</li>
<li><a
href="https://github.com/eemeli/yaml/commit/1e84ebbea7ec35011a4c61bbb820a529ee4f359b"><code>1e84ebb</code></a>
fix: Catch stack overflow during node composition</li>
<li><a
href="https://github.com/eemeli/yaml/commit/6b24090280eaaab5040112bba41ccef57f39c2d5"><code>6b24090</code></a>
ci: Include Prettier check in lint action</li>
<li><a
href="https://github.com/eemeli/yaml/commit/9424dee38c85163fad53ac27533c7c4bdaf7495d"><code>9424dee</code></a>
chore: Refresh lockfile</li>
<li><a
href="https://github.com/eemeli/yaml/commit/d1aca82bc15a4c261bdc58561d32189a5d3a45ef"><code>d1aca82</code></a>
Add trailingComma ToString option for multiline flow formatting (<a
href="https://redirect.github.com/eemeli/yaml/issues/670">#670</a>)</li>
<li><a
href="https://github.com/eemeli/yaml/commit/43215099f7fcdac422d778c15e70d83c691b0e41"><code>4321509</code></a>
ci: Drop the branch filter from GitHub PR actions</li>
<li><a
href="https://github.com/eemeli/yaml/commit/47207d0fc7d4f863cd5fbdcff1378637bd93e847"><code>47207d0</code></a>
chore: Update docs-slate</li>
<li><a
href="https://github.com/eemeli/yaml/commit/5212faeed5936d1fa291d2f28672e4a96e2c2c5d"><code>5212fae</code></a>
chore: Update docs-slate</li>
<li><a
href="https://github.com/eemeli/yaml/commit/086fa6b5bae325da18734750cddee231ce578930"><code>086fa6b</code></a>
2.8.2</li>
<li><a
href="https://github.com/eemeli/yaml/commit/95f01e98032ddf199b42bb3ba0737303b35ef752"><code>95f01e9</code></a>
chore: Add funding to package.json</li>
<li>Additional commits viewable in <a
href="https://github.com/eemeli/yaml/compare/v2.7.0...v2.8.3">compare
view</a></li>
</ul>
</details>
<br />

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…icrosoft#27620)

### Description

Fix crash in ConstantFolding when nodes have missing optional outputs.

ConstantFolding previously iterated over `node->OutputDefs()` and
attempted to resolve an OrtValue index for every output. However, some
operators (e.g. `Unique`) have optional outputs that may not exist in
the graph (`NodeArg::Exists() == false`).

`OptimizerExecutionFrame` only registers OrtValues for outputs that
actually exist when building the name→index map. When ConstantFolding
requested an index for a missing optional output, `GetMLValueIndex()`
returned `-1`. This invalid index was inserted into `fetch_mlvalue_idxs`
and later caused an assertion in `ExecutionFrame::GetMLValue()` during
session initialization.

This PR fixes the issue by:

* Skipping outputs where `NodeArg::Exists() == false`
* Preventing invalid indices from entering `fetch_mlvalue_idxs`
* Skipping constant folding for the node if an output index cannot be
resolved
* Maintaining correct mapping between fetch indices and the original
output indices

---

### Motivation and Context

The failure is reproducible with the model attached in microsoft#26505.

Before this fix:

* session initialization fails with `ORT_ENABLE_BASIC`
* disabling `ConstantFolding` allows the model to load

After this fix:

* the model loads successfully with `ORT_ENABLE_BASIC`
* invalid indices are no longer inserted into `fetch_mlvalue_idxs`

Fixes microsoft#26505
### Description
Fixes 3 ICM issues:


https://portal.microsofticm.com/imp/v5/incidents/details/31000000575344/summary

https://portal.microsofticm.com/imp/v5/incidents/details/31000000575473/summary

https://portal.microsofticm.com/imp/v5/incidents/details/31000000574999/summary




### Motivation and Context
Fix ICMs

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description

Fix `linux-wasm-ci.yml` setup-feed having a spurious `templates/` path
prefix.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

Fixes two overflow/underflow bugs in the CPU RNN kernel (`rnn.cc`):

- **`SafeInt` for GEMM M-dimension**: `seq_length * batch_size` was
computed as a raw `int64_t` multiply before `narrow<int>()`, meaning an
overflow would be UB before the check could fire. Replaced with
`SafeInt<int64_t>(seq_length) * batch_size` for a checked multiply.

- **`seq_length == 0` guard in `Assign_Y_h`**: For the forward
direction, `last_time_step = seq_length - 1` underflows to `-1` when
`seq_length == 0`, producing a negative `y_offset` and out-of-bounds
read. Added an early-exit that zero-fills Y_h for the direction and
returns. Also handles `sequence_lens[batch] == 0` (same underflow path),
zeroing the affected batch slot and skipping via `continue`.

### Motivation and Context

Silent UB from integer overflow/underflow in shape-derived index
arithmetic can corrupt memory or produce incorrect results without any
diagnostic signal. These cases are legal per the ONNX spec (empty
sequences, per-batch zero-length sequences) and must be handled
explicitly.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This pull request introduces several enhancements and refactorings to
the resource accounting and execution provider (EP) infrastructure, with
a focus on better support for plugin-based CUDA execution providers. The
most significant changes include the addition of type-erased arithmetic
for resource accounting, improved handling of resource budgets for
plugin EPs, and more robust device matching logic. These updates
increase maintainability, enforce stricter type safety, and ensure
correct resource tracking across both in-tree and plugin-based EPs.

**Resource accounting improvements:**

* Added type-erased arithmetic functions (`AddResourceCounts`,
`ResourceCountExceeds`, `FormatResourceCount`) for `ResourceCount` to
enforce exhaustive handling of variant types and improve type safety.
[[1]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6R29-R40)
[[2]](diffhunk://#diff-03c846683a6d76ded189d6ef24dc545da89ca418d0bce5cf1243d33cf1e2ac06R320-R351)
* Refactored the `IResourceAccountant` interface: replaced
`ResetPendingWeights` with `ResetForNewPass`, which resets both the stop
flag and pending weights, and introduced a protected
`ResetPendingWeightsImpl` for subclass-specific cleanup.
[[1]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6L64-R83)
[[2]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6R92-R96)
[[3]](diffhunk://#diff-03c846683a6d76ded189d6ef24dc545da89ca418d0bce5cf1243d33cf1e2ac06L123-R123)
[[4]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL280-R280)
[[5]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL351-R351)

**Plugin CUDA EP and resource budget enforcement:**

* Added `kCudaPluginExecutionProvider` constant and updated logic to
ensure plugin EPs correctly map to their in-tree accountant counterparts
and are included in device matching and partitioning.
[[1]](diffhunk://#diff-442c270eea3703252c48e97a7573960e14bf27a45a4443348840ed565330bf70R34)
[[2]](diffhunk://#diff-b20f416b9fe3b85423eea6707c38753351a3f1b8ef7a319858b27794507e0686L102)
[[3]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L186-R187)
[[4]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L206-R207)
[[5]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L228-R229)
[[6]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL1192-R1200)
* Updated plugin EP infrastructure to pass and utilize resource
accountant pointers, enabling host-side resource budget enforcement for
plugin EPs and ensuring correct node assignment.
[[1]](diffhunk://#diff-fb00c9a234d8cc889927a22de94acfcfd893b56505e8ed613961b1bf13c0e435R19)
[[2]](diffhunk://#diff-fb00c9a234d8cc889927a22de94acfcfd893b56505e8ed613961b1bf13c0e435R54-R57)
[[3]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R16-R17)
[[4]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5L239-R259)
[[5]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R273-R281)
[[6]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R14-R22)
[[7]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R233-R241)

**Device matching and partitioning:**

* Improved device matching heuristics to consider both in-tree and
plugin CUDA EPs, and updated logic to prefer runtime device ordinals for
more reliable device selection.

Other minor changes include code style cleanups and additional includes
for completeness.
### Description

Modify ADO pipeline feed setup template to be idempotent.


### Motivation and Context

ADO pipelines are currently stateful, and feed setup requires modifying
files outside of the agent work directories.
Previous jobs may have already modified these files in incompatible
ways, so always override.
Copilot AI and others added 22 commits May 12, 2026 13:58
)

- [x] Cap existing opset 7 Sin/Cos kernels to versioned 7-21
- [x] Add new opset 22 Sin/Cos kernels with BFloat16 support (HFDX)
- [x] Update forward declarations and BuildKernelCreateInfo entries
- [x] Add opset 22 tests for Sin and Cos
- [x] Rebase onto latest main, resolve conflicts
- [x] Merge commit with latest main to resolve conflicts

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description
Optimize the `LinearAttention` Op with subgroupAdd().

- Detect the adapter's `subgroupMinSize` to allocate workgroup shared
memory
- Drastically reduces workgroup shared memory usage (workgroup_size_x *
TILE_V → MAX_SG * TILE_V)
- Eliminates most `workgroupBarrier()` calls in the subgroup reduction

The optimization is gated behind `subgroup_min_size`, which is enabled
when the device supports `wgpu::FeatureName::Subgroups`. 
The original is preserved as fallback.

## Qwen3.5-4B Performance Benchmarks

| Metric | Prefill Speed (TPS) | Decode Speed (TPS) | Prefill
Improvement |
| :--- | :--- | :--- | :--- |
| **Default** | 719.6 | 29.6 | - |
| **Optimized** | 929.8 | 29.7 | 1.29x |

**Test Environment:**
* **Hardware:** Intel Panther Lake
* **Configuration:** Prefill: 1024, Decode: 128


### Motivation and Context
See above.
This pull request improves the robustness and correctness of the
upsampling code in ONNX Runtime, especially for anti-aliased linear and
trilinear upsampling on the CPU. The changes focus on safer handling of
large tensor dimensions, improved memory safety, and code clarity for
interpolation and weight calculation. The most important changes are
grouped below.

**Dimension and Overflow Handling:**

* Added overflow checks for multiplication of large tensor dimensions to
prevent integer overflows during output size calculations, using a new
`checked_mul_int64` lambda.
[[1]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3R1106-R1118)
[[2]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3R1292-R1309)
* Ensured all tensor dimensions are validated to fit within the
`int32_t` range before narrowing, improving safety for large tensors.
[[1]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3R1133-R1143)
[[2]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3L1131-R1234)

**Anti-Alias Upsampling Refactor:**

* Refactored the anti-alias upsampling filter setup to use a new
`InterpolationBound` struct for per-pixel coordinate ranges, replacing
the previous flat vector approach. This improves code clarity and
reduces indexing errors.
[[1]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38R25-R35)
[[2]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L126-R159)
[[3]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L155-L208)
* Updated all interpolation and extrapolation routines to use the new
`bounds` structure, improving readability and maintainability.
[[1]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L272-R290)
[[2]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L341-R353)
[[3]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L391-R400)
* Imlpements CUDA NHWC cubic antialias support

**Memory and Type Safety:**

* Improved buffer management and type safety in filter weight
calculation, including more robust normalization and quantization for
int8/uint8 types.
* Fixed a minor logic bug in extrapolation handling by ensuring the loop
is only entered if there are out-of-bound indices.

**General Code Improvements:**

* Added missing `<limits>` include and replaced some magic numbers with
named variables for clarity.
[[1]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3R6)
[[2]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3L1198-R1257)
[[3]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3L1221-R1272)

These changes together make the upsampling code more robust, especially
for large or edge-case tensors, and improve maintainability for future
development.
…ernels for the NCHWc blocked format in MLAS (microsoft#28411)

### Description
<!-- Describe your changes. -->

New kernel files:
- riscv64/sconv_depthwise_kernel_rvv.cpp — RVV-optimized 3x3 stride-1
depthwise convolution (NCHW format), replacing the MLAS_FLOAT32X4
generic vectorized version
- riscv64/sconv_nchwc_kernel_rvv.cpp — 7 NCHWc kernels using
vfloat32m4_t (LMUL=4, BlockSize=16):
    - Direct NCHW conv (MlasConvNchwFloatKernelRvv)
    - Direct NCHWc conv (MlasConvNchwcFloatKernelRvv)
    - Depthwise NCHWc conv (MlasConvDepthwiseFloatKernelRvv)
    - Pointwise NCHWc conv (MlasConvPointwiseFloatKernelRvv)
    - Max/AvgExcludePad/AvgIncludePad pooling


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Following microsoft#28261, Optimize more MLAS kernels using RISC-V Vector (RVV)
extensions.

Please Note:

- On the K3 (SpacemiT X60), VLEN=256. With LMUL=4 and e32, the hardware
can hold (256/32) * 4 = 32 floats per vector register group — but we
only request 16. So we're using half the available vector width.

- The reason is that BlockSize=16 is baked into the NCHWc data layout
across the whole framework (matching ARM64 NEON). Changing it to 32
would require a different NCHWc format and is not a localized change.

### Benchmark ((SpacemiT K3, VLEN=256, 8-core))
All tests pass with zero numerical error.

Kernel | Speedup (RVV vs scalar)
-- | --
Direct NCHW Conv | 1.27–1.29x
Direct NCHWc Conv | 1.93–1.95x
Depthwise NCHWc Conv | 10.8–12.5x
Pointwise NCHWc Conv | 29.4–30.4x
Max Pooling | 12.5–20.0x
Avg Pooling (exclude pad) | 4.0–4.3x
Avg Pooling (include pad) | 5.5–5.8x
### Description

This adds a fast path `TryInitializeMapFromRawData` to `LabelEncoder_4`
that reads numeric key/value tensor attributes directly from their
`raw_data` buffer, avoiding the intermediate `std::vector` allocation by
the existing `GetAttribute` helper.

### Motivation and Context

For ONNX artifacts with LabelEncoders with millions of numeric keys and
values, session creation leads to unnecessarily large memory peaks
because temporary `std::vector`s with all keys and values are allocated.
This is a real pain point in production use.

This change was verified with an added test, and allocations were
checked with large models via Python with
[memray](https://github.com/bloomberg/memray) on macOS and inside an
Ubuntu devcontainer.

Thank you for taking a look!
### Description

Follow-up to PR microsoft#28097. Applies the same `_torch_load_weights_only()`
wrapper to the two remaining `torch.load()` call sites.

`torch.load` can deserialize arbitrary Python pickle payloads. Using
`weights_only=True` restricts loading to tensor/checkpoint data on
supported PyTorch versions and is the safer default. The wrapper
gracefully falls back to the default `torch.load` behavior on older
PyTorch versions that do not support the `weights_only` parameter.

### Summary of Changes

| File | Change |
|------|--------|
|
`onnxruntime/test/testdata/test_data_generation/lr_scheduler/lr_scheduler_test_data_generator.py`
| Adds `_torch_load_weights_only()` helper and uses it when loading
scheduler/optimizer state dicts. |
|
`orttraining/orttraining/test/python/orttraining_test_ortmodule_pytorch_ddp.py`
| Adds `_torch_load_weights_only()` helper and uses it when loading DDP
model checkpoint. |

### Motivation and Context

These were the last two `torch.load()` calls in the repository without
`weights_only=True`. While both are in test/tooling code with low direct
risk, this change ensures consistency with the pattern established in PR
microsoft#28097 and eliminates all unsafe deserialization call sites.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#28004)

### Description
Fix out-of-bounds reads in `SoftmaxCrossEntropyLoss` and
`SoftmaxCrossEntropyLossGrad` (CPU EP) when label values are outside
`[0, C)`. Same class of bug fixed in microsoft#27568 for
`SparseSoftmaxCrossEntropy`.

The CUDA kernels currently only validate labels via
`CUDA_KERNEL_ASSERT`, which is a no-op in release builds. CUDA hardening
is not part of this PR.

### Changes
- Forward: bounds check folded into the three per-sample loops (after
`ignore_index` skip, before any `weight_data[label]` /
`log_prob_data[i*C + label]` access).
- Backward: single upfront serial bounds check (parallel-for lambdas
cannot return Status); comment explains why.
- Validate `weight_shape[0] == C`.
- Move `weight_data[label]` access after `ignore_index` check in grad
weighted paths.
- `N_D * C` wrapped in `SafeInt`; `gsl::narrow<int>` for `N_D` and `C`.
Overflow / truncation returns `INVALID_ARGUMENT`.
- `Eigen::Index` size guard: `ORT_ENFORCE` -> `ORT_RETURN_IF`.
- `IsScalar(ignore_index)` check: `ORT_ENFORCE` -> `ORT_RETURN_IF_NOT`
in both forward and backward.
- Pre-existing wrong-sized `memset` in backward (`sizeof(T1) * N_D`)
corrected to `sizeof(T1) * probability_shape.Size()`. The previous code
was effectively redundant (subsequent parallel-for paths overwrite all
`N_D * C` entries) so this is cleanup, not an active OOB.
- Renamed `weight_smaple` -> `weight_sample`.

### Tests
11 regression tests in `cross_entropy_test.cc`:
- Label too large (forward + grad, int64 + int32)
- Negative label
- Label too large with weights (MEAN and SUM reductions)
- Higher-dim logit `[2,4,2,3]` with label `[2,2,3]`
- `SoftmaxCrossEntropyLossInternal` and
`SoftmaxCrossEntropyLossInternalGrad` with `ignore_index` as a runtime
tensor input

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description
The per-axis SafeInt multiplication added in microsoft#27566 detects overflow
when computing an individual output dimension, but combinations of
per-axis repeats can still request an int64-representable total that is
unreasonably large. This PR adds a 4 GiB upper bound on the total tiled
byte count in the CPU, CUDA, and WebGPU Tile kernels and extends
validation/tests for the new behavior.

### Changes
- `onnxruntime/core/providers/cpu/tensor/tile.cc`: compute the output
shape with division-based checks that reject negative repeats, int64
overflow, and total tiled byte counts above the supported maximum before
allocation. The maximum is clamped to `size_t::max()` for 32-bit builds,
and the bound applies to `std::string` tensors as well because their
output buffers still allocate per-element backing storage.
- `onnxruntime/core/providers/cuda/tensor/tile.cc`: same output-size
bound applied to keep CPU and CUDA behavior consistent.
- `onnxruntime/core/providers/webgpu/tensor/tile.cc`: same output-size
bound applied to keep WebGPU behavior consistent, plus repeats
rank/length validation matching CPU/CUDA.
- `onnxruntime/test/providers/cpu/tensor/tile_op_test.cc`: tests cover
malformed repeats rank/length, 1-D, multi-axis, double (8-byte element),
and string cases that exceed the bound, plus a positive test confirming
a moderate (4 MB) output is still accepted.

### Motivation and Context
Follow-up to microsoft#27566, which fixed per-axis overflow but did not bound
total allocation size.

---------

Co-authored-by: Gopalakrishnan Nallasamy <gopalakrishnan.nallasamy@microsoft.com>
Co-authored-by: Gopalakrishnan Nallasamy <gnallasamy@microsoft.com>
…soft#28354)

## Description

Adds support for float/float16 zero points in the 2-bit MatMulNBits LUT
GEMM path, enabling AMD QAD/Quark 2-bit quantization which requires a
fractional zero point of 1.5.

Addresses microsoft#28162

### Problem

QAD 2-bit quantization uses non-uniform levels `[-1, -1/3, 1/3, 1]`,
expressed via `dequant = (q - 1.5) * scale`. The zero point 1.5 cannot
be represented as a packed uint8 value. The existing LUT GEMM packing
API only accepted `uint8_t*` zero points, and the fallback dequant path
crashed with `ORT_ENFORCE(nbits_ == 4)` when encountering 2-bit + float
ZP.

### Changes

**MLAS layer** — Widened `MlasLutGemmPack()` to accept `const void*
QuantBZeroPoint` + `bool IsFloatZeroPoint`, following the existing
`MlasQNBitGemmPackQuantBData` convention. The AVX2 packer reads float ZP
values directly per quantization group when `IsFloatZeroPoint` is set,
computing the same `(zp - midpoint) * scale` correction stored in the
packed buffer. The compute kernel (`TMACComputeGemm_avx2`) is unchanged
— it already consumes ZP as a float correction during accumulation.

**MatMulNBits CPU kernel** — Relaxed the PrePack early-exit guard to
allow float ZP into the LUT GEMM path (not non-LUT paths). Added
fp16→fp32 conversion for ZP tensors, matching how scales are already
handled. Fixed the Compute() path to null out prepacked zero_points to
avoid a null dereference in CheckInputs. Fixed the 2-bit fallback
dequant path: relaxed the `nbits_==4` enforce, added inline 2-bit scalar
dequant for float and MLFloat16 ZP with correct packed-B indexing for
padded K shapes.

**Tests** — Added MLAS-level float ZP tests across block lengths
32/64/128 with ZP values {0, 1.5, 2, 3}. Added provider-level directed
QAD tests (`zp=1.5`) verifying end-to-end correctness through the LUT
GEMM path.

### Testing

- 72 MLAS LUT GEMM tests pass (including 36 new float ZP tests)
- 13 provider-level 2-bit tests pass (including new QAD float ZP tests)
- No regressions in existing uint8 ZP tests
- lintrunner clean

### Files changed

| File | Change |
|------|--------|
| `core/mlas/inc/mlas_qnbit.h` | API: `void*` ZP + `IsFloatZeroPoint`
flag |
| `core/mlas/lib/qlutgemm.h` | Dispatch typedef update |
| `core/mlas/lib/qlutgemm.cpp` | Pass-through plumbing |
| `core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp` | Float ZP packing
branch |
| `contrib_ops/cpu/quantization/matmul_nbits.cc` | PrePack guard,
fallback fix, ZP validation |
| `test/mlas/unittest/test_sqlutgemm.cpp` | Float ZP MLAS tests |
| `test/mlas/bench/bench_lutgemm.cpp` | Updated call signature |
| `test/contrib_ops/matmul_2bits_test.cc` | Float ZP provider tests |

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description
Replace `path::string()` / bare `std::filesystem::path(string)` with
`PathToUTF8String` / `ToPathString` in two places that handle
user-supplied
paths.

### Motivation and Context
On Windows, `path::string()` and `std::filesystem::path(std::string)`
use the
system ANSI code page (CP_ACP). When a model or EPContext output file
sits in a
directory with non-ASCII Unicode characters, this corrupts the path and
causes:

- `ModelMetadefIdGenerator::GenerateId` — throws during session
initialization
("No mapping for the Unicode character exists in the target multi-byte
code page")
- `ModelGenOptions` (EPContext options) — constructs a garbled path,
failing
  EPContext file creation with `ENOENT`
…domUniformLike CUDA ops with BFloat16 support (microsoft#27759)

### Description

Fills the opset gap for RandomNormal, RandomNormalLike, RandomUniform,
and RandomUniformLike operators in the CUDA execution provider,
extending coverage from opset 1 to opset 22 with full BFloat16 support
for the new opset 22 registrations.

#### Changes

- **`onnxruntime/core/providers/cuda/generator/random.cc`**: Changed
each `ONNX_OPERATOR_KERNEL_EX` to `ONNX_OPERATOR_VERSIONED_KERNEL_EX`
with version range 1–21. Added new `ONNX_OPERATOR_KERNEL_EX`
registrations at opset 22 with type constraints including BFloat16 via
`BuildKernelDefConstraints<float, MLFloat16, double, BFloat16>()`.
Updated `MLTypeCallDispatcher` in `ComputeNormal` and `ComputeUniform`
to include BFloat16. Updated Like variant type inference checks to
accept BFloat16 inputs.
- **`onnxruntime/core/providers/cuda/generator/random_impl.cu`**: Added
`SPECIALIZED_RANDOM_KERNELS(BFloat16)` template specialization for both
RandomNormal and RandomUniform kernels.
- **`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`**:
Updated forward declarations and `BuildKernelCreateInfo` entries to use
versioned macros (1, 21) and added new opset 22 entries.
- **`onnxruntime/test/providers/cpu/generator/random_test.cc`**:
Extended GPU test helpers with an `opset_version` parameter to handle
BFloat16 data. Added BFloat16 test cases for both vectorized and
non-vectorized paths, including direct and Like variants, all using
opset 22 to match the v22+ kernel registration.
- **`docs/OperatorKernels.md`**: Updated version ranges and type lists
for all four CUDA random operators to show `[1, 21]` and `22+` ranges
with BFloat16 included in opset 22.

### Motivation and Context

These operators were registered only at opset 1 using
`ONNX_OPERATOR_KERNEL_EX` (non-versioned), which per the kernel matching
logic in `kernel_registry.cc` only matches nodes with `SinceVersion ==
1` (exact match). Models exported with newer opset versions (e.g., opset
22) would fail to find matching CUDA kernels for these operators.
Additionally, opset 22 is the schema version that adds
`tensor(bfloat16)` to these random ops, so the new registrations include
BFloat16 support to fully close the opset-22 gap. The BF16 test cases
explicitly use opset 22 to ensure the correct schema validation and
kernel version coverage in CI.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
This pull request refines how tensor attributes are unpacked in the CUDA
LabelEncoder implementation. The main improvement is ensuring that the
raw data from tensor protos is explicitly passed to the `UnpackTensor`
utility, enhancing correctness and compatibility with various tensor
data formats.

**Tensor attribute unpacking improvements:**

* In both `TryGetScalarTensorAttribute` and `GetAttrOrTensor` functions
in `label_encoder.cc`, the code now checks if the tensor proto contains
raw data and, if so, passes the correct raw data pointer and length to
`utils::UnpackTensor`. This replaces the previous approach of always
passing `nullptr` and `0` for these parameters.
[[1]](diffhunk://#diff-2fc4106da1ae063defd383893e035de8f260618ffd1dad0864b615361b4d2e2bL65-R67)
[[2]](diffhunk://#diff-2fc4106da1ae063defd383893e035de8f260618ffd1dad0864b615361b4d2e2bL116-R120)
MultiHeadAttention
Before: 58.3s
After: 2.89
Speedup: 20x

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->


Tested with vision_encoder.onnx for
https://huggingface.co/onnx-community/LightOnOCR-2-1B-ONNX
This pull request improves support for string tensor attributes in the
Common Subexpression Elimination (CSE) optimizer, ensuring correct
handling and hashing of nodes with string tensor attributes and adding a
regression test to prevent regressions. The most important changes are:

**Bug Fixes and Feature Support:**

* Updated `AreScalarTensorAttributeEqual` in
`common_subexpression_elimination.cc` to correctly handle and compare
scalar string tensor attributes, removing the restriction that
previously excluded string tensors.
* Modified `GetTensorAttributeHash` in
`common_subexpression_elimination.cc` to support hashing of string
tensor attributes, removing the enforcement that string tensors are not
expected and ensuring string data is included in the hash.

**Testing and Regression Prevention:**

* Added a regression test `StringTensorAttr` in `cse_test.cc` to verify
that CSE does support nodes with string tensor attributes, specifically
testing with `LabelEncoder` nodes that retain their string tensor
attributes.

---------

Co-authored-by: Rajat Monga <rajatmonga_microsoft@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: rajatmonga <15679194+rajatmonga@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
psimd dependency has outdated cmake_minimum_required incompatible
with CMake 4.3.2 on CI runners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pedrovgs pedrovgs requested a review from pacowong May 14, 2026 11:19
@pedrovgs pedrovgs marked this pull request as ready for review May 14, 2026 11:19
Copy link
Copy Markdown
Collaborator

@pacowong pacowong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Do we keep or upgrade .circleci/config.yml when we merge main back to develop as I saw that it was removed in c2bb76e? It is important as Goodnotes relies on it to build the library with a stable environment.

@pacowong pacowong changed the base branch from main to develop May 15, 2026 01:34
@pacowong pacowong changed the base branch from develop to main May 15, 2026 01:36
@pacowong pacowong merged commit 805e690 into main May 15, 2026
20 of 72 checks passed
@pacowong pacowong deleted the merge/upstream-main-2026-05-14 branch May 15, 2026 01:37
@pedrovgs
Copy link
Copy Markdown
Member Author

I removed this circleci/config.yml because it was already covered by a github workflow @pacowong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.