Merge/upstream main 2026 05 14 by pedrovgs · Pull Request #17 · GoodNotes/onnxruntime

pedrovgs · 2026-05-14T10:02:36Z

Description

This PR merges 2,297 commits from the official Microsoft onnxruntime repository (up to May 13, 2026) into our Goodnotes fork, which was last synced in February 2025. Four merge conflicts were resolved: Paco's debug logging in cmake/CMakeLists.txt and cmake/onnxruntime.cmake was dropped in favor of upstream's cleaner code, the Mac Catalyst compile flags in cmake/onnxruntime_mlas.cmake were preserved as they're needed for our macabi builds, and build_apple_framework.py was merged to adopt upstream's pathlib refactor while keeping our macabi sysroot special-case handling. The update spans 4,754 files with ~507K insertions and ~156K deletions.

If you want to check the difference between our fork and the current state of onnx, you can do it here.

Motivation and Context

Update this repository with the official one so we can get a version of onnx compatible with scribble to erase model operators and also run on Catalyst.

### Description To support the model package design, one of the goals for ORT is to automatically select the most suitable compiled EPContext binary from a collection of precompiled variants based on the EP, provider options, metadata, and available devices. This PR is for ORT to support first phase model package. There could be other follow-up PRs in the future. A model package is a collection of models, binaries, and metadata files organized in a hierarchically structured directory. The directory structure is not yet finalized, so the following is just a simple example of a model package directory: ```` <model>.ortpackage/  ├── manifest.json ├── pipeline.json ├── configs/ | ├── genai_config.json | └── chat_template.jinja  └── models/      └── model_name/          ├── metadata.json | └── Contains general information on the component model, | and specific information about each model variant | such as data types, quantization algo, EP, etc. that | is updated on add/remove of model variant └── shared_weights/ (shared weights from all variants) └── <checksum of weights file A>/ └── model.data └── <checksum of weights file B>/ └── model.data └── ...         └── base model/                ├── model.onnx          └── variant A /              ├── optimized model.onnx (contains EPContext nodes)              └── [Compilation artifacts]          └── variant B /              ├── optimized model.onnx (contains EPContext nodes)              └── [Compilation artifacts]  ```` #### Spec and Format: See [here](https://github.com/microsoft/onnxruntime/blob/07e55627e75da24099c582331a0f786090e6382a/onnxruntime/core/session/model_package/README.md) #### Definitions: - Model Package - A model package defines the overall logical ‘model’ - A model package contains one or more ‘component models’ - Component Model - A component model comprises one or more ‘model variants’ - Model Variant - A ‘model variant’ is a single ONNX or ORT format model #### manifest.json and metadata.json A manifest.json may look like: ```` { "model_name": <logical_model_name>, "component_models": [ <component_model_name_1>, <component_model_name_2> ] } ```` A metadata.json for a component model may look like: ```` { "component_model_name": <component_model_name_1>, "model_variants": { <variant_name_1>: { "file": <ep_context_model_1 onnx file>, "constraints": { "ep": <ep_name>, "device": <device_type>, "architecture": <hardware_architecture> } }, <variant_name_2>: { "file": <ep_context_model_2 onnx file>, "constraints": { "ep": <ep_name>, "device": <device_type>, "architecture": <hardware_architecture> } } } } ```` #### Model Selection The selection logic is implemented in `MatchesVariant()`, which evaluates the following constraints: (Note: A constraint refers to a value under the "constraints" field in either manifest.json or metadata.json.) - Check ep constraint - Check device constraint - For some provider-bridge EPs, they may not implement `OrtEpFactory::GetSupportedDevices`, therefore ORT won't have the supported device information for those EPs. In that case, ORT will skip the device constraint validation for those EPs. - If provider option contains key related to device type, then the value must match the device constraint if any. - Check ep_compatibility_info constraint - ORT does not directly evaluate the architecture constraint. Instead, it relies on the ep_compatibility_info constraint, which may encode architecture information if needed. - The ep_compatibility_info value is expected to match the EP compatibility string stored in the EPContext model metadata. (See OrtEp::GetCompiledModelCompatibilityInfo() for how this string is generated.) - The EP implementation of EpFactory::ValidateCompiledModelCompatibilityInfo() is responsible for validating the compatibility string against the target device (i.e. OrtHardwareDevice) and returning the compatibility result. #### Note Check the unit test [here](https://github.com/microsoft/onnxruntime/pull/27786/changes#diff-bfa4122a85543ae2d80bf4cf6d9f85248e51c2276a5956af32f9bd8c8983d23a) to better understand how to use model package. #### Code Change This pull request introduces significant enhancements to the execution provider (EP) selection and management infrastructure in ONNX Runtime. The main focus is on supporting more sophisticated device selection and manifest-based model packaging, as well as refactoring provider selection logic for modularity and future extensibility. Key changes include: - Introduction of model package context and manifest parsing to support selecting model components based on device and EP constraints. - Refactoring of the execution provider interface and related classes to support multiple devices per provider. - Modularization of EP/device selection, creation, and registration logic in the provider policy context. The most important changes are: **Model Package Context and Manifest Support** - Added new files `model_package_context.h` and `model_package_context.cc` to implement manifest parsing, device/EP constraint matching, and component selection logic for model packages. This enables ONNX Runtime to select the most appropriate model variant based on available hardware and EP configuration. [[1]](diffhunk://#diff-006078879d52b421c973e2880c65db474aad6b21ad81ba69d387df8661bafeb2R1-R78) [[2]](diffhunk://#diff-45c29f481077e424c8969dc2198a8b40ab5908cf3b0bbf25dbeaca3ec51935d5R1-R279) **Execution Provider Interface Enhancements** - Updated the `IExecutionProvider` class to support construction with a list of `OrtEpDevice` pointers, and added a `GetEpDevices()` method to retrieve the supported devices. This allows plugin and bridge EPs to expose multiple devices. [[1]](diffhunk://#diff-e15769e35b807986b812aae3ff7192269e171c5846b2ff4d8ec571ec8ed57aa4R87-R104) [[2]](diffhunk://#diff-e15769e35b807986b812aae3ff7192269e171c5846b2ff4d8ec571ec8ed57aa4R203-R207) - Updated plugin EP construction to pass the list of supported devices to the base class. **Provider Policy Context Refactoring** - Refactored provider policy context logic to modularize device ordering, device selection, telemetry logging, EP creation, and registration. This includes splitting the monolithic `SelectEpsForSession` into smaller methods: `OrderDevices`, `SelectEpDevices`, `LogTelemetry`, `CreateExecutionProviders`, `RegisterExecutionProviders`, and a new flow for model package-based EP selection. [[1]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0R53-R58) [[2]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0L118-L156) [[3]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0L225-R199) [[4]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0R254-R365) These changes collectively lay the groundwork for more flexible, robust, and extensible device and EP selection in ONNX Runtime, especially in scenarios involving packaged models with multiple variants and complex hardware environments. ### Motivation and Context

### Description The test was originally adding for testing model selection based on "device type" provider option. However, the check for provider option was removed from the selection logic but forget to remove the related test.

### Description Fix possible out of boundary target of class ids in TreeEnsemble. ### Motivation and Context security issue

…microsoft#27674) This patch sets `is_channels_last` to true by default in the parameter of `ComputeMatMul` and ignores it in `UseSplitK` when there is no `bias`.

### Description Improve DeformConv op performance ### Motivation and Context This PR consolidates a series of optimizations targeting the `DeformConv` (Deformable Convolution) operator across both CPU and CUDA execution providers. * **For CPU:** The previous implementation suffered from bottlenecks due to redundant computations, lack of vectorization in bilinear sampling, and sub-optimal thread pool utilization. This overhaul redesigns the memory layout and execution pipeline to maximize SIMD opportunities and harden memory safety. * **For GPU:** The batched GEMM operation previously relied on an intermediate buffer and a custom scatter kernel to format the output, which consumed extra memory and kernel launch overhead. This update introduces a zero-copy approach. --- #### 1. CPU Optimizations & Refactoring The CPU execution path has been heavily refactored to minimize branching in hot paths, maximize vectorization, and safely handle edge cases. | Feature / Optimization | Description | Key Benefit | | :--- | :--- | :--- | | **AoSoA Bilinear Sampling Plan** | Replaced on-the-fly interpolation with a precomputed sampling plan using an 8-lane Array-of-Structures-of-Arrays (AoSoA) layout (`kPlanAoSoALanes`). | Perfectly aligns with 256-bit AVX2 vectors, enabling highly efficient SIMD unrolling during the `im2col` gathering phase. | | **Kernel Metadata Caching** | Introduced `DeformConvKernelMetaCacheData` to cache static convolution geometry (e.g., `kH`, `kW`, `padding`, `dilation`). | Eliminates the O(kernel_size) overhead of reallocating and recomputing base offsets on every single `Compute()` step. | | **Fast Math & Branchless Logic** | Implemented a custom `DeformConvFastFloor` and utilized an inverted bounds check with bitwise operations to evaluate all four corners simultaneously. | Removes expensive `std::floor` calls and unpredictable branches from the operator's hottest path. | | **Enhanced Parallelization** | Flattened the bilinear sampling plan build tasks across spatial pixels. | Allows `concurrency::ThreadPool::TryParallelFor` to split fine-grained work effectively, drastically improving thread pool scaling. | | **Hardened Bounds Checking** | Introduced compute-time bounds checks using `CheckedMulSizeT` and `CheckedBatchSpan`. | Ensures batch indexing and stride calculations stay within the addressable `size_t` range, preventing integer overflow vulnerabilities. | | **Bias Addition Refactoring** | Refactored bias addition to avoid expensive `div`/`mod` operations, applying `ORT_CPU_RESTRICT` and force-inlining. | Maximizes memory throughput and instruction pipelining during the final bias addition phase. | --- #### 2. GPU (CUDA) Optimizations The CUDA implementation was optimized to reduce memory footprint and eliminate unnecessary kernel launches. * **Zero-Copy GEMM Output:** Removed the temporary `gemm_output_buffer` allocation entirely. By carefully configuring the `stride_c` parameter (`stride_c_y = M * output_image_size`), the `cublasGemmStridedBatchedHelper` now writes the computed output directly into the correct NCHW memory layout of the final `Y` tensor. * **Kernel Elimination:** Completely removed the `DeformConvCopyGemmOutputRowMajorToNCHW` custom kernel and its associated dispatch logic. This reduces kernel launch overhead, lowers GPU memory bandwidth pressure, and simplifies the overall CUDA execution pipeline. * **Reduced Memory Footprint:** Updated the `bytes_per_image` calculation for workspace memory to reflect the removal of the GEMM output buffer. This allows the operator to potentially process more images in parallel under the same memory constraints. --- #### 3. Changed - **Batch chunking:** Chunk size `k` is chosen so that the number of outer rounds is minimized under the temp-memory cap; **`k` does not have to divide `N`**. The host loop uses `cur_parallel = min(k, N - b)`, so the last chunk may be smaller. This is the intended default behavior for this EP (not yet in a formal release). - **Kernel-size templates:** Im2col is specialized for **1×1, 3×3, and 7×7**; other sizes (including **5×5**) use the **dynamic** `kH`/`kW` path. Rationale: 5×5 is less common in current stacks (often replaced by stacked 3×3); specializing 7×7 targets common large-kernel cases. Older DCN/detection models that still use **5×5** deformable conv will take the dynamic path—correctness is unchanged; only compile-time unrolling differs. - **Add aliasing flags:** Updated DeformConv aliasing comments to make the stronger guarantee explicit: if output `Y` overlaps any input buffer, results can be incorrect regardless of `restrict`, because output writes may clobber source elements before they are fully consumed. `restrict` further tightens this by introducing undefined behavior when aliasing assumptions are violated. --- ### Summary In the current implementation, CPU performance is 33x (main branch is 15x) that of TorchVision. If we were to implement AVX2/AVX512 optimizations from scratch, we could achieve a 36x performance boost. However, I haven’t found any similar reference code in the ONNX Runtime repository. This PR also significantly improves parallelism: <img width="540" height="332" alt="image" src="https://github.com/user-attachments/assets/d4f670bd-dde3-43f1-b597-4471bfde005b" /> _Both ort and tv are configured with 16 threads_ ### Open Question for Reviewers **Regarding CUDA Temporary Memory Allocation:** Currently, the effective maximum temporary memory for CUDA is calculated using a heuristic (`total_global_mem * 0.1` or similar logic in `GetDeformConvEffectiveMaxTempBytes`). While the removal of `gemm_output_buffer` has reduced the memory footprint per image, I am not entirely certain if this 10% threshold is still the most appropriate value for balancing parallel image processing (`n_parallel_imgs`) against overall VRAM consumption in large models. I would appreciate any feedback or suggestions on whether we should tune this threshold, or if there's a more robust way to dynamically determine the optimal temporary workspace size for `DeformConv` in ORT.

…7582) ### Description In `Dequantize4BitsKernelReOrder` (CPU and CUDA EP), values from the `g_idx` tensor are used directly as array indices into the `scales` and `zero_points` buffers without bounds checking. This PR adds value-range validation and tests for the `g_idx` input tensor in the `MatMulNBits` operator. ### Motivation and Context  --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

) ### Description Add input validation to the LinearClassifier operator to prevent an out-of-bounds heap read in GEMM when a crafted model provides mismatched coefficients/intercepts sizes. Fixes https://portal.microsofticm.com/imp/v5/incidents/details/31000000559851/summary ### Changes - **Constructor**: Validate `class_count_ > 0` and `coefficients_.size() % class_count_ == 0` - **Compute()**: Validate `coefficients_.size() == class_count * num_features` before GEMM call - **Tests**: Two regression tests for invalid coefficient sizes ### Motivation and Context MSRC case 109185 (VULN-176698): OOB read via GEMM from crafted model in LinearClassifier operator. Root cause is missing validation that the coefficients vector size matches `[class_count, num_features]` before passing raw pointers to GEMM.

…28013) ### Description Add a pre-commit [git hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) that runs lintrunner on staged files, catching lint and formatting issues before they reach CI. The hook runs lintrunner in check-only mode (no auto-fix) to avoid issues with partial staging. If lint issues are found, the commit is blocked and the developer is prompted to run `lintrunner -a` to fix. The hook is opt-in. Contributors enable it with: `git config core.hooksPath .githooks` ### Motivation and Context Follow-up from microsoft#27856. Catching lint issues at commit time saves CI cycles and review time.

webgpu support for qwen3.5, adding LinearAttention and CausalConvWithState ops based on this proposal: from onnx/onnx#7767 The model can be created with model builder from https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/builder.py. For example for the text only flavor: ``` python builder.py -m Qwen/Qwen3.5-0.8B -o Qwen3.5-0.8B -e webgpu -p int4 --extra_options int4_accuracy_level=4 exclude_embeds=False ```

@Rohanjames1997

…rosoft#27878) ### Description Add Arm64 BF16 fast-math convolution support in MLAS: - direct NCHW conv - depthwise 3x3 NCHWc conv - pointwise 1x1 NCHWc conv This change adds new AArch64 BF16 asm kernels, wires them into MLAS platform dispatch, keeps accumulated pointwise batches on the custom BF16 path instead of falling back to generic SBGEMM, and adds the required BF16 build flags. The new paths are only used when Arm64 BF16 fast-math is enabled via the existing session option. Baseline FP32 behavior is unchanged. ### Performance Individual convolution improvements when running on `c8g` AWS instance where in columns base is FP32 execution, fast-math when enabled without this PR and PR is fast-math with this change: | Type | Shape | fast-math vs base | PR w/ fast-math vs base | PR w/ fast-math vs fast-math | |---|---|---:|---:|---:| | depthwise | N1 IC32 OC32 H112xW112->112x112 K3x3 S1x1 D1x1 P1/1/1/1 G32 | 0.991x | 1.047x | 1.057x | | depthwise | N1 IC96 OC96 H112xW112->56x56 K3x3 S2x2 D1x1 P1/1/1/1 G96 | 1.015x | 1.015x | 1.000x | | depthwise | N1 IC144 OC144 H56xW56->28x28 K3x3 S2x2 D1x1 P1/1/1/1 G144 | 1.020x | 1.004x | 0.984x | | depthwise | N1 IC144 OC144 H56xW56->56x56 K3x3 S1x1 D1x1 P1/1/1/1 G144 | 1.034x | 1.138x | 1.101x | | depthwise | N1 IC192 OC192 H28xW28->28x28 K3x3 S1x1 D1x1 P1/1/1/1 G192 | 0.997x | 1.033x | 1.037x | | depthwise | N1 IC384 OC384 H28xW28->14x14 K3x3 S2x2 D1x1 P1/1/1/1 G384 | 1.016x | 1.021x | 1.005x | | depthwise | N1 IC384 OC384 H28xW28->28x28 K3x3 S1x1 D1x1 P1/1/1/1 G384 | 1.011x | 1.090x | 1.077x | | depthwise | N1 IC576 OC576 H14xW14->7x7 K3x3 S2x2 D1x1 P1/1/1/1 G576 | 1.029x | 0.995x | 0.967x | | depthwise | N1 IC576 OC576 H14xW14->14x14 K3x3 S1x1 D1x1 P1/1/1/1 G576 | 1.025x | 1.006x | 0.982x | | depthwise | N1 IC960 OC960 H7xW7->7x7 K3x3 S1x1 D1x1 P1/1/1/1 G960 | 1.002x | 0.941x | 0.939x | | nchw | N1 IC3 OC32 H224xW224->112x112 K3x3 S2x2 D1x1 P1/1/1/1 G1 | 1.001x | 1.058x | 1.058x | | pointwise | N1 IC16 OC96 H112xW112->112x112 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.213x | 1.328x | 1.095x | | pointwise | N1 IC32 OC16 H112xW112->112x112 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.020x | 1.019x | 0.998x | | pointwise | N1 IC32 OC32 H112xW112->112x112 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.118x | 1.196x | 1.069x | | pointwise | N1 IC32 OC144 H56xW56->56x56 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.220x | 1.528x | 1.252x | | pointwise | N1 IC32 OC192 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.199x | 1.418x | 1.183x | | pointwise | N1 IC64 OC384 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.294x | 1.938x | 1.497x | | pointwise | N1 IC96 OC32 H56xW56->56x56 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.080x | 1.426x | 1.320x | | pointwise | N1 IC96 OC576 H14xW14->14x14 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.280x | 1.961x | 1.532x | | pointwise | N1 IC144 OC32 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.132x | 1.351x | 1.193x | | pointwise | N1 IC144 OC32 H56xW56->56x56 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.073x | 1.374x | 1.281x | | pointwise | N1 IC160 OC960 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.133x | 1.744x | 1.539x | | pointwise | N1 IC192 OC32 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.166x | 1.411x | 1.210x | | pointwise | N1 IC192 OC64 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.212x | 1.763x | 1.454x | | pointwise | N1 IC320 OC1280 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.136x | 2.059x | 1.812x | | pointwise | N1 IC384 OC64 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.256x | 1.904x | 1.516x | | pointwise | N1 IC384 OC96 H14xW14->14x14 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.206x | 1.929x | 1.600x | | pointwise | N1 IC576 OC96 H14xW14->14x14 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.250x | 2.055x | 1.644x | | pointwise | N1 IC576 OC160 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 0.902x | 1.423x | 1.577x | | pointwise | N1 IC960 OC160 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 0.915x | 1.527x | 1.668x | | pointwise | N1 IC960 OC320 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.020x | 1.756x | 1.723x | | pointwise | N1 IC1280 OC1008 H1xW1->1x1 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 0.747x | 1.149x | 1.538x | When running the full models the performance improvements are on `c8g` (AWS Graviton 4) and `Standard_D32plds_v6` (Azure Cobalt-100) when running [MobileNet v2.7](https://github.com/onnx/models/blob/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx) with 8 threads are: | Instance | PR w/ fast-math vs base | PR w/ fast-math vs fast-mat | |---|---|---| `c8g` | 1.892x | 1.647x | `Standard_D32plds_v6` | 2.884x | 1.692x | (cc: @Rohanjames1997 @snadampal) --------- Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>

### Description Fix ICM issue: https://portal.microsofticm.com/imp/v5/incidents/details/31000000567822/summary The ICM is mainly about 2 issues in `validate_package.py` which was fixed by microsoft#27840. But the ICM also references another issue in `whisper_jump_times.py` which is what this PR fixes ### Motivation and Context ICM fixes

## Description Ports graph capture/replay APIs (e.g., CUDA Graph) to the Plugin EP (`OrtEp`) C API so that plugin-based execution providers can participate in ORT-managed graph capture and replay. ### What changed **New Plugin EP C API functions** (`onnxruntime_ep_c_api.h`): - `OrtEp::IsGraphCaptureEnabled` — indicates whether the EP has graph capture enabled. - `OrtEp::IsGraphCaptured` — indicates whether a graph has been captured for a given annotation ID. - `OrtEp::ReplayGraph` — replays a previously captured graph. - `OrtEp::GetGraphCaptureNodeAssignmentPolicy` — returns the node assignment validation policy for graph capture. All four are optional (NULL defaults to safe behavior) and version-gated (`ort_version_supported >= 26`). If `IsGraphCaptureEnabled` returns true, `IsGraphCaptured` and `ReplayGraph` must also be implemented. otherwise `PluginExecutionProvider` logs a warning and disables graph capture for that EP. **New `OrtGraphCaptureNodeAssignmentPolicy` enum** (`onnxruntime_ep_c_api.h`): Replaces the hardcoded EP-name checks in `InferenceSession::Initialize()` with a policy-based approach: - `ALL_NODES_ON_EP` — all nodes must be on the target EP (e.g., TensorRT). - `ALLOW_CPU_FOR_SHAPES` — CPU nodes allowed for shape computation if no memcpy nodes exist (e.g., CUDA, WebGPU, DML). **Refactored `InferenceSession` graph capture selection** (`inference_session.cc`): - Removed the hardcoded `graph_support_ep_list` and per-EP `strcmp` checks. - Now iterates over all registered EPs and uses `IsGraphCaptureEnabled()` + `GetGraphCaptureNodeAssignmentPolicy()` to select and validate the graph-capturing EP. - `AreAllComputeNodesAssignedToCudaOrJsOrDmlEpWebGpuEp()` → generalized to `AreAllComputeNodesAssignedToEpOrCpu()`, which also requires at least one node on the target EP. - `IExecutionProvider::GetGraphCaptureNodeAssignmentPolicy()` added to the base class (defaults to `ALL_NODES_ON_EP`). **Bounded graph capture recursion** (`inference_session.cc/h`): - `Run()` now delegates to `RunImpl()` with a `graph_capture_depth` parameter. - Caps internal run attempts at `kMaxGraphCaptureRunAttempts = 8`, returning a clear error if the EP never reports `IsGraphCaptured() == true`. **EP implementations**: - **WebGPU plugin EP**: Fully implements all four graph capture APIs by forwarding to the underlying `IExecutionProvider`. - **CUDA plugin EP**: Stubs with TODOs (returns disabled/not-implemented). - **NvTensorRTRTX EP**: `IsGraphCaptureEnabled()` now returns `false` since this EP manages graph capture internally (not via ORT). **C++ wrapper** (`onnxruntime_cxx_api.h` / `onnxruntime_cxx_inline.h`): - Added `Ort::Env::CopyTensor()` convenience overload for copying a single tensor (wraps `CopyTensors` with `num_tensors=1`). ### Tests - **`ep_plugin_provider_test.cc`**: Unit tests for each new `PluginExecutionProvider` graph capture method, including NULL function pointer defaults, version < 26 backward compatibilities, and validation that `IsGraphCaptureEnabled()` returns false when `IsGraphCaptured` or `ReplayGraph` are NULL. - **`test_graph_capture.cc`**: End-to-end test for WebGPU plugin EP graph capture/replay using IO binding (warm-up + capture run, then replay with different inputs). ### Motivation and Context Previously, graph capture support was limited to a hardcoded list of EPs (`kCudaExecutionProvider`, `kTensorrtExecutionProvider`, `kJsExecutionProvider`, `kWebGpuExecutionProvider`, `kDmlExecutionProvider`) with EP-specific validation logic in `InferenceSession`. This made it impossible for plugin EPs to participate in ORT-managed graph capture/replay without modifying the core session code. This PR makes graph capture/replay extensible to any EP, including out-of-tree plugin EPs, by exposing it through the `OrtEp` C API.

### Description  - Update `WhereDummyDq` QDQ transformer to be more selective before inserting a dummy `DequantizeLinear` around `Where`. - `SatisfyCondition` now requires the `Where` output to have exactly one consumer and that consumer must be `QuantizeLinear` (Q). Otherwise, the transform is skipped. - `InsertDummyDQ` additionally checks element type consistency between the upstream DQ input tensor type and the downstream Q output tensor type; if they differ, the transform returns without modifying the graph. - Update the implementation of `WhereDummyDq` to avoid negative or zero `scale` value. The change maps the float value to the **boundary** of integer domain to ensure the `scale` value is positive. - If `WhereOp` get a float scalar `xf` and a `DequantizeLinear` as its two inputs, `WhereDummyDq` insert DQ to ensure `xf = DQ(xq, scale, zp)` - The `xq`, `scale` and `zp` are determined with the following table. | | uint8 | uint16 | int8 | int16 | |-----------------|--------------|---------------|-------------|---------------| | xf > 0 | | | | | | xq | 255 | 65535 | 127 | 32767 | | zp | 127 | 32767 | 0 | 0 | | xf < 0 | | | | | | xq | 0 | 0 | -128 | -32768 | | zp | 127 | 32767 | 0 | 0 | | xf = 0 | | | | | | xq | 127 | 32767 | 0 | 0 | | zp | 127 | 32767 | 0 | 0 | - `scale = xf / (xq - zp)` if `xq != zp` else `1` ### Motivation and Context  - Negative or zero scale value is not friendly for various EP and backend such as QNN-EP. - Inserting an additional DQ is only useful when it forms a valid QDQ “node unit” pattern. If the `Where` output is not followed by a single `QuantizeLinear` (e.g., multiple consumers or a non-Q consumer), adding a dummy DQ cannot create the intended pattern and may lead to non-fusible/undesired graph structures.

## Description This PR brings CUDA graph capture/replay to the CUDA plugin execution provider so plugin-based CUDA deployments can get the same reduced CPU launch overhead that the in-tree CUDA EP already supports. It also adds the ORT framework and plugin-C-API plumbing needed to let graph-enabled plugin EPs participate safely in warmup, capture, and replay, while preserving compatibility with older plugins through version-gated fallbacks. ## Summary of Changes ### CUDA plugin EP runtime and allocator integration | File | Change | |------|--------| | `onnxruntime/core/providers/cuda/plugin/cuda_ep.cc` | Implements plugin-side graph capture lifecycle callbacks, per-thread graph context management, graph replay, and stream selection for graph-enabled runs. | | `onnxruntime/core/providers/cuda/plugin/cuda_ep.h` | Adds CUDA graph configuration/state to the plugin EP, including per-thread graph context ownership. | | `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.cc` | Adds `CudaGraphSet`/`CudaGraphManager` to own captured graphs and coordinate warmup, capture, and replay by annotation ID. | | `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.h` | Declares the new graph manager types and graph-related constants. | | `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc` | Adds external-stream wrapping so graph-enabled runs can reuse the thread’s graph stream without taking ownership of it. | | `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h` | Declares the external-stream initialization path and stream ownership tracking. | | `onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc` | Parses `enable_cuda_graph` and `min_num_runs_before_cuda_graph_capture` provider/session options for the plugin EP. | | `onnxruntime/core/providers/cuda/plugin/cuda_mempool_allocator_plugin.cc` | Updates allocator behavior needed for CUDA native mempool compatibility during graph capture/replay. | | `onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h` | Adjusts plugin kernel/device helpers used by the graph-enabled execution path. | | `onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h` | Adds supporting helpers used by the plugin CUDA graph flow. | ### ORT framework and plugin API support for graph replay | File | Change | |------|--------| | `include/onnxruntime/core/session/onnxruntime_ep_c_api.h` | Documents and extends the plugin EP contract for graph-enabled runs, including replay behavior relative to `OnRunStart`/`OnRunEnd`. | | `include/onnxruntime/core/framework/execution_provider.h` | Adds graph-capture node-assignment policy support to the execution provider interface. | | `onnxruntime/core/session/inference_session.cc` | Generalizes the session replay path and warmup/capture retry loop so ORT can drive graph replay for graph-capable EPs. | | `onnxruntime/core/session/inference_session.h` | Updates replay-related messaging and supporting declarations for the new run flow. | | `onnxruntime/core/framework/session_state.cc` | Makes device-stream collection reuse thread-affine so warmup/capture/replay reuse stays on the owning thread. | | `onnxruntime/core/framework/session_state.h` | Adds supporting state for the thread-affine stream collection pool. | | `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc` | Bridges the new graph callbacks, hardens validation of plugin graph support, and exposes effective plugin provider options gathered from session config. | | `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h` | Stores provider options and declares the new accessor/graph bridge behavior. | | `onnxruntime/core/providers/webgpu/webgpu_execution_provider.h` | Aligns graph-capture policy support with the new execution-provider interface. | | `onnxruntime/core/providers/js/js_execution_provider.h` | Aligns graph-capture policy support with the new execution-provider interface. | ### Tests and validation coverage | File | Change | |------|--------| | `onnxruntime/test/python/transformers/test_cuda_plugin_ep.py` | Adds end-to-end CUDA graph tests for warmup/capture/replay, replay after input updates, CUDA mempool mode, multiple graph annotation IDs, multi-GPU/device-id coverage, and a simple Add model. | ### Documentation | File | Change | |------|--------| | `docs/cuda_plugin_ep/cuda_graph_for_cuda_plugin.md` | Adds a dedicated design/implementation document covering architecture, lifecycle, allocator interaction, concurrency, and verification guidance. | | `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | Updates the broader plugin EP design doc to reflect that CUDA graph support is implemented and documents the framework-level changes. | | `docs/cuda_plugin_ep/QUICK_START.md` | Updates quick-start/testing guidance and removes the outdated “no CUDA Graph support” limitation. | ## Testing - Build ONNX Runtime with `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`, install the generated wheel, and deploy the CUDA plugin shared library as described in `docs/cuda_plugin_ep/QUICK_START.md`. - Run `python onnxruntime/test/python/transformers/test_cuda_plugin_ep.py`. - Pay particular attention to the new CUDA graph scenarios in that suite: warmup/capture/replay, replay after in-place input updates, CUDA mempool mode, multiple `gpu_graph_id` captures, and the second-device path when multiple GPUs are available. - Verify backward compatibility by confirming older plugins still load safely through the version-gated graph callback bridge, and that graph-disabled runs continue through the normal execution path. ## Motivation and Context The CUDA plugin EP exists to decouple CUDA EP delivery from core ONNX Runtime releases, but that model only works well if important runtime optimizations are also available through the plugin path. CUDA graph replay is one of the highest-value CUDA execution optimizations because it eliminates repeated kernel-launch overhead after capture, especially for steady-state inference workloads. Supporting that in the plugin EP required more than adding plugin-local capture code. ORT also needed a framework-level replay flow that works for plugin EPs, a plugin C API contract for graph support and node-assignment policy, and thread-affine stream reuse so captured graph resources and stream wrappers are not reused across unrelated threads. This PR packages those pieces together and documents the resulting behavior for future plugin EP work. It also depends on earlier plugin allocator work so warmup can stabilize allocations before capture begins. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (or documented in description)

## Description This fixes a flaky failure in the plugin EP profiling tests on macOS, where reconstructed plugin event timestamps could land a few microseconds outside the correlated ORT parent event interval. The current example plugin profiler reconstructs EP-relative timestamps by combining ORT's profiling-start offset with elapsed time from the EP clock. That reconstruction is close but not exact across clocks, and on macOS the skew was enough to fail the strict containment checks in `KernelPluginEp_SessionProfiling` with cases like `ep_start < parent_start` by a small margin. Instead of weakening the test, this change keeps the strict contract and fixes the profiler output so child EP events are always emitted within the correlated ORT parent event interval. ## Key Changes | File | Change | |------|--------| | `onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h` | Stores the correlated ORT parent event start timestamp and duration on each collected EP event, and adds the helper signature updates needed to propagate that metadata. | | `onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc` | Captures parent event timing from `Ort::ConstProfilingEvent`, attaches it to EP events during `StopEventImpl`, and clamps the reconstructed EP start/end interval to the parent ORT interval before emitting the final profiling event. | ## Why This Change Is Needed - The plugin EP profiling tests intentionally require strict nesting: EP child events must stay within the ORT parent event interval. - The existing implementation reconstructs EP timestamps from two different clocks, which can drift by a few microseconds depending on platform timing behavior. - macOS exposed that drift often enough to make `KernelPluginEp_SessionProfiling` flaky even though the logical event ordering was correct. - Clamping the emitted child interval to the already-correlated parent interval preserves the expected semantics and removes the platform-specific skew from the final profiling output. ## Testing - `ninja -C build/cuda/Debug onnxruntime_autoep_test` - `cd build/cuda/Debug && ./onnxruntime_autoep_test --gtest_filter=OrtEpLibrary.KernelPluginEp_SessionProfiling` - `cd build/cuda/Debug && ./onnxruntime_autoep_test --gtest_filter=OrtEpLibrary.KernelPluginEp_RunProfiling` ## Notes For Reviewers - This is intentionally scoped to the example plugin EP profiling path used by the AutoEP tests. - The change avoids relaxing any assertions in `test_execution.cc`; it fixes the emitted profiling data instead.

### Description set the pointer to nullptr immediately after `UnloadDynamicLibrary`. ### Motivation and Context After unload library, set the function pointer to nullptr to avoid a dangling pointer. Otherwise, the following scenario may cause errors: ``` RegisterExecutionProviderLibrary() SessionOptions::AppendExecutionProvider_VitisAI() ``` In this scenario, the OrtVitisAIEpAPI will call `initialize_vitisai_ep` once but call `deinitialize_vitisai_ep` twice. During deinitialization, functions `deinitialize_onnxruntime_vitisai_ep` are invoked, which leads to errors.

### Description Centralise feed authentication & setup for build systems on ADO build pipelines. ### Motivation and Context SDL requires official build pipelines use a single controlled feed for external resources. --------- Co-authored-by: Sanaa Hamel <sanaahamel@microsoft.com>

…icrosoft#27998) ### Description: ### Summary Fuse the QMoE 1-token decode path to reduce GPU dispatches from 17 (1 + k×4) to 5 (gate + fc1 + swiglu + fc2 + mix), improving token generation throughput by ~21% on Meteor Lake for the gpt-oss-20b MoE model (19 → 23 tps). ### Motivation The QMoE operator processes Mixture-of-Experts layers by selecting top-k experts (k=4) per token. In the original 1-token decode path, each expert is processed serially with 4 dispatches (gather + fc1 + swiglu + fc2 + mix), totaling 17 GPU dispatches per QMoE call. Since each dispatch has M=1, the GPU is underutilized and CPU dispatch overhead dominates. ### Approach For the 1-token path (num_rows == 1): **Gate1Token** — Select top-k experts and output an [indirect_experts](vscode-file://vscode-app/c:/Users/jiajiaqin/AppData/Local/Programs/Microsoft%20VS%20Code/ce099c1ed2/resources/app/out/vs/code/electron-browser/workbench/workbench.html) buffer mapping row index → expert index **Batched fc1 MatMulNBits** — Run a single M=k matmul with [per_row_weight_indirect](vscode-file://vscode-app/c:/Users/jiajiaqin/AppData/Local/Programs/Microsoft%20VS%20Code/ce099c1ed2/resources/app/out/vs/code/electron-browser/workbench/workbench.html) mode, where each row selects a different expert's weights via the indirect buffer **SwiGLU** — Apply activation on all k rows at once **Batched fc2 MatMulNBits** — Same per-row expert selection for the down projection **FusedFinalMix** — Accumulate all k weighted expert results into the output ### Follow-ups Fuse Batched fc1 MatMulNBits + SwiGLU Fuse Batched fc2 MatMulNBits + FusedFinalMix Finally, we only need three shaders: Gate1Token, fused Batched fc1 MatMulNBits, fused batched fc2 MatMulNBits.

Bumps [lodash](https://github.com/lodash/lodash) from 4.17.23 to 4.18.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/lodash/lodash/releases">lodash's releases</a>.</em></p> <blockquote> <h2>4.18.1</h2> <h2>Bugs</h2> <p>Fixes a <code>ReferenceError</code> issue in <code>lodash</code> <code>lodash-es</code> <code>lodash-amd</code> and <code>lodash.template</code> when using the <code>template</code> and <code>fromPairs</code> functions from the modular builds. See <a href="https://redirect.github.com/lodash/lodash/issues/6167#issuecomment-4165269769">lodash/lodash#6167</a></p> <p>These defects were related to how lodash distributions are built from the main branch using <a href="https://github.com/lodash-archive/lodash-cli">https://github.com/lodash-archive/lodash-cli</a>. When internal dependencies change inside lodash functions, equivalent updates need to be made to a mapping in the lodash-cli. (hey, it was ahead of its time once upon a time!). We know this, but we missed it in the last release. It's the kind of thing that passes in CI, but fails bc the build is not the same thing you tested.</p> <p>There is no diff on main for this, but you can see the diffs for each of the npm packages on their respective branches:</p> <ul> <li><code>lodash</code>: <a href="https://github.com/lodash/lodash/compare/4.18.0-npm...4.18.1-npm">https://github.com/lodash/lodash/compare/4.18.0-npm...4.18.1-npm</a></li> <li><code>lodash-es</code>: <a href="https://github.com/lodash/lodash/compare/4.18.0-es...4.18.1-es">https://github.com/lodash/lodash/compare/4.18.0-es...4.18.1-es</a></li> <li><code>lodash-amd</code>: <a href="https://github.com/lodash/lodash/compare/4.18.0-amd...4.18.1-amd">https://github.com/lodash/lodash/compare/4.18.0-amd...4.18.1-amd</a></li> <li><code>lodash.template</code><a href="https://github.com/lodash/lodash/compare/4.18.0-npm-packages...4.18.1-npm-packages">https://github.com/lodash/lodash/compare/4.18.0-npm-packages...4.18.1-npm-packages</a></li> </ul> <h2>4.18.0</h2> <h2>v4.18.0</h2> <p><strong>Full Changelog</strong>: <a href="https://github.com/lodash/lodash/compare/4.17.23...4.18.0">https://github.com/lodash/lodash/compare/4.17.23...4.18.0</a></p> <h3>Security</h3> <p><strong><code>_.unset</code> / <code>_.omit</code></strong>: Fixed prototype pollution via <code>constructor</code>/<code>prototype</code> path traversal (<a href="https://github.com/lodash/lodash/security/advisories/GHSA-f23m-r3pf-42rh">GHSA-f23m-r3pf-42rh</a>, <a href="https://github.com/lodash/lodash/commit/fe8d32eda854377349a4f922ab7655c8e5df9a0b">fe8d32e</a>). Previously, array-wrapped path segments and primitive roots could bypass the existing guards, allowing deletion of properties from built-in prototypes. Now <code>constructor</code> and <code>prototype</code> are blocked unconditionally as non-terminal path keys, matching <code>baseSet</code>. Calls that previously returned <code>true</code> and deleted the property now return <code>false</code> and leave the target untouched.</p> <p><strong><code>_.template</code></strong>: Fixed code injection via <code>imports</code> keys (<a href="https://github.com/lodash/lodash/security/advisories/GHSA-r5fr-rjxr-66jc">GHSA-r5fr-rjxr-66jc</a>, CVE-2026-4800, <a href="https://github.com/lodash/lodash/commit/879aaa93132d78c2f8d20c60279da9f8b21576d6">879aaa9</a>). Fixes an incomplete patch for CVE-2021-23337. The <code>variable</code> option was validated against <code>reForbiddenIdentifierChars</code> but <code>importsKeys</code> was left unguarded, allowing code injection via the same <code>Function()</code> constructor sink. <code>imports</code> keys containing forbidden identifier characters now throw <code>"Invalid imports option passed into _.template"</code>.</p> <h3>Docs</h3> <ul> <li>Add security notice for <code>_.template</code> in threat model and API docs (<a href="https://redirect.github.com/lodash/lodash/pull/6099">#6099</a>)</li> <li>Document <code>lower > upper</code> behavior in <code>_.random</code> (<a href="https://redirect.github.com/lodash/lodash/pull/6115">#6115</a>)</li> <li>Fix quotes in <code>_.compact</code> jsdoc (<a href="https://redirect.github.com/lodash/lodash/pull/6090">#6090</a>)</li> </ul> <h3><code>lodash.*</code> modular packages</h3> <p><a href="https://redirect.github.com/lodash/lodash/pull/6157">Diff</a></p> <p>We have also regenerated and published a select number of the <code>lodash.*</code> modular packages.</p> <p>These modular packages had fallen out of sync significantly from the minor/patch updates to lodash. Specifically, we have brought the following packages up to parity w/ the latest lodash release because they have had CVEs on them in the past:</p> <ul> <li><a href="https://www.npmjs.com/package/lodash.orderby">lodash.orderby</a></li> <li><a href="https://www.npmjs.com/package/lodash.tonumber">lodash.tonumber</a></li> <li><a href="https://www.npmjs.com/package/lodash.trim">lodash.trim</a></li> <li><a href="https://www.npmjs.com/package/lodash.trimend">lodash.trimend</a></li> <li><a href="https://www.npmjs.com/package/lodash.sortedindexby">lodash.sortedindexby</a></li> <li><a href="https://www.npmjs.com/package/lodash.zipobjectdeep">lodash.zipobjectdeep</a></li> <li><a href="https://www.npmjs.com/package/lodash.unset">lodash.unset</a></li> <li><a href="https://www.npmjs.com/package/lodash.omit">lodash.omit</a></li> <li><a href="https://www.npmjs.com/package/lodash.template">lodash.template</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lodash/lodash/commit/cb0b9b9212521c08e3eafe7c8cb0af1b42b6649e"><code>cb0b9b9</code></a> release(patch): bump main to 4.18.1 (<a href="https://redirect.github.com/lodash/lodash/issues/6177">#6177</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/75535f57883b7225adb96de1cfc1cd4169cfcb51"><code>75535f5</code></a> chore: prune stale advisory refs (<a href="https://redirect.github.com/lodash/lodash/issues/6170">#6170</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/62e91bc6a39c98d85b9ada8c44d40593deaf82a4"><code>62e91bc</code></a> docs: remove n_ Node.js < 6 REPL note from README (<a href="https://redirect.github.com/lodash/lodash/issues/6165">#6165</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/59be2de61f8aa9461c7856533b51d31b7d8babc4"><code>59be2de</code></a> release(minor): bump to 4.18.0 (<a href="https://redirect.github.com/lodash/lodash/issues/6161">#6161</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/af634573030f979194871da7c68f79420992f53d"><code>af63457</code></a> fix: broken tests for _.template 879aaa9</li> <li><a href="https://github.com/lodash/lodash/commit/1073a7693e1727e0cf3641e5f71f75ddcf8de7c0"><code>1073a76</code></a> fix: linting issues</li> <li><a href="https://github.com/lodash/lodash/commit/879aaa93132d78c2f8d20c60279da9f8b21576d6"><code>879aaa9</code></a> fix: validate imports keys in _.template</li> <li><a href="https://github.com/lodash/lodash/commit/fe8d32eda854377349a4f922ab7655c8e5df9a0b"><code>fe8d32e</code></a> fix: block prototype pollution in baseUnset via constructor/prototype traversal</li> <li><a href="https://github.com/lodash/lodash/commit/18ba0a32f42fd02117f096b032f89c984173462d"><code>18ba0a3</code></a> refactor(fromPairs): use baseAssignValue for consistent assignment (<a href="https://redirect.github.com/lodash/lodash/issues/6153">#6153</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/b8190803d48d60b8c80ad45d39125f32fa618cb2"><code>b819080</code></a> ci: add dist sync validation workflow (<a href="https://redirect.github.com/lodash/lodash/issues/6137">#6137</a>)</li> <li>Additional commits viewable in <a href="https://github.com/lodash/lodash/compare/4.17.23...4.18.1">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lodash&package-manager=npm_and_yarn&previous-version=4.17.23&new-version=4.18.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

### Description Fix a security issue, onnxruntime could access values outside boundary. ### Motivation and Context security

### Description Porting over additional fixes to file mapping from the QNN EP ABI repo: - Only use file mapping feature if context bin version is >= 3.3.3 - Disable file mapping on a per-model basis for edge use cases ### Motivation and Context When testing based the QNN EP ABI repo, failed QNN context creation from EP context due to the EP context binary being too old prevented the QNN API from freeing all resources when file mapping is enabled. Context creation failure was due to the context binary version being older than 3.3.3, so there is now a check to disable file mapping for any EP context binaries that are too old. Prior to these changes, if file mapping is enabled and QNN context creation fails for any reason, the feature is disabled for all other graphs. This does not account for use cases where (1) a model contains multiple EP context nodes and some of them are incompatible with the file mapping feature; and (2) when multiple sessions share the same EP context and one or more of the models used are incompatible with the file mapping feature. The code has been updated to handle this use case. --------- Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>

…rosoft#28017) Bumps [fast-xml-parser](https://github.com/NaturalIntelligence/fast-xml-parser) from 4.5.3 to 4.5.6. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/NaturalIntelligence/fast-xml-parser/releases">fast-xml-parser's releases</a>.</em></p> <blockquote> <h2>Summary update on all the previous releases from v4.2.4</h2> <ul> <li>Multiple minor fixes provided in the validator and parser</li> <li>v6 is added for experimental use.</li> <li>ignoreAttributes support function, and array of string or regex</li> <li>Add support for parsing HTML numeric entities</li> <li>v5 of the application is ESM module now. However, JS is also supported</li> </ul> <p><strong>Note</strong>: Release section in not updated frequently. Please check <a href="https://github.com/NaturalIntelligence/fast-xml-parser/blob/master/CHANGELOG.md">CHANGELOG</a> or <a href="https://github.com/NaturalIntelligence/fast-xml-parser/tags">Tags</a> for latest release information.</p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/42fbb0bc95e753e03fe52cb0805a8774bba4bf28"><code>42fbb0b</code></a> update release info</li> <li><a href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/805671cb6c19108b171b876cf3e8865f18cdb8fd"><code>805671c</code></a> increase expansion limit as many system need it</li> <li><a href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/9a2cf097c2961d4ad878f618e39fb0a9f5a0e9e5"><code>9a2cf09</code></a> update version</li> <li><a href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/88d0936a23dabe51bfbf42255e2ce912dfee2221"><code>88d0936</code></a> apply all fixes from v5</li> <li><a href="https://github.com/NaturalIntelligence/fast-xml-parser/commit/d4eb6b4713a8d11e6730943392419040898ecbc0"><code>d4eb6b4</code></a> update release version</li> <li>See full diff in <a href="https://github.com/NaturalIntelligence/fast-xml-parser/compare/v4.5.3...v4.5.6">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=fast-xml-parser&package-manager=npm_and_yarn&previous-version=4.5.3&new-version=4.5.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…/nextjs-default (microsoft#28036) Bumps [next](https://github.com/vercel/next.js) from 16.1.5 to 16.2.3. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/vercel/next.js/releases">next's releases</a>.</em></p> <blockquote> <h2>v16.2.3</h2> <blockquote> <p>[!NOTE] This release is backporting security and bug fixes. For more information about the fixed security vulnerability, please see <a href="https://vercel.com/changelog/summary-of-cve-2026-23869">https://vercel.com/changelog/summary-of-cve-2026-23869</a>. The release does <strong>not</strong> include all pending features/changes on canary.</p> </blockquote> <h3>Core Changes</h3> <ul> <li>Ensure app-page reports stale ISR revalidation errors via onRequestError (<a href="https://redirect.github.com/vercel/next.js/issues/92282">#92282</a>)</li> <li>Fix [Bug]: manifest.ts breaks HMR in Next.js 16.2 (<a href="https://redirect.github.com/vercel/next.js/issues/91981">#91981</a> through <a href="https://redirect.github.com/vercel/next.js/issues/92273">#92273</a>)</li> <li>Deduplicate output assets and detect content conflicts on emit (<a href="https://redirect.github.com/vercel/next.js/issues/92292">#92292</a>)</li> <li>Fix styled-jsx race condition: styles lost due to concurrent rendering (<a href="https://redirect.github.com/vercel/next.js/issues/92459">#92459</a>)</li> <li>turbo-tasks-backend: stability fixes for task cancellation and error handling (<a href="https://redirect.github.com/vercel/next.js/issues/92254">#92254</a>)</li> </ul> <h3>Credits</h3> <p>Huge thanks to <a href="https://github.com/icyJoseph"><code>@icyJoseph</code></a>, <a href="https://github.com/sokra"><code>@sokra</code></a>, <a href="https://github.com/wbinnssmith"><code>@wbinnssmith</code></a>, <a href="https://github.com/eps1lon"><code>@eps1lon</code></a> and <a href="https://github.com/ztanner"><code>@ztanner</code></a> for helping!</p> <h2>v16.2.2</h2> <blockquote> <p>[!NOTE] This release is backporting bug fixes. It does <strong>not</strong> include all pending features/changes on canary.</p> </blockquote> <h3>Core Changes</h3> <ul> <li>backport: Move expanded adapters docs to API reference (<a href="https://redirect.github.com/vercel/next.js/issues/92115">#92115</a>) (<a href="https://redirect.github.com/vercel/next.js/issues/92129">#92129</a>)</li> <li>Backport: TypeScript v6 deprecations for baseUrl and moduleResolution (<a href="https://redirect.github.com/vercel/next.js/issues/92130">#92130</a>)</li> <li>[create-next-app] Skip interactive prompts when CLI flags are provided (<a href="https://redirect.github.com/vercel/next.js/issues/91840">#91840</a>)</li> <li>next.config.js: Accept an option for serverFastRefresh (<a href="https://redirect.github.com/vercel/next.js/issues/91968">#91968</a>)</li> <li>Turbopack: enable server HMR for app route handlers (<a href="https://redirect.github.com/vercel/next.js/issues/91466">#91466</a>)</li> <li>Turbopack: exclude metadata routes from server HMR (<a href="https://redirect.github.com/vercel/next.js/issues/92034">#92034</a>)</li> <li>Fix CI for glibc linux builds</li> <li>Backport: disable bmi2 in qfilter <a href="https://redirect.github.com/vercel/next.js/issues/92177">#92177</a></li> <li>[backport] Fix CSS HMR on Safari (<a href="https://redirect.github.com/vercel/next.js/issues/92174">#92174</a>)</li> </ul> <h3>Credits</h3> <p>Huge thanks to <a href="https://github.com/nextjs-bot"><code>@nextjs-bot</code></a>, <a href="https://github.com/icyJoseph"><code>@icyJoseph</code></a>, <a href="https://github.com/ijjk"><code>@ijjk</code></a>, <a href="https://github.com/gaojude"><code>@gaojude</code></a>, <a href="https://github.com/wbinnssmith"><code>@wbinnssmith</code></a>, <a href="https://github.com/lukesandberg"><code>@lukesandberg</code></a>, and <a href="https://github.com/bgw"><code>@bgw</code></a> for helping!</p> <h2>v16.2.1</h2> <blockquote> <p>[!NOTE] This release is backporting bug fixes. It does <strong>not</strong> include all pending features/changes on canary.</p> </blockquote> <h3>Core Changes</h3> <ul> <li>docs: post release amends (<a href="https://redirect.github.com/vercel/next.js/issues/91715">#91715</a>)</li> <li>docs: fix broken Activity Patterns demo link in preserving UI state guide (<a href="https://redirect.github.com/vercel/next.js/issues/91698">#91698</a>)</li> <li>Fix adapter outputs for dynamic metadata routes (<a href="https://redirect.github.com/vercel/next.js/issues/91680">#91680</a>)</li> <li>Turbopack: fix webpack loader runner layer (<a href="https://redirect.github.com/vercel/next.js/issues/91727">#91727</a>)</li> <li>Fix server actions in standalone mode with <code>cacheComponents</code> (<a href="https://redirect.github.com/vercel/next.js/issues/91711">#91711</a>)</li> <li>turbo-persistence: remove Unmergeable mmap advice (<a href="https://redirect.github.com/vercel/next.js/issues/91713">#91713</a>)</li> <li>Fix layout segment optimization: move app-page imports to server-utility transition (<a href="https://redirect.github.com/vercel/next.js/issues/91701">#91701</a>)</li> <li>Turbopack: lazy require metadata and handle TLA (<a href="https://redirect.github.com/vercel/next.js/issues/91705">#91705</a>)</li> <li>[turbopack] Respect <code>{eval:true}</code> in worker_threads constructors (<a href="https://redirect.github.com/vercel/next.js/issues/91666">#91666</a>)</li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/vercel/next.js/commit/d5f649b2f4affdad1009cb178c1e3b37f4f1ad3f"><code>d5f649b</code></a> v16.2.3</li> <li><a href="https://github.com/vercel/next.js/commit/28739286a88a83ab2d4e1899bdb4eb4ee7bee9a9"><code>2873928</code></a> [16.x] Avoid consuming cyclic models multiple times (<a href="https://redirect.github.com/vercel/next.js/issues/75">#75</a>)</li> <li><a href="https://github.com/vercel/next.js/commit/d7c77653602ae2009595cc71eb10f1b8828cc789"><code>d7c7765</code></a> [backport]: Ensure app-page reports stale ISR revalidation errors via onReque...</li> <li><a href="https://github.com/vercel/next.js/commit/c573e8c4f3208711f52bf3b64f5db238c9164762"><code>c573e8c</code></a> fix(server-hmr): metadata routes overwrite page runtime HMR handler (<a href="https://redirect.github.com/vercel/next.js/issues/92273">#92273</a>)</li> <li><a href="https://github.com/vercel/next.js/commit/57b8f659060e1d0f202273a9ed9e56d40f1d1a9c"><code>57b8f65</code></a> next-core: deduplicate output assets and detect content conflicts on emit (<a href="https://redirect.github.com/vercel/next.js/issues/9">#9</a>...</li> <li><a href="https://github.com/vercel/next.js/commit/f158df18bd926d0c2165ad309bbb561d7e73e74a"><code>f158df1</code></a> Fix styled-jsx race condition: styles lost due to concurrent rendering (<a href="https://redirect.github.com/vercel/next.js/issues/92459">#92459</a>)</li> <li><a href="https://github.com/vercel/next.js/commit/356d605b5831ffbe12ce9c9641e5e2e55d203523"><code>356d605</code></a> turbo-tasks-backend: stability fixes for task cancellation and error handling...</li> <li><a href="https://github.com/vercel/next.js/commit/3b77a6e2670ce81d686111b8e466eec612fa1867"><code>3b77a6e</code></a> Fix DashMap read-write self-deadlock in task_cache causing hangs (<a href="https://redirect.github.com/vercel/next.js/issues/92210">#92210</a>)</li> <li><a href="https://github.com/vercel/next.js/commit/b2f208ae98645d119a7e3388ab8a407005619dd8"><code>b2f208a</code></a> Backport: new view-transitions guide, update and fixes (<a href="https://redirect.github.com/vercel/next.js/issues/92264">#92264</a>)</li> <li><a href="https://github.com/vercel/next.js/commit/52faae3d94641584e13691238df5be158d0f00fb"><code>52faae3</code></a> v16.2.2</li> <li>Additional commits viewable in <a href="https://github.com/vercel/next.js/compare/v16.1.5...v16.2.3">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=next&package-manager=npm_and_yarn&previous-version=16.1.5&new-version=16.2.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…icrosoft#27620) ### Description Fix crash in ConstantFolding when nodes have missing optional outputs. ConstantFolding previously iterated over `node->OutputDefs()` and attempted to resolve an OrtValue index for every output. However, some operators (e.g. `Unique`) have optional outputs that may not exist in the graph (`NodeArg::Exists() == false`). `OptimizerExecutionFrame` only registers OrtValues for outputs that actually exist when building the name→index map. When ConstantFolding requested an index for a missing optional output, `GetMLValueIndex()` returned `-1`. This invalid index was inserted into `fetch_mlvalue_idxs` and later caused an assertion in `ExecutionFrame::GetMLValue()` during session initialization. This PR fixes the issue by: * Skipping outputs where `NodeArg::Exists() == false` * Preventing invalid indices from entering `fetch_mlvalue_idxs` * Skipping constant folding for the node if an output index cannot be resolved * Maintaining correct mapping between fetch indices and the original output indices --- ### Motivation and Context The failure is reproducible with the model attached in microsoft#26505. Before this fix: * session initialization fails with `ORT_ENABLE_BASIC` * disabling `ConstantFolding` allows the model to load After this fix: * the model loads successfully with `ORT_ENABLE_BASIC` * invalid indices are no longer inserted into `fetch_mlvalue_idxs` Fixes microsoft#26505

### Description Fixes 3 ICM issues: https://portal.microsofticm.com/imp/v5/incidents/details/31000000575344/summary https://portal.microsofticm.com/imp/v5/incidents/details/31000000575473/summary https://portal.microsofticm.com/imp/v5/incidents/details/31000000574999/summary ### Motivation and Context Fix ICMs --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

### Description Fix `linux-wasm-ci.yml` setup-feed having a spurious `templates/` path prefix. ### Motivation and Context

### Description Fixes two overflow/underflow bugs in the CPU RNN kernel (`rnn.cc`): - **`SafeInt` for GEMM M-dimension**: `seq_length * batch_size` was computed as a raw `int64_t` multiply before `narrow<int>()`, meaning an overflow would be UB before the check could fire. Replaced with `SafeInt<int64_t>(seq_length) * batch_size` for a checked multiply. - **`seq_length == 0` guard in `Assign_Y_h`**: For the forward direction, `last_time_step = seq_length - 1` underflows to `-1` when `seq_length == 0`, producing a negative `y_offset` and out-of-bounds read. Added an early-exit that zero-fills Y_h for the direction and returns. Also handles `sequence_lens[batch] == 0` (same underflow path), zeroing the affected batch slot and skipping via `continue`. ### Motivation and Context Silent UB from integer overflow/underflow in shape-derived index arithmetic can corrupt memory or produce incorrect results without any diagnostic signal. These cases are legal per the ONNX spec (empty sequences, per-batch zero-length sequences) and must be handled explicitly. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

This pull request introduces several enhancements and refactorings to the resource accounting and execution provider (EP) infrastructure, with a focus on better support for plugin-based CUDA execution providers. The most significant changes include the addition of type-erased arithmetic for resource accounting, improved handling of resource budgets for plugin EPs, and more robust device matching logic. These updates increase maintainability, enforce stricter type safety, and ensure correct resource tracking across both in-tree and plugin-based EPs. **Resource accounting improvements:** * Added type-erased arithmetic functions (`AddResourceCounts`, `ResourceCountExceeds`, `FormatResourceCount`) for `ResourceCount` to enforce exhaustive handling of variant types and improve type safety. [[1]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6R29-R40) [[2]](diffhunk://#diff-03c846683a6d76ded189d6ef24dc545da89ca418d0bce5cf1243d33cf1e2ac06R320-R351) * Refactored the `IResourceAccountant` interface: replaced `ResetPendingWeights` with `ResetForNewPass`, which resets both the stop flag and pending weights, and introduced a protected `ResetPendingWeightsImpl` for subclass-specific cleanup. [[1]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6L64-R83) [[2]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6R92-R96) [[3]](diffhunk://#diff-03c846683a6d76ded189d6ef24dc545da89ca418d0bce5cf1243d33cf1e2ac06L123-R123) [[4]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL280-R280) [[5]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL351-R351) **Plugin CUDA EP and resource budget enforcement:** * Added `kCudaPluginExecutionProvider` constant and updated logic to ensure plugin EPs correctly map to their in-tree accountant counterparts and are included in device matching and partitioning. [[1]](diffhunk://#diff-442c270eea3703252c48e97a7573960e14bf27a45a4443348840ed565330bf70R34) [[2]](diffhunk://#diff-b20f416b9fe3b85423eea6707c38753351a3f1b8ef7a319858b27794507e0686L102) [[3]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L186-R187) [[4]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L206-R207) [[5]](diffhunk://#diff-a8f614056d63b5b3325eea1d855afc96550c977c16d8fdba641012a79194b7b5L228-R229) [[6]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL1192-R1200) * Updated plugin EP infrastructure to pass and utilize resource accountant pointers, enabling host-side resource budget enforcement for plugin EPs and ensuring correct node assignment. [[1]](diffhunk://#diff-fb00c9a234d8cc889927a22de94acfcfd893b56505e8ed613961b1bf13c0e435R19) [[2]](diffhunk://#diff-fb00c9a234d8cc889927a22de94acfcfd893b56505e8ed613961b1bf13c0e435R54-R57) [[3]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R16-R17) [[4]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5L239-R259) [[5]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R273-R281) [[6]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R14-R22) [[7]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R233-R241) **Device matching and partitioning:** * Improved device matching heuristics to consider both in-tree and plugin CUDA EPs, and updated logic to prefer runtime device ordinals for more reliable device selection. Other minor changes include code style cleanups and additional includes for completeness.

### Description Modify ADO pipeline feed setup template to be idempotent. ### Motivation and Context ADO pipelines are currently stateful, and feed setup requires modifying files outside of the agent work directories. Previous jobs may have already modified these files in incompatible ways, so always override.

) - [x] Cap existing opset 7 Sin/Cos kernels to versioned 7-21 - [x] Add new opset 22 Sin/Cos kernels with BFloat16 support (HFDX) - [x] Update forward declarations and BuildKernelCreateInfo entries - [x] Add opset 22 tests for Sin and Cos - [x] Rebase onto latest main, resolve conflicts - [x] Merge commit with latest main to resolve conflicts --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

### Description Optimize the `LinearAttention` Op with subgroupAdd(). - Detect the adapter's `subgroupMinSize` to allocate workgroup shared memory - Drastically reduces workgroup shared memory usage (workgroup_size_x * TILE_V → MAX_SG * TILE_V) - Eliminates most `workgroupBarrier()` calls in the subgroup reduction The optimization is gated behind `subgroup_min_size`, which is enabled when the device supports `wgpu::FeatureName::Subgroups`. The original is preserved as fallback. ## Qwen3.5-4B Performance Benchmarks | Metric | Prefill Speed (TPS) | Decode Speed (TPS) | Prefill Improvement | | :--- | :--- | :--- | :--- | | **Default** | 719.6 | 29.6 | - | | **Optimized** | 929.8 | 29.7 | 1.29x | **Test Environment:** * **Hardware:** Intel Panther Lake * **Configuration:** Prefill: 1024, Decode: 128 ### Motivation and Context See above.

This pull request improves the robustness and correctness of the upsampling code in ONNX Runtime, especially for anti-aliased linear and trilinear upsampling on the CPU. The changes focus on safer handling of large tensor dimensions, improved memory safety, and code clarity for interpolation and weight calculation. The most important changes are grouped below. **Dimension and Overflow Handling:** * Added overflow checks for multiplication of large tensor dimensions to prevent integer overflows during output size calculations, using a new `checked_mul_int64` lambda. [[1]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3R1106-R1118) [[2]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3R1292-R1309) * Ensured all tensor dimensions are validated to fit within the `int32_t` range before narrowing, improving safety for large tensors. [[1]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3R1133-R1143) [[2]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3L1131-R1234) **Anti-Alias Upsampling Refactor:** * Refactored the anti-alias upsampling filter setup to use a new `InterpolationBound` struct for per-pixel coordinate ranges, replacing the previous flat vector approach. This improves code clarity and reduces indexing errors. [[1]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38R25-R35) [[2]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L126-R159) [[3]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L155-L208) * Updated all interpolation and extrapolation routines to use the new `bounds` structure, improving readability and maintainability. [[1]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L272-R290) [[2]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L341-R353) [[3]](diffhunk://#diff-051136817a71a65f4763b9f5c6e02c15f9a6aa39189d952717f4f36c6490ee38L391-R400) * Imlpements CUDA NHWC cubic antialias support **Memory and Type Safety:** * Improved buffer management and type safety in filter weight calculation, including more robust normalization and quantization for int8/uint8 types. * Fixed a minor logic bug in extrapolation handling by ensuring the loop is only entered if there are out-of-bound indices. **General Code Improvements:** * Added missing `<limits>` include and replaced some magic numbers with named variables for clarity. [[1]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3R6) [[2]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3L1198-R1257) [[3]](diffhunk://#diff-13eb8371a91e6fab62e63ecc46583049f97e8acc244af6ce8cc1c981d1d72dd3L1221-R1272) These changes together make the upsampling code more robust, especially for large or edge-case tensors, and improve maintainability for future development.

…ernels for the NCHWc blocked format in MLAS (microsoft#28411) ### Description  New kernel files: - riscv64/sconv_depthwise_kernel_rvv.cpp — RVV-optimized 3x3 stride-1 depthwise convolution (NCHW format), replacing the MLAS_FLOAT32X4 generic vectorized version - riscv64/sconv_nchwc_kernel_rvv.cpp — 7 NCHWc kernels using vfloat32m4_t (LMUL=4, BlockSize=16): - Direct NCHW conv (MlasConvNchwFloatKernelRvv) - Direct NCHWc conv (MlasConvNchwcFloatKernelRvv) - Depthwise NCHWc conv (MlasConvDepthwiseFloatKernelRvv) - Pointwise NCHWc conv (MlasConvPointwiseFloatKernelRvv) - Max/AvgExcludePad/AvgIncludePad pooling ### Motivation and Context  Following microsoft#28261, Optimize more MLAS kernels using RISC-V Vector (RVV) extensions. Please Note: - On the K3 (SpacemiT X60), VLEN=256. With LMUL=4 and e32, the hardware can hold (256/32) * 4 = 32 floats per vector register group — but we only request 16. So we're using half the available vector width. - The reason is that BlockSize=16 is baked into the NCHWc data layout across the whole framework (matching ARM64 NEON). Changing it to 32 would require a different NCHWc format and is not a localized change. ### Benchmark ((SpacemiT K3, VLEN=256, 8-core)) All tests pass with zero numerical error. Kernel | Speedup (RVV vs scalar) -- | -- Direct NCHW Conv | 1.27–1.29x Direct NCHWc Conv | 1.93–1.95x Depthwise NCHWc Conv | 10.8–12.5x Pointwise NCHWc Conv | 29.4–30.4x Max Pooling | 12.5–20.0x Avg Pooling (exclude pad) | 4.0–4.3x Avg Pooling (include pad) | 5.5–5.8x

### Description This adds a fast path `TryInitializeMapFromRawData` to `LabelEncoder_4` that reads numeric key/value tensor attributes directly from their `raw_data` buffer, avoiding the intermediate `std::vector` allocation by the existing `GetAttribute` helper. ### Motivation and Context For ONNX artifacts with LabelEncoders with millions of numeric keys and values, session creation leads to unnecessarily large memory peaks because temporary `std::vector`s with all keys and values are allocated. This is a real pain point in production use. This change was verified with an added test, and allocations were checked with large models via Python with [memray](https://github.com/bloomberg/memray) on macOS and inside an Ubuntu devcontainer. Thank you for taking a look!

### Description Follow-up to PR microsoft#28097. Applies the same `_torch_load_weights_only()` wrapper to the two remaining `torch.load()` call sites. `torch.load` can deserialize arbitrary Python pickle payloads. Using `weights_only=True` restricts loading to tensor/checkpoint data on supported PyTorch versions and is the safer default. The wrapper gracefully falls back to the default `torch.load` behavior on older PyTorch versions that do not support the `weights_only` parameter. ### Summary of Changes | File | Change | |------|--------| | `onnxruntime/test/testdata/test_data_generation/lr_scheduler/lr_scheduler_test_data_generator.py` | Adds `_torch_load_weights_only()` helper and uses it when loading scheduler/optimizer state dicts. | | `orttraining/orttraining/test/python/orttraining_test_ortmodule_pytorch_ddp.py` | Adds `_torch_load_weights_only()` helper and uses it when loading DDP model checkpoint. | ### Motivation and Context These were the last two `torch.load()` calls in the repository without `weights_only=True`. While both are in test/tooling code with low direct risk, this change ensures consistency with the pattern established in PR microsoft#28097 and eliminates all unsafe deserialization call sites. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…icrosoft#28004) ### Description Fix out-of-bounds reads in `SoftmaxCrossEntropyLoss` and `SoftmaxCrossEntropyLossGrad` (CPU EP) when label values are outside `[0, C)`. Same class of bug fixed in microsoft#27568 for `SparseSoftmaxCrossEntropy`. The CUDA kernels currently only validate labels via `CUDA_KERNEL_ASSERT`, which is a no-op in release builds. CUDA hardening is not part of this PR. ### Changes - Forward: bounds check folded into the three per-sample loops (after `ignore_index` skip, before any `weight_data[label]` / `log_prob_data[i*C + label]` access). - Backward: single upfront serial bounds check (parallel-for lambdas cannot return Status); comment explains why. - Validate `weight_shape[0] == C`. - Move `weight_data[label]` access after `ignore_index` check in grad weighted paths. - `N_D * C` wrapped in `SafeInt`; `gsl::narrow<int>` for `N_D` and `C`. Overflow / truncation returns `INVALID_ARGUMENT`. - `Eigen::Index` size guard: `ORT_ENFORCE` -> `ORT_RETURN_IF`. - `IsScalar(ignore_index)` check: `ORT_ENFORCE` -> `ORT_RETURN_IF_NOT` in both forward and backward. - Pre-existing wrong-sized `memset` in backward (`sizeof(T1) * N_D`) corrected to `sizeof(T1) * probability_shape.Size()`. The previous code was effectively redundant (subsequent parallel-for paths overwrite all `N_D * C` entries) so this is cleanup, not an active OOB. - Renamed `weight_smaple` -> `weight_sample`. ### Tests 11 regression tests in `cross_entropy_test.cc`: - Label too large (forward + grad, int64 + int32) - Negative label - Label too large with weights (MEAN and SUM reductions) - Higher-dim logit `[2,4,2,3]` with label `[2,2,3]` - `SoftmaxCrossEntropyLossInternal` and `SoftmaxCrossEntropyLossInternalGrad` with `ignore_index` as a runtime tensor input --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description The per-axis SafeInt multiplication added in microsoft#27566 detects overflow when computing an individual output dimension, but combinations of per-axis repeats can still request an int64-representable total that is unreasonably large. This PR adds a 4 GiB upper bound on the total tiled byte count in the CPU, CUDA, and WebGPU Tile kernels and extends validation/tests for the new behavior. ### Changes - `onnxruntime/core/providers/cpu/tensor/tile.cc`: compute the output shape with division-based checks that reject negative repeats, int64 overflow, and total tiled byte counts above the supported maximum before allocation. The maximum is clamped to `size_t::max()` for 32-bit builds, and the bound applies to `std::string` tensors as well because their output buffers still allocate per-element backing storage. - `onnxruntime/core/providers/cuda/tensor/tile.cc`: same output-size bound applied to keep CPU and CUDA behavior consistent. - `onnxruntime/core/providers/webgpu/tensor/tile.cc`: same output-size bound applied to keep WebGPU behavior consistent, plus repeats rank/length validation matching CPU/CUDA. - `onnxruntime/test/providers/cpu/tensor/tile_op_test.cc`: tests cover malformed repeats rank/length, 1-D, multi-axis, double (8-byte element), and string cases that exceed the bound, plus a positive test confirming a moderate (4 MB) output is still accepted. ### Motivation and Context Follow-up to microsoft#27566, which fixed per-axis overflow but did not bound total allocation size. --------- Co-authored-by: Gopalakrishnan Nallasamy <gopalakrishnan.nallasamy@microsoft.com> Co-authored-by: Gopalakrishnan Nallasamy <gnallasamy@microsoft.com>

…soft#28354) ## Description Adds support for float/float16 zero points in the 2-bit MatMulNBits LUT GEMM path, enabling AMD QAD/Quark 2-bit quantization which requires a fractional zero point of 1.5. Addresses microsoft#28162 ### Problem QAD 2-bit quantization uses non-uniform levels `[-1, -1/3, 1/3, 1]`, expressed via `dequant = (q - 1.5) * scale`. The zero point 1.5 cannot be represented as a packed uint8 value. The existing LUT GEMM packing API only accepted `uint8_t*` zero points, and the fallback dequant path crashed with `ORT_ENFORCE(nbits_ == 4)` when encountering 2-bit + float ZP. ### Changes **MLAS layer** — Widened `MlasLutGemmPack()` to accept `const void* QuantBZeroPoint` + `bool IsFloatZeroPoint`, following the existing `MlasQNBitGemmPackQuantBData` convention. The AVX2 packer reads float ZP values directly per quantization group when `IsFloatZeroPoint` is set, computing the same `(zp - midpoint) * scale` correction stored in the packed buffer. The compute kernel (`TMACComputeGemm_avx2`) is unchanged — it already consumes ZP as a float correction during accumulation. **MatMulNBits CPU kernel** — Relaxed the PrePack early-exit guard to allow float ZP into the LUT GEMM path (not non-LUT paths). Added fp16→fp32 conversion for ZP tensors, matching how scales are already handled. Fixed the Compute() path to null out prepacked zero_points to avoid a null dereference in CheckInputs. Fixed the 2-bit fallback dequant path: relaxed the `nbits_==4` enforce, added inline 2-bit scalar dequant for float and MLFloat16 ZP with correct packed-B indexing for padded K shapes. **Tests** — Added MLAS-level float ZP tests across block lengths 32/64/128 with ZP values {0, 1.5, 2, 3}. Added provider-level directed QAD tests (`zp=1.5`) verifying end-to-end correctness through the LUT GEMM path. ### Testing - 72 MLAS LUT GEMM tests pass (including 36 new float ZP tests) - 13 provider-level 2-bit tests pass (including new QAD float ZP tests) - No regressions in existing uint8 ZP tests - lintrunner clean ### Files changed | File | Change | |------|--------| | `core/mlas/inc/mlas_qnbit.h` | API: `void*` ZP + `IsFloatZeroPoint` flag | | `core/mlas/lib/qlutgemm.h` | Dispatch typedef update | | `core/mlas/lib/qlutgemm.cpp` | Pass-through plumbing | | `core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp` | Float ZP packing branch | | `contrib_ops/cpu/quantization/matmul_nbits.cc` | PrePack guard, fallback fix, ZP validation | | `test/mlas/unittest/test_sqlutgemm.cpp` | Float ZP MLAS tests | | `test/mlas/bench/bench_lutgemm.cpp` | Updated call signature | | `test/contrib_ops/matmul_2bits_test.cc` | Float ZP provider tests | --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description Replace `path::string()` / bare `std::filesystem::path(string)` with `PathToUTF8String` / `ToPathString` in two places that handle user-supplied paths. ### Motivation and Context On Windows, `path::string()` and `std::filesystem::path(std::string)` use the system ANSI code page (CP_ACP). When a model or EPContext output file sits in a directory with non-ASCII Unicode characters, this corrupts the path and causes: - `ModelMetadefIdGenerator::GenerateId` — throws during session initialization ("No mapping for the Unicode character exists in the target multi-byte code page") - `ModelGenOptions` (EPContext options) — constructs a garbled path, failing EPContext file creation with `ENOENT`

…domUniformLike CUDA ops with BFloat16 support (microsoft#27759) ### Description Fills the opset gap for RandomNormal, RandomNormalLike, RandomUniform, and RandomUniformLike operators in the CUDA execution provider, extending coverage from opset 1 to opset 22 with full BFloat16 support for the new opset 22 registrations. #### Changes - **`onnxruntime/core/providers/cuda/generator/random.cc`**: Changed each `ONNX_OPERATOR_KERNEL_EX` to `ONNX_OPERATOR_VERSIONED_KERNEL_EX` with version range 1–21. Added new `ONNX_OPERATOR_KERNEL_EX` registrations at opset 22 with type constraints including BFloat16 via `BuildKernelDefConstraints<float, MLFloat16, double, BFloat16>()`. Updated `MLTypeCallDispatcher` in `ComputeNormal` and `ComputeUniform` to include BFloat16. Updated Like variant type inference checks to accept BFloat16 inputs. - **`onnxruntime/core/providers/cuda/generator/random_impl.cu`**: Added `SPECIALIZED_RANDOM_KERNELS(BFloat16)` template specialization for both RandomNormal and RandomUniform kernels. - **`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`**: Updated forward declarations and `BuildKernelCreateInfo` entries to use versioned macros (1, 21) and added new opset 22 entries. - **`onnxruntime/test/providers/cpu/generator/random_test.cc`**: Extended GPU test helpers with an `opset_version` parameter to handle BFloat16 data. Added BFloat16 test cases for both vectorized and non-vectorized paths, including direct and Like variants, all using opset 22 to match the v22+ kernel registration. - **`docs/OperatorKernels.md`**: Updated version ranges and type lists for all four CUDA random operators to show `[1, 21]` and `22+` ranges with BFloat16 included in opset 22. ### Motivation and Context These operators were registered only at opset 1 using `ONNX_OPERATOR_KERNEL_EX` (non-versioned), which per the kernel matching logic in `kernel_registry.cc` only matches nodes with `SinceVersion == 1` (exact match). Models exported with newer opset versions (e.g., opset 22) would fail to find matching CUDA kernels for these operators. Additionally, opset 22 is the schema version that adds `tensor(bfloat16)` to these random ops, so the new registrations include BFloat16 support to fully close the opset-22 gap. The BF16 test cases explicitly use opset 22 to ensure the correct schema validation and kernel version coverage in CI. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

This pull request refines how tensor attributes are unpacked in the CUDA LabelEncoder implementation. The main improvement is ensuring that the raw data from tensor protos is explicitly passed to the `UnpackTensor` utility, enhancing correctness and compatibility with various tensor data formats. **Tensor attribute unpacking improvements:** * In both `TryGetScalarTensorAttribute` and `GetAttrOrTensor` functions in `label_encoder.cc`, the code now checks if the tensor proto contains raw data and, if so, passes the correct raw data pointer and length to `utils::UnpackTensor`. This replaces the previous approach of always passing `nullptr` and `0` for these parameters. [[1]](diffhunk://#diff-2fc4106da1ae063defd383893e035de8f260618ffd1dad0864b615361b4d2e2bL65-R67) [[2]](diffhunk://#diff-2fc4106da1ae063defd383893e035de8f260618ffd1dad0864b615361b4d2e2bL116-R120)

MultiHeadAttention Before: 58.3s After: 2.89 Speedup: 20x ### Description  ### Motivation and Context  Tested with vision_encoder.onnx for https://huggingface.co/onnx-community/LightOnOCR-2-1B-ONNX

This pull request improves support for string tensor attributes in the Common Subexpression Elimination (CSE) optimizer, ensuring correct handling and hashing of nodes with string tensor attributes and adding a regression test to prevent regressions. The most important changes are: **Bug Fixes and Feature Support:** * Updated `AreScalarTensorAttributeEqual` in `common_subexpression_elimination.cc` to correctly handle and compare scalar string tensor attributes, removing the restriction that previously excluded string tensors. * Modified `GetTensorAttributeHash` in `common_subexpression_elimination.cc` to support hashing of string tensor attributes, removing the enforcement that string tensors are not expected and ensuring string data is included in the hash. **Testing and Regression Prevention:** * Added a regression test `StringTensorAttr` in `cse_test.cc` to verify that CSE does support nodes with string tensor attributes, specifically testing with `LabelEncoder` nodes that retain their string tensor attributes. --------- Co-authored-by: Rajat Monga <rajatmonga_microsoft@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: rajatmonga <15679194+rajatmonga@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit 3ce0d38.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

psimd dependency has outdated cmake_minimum_required incompatible with CMake 4.3.2 on CI runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pacowong

Thanks. Do we keep or upgrade .circleci/config.yml when we merge main back to develop as I saw that it was removed in c2bb76e? It is important as Goodnotes relies on it to build the library with a stable environment.

pedrovgs · 2026-05-15T09:32:04Z

I removed this circleci/config.yml because it was already covered by a github workflow @pacowong

chilo-ms and others added 30 commits April 8, 2026 16:43

Remove unnecessary model package test (microsoft#28015)

c159603

### Description The test was originally adding for testing model selection based on "device type" provider option. However, the check for provider option was removed from the selection logic but forget to remove the related test.

fix target_ids out of boundary in TreeEnsemble* (microsoft#27951)

c6ff87f

### Description Fix possible out of boundary target of class ids in TreeEnsemble. ### Motivation and Context security issue

[webgpu] Set is_channels_last to true by default in ComputeMatMul (…

9e3614b

…microsoft#27674) This patch sets `is_channels_last` to true by default in the parameter of `ComputeMatMul` and ignores it in `UseSplitK` when there is no `bias`.

fix a security issue in SVM* (microsoft#27950)

ce92643

### Description Fix a security issue, onnxruntime could access values outside boundary. ### Motivation and Context security

Copilot AI and others added 22 commits May 12, 2026 13:58

Merge main with latest official onnxruntime from 2026 May 14

fb2b467

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update CircleCI macOS resource class to M4 Pro

0a2d3bd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Disable shellcheck in lint workflow for our fork

3ce0d38

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "Disable shellcheck in lint workflow for our fork"

ab6ea29

This reverts commit 3ce0d38.

Update CircleCI Xcode image to 16.3.0 for M4 Pro compatibility

e6ccf73

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix CircleCI resource class name to m4pro.medium

fb4af06

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add CMAKE_POLICY_VERSION_MINIMUM=3.5 for CMake 4.x compat

ae97eb8

psimd dependency has outdated cmake_minimum_required incompatible with CMake 4.3.2 on CI runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove CircleCI config, GH Actions handles xcframework build

c2bb76e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pedrovgs requested a review from pacowong May 14, 2026 11:19

pedrovgs marked this pull request as ready for review May 14, 2026 11:19

pacowong reviewed May 14, 2026

View reviewed changes

pacowong changed the base branch from main to develop May 15, 2026 01:34

pacowong changed the base branch from develop to main May 15, 2026 01:36

pacowong merged commit 805e690 into main May 15, 2026
20 of 72 checks passed

pacowong deleted the merge/upstream-main-2026-05-14 branch May 15, 2026 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge/upstream main 2026 05 14#17

Merge/upstream main 2026 05 14#17
pacowong merged 2452 commits into
mainfrom
merge/upstream-main-2026-05-14

pedrovgs commented May 14, 2026 •

edited

Loading

Uh oh!

pacowong left a comment •

edited

Loading

Uh oh!

Uh oh!

pedrovgs commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pedrovgs commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

pacowong left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pedrovgs commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pedrovgs commented May 14, 2026 •

edited

Loading

pacowong left a comment •

edited

Loading