Skip to content

Releases: uxlfoundation/oneDNN

v3.8.1

27 May 22:56
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.8:

  • Fixed correctness issue in reorder primitive with non-trivial strides on Intel CPUs (a762d32)
  • Fixed runtime error in convolution weight gradient on Xe2 architecture-based Intel GPUs (a8fac73, c409ef9)
  • Fixed performance regression in bf16 convolution on Intel Datacenter GPU Max Series (98170d0, c6bae4a, c5edd53, bb1a591)
  • Improved performance of fp16 matmul with fp8 compressed weights on Intel GPUs (58f3ec1, abff176, ffd7dd3, 3b1e855, 2e140de, 3429f79)
  • Fixed runtime error in fp16 pooling primitive on Xe2 architecture based Intel GPUs (c0f6b6d)
  • Improved performance of fp16 matmul with int4 weights and 32 < m <= 64 on Intel GPUs (2fa7072)
  • Fixed correctness issues in bf16 matmul with 3 or more dimensional tensors on processors with Intel AMX support (dd20965, ea1b4a1)
  • Fixed performance regression in fp16 or bf16 matmul with transposed source and weight tensors on Intel Datacenter GPU Max Series (e45e1aa)
  • Improved performance of bf16 matmul with int4 weights on Intel GPUs (7a15c23)
  • Fixed runtime error in fp16 SDPA subgraph with head size 512 on Intel Core Ultra (Series 2) processor integrated GPU (bde6985)

v3.8

10 May 00:29
Compare
Choose a tag to compare

Performance Optimizations

Intel Architecture Processors

  • Improved matmul and inner product primitives performance on processors with Intel AMX instruction set support.
  • Improved performance of convolution and inner product primitives on processors with Intel AVX2 instruction set support.
  • Improved performance of int8 convolution support with zero points.
  • Improved fp32 convolution performance with fp16 and bf16 compressed weights on processors with Intel AVX2 or Intel AVX-512 instruction set support.
  • Improved fp16/bf16 depthwise convolution performance with fp32 bias or sum post-ops or dilation.
  • Improved bf16 pooling backpropagation performance.
  • Improved binary post-ops performance with per_w broadcast.

Intel Graphics Products

  • Improved performance on Intel Arc graphics for future Intel Core Ultra processors (code name Panther Lake).
  • Improved convolution performance on:
    • Intel Arc Graphics for Intel Core Ultra processor series 2 (formerly Lunar Lake).
    • Intel Arc B-series discrete graphics (formerly Battlemage).
  • Improved int8 matmul performance with zero-points support for source and weight tensors.
  • Improved f4_e2m1 and f4_e3m0 matmul and reorder performance.
  • Improved performance of the following subgraphs with Graph API:

AArch64-based Processors

  • Improved fp16 reorder performance.
  • Improved int8 matmul performance.
  • Improved bf16 inner product forward propagation performance with Arm Compute Library (ACL).
  • Improved bf16 eltwise performance.
  • Improved convolution performance on processors with SVE support with ACL.

Functionality

Common

  • Extended Graph API Softmax operation to support inf_as_zero mode. This functionality enables SDPA subgraph compliant with Pytorch Safe Softmax semantics.

Intel Architecture Processors

  • Introduced support for f32 convolution with fp16 compressed weights.
  • Enabled int8/int4 compressed weights support in matmul primitive.

Intel Graphics Products

  • Introduced select algorithm support in binary primitive.
  • Introduced support for f4_e2m1 and f4_e3m0 data types in convolution primitive.
  • Introduced support for the GenIndex operation in Graph API.

Generic GPU Vendor

  • Introduced support for:
    • Vanilla RNN forward propagation.
    • Inner product backpropagation.
    • Group normalization.
  • Improved accuracy of inner product primitive with sum post-ops for large shapes.

NVIDIA GPUs

  • Introduced Graph API support.

Usability

  • Added support for group normalization primitive with ONEDNN_ENABLE_PRIMITIVE build option.
  • Enabled support for ROCm 6 on AMD GPUs.
  • Improved CMake integration for oneDNN installation with Nvidia backend enabled.
  • Reduced memory footprint for matmul primitive when using ACL.

Validation

  • Added benchdnn option --execution-mode to test oneDNN functionality with SYCL Graph record/execute mode.
  • Extended benchdnn option --cold-cache with support for cold TLB mode.
  • Added benchdnn option --bia-dt to control bias data type for matmul, inner product, convolution, and deconvolution primitives.
  • Extended syntax of benchdnn --dt option in Graph API driver to manage data types of individual tensors in a pattern.

Deprecated Functionality

  • BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.

Breaking Changes

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, Denis @redradist, Dmitriy Ovchinnikov @inteldimitrius, Eliezer Weissmann @eliezerweissmann, Hubert Maciak @hmaciak, Ilya Lavrenov @ilya-lavrenov, James McGregor @Jmc18134, @jstachowintel, Marek Michalowski @michalowski-arm, Maria Zhukova @mzhukova, Orel Yehuda @yehudaorel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, @Shreyas-fuj, Shu Chen @shu1chen, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, and @zhangfeiv0.

v3.7.3

21 Apr 16:24
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.7.2:

  • Fixed correctness issue in matmul with non-trivial strides for the first tensor on processors with Intel AMX instruction set support (e18c622)
  • Removed spurious warning messages for SDPA subgraph on Intel GPUs (05541bb, 9e9a3a6)
  • Fixed segfault in fp32 matmul with bf16 math mode on processors with Intel AVX2 instruction set support (7d495ae)
  • Fixed performance regression in bf16 3D convolution backpropagation on processors with Intel AVX-512 and Intel DL Boost instruction set support (c38e02c, 67afc74)
  • Worked around GCC 12.3 bug causing accuracy issues in fp8 functionality on Intel GPUs (69b38d7)
  • Removed -fcf-protection build option for GCC 7 and earlier versions (813725d)

v3.8-rc

18 Apr 22:46
Compare
Choose a tag to compare
v3.8-rc Pre-release
Pre-release

Performance Optimizations

Intel Architecture Processors

  • Improved matmul and inner product primitives performance on processors with Intel AMX instruction set support.
  • Improved performance of convolution and inner product primitives on processors with Intel AVX2 instruction set support.
  • Improved performance of int8 convolution support with zero points.
  • Improved fp32 convolution performance with fp16 and bf16 compressed weights on processors with Intel AVX2 or Intel AVX-512 instruction set support.
  • Improved fp16/bf16 depthwise convolution performance with fp32 bias or sum post-ops or dilation.
  • Improved bf16 pooling backpropagation performance.
  • Improved binary post-ops performance with per_w broadcast.

Intel Graphics Products

  • Improved performance on Intel GPUs based on Xe3 architecture.
  • Improved convolution performance on:
    • Intel Arc Graphics for Intel Core Ultra (Series 2, formerly Lunar Lake).
    • Intel Arc B-series discrete graphics (formerly Battlemage).
  • Improved int8 matmul performance with zero-points support for source and weight tensors.
  • Improved f4_e2m1 and f4_e3m0 matmul and reorder performance.
  • Improved performance of the following subgraphs with Graph API:

AArch64-based Processors

  • Improved fp16 reorder performance.
  • Improved int8 matmul performance.
  • Improved bf16 inner product forward propagation performance with Arm Compute Library (ACL).
  • Improved convolution performance on processors with SVE support with ACL.

Functionality

Common

  • Extended Graph API Softmax operation to support inf_as_zero mode. This functionality enables SDPA subgraph compliant with Pytorch Safe Softmax semantics.

Intel Architecture Processors

  • Introduced support for f32 convolution with fp16 compressed weights.
  • Enabled int8/int4 compressed weights support in matmul primitive.

Intel Graphics Products

  • Introduced select algorithm support in binary primitive.
  • Introduced support for f4_e2m1 and f4_e3m0 data types in convolution.
  • Introduced support for the GenIndex operation in Graph API.

Generic GPU Vendor

  • Introduced support for:
    • Vanilla RNN forward propagation
    • Inner product backpropagation
    • Group normalization
  • Improved accuracy of inner product primitive with sum post-ops for large shapes.

NVIDIA GPUs

  • Introduced Graph API support.

Usability

  • Added support for Group Normalization primitive with ONEDNN_ENABLE_PRIMITIVE build option.
  • Enabled support for ROCm 6 on AMD GPUs.
  • Improved CMake integration for oneDNN installation with Nvidia backend enabled.
  • Reduced memory footprint for matmul primitive when using ACL.

Validation

  • Added benchdnn option --execution-mode to test oneDNN functionality with SYCL Graph record/execute mode.
  • Extended benchdnn option --cold-cache with support for cold TLB mode.
  • Added benchdnn option --bia-dt to control bias data type for matmul, inner product, convolution, and deconvolution.
  • Extended syntax of benchdnn --dt option in Graph API driver to manage data types of individual tensors in a pattern.

Breaking Changes

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, Denis @redradist, Dmitriy Ovchinnikov @inteldimitrius, Eliezer Weissmann @eliezerweissmann, Hubert Maciak @hmaciak, Ilya Lavrenov @ilya-lavrenov, James McGregor @Jmc18134, @jstachowintel, Marek Michalowski @michalowski-arm, Maria Zhukova @mzhukova, Orel Yehuda @yehudaorel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, @Shreyas-fuj, Shu Chen @shu1chen, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, and @zhangfeiv0.

v3.7.2

18 Mar 23:47
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.7.1:

  • Fixed hang in matmul with odd shapes on Intel Arc GPUs (46e7499)
  • Fixed out-of-registers error in matmul on Intel Arc GPUs (599c839)
  • Fixed incorrect results in SDPA pattern on Intel GPUs (6343c73)
  • Fixed integer overflow in convolution with large shapes on x64 CPUs (c541100)
  • Fixed access violation issue in experimental Graph Compiler (8b0e626)
  • Fixed access violation in pooling on Intel GPUs (cd2cd5d)
  • Improved performance of int8 matmul with int4 weights on Intel GPUs (d6c98ec)

v3.7.1

01 Mar 16:07
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.7:

  • Fixed correctness issue in int8 matmul primitive with int4 weights on on Intel Arc graphics (b16184d)
  • Fixed matmul performance regression on Intel Arc graphics (41e406b)
  • Fixed potential integer overflow in bf16 convolution for processors with Intel AVX-512 instruction set support (f882861)
  • Fixed functional issue in matmul with dropout attribute on generic GPUs (8303330)
  • Fixed functional issues in matmul with scales on NVIDIA GPUs (e8d8594)
  • Fixed integer overflows for large shapes in convolution for x64 processors (fc3f17a, 31b079f)
  • Worked around an MSVC 19.29.30158.0 bug that results in a crash at binary primitive creation on x64 processors (50dd6cc)

v3.7

19 Feb 01:05
Compare
Choose a tag to compare

Performance Optimizations

Intel Architecture Processors

  • Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
  • Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
  • Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
  • Improved performance of int8 RNN primitive on processors with Intel AVX2 and Intel AVX-512 instruction set support.
  • Improved performance of int8 depthwise separable convolution primitive with per-channel zero points on processors with Intel AVX2 and Intel AVX-512 instruction set support.
  • Improved fp16 and bf16 softmax performance with relaxed accumulation mode.
  • Improved performance of int8 matmul primitive with fp16 output data type.
  • Improved performance of the following subgraphs with Graph API:

Intel Graphics Products

  • Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
  • Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
  • Improved performance of convolution with source zero points by pre-packing compenstation.
  • Improved performance of backward by data convolution with strides for large filter.
  • Improved performance of the following subgraphs with Graph API:

AArch64-based Processors

  • Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
  • Improved bf16 to fp32 reorder performance.
  • Improved bf16 reorder performance.
  • Improved bf16 convolution with ACL.

NVIDIA GPUs

  • Improved matmul performance using cuBLASLt-based implementation.

Functionality

Common

  • Introduced support for select algorithm in binary primitive. The functionality is optimized for Intel CPUs.
  • Extended quantization support in matmul and reorder with grouped scales and zero-points for weights. This functionality is optimized for Intel CPUs and GPUs.
  • Introduced initial support for 4-bit floating-point data types f4_e2m1 and f4_e3m0 in matmul and reorder, as well as e8m0 scales data type in matmul and reorder. This functionality is available on Intel CPUs and GPUs.
  • Introduced GenIndex, and GreaterEqual operations in Graph API.

Intel Architecture Processors

  • Introduced support for fp32 matmul with fp16 and bf16 weights.

Intel Graphics Products

  • Introduced stochastic rounding support for convolution, matmul and reorder based on Philox counter-based random number generator.
  • Introduced support for strided memory formats in convolution.

Generic GPU vendor

  • Introduced support for reduction primitive.
  • Introduced support for inner product primitive forward propagation.

Usability

Common

  • With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines.
  • Added Graph API examples for Gated MLP and int4 Gated MLP patterns.

Intel Architecture Processors

  • Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for Intel CPU and Intel GPU implementations.
  • Enabled frame pointers support on Intel64 platforms to improve integration with profilers.

Intel Processor Graphics

  • Improved verbose diagnostics for Intel GPU driver compatibility issues.
  • Improved support of large size tensors in convolution, matmul and reduction primitives on Intel GPUs.
  • Reduced scratchpad usage for NCHW convolution on Intel GPUs.

AArch64-based Processors

  • Added support for the Arm Compute Library (ACL) thread_local scheduler via ThreadpoolScheduler.
  • Improved memory efficiency in ACL matmuls by fixing a bug where scratchpad memory was not being used.
  • Made the ACL matmul primitive thread-safe which allows concurrent execution.

Validation

  • Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
  • Extended benchdnn with support for rewriting data types in the test JSON files in the graph driver.
  • Extended benchdnn with support and validation for the number of partitions returned from the test JSON files.

Deprecated Functionality

  • Experimental Graph Compiler is deprecated and will be removed in future releases.

Breaking Changes

  • Updated minimal supported CMake version to 3.13 (was 2.8.12).
  • Updated minimal supported GCC version to 8.0 (was 4.8).
  • Updated minimal supported Clang version to 11.0 (was 3.0).
  • Updated minimal supported ACL version to 24.11.1 (was 24.09).
  • Removed support for SYCL standards preceding SYCL 2020.
  • Enforced fp32 accumulation mode in fp16 matmul and inner product primitives on Intel Graphics products without Intel XMX cores. Previous behavir can be enabled with relaxed accumulation mode.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Karasev @karasjoh000, John Osorio @kala855, Keola Wierschem @kwiersch, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, MichaΕ‚ GΓ³rny @mgorny, NicolΓ² Scipione @s-Nick, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Tadej Ciglarič @t4c1, Varad Ahirwadkar @varad-ahirwadkar, Viktoriia Gvozdeva @vgvozdeva, @vishwascm, @yair-obodovsky, Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.

v3.7-rc

27 Jan 09:25
c538dc5
Compare
Choose a tag to compare
v3.7-rc Pre-release
Pre-release

Performance Optimizations

Intel Architecture Processors

  • Improved fp16/bf16 softmax performance with relaxed accumulation mode.
  • Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support.
  • Improved performance of convolution and matmul primitives on processors with Intel AMX support.
  • Improved performance of fp8 matmul primitives with bf16 and fp16 bias datatype on processors with Intel AMX instruction set support.
  • Improved performance of int8 matmul primitive with fp16 output datatype.
  • Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support.

Intel Graphics Products

  • Introduced initial optimizations for GPUs based on Xe3 architecture.
  • Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
  • Improved performance of the following subgraphs with Graph API

Functionality

  • Introduced support for select algorithm in binary primitive. The functionality is optimized for Intel CPUs.
  • Enabled support for matmul primitive with grouped quantization on weight along N dimension.
  • Graph API: new Select, GenIndex and GreaterEqual operations.
  • Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
  • Introduced support for grouped scales and zero points in reorder primitive.
  • Enabled support for 4d weight scale in matmul primitive.
  • Graph API: added support for Quantized and non-quantized Gated MLP pattern.
  • Introduced preliminary support for 4-bit floating-point data types f4_e2m1 and f4_e3m0 in matmul and reorder, as well as e8m0 scales data type in matmul and reorder.

Usability

  • With SYCL runtime, memory objects on CPU engine are now reference-counted and no more need to be explicitly kept alive by user for the duration of the primitive execution. This align memory object lifetime behavior on CPU and GPU engines.
  • Improve verbose diagnostic to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
  • Improve verbose diagnostic to simplify debugging of nGEN fallbacks.
  • Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
  • Added examples for Gated MLP and int4 Gated MLP.

Validation

  • Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
  • Extended benchdnn with support for rewriting data types in the test JSON files in graph driver.
  • Extended benchdnn with support and validation for the number of partition returned from the test JSON files.

Breaking Changes

  • Updated minimal supported CMake version to 3.13 (was 2.8.12).
  • Updated minimal supported GCC version to 8.0 (was 4.8).
  • Updated minimal supported Clang version to 11.0 (was 3.0).
  • Removed support for SYCL older than 2020.

Thanks to these Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Osorio @kala855, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, MichaΕ‚ GΓ³rny @mgorny, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Varad Ahirwadkar @varad-ahirwadkar, @vishwascm, and Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.

v3.6.2

06 Dec 05:18
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.6.1:

  • Fixed segmentation fault issue in convolution primitive on processors with Intel AVX2 instruction set support (2eb3dd1)
  • Added a workaround for build issue with GCC 8.2 and GNU binutils 2.27 (19ef223, 262fb02, e3782e8)
  • Fixed a thread safety issue in matmul primitive for builds relying on Arm Compute Library (ACL) and bumped minimal supported ACL version to 24.11.1 (4d962e7)
  • Suppressed spurious warnings for GCC (7d3164d, c805a50, e526172, dc780cb)
  • Fixed segfaults in BRGEMM-based matmul, convolution, and deconvolution implementations on AArch64-based processors (a873a1c, 9a1dc92)
  • Fixed performance regression in bf16 convolution with ACL on AArch64-based processors (4793296)
  • Fixed an issue with convolution primitive creation with PREFER_YMM CPU ISA hint on AArch64-based processors (e34d992)
  • Improved bf16 matmul performance with fp32 destination with ACL on AArch64-based processors (548d5d6)
  • Improved bf16 to fp32 reorder performance on AArch64-based Processors (917dd13)
  • Fixed issue in matmul primitive with 4D tensors on AArch64-based processors (d13c966)
  • Suppressed spurious GCC warnings in deconvolution primitive on AArch64-based processors (f90f60e)
  • Fixed warnings in BRGEMM implementation on AArch64-based processors (866b196)
  • Fixed correctness issue in reorder primitive with zero points for 4D shapes on AArch64-based Processors (836ea10)
  • Improved bf16 reorder performance on AArch64-based Processors (12bafbe)
  • Fixed performance regression for backward convolution primitive descriptor creation time on Intel processors (2b3389f)
  • Improved performance of fp16 matmul with int4 weights on Intel GPUs based on Xe2 architecture (4c8fb2c, 3dd4f43, 280bd28)
  • Fixed performance regression for int8 convolution with large spatial sizes on processors with Intel AMX support (05d68df)
  • Restricted check for microkernel fusion support to cases when fusion functionality is actually used on Intel GPUs (48f6bd9)

v3.6.1

06 Nov 00:05
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.6:

  • Fixed convolution correctness issue in some scenarios involving persistent cache on Intel GPUs (e595e59)
  • Fixed potential page faults in reduction primitive implementation for Intel GPUs (7740c75, a4fcef9, 32d8660)
  • Implemented a workaround for GCC 13 bug that resulted in matmul hangs on some Intel Arc graphics SKUs (a30d526)
  • Updated execution units (EU) number detection logic for Intel GPUs based on Xe2 architecture to accommodate for behavior changes in Linux driver (04e7eac, 97b04bd)
  • Fixed build issue for static library with ONEDNN_VERBOSE=OFF (7f476cb)
  • Fixed correctness issue in SYCL deconvolution implementation with post-ops (8f600a3)
  • Fixed memory formats checks in SYCL softmax implementation (6ae73e4)
  • Fixed correctness issue in SYCL resampling implementation with post-ops (9845057)
  • Aligned accessor types in SYCL kernels with SYCL specification (0d9b3bd)
  • Improved scales argument checks in generic SYCL kernels (9f73bf1, 7d85c75)
  • Fixed correctness issue in int8 convolution with sum post-op on NVIDIA GPUs (7486ed8)
  • Relaxed accuracy test threshold for bf16 softmax on NVIDIA GPUs (e9d0fdb)
  • Added support for bf16 and fp16 bias for fp8 matmul on Intel CPUs (188ae7f)
  • Fixed a bug that prevented dispatching Intel AVX-512 with Intel DL Boost implementation in int8 RNN primitive (bf58e72)
  • Fixed a runtime fail with CL_OUT_OF_RESOURCES error in fp16 convolution on Intel Arc graphics (39a5f67, 7e1663f)