From 496af0d8bb3a5f0ae2ed2de67594508cf1d8bc73 Mon Sep 17 00:00:00 2001 From: Martin Kroeker Date: Sun, 22 Mar 2026 22:34:27 +0100 Subject: [PATCH] add gemm_batch, gemm_batch_strided, bgemm/bgemv and fp16 extensions --- docs/extensions.md | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/docs/extensions.md b/docs/extensions.md index bc015910d3..ea6eff5a23 100644 --- a/docs/extensions.md +++ b/docs/extensions.md @@ -13,7 +13,9 @@ This page documents those non-standard APIs. | ?omatcopy | s,d,c,z | out-of-place transposition/copying | | ?geadd | s,d,c,z | ATLAS-like matrix add `B = α*A+β*B` | | ?gemmt | s,d,c,z | `gemm` but only a triangular part updated | - +| cblas_?gemm_batch | s,d,c,z,b | `gemm` with several groups of input data +| +| cblas_?gemm_batch_strided | s,d,c,z,b | `gemm` with groups of data stored at fixed offsets in the input arrays ## bfloat16 functionality @@ -26,6 +28,15 @@ BLAS-like and conversion functions for `bfloat16` (available when OpenBLAS was c * `float cblas_sbdot` computes the dot product of two bfloat16 arrays * `void cblas_sbgemv` performs the matrix-vector operations of GEMV with the input matrix and X vector as bfloat16 * `void cblas_sbgemm` performs the matrix-matrix operations of GEMM with both input arrays containing bfloat16 +* `void cblas_bgemv` performs the matrix-vector operations of GEMV with the input matrix, X vector and result as bfloat16 +* `void cblas_bgemm` performs the matrix-matrix operations of GEMM with both input arrays containing bfloat16 and the output being bfloat16 as well + +## half-precision float or fp16 functionality + +BLAS-like and conversion functions for `hfloat16` (available when OpenBLAS was compiled with `BUILD_HFLOAT16=1`): + +* `void cblas_shgemm` performs the matrix-matrix operations of GEMM with both input arrays containing hfloat16 + ## Utility functions @@ -36,4 +47,5 @@ BLAS-like and conversion functions for `bfloat16` (available when OpenBLAS was c * `char * openblas_get_config()` returns the options OpenBLAS was built with, something like `NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell` * `int openblas_set_affinity(int thread_index, size_t cpusetsize, cpu_set_t *cpuset)` sets the CPU affinity mask of the given thread to the provided cpuset. Only available on Linux, with semantics identical to `pthread_setaffinity_np`. +* `openblas_set_thread_callback_function` overrides the default multithreading backend with the provided argument