Summary
When OpenBLAS is built with USE_OPENMP, calling openblas_set_num_threads(1) before a BLAS operation has no lasting effect.
The thread count is unconditionally overridden back to omp_get_max_threads() on every BLAS call, making the API a silent no-op.
When a user uses openblas_set_num_threads(1) in the hope to get deterministic BLAS results, they will eventually find that this does not work, and results are still nondeterministic, because of this.
The workaround is to run with the OMP_NUM_THREADS=1 environment variable set. But it seems very wrong that this cannot be done with code, and that a function that is explicitly named to set the number of threads, does not actually do that.
Call chain demonstrating the problem
- User calls
openblas_set_num_threads(1) which calls goto_set_num_threads(1), setting blas_cpu_number = 1.
- User then calls
cblas_sgemm(); code is in void CNAME() in gemm.c.
- Inside,
args.nthreads = get_gemm_optimal_nthreads(MNK) calls num_cpu_avail(3).
num_cpu_avail() calls omp_get_max_threads(), which returns the OpenMP default (all CPUs, since OMP_NUM_THREADS is unset).
if (blas_cpu_number != openmp_nthreads) is true (1 != N).
goto_set_num_threads(openmp_nthreads) overrides blas_cpu_number back to N.
num_cpu_avail() returns blas_cpu_number (now N), which flows to GEMM_THREAD(..., args.nthreads) then to exec_blas(num_cpu, queue) then to #pragma omp parallel for.
Root cause
goto_set_num_threads() does NOT call omp_set_num_threads() -- confirmed by searching the entire driver/ directory (zero results). So the OpenMP runtime's idea of the thread count is never updated, and num_cpu_avail() always re-syncs blas_cpu_number from omp_get_max_threads().
Expected behavior
openblas_set_num_threads(1) should durably limit OpenBLAS to 1 thread, even when built with USE_OPENMP. The only current workaround is setting OMP_NUM_THREADS=1 in the environment, which is a process-global side effect affecting all OpenMP consumers.
Suggested fix
When USE_OPENMP is defined, goto_set_num_threads() should also call omp_set_num_threads(num_threads) so that subsequent calls to omp_get_max_threads() in num_cpu_avail() return the value the user requested.
I have not tested this yet, but it seems the most sensical to me that this should work.
Practical impact
This breaks deterministic computation for any library that links OpenBLAS and tries to force single-threaded BLAS via the documented API. Multi-threaded floating-point reductions in exec_blas() produce nondeterministic results due to different summation orders across runs.
I found it when trying to make llama-cpp deterministic with its --threads 1 option and found that it didn't work.
Version
Tested on commit 8cecf899e (v0.3.32).
Summary
When OpenBLAS is built with
USE_OPENMP, callingopenblas_set_num_threads(1)before a BLAS operation has no lasting effect.The thread count is unconditionally overridden back to
omp_get_max_threads()on every BLAS call, making the API a silent no-op.When a user uses
openblas_set_num_threads(1)in the hope to get deterministic BLAS results, they will eventually find that this does not work, and results are still nondeterministic, because of this.The workaround is to run with the
OMP_NUM_THREADS=1environment variable set. But it seems very wrong that this cannot be done with code, and that a function that is explicitly named to set the number of threads, does not actually do that.Call chain demonstrating the problem
openblas_set_num_threads(1)which callsgoto_set_num_threads(1), settingblas_cpu_number = 1.cblas_sgemm(); code is invoid CNAME()ingemm.c.args.nthreads = get_gemm_optimal_nthreads(MNK)callsnum_cpu_avail(3).num_cpu_avail()callsomp_get_max_threads(), which returns the OpenMP default (all CPUs, sinceOMP_NUM_THREADSis unset).if (blas_cpu_number != openmp_nthreads)is true (1 != N).goto_set_num_threads(openmp_nthreads)overridesblas_cpu_numberback to N.num_cpu_avail()returnsblas_cpu_number(now N), which flows toGEMM_THREAD(..., args.nthreads)then toexec_blas(num_cpu, queue)then to#pragma omp parallel for.Root cause
goto_set_num_threads()does NOT callomp_set_num_threads()-- confirmed by searching the entiredriver/directory (zero results). So the OpenMP runtime's idea of the thread count is never updated, andnum_cpu_avail()always re-syncsblas_cpu_numberfromomp_get_max_threads().Expected behavior
openblas_set_num_threads(1)should durably limit OpenBLAS to 1 thread, even when built withUSE_OPENMP. The only current workaround is settingOMP_NUM_THREADS=1in the environment, which is a process-global side effect affecting all OpenMP consumers.Suggested fix
When
USE_OPENMPis defined,goto_set_num_threads()should also callomp_set_num_threads(num_threads)so that subsequent calls toomp_get_max_threads()innum_cpu_avail()return the value the user requested.I have not tested this yet, but it seems the most sensical to me that this should work.
Practical impact
This breaks deterministic computation for any library that links OpenBLAS and tries to force single-threaded BLAS via the documented API. Multi-threaded floating-point reductions in
exec_blas()produce nondeterministic results due to different summation orders across runs.I found it when trying to make llama-cpp deterministic with its
--threads 1option and found that it didn't work.Version
Tested on commit
8cecf899e(v0.3.32).