The 2D thread partitioning in GEMM (PR#4655) requires nthreads_m % 2 == 0. This can prevent optimal nthreads_m and nthreads_n combinations on architectures like A64FX (48 cores) or Grace (144 cores) when M<<N, due to core counts having divisors other than 2. A fix is planned.
The 2D thread partitioning in GEMM (PR#4655) requires nthreads_m % 2 == 0. This can prevent optimal nthreads_m and nthreads_n combinations on architectures like A64FX (48 cores) or Grace (144 cores) when M<<N, due to core counts having divisors other than 2. A fix is planned.