I've been analyzing the performance of [SD]GER and noticed that its computation path goes through kernel/generic/ger.c and driver/level2/ger_thread.c, eventually relying on the [SD]AXPY kernel.
For Level-2 BLAS routines like GER, we can often achieve better performance by unrolling the loops in both directions.
I would like to implement this optimization for [SD]GER kernel on the A64FX and Neoverse V1.
I will be working on this.