Improve [SD]GER performance on A64FX and Neoverse V1

I've been analyzing the performance of [SD]GER and noticed that its computation path goes through kernel/generic/ger.c and driver/level2/ger_thread.c, eventually relying on the [SD]AXPY kernel.
For Level-2 BLAS routines like GER, we can often achieve better performance by unrolling the loops in both directions.
I would like to implement this optimization for [SD]GER kernel on the A64FX and Neoverse V1.
I will be working on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve [SD]GER performance on A64FX and Neoverse V1 #5514

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve [SD]GER performance on A64FX and Neoverse V1 #5514

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions