Hi there,
I noticed that the current FP16 GEMM implementation on ARM is based on a generic C implementation, without leveraging ISA-specific optimizations (e.g., NEON or SVE intrinsics).
Are there any plans to optimize this kernel in the future? Given that some ARM cores lack dedicated FP16 dot-product instructions or MMLA variants, I suspect this might be the primary reason for the current generic implementation. Is this assumption correct?
Thanks for your time and for maintaining this project!