v3.0.0
Major: Improve GEMM performance and add Level 3 BLAS/mixed-precision support
- Improve the performance of the existing GEMM implementation:
gemmul8::gemm,gemmul8::gemmLt
- Add support for the following Level 3 BLAS-like matrix operations:
- SYMM (
gemmul8::symm,gemmul8::symmLt) - SYRK (
gemmul8::syrk,gemmul8::syrkLt) - SYR2K (
gemmul8::syr2k,gemmul8::syr2kLt) - SYRKX (
gemmul8::syrkx,gemmul8::syrkxLt) - HERK (
gemmul8::herk,gemmul8::herkLt) - HER2K (
gemmul8::her2k,gemmul8::her2kLt) - HERKX (
gemmul8::herkx,gemmul8::herkxLt) - TRMM (
gemmul8::trmm,gemmul8::trmmLt) - TRSM (
gemmul8::trsm,gemmul8::trsmLt) - TRTRMM (
gemmul8::trtrmm,gemmul8::trtrmmLt): triangular-by-triangular matrix multiplication
- SYMM (
- Add support for mixed-precision execution
- Add workspace-query support by calling GEMMul8 routines with
work == nullptr - Extend
gemmul8::workSizeto support the routines listed above except TRSM - Add
gemmul8::workSizeTrsmfor TRSM workspace-size calculation - Add TRSM block-size control APIs for the internal blocked algorithm:
gemmul8::set_block_size_trsm(int nB)gemmul8::get_block_size_trsm()
- Add overload (Hook Mode) support for the routines listed above
- Add overload (Hook Mode) support for
_64,3m, and3m_64variants where applicable - Change the GEMMul8 routine argument type from
unsigned num_modulitoint num_moduli