Clarifying naming of compute kernels prior to submission of PR

I'm currently implementing the FP16 kernels as a follow up to #3754 and #2767

I want to be 100% clear on OpenBLAS nomenclature prior to submitting my PR.

### Baseline assumptions 

Where name is `XYgemm_kernel_NxM_UARCH.c`

and

`X` defines the returning data type 
`Y` defines the submitted data type
`N` defines the column of the matrix, each column being the columnar access offset of a given register/vector 
`M` defines the in register vector width (so for FP16 in a 512 bit vector lanes, 32 entries), while also being the row of the matrix

`X` and `Y` can be any of the following
s=single precision
d=double
b=bf16
h=fp16
c=complex single
z=complex double
`null` or no prefix

Also for the sake of reference 
`ge` is general 
`m` is matrix
`v` is vector 
e:  `sdgemv` would be an operation that takes in double precision floats in a matrix and vectors and returns in single precision format


`UARCH` is either the *first* `UARCH` for which the compute kernel is targeted, or if one already exist but a new `UARCH` which supports new extensions has come out and has favorable additional instructions that can further optimize a process, uses the name of the new `UARCH` instead. In some cases the lake/bridge suffix is dropped (sandy instead of sandy bridge). Sometimes the consumer/workstation name is used instead of the architecture name (skylakex instead of skylake)

This is assumed from the following:

The project has 6 implementations of a x86_64 single precision (fp32) 16x4 kernel.

3 in c with inline asm, 3 in pure asm. 

Looking only to the ASM versions we have:
`sgemm...sandy.S` which is where AVX(1) instructions were introduced. 
`sgemm...haswell.S` which is where AVX2 instructions were introduced.
`sgemm...skylakex.S` which is where AVX512 instructions were introduced.

### Next:

if `X` is the same as `Y` then `Y` is dropped and shortened to `Xgemm...`
if `N` is the same as `M` both are kept for the sake of clarity

Something I'm unclear about is if N and M are expected to set the upper bound of a kernel, or if they're expected to always be a given size. 

For example is the 16x4 also expected to deal with 8x2? Or is each GEMM expected to be it's own file implementation. 

From looking at the /kernel/x86_64/ folder I believe it's the later, but I wanted to confirm

### Next:

Specific to my PR

Following this nomenclature, I believe my first 2 kernels will be a 32x4 fp16->fp16 in avx512 using the new ISA as well as a kernel leveraging AVX(1) and the f16c ISA extensions to provide  legacy support. 

### Kernel1: legacy

f16c and AVX were first introduced together on the Ivybridge micro-architecture and use the 32bit vector registers to compute then convert to fp16. AVX1 allows for YMM registers if we stick to floating point mode, which we're doing, so all good on that front. 

As such we can use YMM registers, but need to treat them in 32 bit increments. This would mean a maximum of Nx8 kernel, with a naming convention of: 
`hgemm_kernel_NxM_ivy(bridge?).c`


Where M could be 1,2,4,8, but probably only 8, N probably only 4, 8 and 16. 

A potential issue is that the spec of f16c defines that the values be converted to fp32 before computation->computed->then computed back. it would therefore only be applicable for testing/development purposes before sending to modern systems capable of utilizing the AVX512 implementation. 

### Kernel 2: The modern implementation (read: the fast one)

In the case of the AVX512 implementation things are a little different. The first implementation in market was unofficial/unsanctioned via Alderlake (see #3490) over a year in market prior to Sapphire rapids on the performance core's (named Golden Cove). It supports proper FP16 implementation, therefore supporting 32 values in the 512bit ZMM registers. 

The compromise I've come to is to name the AVX512 versions:

`hgemm_kernel_NxM_goldencove.c


Where the implementations will be 
M=4,8,16 and N = 16, 32. 

Once all of these are done it may be viable/workable to create SH/HS as well as the FP16 complex kernels but that's for later. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarifying naming of compute kernels prior to submission of PR #3756

Baseline assumptions

Next:

Next:

Kernel1: legacy

Kernel 2: The modern implementation (read: the fast one)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarifying naming of compute kernels prior to submission of PR #3756

Description

Baseline assumptions

Next:

Next:

Kernel1: legacy

Kernel 2: The modern implementation (read: the fast one)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions