Skip to content

FlopsHaswell

Thomas Gruber edited this page May 22, 2019 · 3 revisions

Measuring FLOP/s on Intel Haswell platforms

In the HPC world, the FLOP/s metric is an often applied metric to determine the code quality. While many users calculate the FLOP/s in their high-level code, it may differ from the actual FLOP rate due to compiler optimizations or floating-point calculations hidden in some library function. Therefore, it is really helpful to measure the floating-point operations at hardware level. Unfortunately, Intel reduced the floating point related hardware performance events to almost zero.

Although Intel SandyBridge and IvyBridge architectures provide the events, the measurements are commonly too high. The PAPI group has a webpage discussing the events on Intel SandyBridge and IvyBridge. There is also a section about the floating-point related events on Intel Haswell platforms. The section compressed to a single sentence: There are no floating point events for Intel Haswell. But is this true?

AVX_INSTS.ALL event

Some time after the release of the Intel Haswell chips, Intel updated their online performance monitoring events database (Copy at GitHub) and lists an event AVX_INSTS.ALL with event code 0xC6 and umask 0x7. The description of the event by Intel:

Approximate counts of AVX & AVX2 256-bit instructions, including non-arithmetic instructions, loads, and stores. May count non-AVX instructions that employ 256-bit operations, including (but not necessarily limited to) rep string instructions that use 256-bit loads and stores for optimized performance, XSAVE* and XRSTOR*, and operations that transition the x87 FPU data registers between x87 and MMX.

and

Note that a whole rep string only counts AVX_INST.ALL once.

For anybody who might not know it, the umask is a bitmask to limit the event to a subevent. So for me, seeing the umask 0x7 means that you need to set three bits. I did some tests what happens when settings only one of the three bits. With pure AVX assembly benchmarks, I was able to split the event in its subevents for loads, stores and calculations. I measured the accuracy of the calculation event and for pure AVX code it was pretty accurate. So I created a performance group for LIKWID for the AVX FLOP/s (FLOPS_AVX) for Intel Haswell.

For a presentation I needed the FLOP/s rate for an Intel Haswell system and was happy to have the FLOPS_AVX group. BUT the results were way to high for the code and system so I did some further investigation.

Problems with the AVX_INSTS.* events

As a default I use the GCC and for the GCC-generated code, the measurement results were pretty accurate. But to get out the highest performance on Intel system, I commonly try the Intel C compiler as well. And with the Intel compiler the results were way to high. So I tried to isolate the instruction in the assembly which causes the overcounting and found the instruction insertf128 which is counted as calculation instead of load. This instruction is used by the Intel compiler to avoid split cache line AVX loads by loading the lower 128 bits (ìnsertf128) and the higher 128 bits (movupd) separately. The two instructions are executed simultaneously, so there is no performance degradation but it skews the measurement results.

There are probably other non-calculation instructions that is included in the event but I havn't found any during my investigation. So if you find an instruction, please send me an email and I will add it to this page.

J. D. McCalpin stated in issue #64 that the term AVX is ambiguous in this context as AVX refers to instructions with 3-operand AVX instruction format and not only to SIMD instructions working on 256 bit registers (ymm). He did some test with the AVX_INSTS.* events and found out that the events count only AVX instructions working on 256 bit registers (ymm) but not for AVX instructions working on 128 bit registers (xmm, GCC flag -mprefer-avx128). He assumes that AVX instructions on scalars also do not increment the event counters.

Listing of all subevents

AVX_INSTS.LOAD: Event 0xC6, Umask 0x01

AVX_INSTS.STORE: Event 0xC6, Umask 0x02

AVX_INSTS.CALC: Event 0xC6, Umask 0x04

AVX_INSTS.ALL: Event 0xC6, Umask 0x07

Using the events with LIKWID:

For Haswell the events AVX_INSTS_LOAD, AVX_INSTS_STORE, AVX_INSTS_CALC, AVX_INSTS_ALL are available for the PMC counters.

Using the events with PAPI:

AVX_INSTS.LOAD: r01C6

AVX_INSTS.STORE: r02C6

AVX_INSTS.CALC: r04C6

AVX_INSTS.ALL: r07C6

Using the events with perf_event:

Set type in struct perf_event_attr to PERF_TYPE_RAW.

In the config field set:

AVX_INSTS.LOAD: 0x01C6

AVX_INSTS.STORE: 0x02C6

AVX_INSTS.CALC: 0x04C6

AVX_INSTS.ALL: 0x07C6

Clone this wiki locally