PatternsHaswellEP

Bottlenecks related performance patterns

Pattern	Desired events	Available events
ALU saturation	Amount of UOPs executed per port, Amount of load/store UOPs, Amount of calculation UOPs	UOPS_EXECUTED_PORT.PORT_(0-8), INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, MEM_UOPS_RETIRED.ALL_LOADS, MEM_UOPS_RETIRED.ALL_STORES, AVX_INSTS_CALC
Bandwidth saturation	Amount of transferred cache lines between L1 and L2, L2 and L3, L3 and Memory including prefetches, snoops, ..., Amount of scalar/packed/vector loads/stores	L1D.REPLACEMENT, L2_TRANS.L1_WB, L2_LINES_IN.ALL, L2_TRANS.L2_WB, UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, UNC_H_IMC_READS.NORMAL, UNC_H_BYPASS_IMC.TAKEN, UNC_H_IMC_WRITES.ALL, AVX_INSTS.LOADS, AVX_INSTS.STORES

Hazards related performance patterns

Pattern	Desired events	Available events
Inefficient data access due to excess data volume	Amount of cache lines transferred between cache levels (in and out), Amount of cache hits, Amount of cache misses	L1D.REPLACEMENT, L2_TRANS.L1_WB, L2_LINES_IN.ALL, L2_TRANS.L2_WB, UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_HIT, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_MISS
Inefficient data access due to latency-bound accesses	Latency in cycles for loads and stores, Amount of cache lines transferred between cache levels (in and out), Amount of cache hits, Amount of cache misses	Latency measurements only available at kernel space with PEBS, L1D.REPLACEMENT, L2_TRANS.L1_WB, L2_LINES_IN.ALL, L2_TRANS.L2_WB, UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_HIT, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_MISS
Limited instruction throughput	Stall and used cycles at decoder, reservation station, all execution ports, reorder buffer and store buffer	UOPS_ISSUED.THREAD, UOPS_EXECUTED.THREAD, UOPS_RETIRED.THREAD, RESOURCE_STALLS.(RS, SB, ROB), UOPS_ISSUED.THREAD:CMASK=0x1:INV=1, UOPS_EXECUTED.THREAD:CMASK=0x1:INV=1, UOPS_RETIRED.THREAD:CMASK=0x1:INV=1, RESOURCE_STALLS.(RS, SB, ROB):CMASK=0x1:INV=1
Micro-architectural anomalies	Amount of memory aliasing stalls, Amount of conflict misses, Amount of unaligned loads and stores, Amount of requeues of UOPs, All amounts of performance degrading hardware behavior	RESOURCE_STALLS.(RS, SB, ROB), MISALIGN_MEM_REF.ANY, LD_BLOCKS_PARTIAL.ADDRESS_ALIAS, LOCK_CYCLES.CACHE_LOCK_DURATION
False sharing of cache lines	Amount of modified cache lines transferred from a CPU's private cache to other CPU's cache, Amount of modified cache lines transferred between CPU sockets	MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_RESPONSE:LLC_HIT:HITM_OTHER_CORE, OFFCORE_RESPONSE:LLC_MISS:REMOTE_HITM
Bad ccNUMA page placement	Amount of cache lines transferred from local memory to a CPU core, Amount of cache lines transferred from remote memory to a CPU core (best with filtering for source memory domain), Amount of data transferred over socket interconnect	UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, RXL_FLITS_G0.DATA, TXL_FLITS_G0.DATA, OFFCORE_RESPONSE:L3_MISS:LOCAL_DRAM, OFFCORE_RESPONSE:L3_MISS:REMOTE_DRAM
Control flow issues	Amount of all branches, Amount of all misspredicted branched, Amount of retired instructions	BR_INST_RETIRED.ALL_BRANCHES, BR_MISP_RETIRED.ALL_BRANCHES, INST_RETIRED.ANY

Work related performance patterns

Pattern	Desired events	Available events
Load imbalance / serial fraction	Amount of "work" instructions e.g. floating point operations or bit shifts, Amount of cache lines transferred between L1 and CPU core	AVX_INSTS.CALC, L1D.REPLACEMENT, L2_TRANS.L1D_WB
Synchronization overhead	Amount of "work" instructions, e.g. floating point operations or bit shifts, Amount of halted cycles, Amount of unhalted cycles, Amount of retired instructions	AVX_INSTS.CALC, INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, CPU_CLK_UNHALTED.THREAD_P:CMASK=0x1:INV=1
Instruction overhead	Amount of "long-latency" instructions, Amount of issued/executed/retired instructions, Amount of floating-point instructions	INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, AVX_INSTS.CALC
Bad code composition due to expensive instructions	Amount of expensive UOPs like divide, sqrt, rand, ..., Amount of retired instructions, Amount of retired UOPs	ARITH.DIVIDER_UOPS, INST_RETIRED.ANY, UOPS_RETIRED.ANY
Bad code composition due to ineffective instructions	Amount of not work-related instructions, Amount of retired instructions, floating-point instructions separated by scalar, packed and vectorized	INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, AVX_INSTS.CALC, AVX_INSTS.LOADS, AVX_INSTS.STORES

Home
Build instructions
Release Process
FAQ
LikwidAPI and MarkerAPI
Likwid nomenclature
API documentation
Quick reference sheet
Applications
Config files
- likwid.cfg
- likwid_topo.cfg
Daemons
- likwid-accessD
- likwid-setFreq
Architectures
Tutorials
Miscellaneous
Contributing
- Adding x86/x86_64 chips
- Adding ARM chips

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PatternsHaswellEP

Bottlenecks related performance patterns

Hazards related performance patterns

Work related performance patterns

Clone this wiki locally