Skip to content

PatternsHaswellEP

Thomas Roehl edited this page Sep 17, 2015 · 2 revisions

Bottlenecks related performance patterns

Pattern Desired events Available events
ALU saturation Amount of UOPs executed per port, Amount of load/store UOPs, Amount of calculation UOPs UOPS_EXECUTED_PORT.PORT_(0-8), INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, MEM_UOPS_RETIRED.ALL_LOADS, MEM_UOPS_RETIRED.ALL_STORES, AVX_INSTS_CALC
Bandwidth saturation Amount of transferred cache lines between L1 and L2, L2 and L3, L3 and Memory including prefetches, snoops, ..., Amount of scalar/packed/vector loads/stores L1D.REPLACEMENT, L2_TRANS.L1_WB, L2_LINES_IN.ALL, L2_TRANS.L2_WB, UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, UNC_H_IMC_READS.NORMAL, UNC_H_BYPASS_IMC.TAKEN, UNC_H_IMC_WRITES.ALL, AVX_INSTS.LOADS, AVX_INSTS.STORES

Hazards related performance patterns

Pattern Desired events Available events
Inefficient data access due to excess data volume Amount of cache lines transferred between cache levels (in and out), Amount of cache hits, Amount of cache misses L1D.REPLACEMENT, L2_TRANS.L1_WB, L2_LINES_IN.ALL, L2_TRANS.L2_WB, UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_HIT, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_MISS
Inefficient data access due to latency-bound accesses Latency in cycles for loads and stores, Amount of cache lines transferred between cache levels (in and out), Amount of cache hits, Amount of cache misses Latency measurements only available at kernel space with PEBS, L1D.REPLACEMENT, L2_TRANS.L1_WB, L2_LINES_IN.ALL, L2_TRANS.L2_WB, UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_HIT, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_MISS
Limited instruction throughput Stall and used cycles at decoder, reservation station, all execution ports, reorder buffer and store buffer UOPS_ISSUED.THREAD, UOPS_EXECUTED.THREAD, UOPS_RETIRED.THREAD, RESOURCE_STALLS.(RS, SB, ROB), UOPS_ISSUED.THREAD:CMASK=0x1:INV=1, UOPS_EXECUTED.THREAD:CMASK=0x1:INV=1, UOPS_RETIRED.THREAD:CMASK=0x1:INV=1, RESOURCE_STALLS.(RS, SB, ROB):CMASK=0x1:INV=1
Micro-architectural anomalies Amount of memory aliasing stalls, Amount of conflict misses, Amount of unaligned loads and stores, Amount of requeues of UOPs, All amounts of performance degrading hardware behavior RESOURCE_STALLS.(RS, SB, ROB), MISALIGN_MEM_REF.ANY, LD_BLOCKS_PARTIAL.ADDRESS_ALIAS, LOCK_CYCLES.CACHE_LOCK_DURATION
False sharing of cache lines Amount of modified cache lines transferred from a CPU's private cache to other CPU's cache, Amount of modified cache lines transferred between CPU sockets MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_RESPONSE:LLC_HIT:HITM_OTHER_CORE, OFFCORE_RESPONSE:LLC_MISS:REMOTE_HITM
Bad ccNUMA page placement Amount of cache lines transferred from local memory to a CPU core, Amount of cache lines transferred from remote memory to a CPU core (best with filtering for source memory domain), Amount of data transferred over socket interconnect UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, RXL_FLITS_G0.DATA, TXL_FLITS_G0.DATA, OFFCORE_RESPONSE:L3_MISS:LOCAL_DRAM, OFFCORE_RESPONSE:L3_MISS:REMOTE_DRAM
Control flow issues Amount of all branches, Amount of all misspredicted branched, Amount of retired instructions BR_INST_RETIRED.ALL_BRANCHES, BR_MISP_RETIRED.ALL_BRANCHES, INST_RETIRED.ANY

Work related performance patterns

Pattern Desired events Available events
Load imbalance / serial fraction Amount of "work" instructions e.g. floating point operations or bit shifts, Amount of cache lines transferred between L1 and CPU core AVX_INSTS.CALC, L1D.REPLACEMENT, L2_TRANS.L1D_WB
Synchronization overhead Amount of "work" instructions, e.g. floating point operations or bit shifts, Amount of halted cycles, Amount of unhalted cycles, Amount of retired instructions AVX_INSTS.CALC, INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, CPU_CLK_UNHALTED.THREAD_P:CMASK=0x1:INV=1
Instruction overhead Amount of "long-latency" instructions, Amount of issued/executed/retired instructions, Amount of floating-point instructions INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, AVX_INSTS.CALC
Bad code composition due to expensive instructions Amount of expensive UOPs like divide, sqrt, rand, ..., Amount of retired instructions, Amount of retired UOPs ARITH.DIVIDER_UOPS, INST_RETIRED.ANY, UOPS_RETIRED.ANY
Bad code composition due to ineffective instructions Amount of not work-related instructions, Amount of retired instructions, floating-point instructions separated by scalar, packed and vectorized INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, AVX_INSTS.CALC, AVX_INSTS.LOADS, AVX_INSTS.STORES
Clone this wiki locally