X86 CPUs & Performance



and notice that the second instruction gets executed after the first one

Reference: Denis Bakhvalov's article

LAST UPDATE DATE: 20 OCT 2022 FOR LATEST VERSION: www.github.com/akhin/microarchitecture-cheatsheet AUTHOR: AKIN OCAL akin ocal@hotmail.com



INSIDE **INDIVIDUAL** CORE



PIPELINE PARALLELISM & PERFORMANCE Pipeline diagrams: The diagrams below in the following topics are outputs from an online microarchitecture analysis tool <u>UICA</u> and they represent parallel execution through cycles. Rows are multiple instructions being executed at the same time. Columns display how instruction state changes through cycles. IPC: As for pipeline performance, typically IPC is used. It stands for "instructions per cyle". A higher IPC value usually means a better throughput You can measure IPC with perf : <a href="https://perf.wiki.kernel.org/index.php/Tutorial">https://perf.wiki.kernel.org/index.php/Tutorial</a> Instruction lifecycle states in UICA diagrams Rate of retired instructions: Apart from IPC, number of retired instructions should be checked. Retired instructions are not committed/finalised as they were wrongly speculated. On the other hand executed instructions are the ones which were finalised. Therefore a high rate of retired instructions indicates low branch prediction rate. CONTENTION FOR EXECUTION PORTS IN THE PIPELINE In the example above, all instructions are working on different registers, but SHR, ADD, DEC instructions are competing for ports 0 and 6. SHR and DEC are getting executed after ADD instruction. Also notice that there is longer time between E(executed) and R(retired) states of instruction ADD as retirement has to be done in-order whereas execution is out-of-order. Reference : Denis Bakhvalov's article INSTRUCTION STALLS DUE TO DATA DEPENDENCY

In the example above, there are 2 dependency chains, each marked with a different colour. In the first red coloured one, 2 instructions are competing for RAX register

RDTSCP INSTRUCTION FOR MEASUREMENTS RDTSCP instruction can flush the pipeline to discard the instructions prior to the measurement and read the TSC value of the CPU. TSC: timestamp counter You can use CPUID and RDTSC combination in older systems that don't support RDTSCP. **ESTIMATING INSTRUCTION LATENCIES** Based on Agner Fog`s <u>Instruction tables</u>, RDTSCP reciprocal throughput (clock cycle per instruction) is 32 on Skylake microarchitecture: -> 1 cycle @4.5GHZ is 0.22 nanoseconds -> 32\*0.22=7.04 nanoseconds So its resolution estimate is about 7 nanoseconds on a 4.5 GHz Skylake microarchitecture. You have to recalculate it for different microarchitectures and clock speeds. HYPERTHREADING / SIMULTANEOUS MULTITHREADING Based on Intel Software Developer's Manual Volume3, it is implemented by 2 virtual cores that share resources including cache memory, branch prediction resources and execution ports. And AMD seems to use the resources in the same way based on Agner Fog's microarchitecture book. For ex if your app is data-intensive, halved caches won't help. It can be disabled it via BIOS settings. In general, it moves the control of resources from software to hardware and that is usually not desired for performance critical applications. Note: Its generic name is simultaneous multithreading. Hyperthreading name used by only Intel. DYNAMIC CLOCK SPEEDS Modern CPUs employ dynamic frequency scaling which means there is a min and max Max level ◀ frequency per CPU core. Also ACPI defines multiple power states and C0 - Normal execution ← → Pn modern CPUs implement those. P-State's are C1 - Idle for performance and C states are for energy In order to switch to Pstates, C-state You can use Intel's <u>Turboboost</u> or AMD's

Note that SSE usage may also introduce downclocking, therefore they should be used carefully :

<u>Turbocore</u> to maximise the CPU usage.

Daniel Lemire's article

has to be brought

to C0 level

STORE-TO-LOAD FORWARDING & LHS & PERFORMANCE **LOAD & STORE BUFFERS** Based on Intel Optimization Manual 3.6.4, store-to-load forwarding may Load and store buffers allow CPU to do out-of-order execution on loads and improve combined latency of those 2 operations. The reason is not stores by decoupling speculative execution and committing the results to the specified however it is potentially LHS (Load-Hit-Store) problem in which the penalty is a round trip to the cache memory Reference: https://en.wikipedia.org/wiki/Memory\_disambiguation https://en.wikipedia.org/wiki/Load-Hit-Store **LOAD** STORE-TO-LOAD FORWARDING There are several conditions for the forwarding to happen. In case of a STORE BUFFER LOAD BUFFER **STORE** Using buffers for stores and loads to support out of order execution leads successful forwarding, the steps 2 to a data syncronisation issue. That issue is described in and 3 ( a roundtrip to the cache ) en.wikipedia.org/wiki/Memory disambiguation#Store to load forwarding **REALM** will be bypassed. L1 CACHE As a solution, CPU can forward a memory store operation to a following load, if they are both operating on the same address. The conditions for a successful forwarding and latency penalties in case of An example store and load sequence : no-forwarding can be found in Agner Fog's microarchitecture book. mov [eax],ecx; STORE, Write the value of ECX register to the memory Previous game consoles PlayStation3 and Xbox360 had PowerPC based ; address which is stored in EAX register processors which did in-order-execution rather than out-of-order execution. mov ecx,[eax]; LOAD, Read the value from that memory address Therefore developers had to separately handle LHS by using ; ( which was just used) and write it to ECX register restrict keyword and other methods : Elan Ruskin's article

**FLOATING POINTS** X86 EXTENSIONS X86 EXTENSIONS: SIMD DETAILS X86 uses IEEE 754 standard for floating points. A 32 bit floating point consists of 3 parts x86 extensions are specialised instructions. They have various categories The most recent SIMD instruction sets and their corresponding registers are : in the memory layout. Below you can see all bits of 1234.5678 FP from cryptography to neural network operations **ARITHMETIC** number. Used <u>bartaz.github.io/ieee754-visualization</u> as visualiser: AVX: 128 bits, XMM registers Intel Intrinsics Guide is a good page to explore those extensions. AVX2: 256 bits, YMM registers REALM AVX512:512 bits, ZMM registers SSE (Streaming SIMD Extensions) is one of the most important ones. SIMD stands mantissa - 23 bits for "single instruction multiple data". SIMD instructions use wider registers to execute more work in a single go: As for programming, there are also wider data types. The data type diagrams below are for 128 bit AVX A floating point's value is calculated as: ±mantissa × 2 exponent ARITHMETIC INSTRUCTION LATENCIES IEEE754 also defines denormal numbers. They are very small / near zero numbers m128 , 4 x 32 bit floating points Float Float Float Float You can see a set of arithmetic opertions from fast to slow below float GetInverseOfDiff(float a, float b) As floating points are approximations, Double \_\_m128d , 2 x 64 bit doubles denormal numbers are needed to avoid an The clock cycles are based on Agner Fog's Instruction tables & Skylake undesired case of : a!=b but a-b=0 architecture on 64 bit registers. return 1.0f / (a - b); \_\_m128i , 4 x 32 bit ints Without denormals the code to the right would invoke a divide-by-zero exception. Bitwise operations, integer add/sub: 0.25 to 1 clock cycle In the example above, an array 4 integers (i1 to i4) are added to another array of Reference : Bruce Dawson's article \_\_m128l , 2 x 64 bit long longs integers (j1 to j4). The result is also an array of sums ( s1 to s4). In this example, 4 add Floating point add: 3 clock cycles long long operations are executed by a single instruction. Based on Agner Fog's microarchitecture book, Intel CPUs have a penalty for denormal numbers, for ex: 129 clock cycles on Skylake. They also can be turned off on Intel CPUs. Some typical application areas are 3D graphics and quantitative finance. Note that SSE instructions require more power, therefore their usage may also As for AMD side, the recent Zen architecture CPUs seemingly don't have the same Apart from arithmetic operations, they can be utilised for string operations as well. introduce downclocking. They should be benchmarked : Daniel Lemire's article A SIMD based JSON parser : <a href="https://github.com/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/simdjson/si performance degradation.





a round trip to the system memory and total latency becomes 3 digit nanoseconds.



N-WAY SET ASSOCIATIVITY





