Die photo of a quadcore CPU , Copyright of Intel

LAST UPDATE DATE: 14 NOV 2022 FOR LATEST VERSION: www.github.com/akhin/microarchitecture-cheatsheet AUTHOR: AKIN OCAL akin ocal@hotmail.com



**INSIDE** INDIVIDUAL CORE



PIPELINE PARALLELISM & PERFORMANCE Pipeline diagrams: The diagrams below in the following topics are outputs from an online microarchitecture analysis P Predecoded tool UICA and they represent parallel execution through cycles. Q Added to IDQ I Issued Rows are multiple instructions being executed at the same time. Ready for dispatch Columns display how instruction state changes through cycles. IPC : As for pipeline performance, typically IPC is used. It stands for "instructions per cyle" A higher IPC value usually means a better throughput. You can measure IPC with perf : <a href="https://perf.wiki.kernel.org/index.php/Tutorial">https://perf.wiki.kernel.org/index.php/Tutorial</a> Instruction lifecycle states Rate of retired instructions: Apart from IPC, number of retired instructions should be checked. Retired instructions in UICA diagrams are not committed/finalised as they were wrongly speculated. On the other hand executed instructions are the ones which were finalised. Therefore a high rate of retired instructions indicates low branch prediction rate. CONTENTION FOR EXECUTION PORTS IN THE PIPELINE Possible Ports | Actual Port | 0 In the example above, all instructions are working on different registers, but SHR, ADD, DEC instructions are competing for ports 0 and 6. SHR and DEC are getting executed after ADD instruction. Also notice that there is longer time between E(executed) and R(retired) states of instruction ADD as retirement has to be done in-order whereas execution is out-of-order. Reference: Denis Bakhvalov's article INSTRUCTION STALLS DUE TO DATA DEPENDENCY

and notice that the second instruction gets executed after the first one. And the same applies to the 2nd purple pair.

Reference : Denis Bakhvalov`s article

**ARITHMETIC** 

**REALM** 

You can see a set of arithmetic opertions from fast to slow below.

The clock cycles are based on Agner Fog`s <u>Instruction tables</u> & Skylake

CACHE

**MEMORY** 

**REALM** 

RDTSCP INSTRUCTION FOR MEASUREMENTS TSC ( time stamp counter ) is a special register that counts CPU cycles. RDTSCP can be used to read the TSC value which then can be used for measurements.. It can also avoid out-of-order execution effects to a degree : it does wait until all previous instructions have executed and all previous loads are globally visible ( From Intel Software Developer's Manual Volume2 4.3, April 2022) Intel's <u>How to benchmark code execution times</u> whitepaper has details of using RDTSCP instruction. AMD Programmers Manual Vol3 states: RDTSCP forces all older instructions to retire before reading the timestamp counter **ESTIMATING INSTRUCTION LATENCIES** You can use Agner Fog's Instruction tables to find out instructions' reciprocal throughputs (clock cycle per instruction). As an example, reciprocal throughput of instruction RDTSCP is 32 on Skylake microarchitecture: -> 1 cycle @4.5GHZ ( highest frequency on Skylake) is 0.22 nanoseconds -> 32\*0.22=7.04 nanoseconds So its resolution estimation is about 7 nanoseconds on a 4.5 GHz Skylake CPU. You have to recalculate it for different microarchitectures and clock speeds. HYPERTHREADING / SIMULTANEOUS MULTITHREADING Hyperthreading name is used by Intel and it is called as "Simultaneous multithreading" by AMD. In both resources including caches and execution units are shared.Agner Fog`s microarchitecture book has "multithreading" sections for each of Intel and AMD microarchitectures Regarding using it , if your app is data-intensive , halved caches won't help. Therefore it can be disabled it via In general, it moves the control of resources from software to hardware and that is usually not desired for performance critical applications. **DYNAMIC FREQUENCIES** Modern CPUs employ dynamic frequency scaling which Max level ◀ means there is a min and a max frequency per CPU C0 - Normal execution ← → Pn ACPI : ACPI defines multiple power states and modern CPUs implement those. P-State's are for performance C1 - Idle and C states are for energy efficiency. In order to switch Intel has various tunability options and the most well to Pstates, C-state known is TurboBoost. On AMD side there is <u>Turbocore</u> has to be brought

Number of active cores & SIMD AVX2/512 on Intel CPUs: Intel's power management policies are complex.

See the arithmetic and the multicore realms as number of active cores and some of AVX2/512 extensions also

\_m128l , 2 x 64 bit long longs

**N-WAY SET ASSOCIATIVITY** 

Why: Cache capacities are much smaller than the system memory. Moreover, software can use various regions of

How: In N-Way set associativity, caches are divided to groups of sets. And each set will have N cache blocks. The

SET

identifier per cache block the set in a cache in the target cache block

if tag of the current block equals to tag ( which we just have found out )

read and return data using offset , it is a cache hit

For each block in the current set ( which we have just found out )

The level of associativity (the number of ways) is a trade off between the search time and the amount of system

mapping information is stored in bits of addresses which has 3 parts:

used as a unique

The pseudocode below shows steps for searching a single byte in the cache memory

Get tag, set and offset from the address

If there was no matching tag, it is a cache miss

their address space. So if there was one to one mapping of a fully sequential memory that would lead to cache misses most of the time. Therefore there is a need for efficient mapping between the cache memory and the system



STORE-TO-LOAD FORWARDING & LHS & PERFORMANCE L1 CACHE

FLOATING POINTS X86 uses IEEE 754 standard for floating points. A 32 bit floating point consists of 3 parts in the memory layout. Below you can see all bits of 1234.5678 FP number. Used <u>bartaz.github.io/ieee754-visualization</u> as visualiser <u>∞</u>. exponent mantissa - 23 bits 8 bits ARITHMETIC INSTRUCTION LATENCIES A floating point's value is calculated as: ±mantissa × 2 exponent

DRAM used in system memories

=

Caches are organised in multiple levels. As you go upper

in that hierarchy , the capacity increases. Therefore **LLC** 

3 level caches are currently the most common ones. Intel

upcoming AMD CPUs may come with 4 level of caches.

Cache line size is the unit of data transfer between the

cache and the system memory. It is typically 64 bytes. And

the caches are organised according to the cache line size.

Broadwell architecture had 4 level caches in the past. Also

term used to indicate the last level of cache.

Lemire's article

In the example above, there are 2 dependency chains, each marked with a different colour. In the first red coloured one, 2 instructions are competing for RAX register

IEEE754 also defines denormal numbers. They are very small / near zero numbers. As floating points are approximations, float GetInverseOfDiff(float a, float b) denormal numbers are needed to avoid an undesired case of : a!=b but a-b=0 Without denormals the code to the right return 1.0f / (a - b); return 0.0f; would invoke a divide-by-zero exception. Reference : Bruce Dawson's article

integers (j1 to j4). The result is also an array of sums (s1 to s4). In this example, 4 add operations are executed by a single instruction. Based on Agner Fog`s microarchitecture book, Intel CPUs have a penalty for denormal numbers, for ex: 129 clock cycles on Skylake. They also can be turned off on Intel CPUs. As for AMD side, the recent Zen architecture CPUs seemingly don't have the same performance degradation.

x86 extensions are specialised instructions. They have various categories from <u>cryptography</u> to <u>neural network operations</u>. Intel Intrinsics Guide is a good page to explore those extensions. SSE (Streaming SIMD Extensions) is one of the most important ones. SIMD stands for "single instruction multiple data". SIMD instructions use wider registers to execute more work in a single go: i1 i2 i3 i4 + + + + = = = =

**X86 EXTENSIONS** 

You can use those to maximise the CPU usage.

may affect the frequency while in Turboboost.

They play key role in compilers' vectorisation optimisations: GCC auto vectorisation Apart from arithmetic operations, they can be utilised for string operations as well. A SIMD based JSON parser : https://github.com/simdjson/simdjsor

In the example above, an array 4 integers (i1 to i4) are added to another array of

**X86 EXTENSIONS: SIMD DETAILS** The most recent SIMD instruction sets for Intel CPUs are : AVX : Up to 256 bits AVX2 : Up to 256 bits <u>AVX512</u>: Up to 512 bits Recent AMD CPUs support AVX & AVX2. Only the latest Zen4 architecture supports As for programming, there are also wider data types. The data type diagrams below are for 128 bit operations: \_\_m128 , 4 x 32 bit floating points Float Float Float m128d, 2 x 64 bit doubles Double \_\_m128i , 4 x 32 bit ints int int

long long

to C0 level

Note that as SIMD instructions require more power, therefore usage of some AVX2/512 extensions may introduce downclocking. They should be benchmarked. For details : <u>Daniel Lemire`s article</u>

OFFSET

used to determine used to determine the actual bytes



Reference: <a href="https://en.wikipedia.org/wiki/Memory\_disambiguation">https://en.wikipedia.org/wiki/Memory\_disambiguation</a> STORE-TO-LOAD FORWARDING Using buffers for stores and loads to support out of order execution leads to a data syncronisation issue. That issue is described in en.wikipedia.org/wiki/Memory\_disambiguation#Store\_to\_load\_forwarding

load, if they are both operating on the same address.

An example store and load sequence

cache memory

mov [eax],ecx; STORE, Write the value of ECX register to the memory address which is stored in EAX register mov ecx,[eax]; LOAD, Read the value from that memory address ; ( which was just used) and write it to ECX register

fetching a set of instructions in advance.

How: There are auxilliary hardware buffers.

Pattern history tables track the history of results

( whether it was taken or not ) per branch.

Branch target buffer stores target addresses (instruction

pointers ) of branches. AMD uses multiple level of BTBs :

be flushing the pipeline.

L1 BTB, L2 BTB etc.

As a solution, CPU can forward a memory store operation to a following

#### LOAD & STORE BUFFERS Based on Intel Optimization Manual 3.6.4, store-to-load forwarding may Load and store buffers allow CPU to do out-of-order execution on loads and improve combined latency of those 2 operations. The reason is not stores by decoupling speculative execution and commiting the results to the specified however it is potentially LHS (Load-Hit-Store) problem in which the penalty is a round trip to the cache memory: https://en.wikipedia.org/wiki/Load-Hit-Store

1 2 3 4 ... T T NT T ...

branch ... NT NT NT T

branch n T NT NT NT ...

A hypothetical pattern history table

T: taken, NT : not taken

There are several conditions for the forwarding to happen. In case of a STORE BUFFER LOAD BUFFER successful forwarding, the steps 2 and 3 ( a roundtrip to the cache ) will be bypassed The conditions for a successful forwarding and latency penalties in case of no-forwarding can be found in Agner Fog's microarchitecture book.

What would happen without forwarding?: In the past, game consoles PlayStation3 and Xbox360 had PowerPC based processors which used inorder-execution rather than out-of-order execution. Therefore developers had to separately handle LHS by using restrict keyword and other methods : Elan Ruskin`s article

#### architecture on 64 bit registers Bitwise operations , integer add/sub : 0.25 to 1 clock cycle Floating point add : 3 clock cycles Integer division: 24-90 clock cycles

# **CACHE MEMORY VS SYSTEM MEMORY**



Access time: 50-150 nanoseonds due to capacitor Access time: Under 1 nanosecond charge/discharge times and other steps Cost: Expensive in the price Cost: Cheaper in the price as it has less components due to 6 transistors

#### Reference: Ulrich Drepper's What every programmer should know about memory **CACHE ORGANISATION**



All the mentioned caches till now were data caches. But there is also instruction cache (iCache) which store program instructions rather than data to improve throughput of CPU frontend. In case of a cache hit, the latency is typically single digit nanoseconds. And in case of a cache miss, we need a round trip to the system memory and total latency becomes 3 digit nanoseconds.

lines automatically. Developers can also use instruction \_mm\_prefetch to prefetch data explicitly. That is called as

software prefetching. However performance improvement by using software prefetcher is controversial: Daniel

HARDWARE AND SOFTWARE PREFETCHING Intel Optimisation Manual 3.7 describes prefetching. Hardware prefetchers prefetch data and instruction to cache

#### directly LLC of CPUs that support this feature. Intel refers to their technology as DDIO ( Direct I/O ). Reference : Intel documentation

memory we can map.

memory.

support needed.

Modern NICs come with a DMA (Direct Memory

Access ) engine and can transfer data directly to

drivers' ring buffers which reside on the system

DMA mechanism doesn't require CPU involvement.

Though mechanism initiated by CPU, therefore CPU

DCA bypasses the system memory and can transfer to



NIC

#### **BRANCH PREDICTION REALM**

CMOV ( Conditional move ) instruction also computes the conditions for some additional time. Therefore they don't introduce extra load to the branch prediction mechanism. They can be used to eliminate Reference : Intel Optimisation Manual 3.4.1.1 **BP METHODS: 2-LEVEL ADAPTIVE BRANCH PREDICTION** 

CONDITIONAL MOVE INSTRUCTION

**BRANCH PREDICTION BASICS** 

Why: CPUs proactively fetch instructions of potentially upcoming branches to utilise the pipeline as much as

Gain if predicted correctly: If the right branch was predicted that will increase the throughput as it completed

Penalty in case of misprediction: If the prediction was wrong, that prefetch will be a waste and the cost will

What are branch instructions?: Unconditional ones (jmp), conditional ones (eg: jne), call/ret

Saturating counter as a building block Strongly not taken Not taken A 2-bit saturating counter can store 4 strength states. Whenever a branch is taken it goes stronger. And whenever a branch is not taken it goes 2 level adaptive predictor

In this method, the pattern history table keeps 2<sup>n</sup> rows and each row will have a saturating counter. A branch history register which has the history of last n occurences, will be used to choose which row will be used from the pattern history table. Reference : Agner Fog`s microarchitecture book 3.1.

# **BP METHODS: AMD PERCEPTRONS**

A <u>perceptron</u> is basically the simplest form of machine learning. It can be considered as a linear array of Agner Fog mentions that they are good at predicting very long branches compared to 2-level adaptive branch prediction in his microarchitecture book 3.12. For details of perceptron based branch prediction: Dynamic Branch Prediction with Perceptrons by Daniel The output Y (in this case whether a branch Jimenez and Calvin Lin taken or not ) is calculated by dot product

of the weight vector and the input vector. INTEL LSD ( LOOP STREAM DETECTOR )

Intel LSD will detect a loop and stop fetching instructions to improve the frontend bandwidth. Several conditions mentioned in <a href="Intel Optimisation Manual">Intel Optimisation Manual</a> 3.4.2.4: • Loop body size up to 60 μops, with up to 15 taken branches, and up to 15 64-byte fetch lines. No CALL or RET. • No mismatched stack operations (e.g., more PUSH than POP). • More than ~20 iterations.

https://en.wikichip.org/wiki/intel/microarchitectures/skylake\_(server)#Front-end DISABLING SPECULATIVE EXECUTION PATCHES

Note that LSD is disabled on Skylake Server CPUs. Reference :

Spectre paper : https://spectreattack.com/spectre.pdf

You can consider disabling system patches for speculative execution related vulnerabilities such as Meltdown and Spectre for performance, if that is doable in your system. Kernel.org documentation: https://www.kernel.org/doc/html/latest/admin-guide/kernel-Red Hat Enterprise documentation : <a href="https://access.redhat.com/articles/3311301">https://access.redhat.com/articles/3311301</a> Meltdown paper: https://meltdownattack.com/meltdown.pdf

**ESTIMATED LIMITS: HOW MANY IFS ARE TOO MANY?** As for max number of entries in BTBs, there are estimations made by stress testing the BTB with sequences of branch instructions

Intel Xeon Gold 6262 -> roughly 4K AMD EPYC 7713 -> roughly 3K Reference: Marek Majkovski's article on Cloudflare blog

System memory var x = 0

Only cache1 of core1 holds the data.

Therefore it is in E ( exclusive ) state.

System memory var x = 0

data, and both are in S (shared) state

System memory var x = 0

var x = 1

Both cache blocks on 2 cores hold the same

Core 2 reads data

var x = 0

Core 1 modifies the data

var x = 0

# VIRTUAL MEMORY REALM



**TLB PRESSURE & HUGE PAGES** TLB pressure If each page is 4K, that increases the load on the TLB buffer. **CPU** support for larger pages TLB x86-x64 CPUs support huge pages from 2MB to 1GB to reduce the pressure on TLB. TLB Miss: If not found on TLB, we need to go to the system memory which is slower Regular pages Linux implementation refers to them as huge pages and Windows calls them as large pages. TLB You shall check your OS and CPU in combination to find out the 1 GB 1 GB supported sizes. PAGE TABLE ON THE SYSTEM MEMORY MANAGED BY OS Huge pages

PAGE TABLE WALKING Even with pages which group addresses, having all pages in a page table would still need too much storage on 64 bit systems. Therefore page tables are implemented hierarchically. Memory is divided into address spaces. And there is a tree data structure for each address space in the page table. Processes have to "walk the page table" level by level in the hierarchy to find out the actual address 47 39 38 30 29 21 20 12 11 4 level page table is the most common one. In the diagram above, the first 48 bits of a 64 bit address are used for page table walking. All of 48 bits have to be used in order to find out the final actual address. (For all details: Intel Software Developer's Manual Volume3 4.5) Intel CPUs started to support 5 level tables since Ice Lake. The advantage of another level is that you can address even more space.

The disadvantage is that the time needed to walk the page table increases

due to a new level of indirection.

**SYSTEM MEMORY REALM** DDR RAMs are the most common commodity hardware as system memory. They are found in forms of DIMMs ( Dual inline memory module ) / RAMSticks. A DIMM. Click for Image source DANIZ 1 System memory / RAM is organised as RANK ... BANK 1 collection of ranks. RANK N BANK 1 BANK .. Each rank have banks which are collection of BANK N -DRAM cells per bit. **DRAM** refreshes DRAM circuits use capacitors which lose their charge over time. ( See the cache memory realm ). So RAMs have to refresh their DRAM cells periodically. As for DDR4, refreshing is rank-level which means the other banks in the same rank become inaccesible. DDR5 comes with same-bank-refresh feature which allows a more finegrained bank-level refresh. Therefore it can offer a higher DDR4 refresh granularity DDR5 refresh granularity throughput.

SMP (Symmetric multiprocessing)

All CPUs use a single bus to access the

same system memory. CPUs may slow

down each other as there may be a

contention for access to the banks

# MULTICORE REALM

#### **TOPOLOGIES**

**TOPOLOGICAL OVERVIEW - INTEL CPUS** 



Diagram above aims to show resource per core and shared resources. Note that uncore in an Intelonly term to refer to CPU functionality which are not per core.

**Exception of E-cores :** An exception to the above diagram is Intel's recent E-cores. E-cores are meant for power efficiency and paired with less resources. For ex: Alder Lake CPUs` E-cores also share L2 cache. Reference: https://www.anandtech.com/show/16959/intel-innovation-alder-lake-november-4th

**TOPOLOGICAL OVERVIEW - AMD CPUS** 

Most of AMD topology is similar to Intel. However starting from Zen microarchitecture, one key

AMD CPUs are designed as group of 4 cores which is called as CCX ( Core complex ) , and

difference is CCXs.

there is one LLC per each CCX/quad core. Practically the maximum number of cores competing for the LLC ( without simultanenous multithreading ) is 4 in recent AMD CPUs. An example 8 core CPU with 2 CCXs: CPU CORE 6 CORE 1 CORE 2 | CORE 5 CORE 3 CORE 4 CORE 7 CORE 8

Reference: https://en.wikichip.org/wiki/amd/microarchitectures/zen#CPU\_Complex\_.28CCX.29

#### COHERENCY

**CACHE COHERENCY: PROTOCOLS** Cache coherency protocols are needed to avoid data hazards. Intel CPUs use MESIFand AMD CPUs use

protocol. There are 4 states for a CPU cache line in MESI protocol, which are M for modified, E for exclusive, S for shared and I for invalid. The 3 diagrams to the right are illustrating the simplest cases for all 4 states. Intel's MESIF on Wikipedia

MOESI, however both heavily depend on MESI

AMD's MOESI on Wikipedia State transition can trigger cache coherency protocol across multiple cores. Variables can be cached to avoid cache coherency traffic whereever applicable Erik Rigtorp's article: Optimising a ring buffer for throughput

Only core1 holds the latest data so cache1 is in M (modified) state and cache2 is in I(invalid) state. **CACHE COHERENCY: FALSE SHARING AND CACHE PING-PONGING** Core 1

In the diagram to the right, if Core1 changes its var1, that change will Core 2 need to be propagated to all other cores by the cache coherency protocol. That will lead to invalidation of cache areas associated with the shared cache line across all cores, even though it var1 var2 memory is used by only one core. That situation is called false sharing. Shared system memory cache line holding var1 for Core1 and var2 for Core2

that situation is called as cache ping-pong. **VIRTUAL MEMORY PAGE TABLE COHERENCY: TLB SHOOTDOWNS** Whenever a page table entry is modified by any of the cores, that particular TLB entry is invalidated in all cores via IPIs. This one is not done by hardware but initiated by operating system. IPI: Interprocessor interrupt, you can take "processor" as core in this context.

If those happen in higher rates and if cache lines from system memory transferred between cores rapidly,



### MEMORY REORDERINGS & SYNCRONISATION

MEMORY REORDERINGS The term memory ordering refers to the order in which the processor issues reads (loads) and writes (stores) . Based on Intel Software Developer's Manual Volume3 8.2.3.4 , there is only one kind of memory reordering that can happen. Loads can be reordered with earlier stores if they use different

memory locations. That reordering will not happen if they use the same address: CORE2 CORE1 ; x and y initially 0 ; x and y initially 0 mov [x], 1; STORE TO X mov [y], 1; STORE TO Y mov [result2], x ; LOAD FROM X mov [result1], y; LOAD FROM Y

**INSTRUCTIONS TO AVOID REORDERINGS** Reorderings can be avoided by using serialising instructions such as SFENCE, LFENCE, and MFENCE: Intel Software Developer's Manual Volume3 8.3 defines them as: These instructions force the processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all buffered writes to memory before the next instruction is fetched and executed. There is also bus locking "LOCK" prefix ( Intel Software Developer's Manual Volume3 8.1.2 ) which can be used as well to avoid reorderings.

In case of reordering, result1 and result2 above can both end up as zero in both

cores. Note that, apart from CPUs , also compilers can do memory reordering

Jeff Preshing's article: Memory Ordering at Compile Time

High level languages expose memory fence APIs which are emitted as one of those methods: The image is taken from the online tool **Compiler Explorer**. **ATOMIC OPERATIONS** 

ATOMIC RMW OPERATIONS: COMPARE-AND-SWAP CAS instruction ( CMPXCHG ) reads values of 2 operands. It then compares them and if they are equal, it swaps values. All the operations are atomic / uninterruptible. It can be used to implement ATOMIC RMW OPERATIONS: TEST-AND-SET

An atomic operation means that there will be no other operations going on during the execution

The most common type of atomic operations are RMW (read-modify-write) operations.

From point of execution, an atomic operation is indivisible and nothing can affect its execution.

Test-and-set is an atomic operation which writes to a target memory and returns its old value. It is typically used to implement spin locks. TRANSACTIONAL MEMORY Transactional memory areas are programmer specified critical sections. Reads and writes in those areas are done atomically. ( <u>Intel Optimization Manual</u> section 16 ) However due to another hardware security issue, Intel disabled them from Skylake to Coffee Lake CPUs: <a href="https://www.theregister.com/2021/06/29/intel\_tsx\_disabled/">https://www.theregister.com/2021/06/29/intel\_tsx\_disabled/</a>

there are no AMD processors using it yet.

AMD equivalent is called as "Advanced Syncronisation Facility". According to Wikipedia article,

#### LIMITING CONTENTION BETWEEN CORES DISABLING UNUSED CORES TO MAXIMISE FREQUENCY (INTEL)

Number of active cores may introduce downclocking : Wikichip article Therefore disabling unused cores may improve frequency for perf-critical cores, depending on your CPU. You shall refer to your CPU's frequency table : An example frequency table : Wikichip XeonGold5120 article ALLOCATING A PARTITION OF LLC (SERVER CLASS CPUS) You can allocate a partition of the shared CPU last level cache for your

performance sensitive application to avoid evictions on Intel CPUs that support CAT feature. CAT : Cache allocation tech , reference : Intel CAT page **CDP** (Code and data prioritisation) allows developers to allocate LLC on code basis : Intel's CDP page on supported CPUs.

CORE 1 CORE 2 CRITICAL LLC cache lines shared by non LLC cache lines dedicated performance critical cores to only one core On AMD side, QOS Extensions were introduced starting from Zen2. Corresponding

MEMORY BANDWIDTH THROTTLING (SERVER CLASS CPUS) You can throttle memory bandwidth per CPU core on Intel CPUs that support MBA. Each core can be throttled with their request rate controller units. MBA: Memory bandwidth allocation, reference: Intel MBA page For AMD equivalent, QOS Extensions were introduced starting from Zen2: https://developer.amd.com/wp-content/resources/56375 1.00.pdf

technologies are called as "Cache allocation enforcement" and "Code and data

prioritisation": https://developer.amd.com/wp-content/resources/56375 1.00.pdf



INTERLEAVING FOR REDUCING CONTENTION ON SYSTEM MEMORY Read and write requests are done at bank level. ( See the system memory realm for its organisation) Therefore if multiple cores try to access to the same bank, there will be a contention. Interlaving bank address spaces is one method to







**MULTICPU REALM** 

(SERVER CLASS CPUS)

CPU 1

SYSTEM MEMORY ( RAM )

BUS

---

CPU N

# INTEL'S TOPDOWN MICROARCHITECTURE ANALYSIS METHOD

of microarchitecture events. It is documented in Intel Optimisation Manual Appendix B1. There are 2 main categories for stalls 1. Frontend : A typical example is large code sizes leading to instruction cache misses. **ACROSS** 2. Backend: Usually either memory bound or **REALMS** 

compute bound

Branch mispredictions are categorised under "bad speculation". You can use either Intel's Vtune or Andi Kleen's <u>Toplev</u> tool to make a top down analysis. Both utilise Intel CPUs` performance monitoring counters.

Intel's Top Down analysis is hierarchical organisation





