Cheat sheet by Kenny Luu, page 1 of 2

# **CPU Power Consumption**

Chip Power Consumtion = C \* $Vdd^2 * F C =$ capacitance, Vdd =voltage, F = frequency

Dennard scaling: Transistors get smaller but their power density stays the same. The supply voltage of a chip can be reduced by 0.7x every generation

 $Power = C * Vdd^2 * F_{0 \to 1} + Vdd *$ Ileakage Leakage gets worse with samller devices and lower *Vdd*, it also gets worse with higher temps Amdahl's Law: We want to make the common case efficient, given an optimization x that accerlera-

tes fraction  $f_x$  of the program by

 $(1-f_x)+\frac{f_x}{S_x}$ 

## Performance

Latency: how long it takes to do a

Throughput: total work done per unit time ExeTime

Instrs Clockcycles Program Instr ClockCycle Relative Performance: define performance as = 1/ExecutionTime "X is n times faster than Y" means Perormance<sub>x</sub>/Performance<sub>v</sub>

## 2 Instruction Set Architechture





Constant == immediate == literal

There are 32 registers in RISC-V **Id dst, offset(base)** the base is the starting address of the array the offset is the index

When using switch statement we can use a jump table, a jump table holds addresses in memory of where the code for the jump tar-

**Stack:** The stack is a allocated length

in frames, it stores the state of a procedure for a limited time, the callee returns before the caller does. The things which can be saved on the stack are: local arrays, return addresses, saved registers, and nested call arguments



- Callee saved registers (preserved for caller · Save register values on stack prior to use Restore registers before return
- · Caller saved registers (not preserved for caller) . Do what you please and expect callees to do likewise Should be saved by the caller if needed after procedure call
- Procedure Call Steps

# a factor $S_x$ , the overall speedup is:

# 1. Place parameters in a place

cess them 2. Transfer control to the pro-

where the procedure can ac-

- 3. Allocate the memory re-
- sources needed for the pro-
- 4. Perform the desired task
- 5. Place the result value in a place where the calling program can access it
- 6. Free the memory allocated
- 7. Return control to the point of origin

# 3 Pipelining

In pipelinging we overlap instructions in defferent stages



There are hazards, these include structural hazards where a required resource is busy, data harzards where we must wait previous instructions to produce/consume data, and control ha**zards** where next PC depends on previous instruction. **Structural Hazards** 

two instructions are trying to use the same hardware within the same cycle, to solve this we can make all the instructions the same We want to avoid in-order stalls

## **Data Dependencies**

Dependencies for instruction *j* following instruction i

- Read after Write (RAW or true dependence) Instruction *j* tries to read
- before instruction *i* tries to Write after Write (WAW or
- output dependence) Instruction *j* tries to write an operand before *i* writes
- its value Write after Read (WAR or
- (anti dependence)) Instruction *j* tries to write a destination before it is read

Solutions for RAW Hazards: We

can delay the reading of an instruction until data is available, to do this we can insert pipeline bubbles, can also write to the register file in the first half of a cycle and then read in the second half. **Forwarding:** Another solution is

forwarding or pushing the data to an appropriate unit. We can also reorder instructions to deal with RAW hazards



- if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (FX/MFM RegisterRd == ID/FX RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt))

#### Control Hazards

A control hazards is like a data hazard on the PC, we cannot fetch the next instruction if we don't know the PC Some solutions for control ha-

zards are stalling on branches, predicting taken or not taken. We need to flush the pipeline if we predict wrong, in a 5-sage pipeline we only need to flush 1 instruc-

# Out of Order Execution and

so we use out of order execution

to re-order instructions based on dependencies



A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. I.e we can launch multiple instructions every cycle. There are some issues with multi-

ple instructions executing at onc,e we need to double the amount of hardware, we introduce hazards, branch delay, & load delay We can rename (map) architectural registers to physical registers in decode stage to get rid of false dependencies Superscalar + Dynamic scheduling + register renaming

add  $$t0_A$ , \$t1, \$t2 sub  $$t0_B$ , \$t1, \$t2or \$t3, \$t0, \$t2 and \$t5, \$t0, \$t2 There are some limits to **ILP** and pipelining:

- · Limited ILP in real pro-
- Pipeline overhead
- Branch and load delays exacerbated
- Clock cycle timing limits
- Limited branch prediction accuracy (85%-98%)
- · Even a few percent really hurts with long/wide pipes
- · Memory inefficiency
- Load delays + # of loads/cycle

# Caches



Principle of locality Programs work on a relatively small portion of data at any time, so we can predict data accessed in near future by looking at recent accesses There are two types of locality: spatial and temporal, spatial locality is if an item has been accessed recently, nearby items will

tend to be referenced soon, and When we encounter a miss the temporal is if an item has been pipeline stalls for an instruction referenced recently, it will proba- or data miss bly be accessed again soon

How is data stored in a cache? Cache Blocks There are three ways: • Direct mapped (single loca-

- Fully associative (anywhe-
- Set associative (anywhere
- in a set)

# **Direct Mapped Cache**

Location in cache determined by (main) memory address, we use the lowest order bits to determine this address.

To determine which particular block is stored in a cache location, we look at tag and valid bits, the tag bits are the highorder bits of the memory address, the **valid** bit represents 1 = present, 0 = not present



- Miss Data not found in the cache Miss rate - Percent of misses (1 - Hit rate)
- Miss penalty Overhead in getting data from a higher numbered level Miss penalty = higher level access time + Time to deliver to lower level + Cache replacement / forward to processor time Miss penalty is usually much larger than the hit time
  - . This is in addition to the hit time These apply to each level of a multi-level cache
- **Average Memory Access Times:**

# $Accesstime = hittime + missrate \times$ misspenalty

# 3 C's of Cache Misses:

- Compulsory this is the first time you referenced this item
- Capacity not enough room in the cache to hold this miss would disappear if the cache were big enough
- Conflict item was replaced because of a conflict in this miss would disappear

with more associativity

*CPI penalty* = missrate × misspenalty

Assuming a  $2^n$  byte direct mapped cache with  $2^m$  byte blocks

- Byte select The lower m
  - Cache index The lower (nm) bits of the memory ad-
- Cache tag The upper (32n) bits of the memory ad-Direct Mapped Problems: Thra-

shing If accesses alternate, one block will replace the other before



however it also incurs larger miss penalty since it takes longer to transfer the block into the cache.

**Fully Associative Cache** 

Opposite extreme in that it has no cache index to hash, it uses any available entry to store memory elements, there are no conflict misses, only capacity misses, and we must compare cache tags of all entries to find the desired one

# **N-way Set Associative Cache**

Compromise between directmapped and fully associative, each memory block can go to one of N entries in cache, and each set can store N blocks; a cache contains some number of sets



#### Tag & Index with Set-Associative Caches

Given a  $2^n$  byte cache with  $2^m$  byte blocks that is  $2^a$  setassociative, the cache contains  $2^{n-m}$  blocks, and each cache way contains  $2^{n-m-a}$  blocks, and the

Cheat sheet by Kenny Luu, page 2 of 2

cache index is n - m - a bits after the byte select

| Organization                      | # of sets | # blocks / set | 12 bit Address          |  |  |
|-----------------------------------|-----------|----------------|-------------------------|--|--|
| Direct mapped                     | 16        | 1              | tag index blk off 4 4 4 |  |  |
| 2-way<br>set associative          | 8         | 2              | tag index blk off 5 3 4 |  |  |
| 4-way<br>set associative          | 4         | 4              | tag ind blk off 6 2 4   |  |  |
| 8-way<br>set associative          | 2         | 8              | tag i bik off           |  |  |
| 16-way (fully)<br>set associative | 1         | 16             | tag blk off             |  |  |

Some cons with associative caches are that there is an an area overhead and more latency

The **replacement policy** for a Nway set associative cacheis choosing the least recently used thing

#### Cache Write Policies

- Write-through (write data go to cache and memory) Main memory is updated on each cache write
- Replacing a cache entry is simple (just overwrite new block)
- Memory write causes significant delay if pipeline must stal
- Write-back (write data only goes to the cache)
- Only the cache entry is updated on each cache write so main memory
- Add "dirty" bit to the cache entry to indicate whether the data in the
- Replacing a cache entry requires writing the data back to memory before replacing the entry if it is "dirty"



#### Use Write Buffer between cache and memory

- Processor writes data into the cache and the write buffer
- Memory controller slowly "drains" buffer to memory

#### Write Buffer: a first-in-first-out buffer (FIFO)

- Can absorb small bursts as long as the long-term rate of writing to

### Write Miss Options:

|       |                     |                     | -                            |                     | _                     |                       |
|-------|---------------------|---------------------|------------------------------|---------------------|-----------------------|-----------------------|
|       |                     | Write th            | Write back<br>Write allocate |                     |                       |                       |
|       | Write allocate      |                     |                              |                     | No write allocate     |                       |
| Steps | fetch on<br>miss    | no fetch on<br>miss | write around                 | write<br>invalidate | fetch on<br>miss      | no fetch on<br>miss   |
| 1     | pick<br>replacement | pick<br>replacement |                              |                     | pick re-<br>placement | pick re-<br>placement |
| 2     |                     |                     |                              | invalidate<br>tag   | [write back]          | [write back]          |
| 3     | fetch block         |                     |                              |                     | fetch block           |                       |
| 4     | write cache         | write partial cache |                              |                     | write cache           | write partial cache   |
| 5     | write<br>memory     | write<br>memory     | write<br>memory              | write               |                       |                       |

Splitting Caches Most chips have separate caches for instructions and data, some advantages are that we have extra bandwidth and low hit time, but miss rate will be higher.

Multilevel Caches: Different levels of caches L1, L2, and L3. L1 is focused on hit time L2 is focused on hit rate, L3 is extra.

Cache Coherence: State and sequence of actions needed to ensure caches are coherence in multicore systems, there are protocols that rely on monitoring other ca-

#### MSI Coherence Protocol

# Bus Read Exclusive

# Memory-Access Protocol

## 5 basic commands

- ACTIVATE (open a row)
- READ (read a column)
- WRITE
- PRECHARGE (close row)
- REFRESH

# To reduce pin count, row and column share same address

- RAS = Row Address Strobe
- CAS = Column Address Strobe

Modern DRAM chips consist of multiple banks, they operate independently, but share command/address/data pins, each bank can have a different row active so we can overlabp ACTIVA-TE and PRECHARGE latencies, in a coarser granularity i.e 4Kb e.g. READ to bank 0 while ACTI-VÄTING bank 1

#### **DRAM High Level**



Ranks A 64-bit DIMM with 16 chips with x8 interfaces has 2 ranks, each 64-bit group of chips is called a rank, all chips in a rank respond to the same command **DRAM Channels** All DIMMs get the same command, one of the ranks replies Multi-Channel: Multiple lock-

step channels, single channel with wider interface which gives faster cache line refill.

7 Virtual Memory

We expose a virtual memory space to applications so that they can assume that they own the whole space, then the OS handles the mapping of virtual to physical memory space.

Page Table: A data structure which contains virtual to physical address mappings. If we haven't used a page in a while we can map it to disk memory instead of DRAM. Page tables are mapping



One entry per virtual page number