



## **Computer Systems**

Advanced Processor Pipelines 2

Daniel Mueller-Gritschneder

08.04.2025





This book covers the basics of how to design a simple in-order scalar processor pipeline in detail in hardware.

- Literature: "Digital Design and Computer Architecture: RISC-V Edition", by Sarah L. Harris and David Harris
  - <a href="https://shop.elsevier.com/books/digital-design-and-computer-architecture-risc-v-edition/harris/978-0-12-820064-3">https://shop.elsevier.com/books/digital-design-and-computer-architecture-risc-v-edition/harris/978-0-12-820064-3</a>
  - <a href="https://pages.hmc.edu/harris/ddca/ddcarv.html">https://pages.hmc.edu/harris/ddca/ddcarv.html</a> (Includes resources for students!)
  - They also provide slideshows the basis for ours! You can investigate extended version at their website.
- Available at TU's library: <a href="https://catalogplus.tuwien.at/permalink/f/qknpf/UTW">https://catalogplus.tuwien.at/permalink/f/qknpf/UTW</a> alma21139903990003336



So-called application processors have many additional features: Branch prediction, Out of order execute, Scoreboard, Superpipelining, Multi-issue, Superscalar, VLIW, Multi-threading, ...

**Disclaimer**: The book provides advanced concepts from real complex processor designs. We only study the concepts at a high level. For simplicity, the used pipeline models in this lecture are reduced strongly in complexity.

But: We will have a look at some current RISC-V processor designs

Literature: "Computer Architecture A Quantitative Approach" 5th Edition - September 16, 2011 Authors: John L. Hennessy, David A. Patterson eBook ISBN: 9780123838735

- https://shop.elsevier.com/books/computer-architecture/hennessy/978-0-12-383872-8
- Available at TU's library: https://catalogplus.tuwien.at/permalink/f/8agg25/TN cdi askewsholts vlebooks 9780123838735

## RECAP: Five-Stage In-Order Scalar Pipeline



- Five Stage
- In-order pipeline

Each stage takes one cycle to complete

- Scalar pipeline
- Single access cycle to instruction and data memory: Works for small and slow micro-controller-type processors with on-chip embedded SRAM memories
- ➤ Single cycle operations, works for simple instructions (ADD, Compare,...)



Scalar processor: Can execute at maximum 1 instruction per cycle (IPC <=1)</li>

#### Content

- Multi-cycle Functional Units (FUs)
- Load and Store Optimizations
- Instruction Dependencies (RAW, WAW, WAR)
- Dynamic Scheduling with Scoreboard (Out of Order OoO)
- Register Renaming
- Superscalar
- A look at a real RISC-V processor: CVA6
- Pipeline Support for Precise Traps

Optional, not relevant for exam

# **Multi-Cycle Operations**

## **Integer Multiplication Instructions**

- Signed-signed Multiplication
  - Multiplying two 32bit values can result in a value of up to 64 bit
  - MUL a3,a1,a2
    - Behavior: a3 ← a1\*a2 // only the lower 32bit
  - MULH a4, a1, a2
    - Behavior: a4 ← a1\*a2 // only the higher 32bit
  - Example:
    - MULH a4,a1,a2
    - MUL a3, a1, a2 Behavior: [a4 a3] = a1\*a2 // full 64 bit
- Unsigned-unsigned multiplication MULHU
- Signed-Unsigned multiplication MULHSU

## **Integer Division Instructions**

- Signed-signed Division
  - DIV a3,a1,a2
    - Behavior: a3 ← a1 / a2
  - REM a4,a1,a2
    - Behavior: a4 ← a1 modulo a2 // remainder
- Unsigned-unsigned division DIVU, REMU

## Pipelined Functional Units (FUs)

- Complex computations require deep circuit logic
- Critical path in deep logic limits the design's frequency
- Similar to processor design, break FU into stages and integrate registers to build a pipeline
- > Latency (in cycles) equals to number of pipeline stages
- > Initialization Interval: Delay (in cycles) between start of two computations

• Example: 2-stage Multiplier



08.04.2025

MUL a0,a0,t0

MUL a1, a1, t1

MUL a2, a2, t2

## Serial Functional Units (FUs)

- Often complex operations such as divisions can be computed by iterative algorithms
- The number of iterations (required clock cycles) often depends on the input values
- These iterations can be implemented on a serial FU, which is busy as long as it computes
- > Latency equals to number of cycles required for computation
- > Initialization Interval equals to number of cycles required for computation



### Example: RISC-V CVA6 Processor

#### "Multiplier

The multiplier contains a division and multiplication unit. Multiplication is performed in two cycles and is fully pipelined (re-timing needed). The division is a simple serial divider which needs 64 cycles in the worst case."\*

\*https://docs.openhwgroup.org/projects/cva6-user-manual/03\_cva6\_design/ex\_stage.html

### Integration of Multi-cycle Functional Units

- Multi-cycle Functional Units are integrated into the EX stage
- Example only for Multiplier



# Simplified Illustration Style for Multiplexing



## Scalar Five-Stage Pipeline with Multi-cycle FUs and Forwarding

Multi-cycle Functional Units are integrated into the EX stage

Simplified diagram



## Scalar Five-Stage Pipeline with Multi-cycle FUs and Forwarding

- Multi-cycle Functional Units are integrated into the EX stage
- Further simplified diagram (PC Generation, Extend, PC+rd address not shown, but of course still needed!)



08.04.2025 Computer Systems 14

## Scalar Four-Stage Pipeline with Multi-cycle FUs with Forwarding

- The DIV and MUL do not need to make memory accesses
- Move the memory stage (MS) after the ALU (which is required for the address computation for load/store)
- Merges MS and EX stage (four stages)
- Single forwarding path required in four-stage pipeline
- Such changes need additional control in control path



## Scalar Four-Stage Pipeline with Multi-cycle FUs and Load Store Unit (LSU)

 We can add a second address computation adder (AC) to form a simple so-called load/store unit (LSU)



08.04.2025

## Execution Scheme: Four-Stage In-Order Scalar Pipeline

- The EX stage has an execution scheme defined by the processor control path
- Version 1: Static In-order Scheduling
  - ➤ Allow only one single instruction in the EX stage
  - > Data hazards: Operands are forwarded by previous instruction



## Execution Scheme: Scalar Four-Stage Pipeline with Pipelined FUs

- <u>Version 2</u>: Static In-order Scheduling exploiting Pipelined FUs
- ➤ Allow only one single instruction in EX stage
- Except for: Pipelined MUL can use Initialization Interval for two consecutive MUL (still need to check for RAW dependency between the MUL)

|              | Cycle 1 | Cycle 2 | Cycle 3 | Cycle 4 | Cycle 5 | Cycle 6 | Cycle 7 | Cycle 8 | Cycle 9 | Cycle 10 |
|--------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|----------|
| ADD a1,t1,t2 | IF      | ID      | ALU     | WB      |         |         |         |         |         |          |
| MUL a2,a0,a2 |         | IF      | ID      | MUL(s1) | MUL(s2) | WB      |         |         |         |          |
| MUL a4,a1,a4 |         |         | IF      | ID      | MUL(s1) | MUL(s2) | WB      |         |         |          |
| LW t1,0(a3)  |         |         |         | IF      | ID      | stall   | AC      | DMEM    | WB      |          |
| ADDI t1,t1,4 |         |         |         |         | IF      | stall   | ID      | stall   | ALU     | WB       |

# **Load / Store Optimizations**

## **Memory System**

 The memory for more complex processors usually uses caches to allow for fast accesses

 Memory latency depends whether the data is found in the cache (cache hit/miss)

 Also instructions are loaded from caches, so also instruction fetch may require several cycles on an instruction cache miss.



#### **Instruction Cache Misses**

- Instruction cache miss causes several cycles of delay for instruction fetch (IF), depending on speed to catch fresh instruction block from memory system
- Instructions are usually reloaded to cache in blocks (cache line size) so that usually there
  are several cache hits after a cache miss (depending on jumps/branches in program)



• Advanced caches pre-fetch the next block before the cache miss happens to hide cache refill latencies.

21

#### **Load Cache Miss**

- Data cache misses lead to extra cycles for loads as the data needs to get fetched from another memory (level 2 cache, main memory)
- Example (function vec\_add, see first session): We load from two different addresses a0
  and a1 (worst case both loads lead to a data cache miss)



ID here because stall on previous instruction finished

## Example vec\_add: Loads from two different addresses (a0,a1)

Example C-Code 3

```
// vector addition of 4-element integer vectors
void vec_add(int[4] a, int[4] b, int[4] c) {
   unsigned int i;
   for (i=0;i<4;i++) {
     c[i] = a[i] + b[i];
   }
}</pre>
```

```
RISC-V Code
# base address of a: a0,
# base address of b: a1,
# base address of c: a2,
# i: t0, constant 4: t3
vec add:
 LI t0,0
               # i=0
 LI t3,4 # t3=4
vec add for:
 LW t1,0(a0) # t1 = a[i]
 LW t2,0(a1) \# t2 = b[i]
 ADD t1, t1, t2  # t1 = a[i] + b[i]
                \# c[i] = t1
  SW t1,0(a2)
 ADDI a0,a0,4 #next element is base address + 4
 ADDI a1,a1,4 #next element is base address + 4
 ADDI a2,a2,4 #next element is base address + 4
 ADDI t0,t0,1 # i++
 BLTU t0, t3, vec add for \# for (i < 4)
       # void return
 RET
```

## Nonblocking Loads (1/2)

- Load accesses are for longer times in flight due to cache misses
- Most interconnects/caches allow to overlap multiple memory accesses
- Allows to execute multiple load accesses in overlapping fashion
- Example (function vec\_Add): Cache observes both addresses for load accesses and may need to reload cache lines for both accesses when both miss.

| Data Cache Misses |         |         |         |         |         |         |         |         |         |          |  |  |
|-------------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|----------|--|--|
|                   | Cycle 1 | Cycle 2 | Cycle 3 | Cycle 4 | Cycle 5 | Cycle 6 | Cycle 7 | Cycle 8 | Cycle 9 | Cycle 10 |  |  |
| LW t1,0(a0)       | IF      | ID      | AC      | DMEM    | DMEM    | DMEM    | DMEM    | WB      |         |          |  |  |
| LW t2,0(a1)       |         | IF      | ID      | AC      | DMEM    | DMEM    | DMEM    | DMEM    | WB      |          |  |  |
| ADD t1,t1,t2      |         |         | IF      | ID      | stall   | stall   | stall   | stall   | ALU     | WB       |  |  |

24

## Nonblocking Loads (2/2)

- Cache usually returns values in-order (some caches/interconnects support to return data out-of-order)
- Example (function 3): When only the first load misses, the second load still needs to wait in the LSU when the LSU returns results in-order.

|              | Data Cache Misses |         |         |         |         |         |         |         |         |          |  |  |  |  |
|--------------|-------------------|---------|---------|---------|---------|---------|---------|---------|---------|----------|--|--|--|--|
|              | Cycle 1           | Cycle 2 | Cycle 3 | Cycle 4 | Cycle 5 | Cycle 6 | Cycle 7 | Cycle 8 | Cycle 9 | Cycle 10 |  |  |  |  |
| LW t1,0(a0)  | IF                | ID      | AC      | DMEM    | DMEM    | DMEM    | DMEM    | WB      |         |          |  |  |  |  |
| LW t2,0(a1)  |                   | IF      | ID      | AC      | DMEM    | DMEM    | DMEM    | DMEM    | WB      |          |  |  |  |  |
| ADD t1,t1,t2 |                   |         | IF      | ID      | stall   | stall   | stall   | stall   | ALU     | WB       |  |  |  |  |

No data cache miss, but we need to wait for first cache access to finish.

#### Store Cache Miss

- Depending on Store Policy: Write-back data cache:
  - Additional latencies for stores possible when a dirty cache line needs to be replaced.
  - Dirty cache line needs first to be written to memory before it can be replaced
- Write through data cache:
  - Long store latency because the data is written not only to cache but also to main memory.

#### Example: We store to two different addresses a0 and a1 (first store misses)

|             | Cycle 1 | Cycle 2 | Cycle 3 | Data Cac<br>Cycle 4 | che Misses<br>Cycle 5 | Cycle 6 | Cycle 9 | Cycle 10 | Cycle 11 |     |    |
|-------------|---------|---------|---------|---------------------|-----------------------|---------|---------|----------|----------|-----|----|
| SW t1,0(a0) | IF      | ID      | AC      | DMEM                | DMEM                  | DMEM    | DMEM    | WB       |          |     |    |
| SW t2,0(a1) |         | IF      | ID      | stall               | stall                 | stall   | stall   | AC       | DMEM     | WB  |    |
| LI t2,4     |         |         | IF      | stall               | stall                 | stall   | stall   | ID       | stall    | ALU | WB |

26

#### Buffers

- A buffer can store several values
- FIFO (First-in-first-out) buffer: Values can be read only from the buffer in the same order they are written to the buffer

Reorder buffer: We can look up and read any value in the buffer



#### Store Buffer

- It is not really necessary to wait until a store write completes
- Store Unit (SU) with Store Buffer:
  - > Put store address and data to store buffer (sometimes called "Posted stores")
  - > Store buffer performs memory store access (MSA) independently from pipeline

Computer Systems

> Only stall pipeline for stores when store buffer is full

- Load Unit (LU): Load more complex:
  - > need to first look whether address is in store buffer then in cache
  - riangleright or need to wait until SB is empty.

## Nonblocking Stores with Store Buffer

- Store accesses are for longer times in flight due to cache misses
- Store Buffer store accesses and pipeline continues execution
- Store Buffer writes data to memory via Memory Store Access (MSA).
- Only stall pipeline for stores when store buffer is full
- Example:

|             | Cycle 1 | Cycle 2 | Cycle 3 | Cycle 4 | Cycle 5 | Cycle 6 | Cycle 7 | Cycle 8 | Cycle 9 | Cycle 10 |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|----------|
| SW t1,0(a0) | IF      | ID      | AC      | SB      | SB      | SB      | MSA     |         |         |          |
| SW t2,0(a1) |         | IF      | ID      | AC      | SB      | SB      | SB      | MSA     |         |          |
| LI t2,4     |         |         | IF      | ID      | ALU     | WB      |         |         |         |          |

# Execution Scheme: Scalar Four-Stage Pipeline with Pipelined FUs and Load Store Optimization

- <u>Version 3</u>: Static Scheduling with pipelined FUs and Load Store Optimization
- ➤ Allow only one single instruction in EX stage
- > Except for:
  - > Pipelined MUL can use Initialization Interval for two consecutive MUL
  - > Certain number of nonblocking Loads can be in EX stage (then EX stalls)
  - ➤ Certain number of stores can be posted in the SB depending on SB size (EX stalls when SB full). When Store is posted in SB, it does not count as instruction in EX stage.

|              | Cycle<br>1 | Cycle<br>2 | Cycle<br>3 | Cycle<br>4 | Cycle<br>5 | Cycle<br>6 | Cycle<br>7 | Cycle<br>8 | Cycle<br>9 | Cycle<br>10 | Cycle<br>11 | Cycle<br>12 |
|--------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|-------------|-------------|-------------|
| ADD a2,t1,t2 | IF         | ID         | ALU        | WB         |            |            |            |            |            |             |             |             |
| MUL a2,a0,a2 |            | IF         | ID         | MUL(s1)    | MUL(s2)    | WB         |            |            |            |             |             |             |
| MUL a4,a1,a4 |            |            | IF         | ID         | MUL(s1)    | MUL(s2)    | WB         |            |            |             |             |             |
| SW a2,0(a3)  |            |            |            | IF         | ID         | stall      | AC         | SB         | SB         | MSA         |             |             |
| ADDI a3,a3,4 |            |            |            |            | IF         | stall      | ID         | ALU        | WB         |             |             |             |
| SW a2,0(a3)  |            |            |            |            |            | stall      | IF         | ID         | AC         | SB          | SB          | MSA         |

# Performance of Scalar Four-Stage Pipeline with Pipelined FUs and Load Store Optimization

- We still only allow one instruction to execute in EX stage except for some instruction types (MUL, Store, Load) in Version 3
- Multi-cycle operations cause many stalls (stiff scalar execution scheme)



- Can we interleave instructions to make better use of parallel units, maybe even just start them when they are ready, possibly out-of-order (OoO)?
- We want to exploit so-called Instruction Level Parallelism

## Challenges for Exploiting Instruction Level Parallelism

## Challenges for Exploiting Instruction Level Parallelism: Structural Hazards

- Start instructions in EX stage when FUs are available?
- Challenge: Structural Hazards, e.g. in WB Stage

|              | Cycle 1 | Cycle 2 | Cycle 3 | Cycle 4 | Cycle 5 | Cycle 6 | Cycle 7 | Cycle 8 | Cycle 9 | Cycle 10         |
|--------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|------------------|
| ADD a2,t1,t2 | IF      | ID      | ALU     | WB      |         |         |         |         |         |                  |
| MUL a2,a0,a2 |         | IF      | ID      | MUL(s1) | MUL(s2) | WB      |         |         |         |                  |
| MUL a4,a1,a4 |         |         | IF      | ID      | MUL(s1) | MUL(s2) | WB      |         |         |                  |
| LW t1,0(a3)  |         |         |         | IF      | ID      | AC      | DMEM    | WB      | Two WI  | 3 in same cycle! |
| ADDI a3,a3,4 |         |         |         |         | IF      | ID      | ALU     | WB      | WB coll | ision!           |
|              |         |         |         |         |         |         |         |         | Structu | ral Hazard!      |

## Challenges for Exploiting Instruction Level Parallelism: Instruction Dependencies

- Start instructions in EX stage when FUs are available?
- Instructions can overtake each other due to different FU latencies.
- Challenge: The assembly program defines a program order for the instructions.
- Requires consideration of instruction dependencies during pipelined execution to preserve program order.

|              | Cycle 1 | Cycle 2 | Cycle 3 | Cycle 4 | Cycle 5 | Cycle 6 | Cycle 7 | Cycle 8           | Cycle 9   | Cycle 10                          |
|--------------|---------|---------|---------|---------|---------|---------|---------|-------------------|-----------|-----------------------------------|
| ADD a2,t1,t2 | IF      | ID      | ALU     | WB      |         |         | DAVA    | / al a .a a .a al |           | !                                 |
| MUL a2,a0,a2 |         | IF      | ID      | MUL     | MUL     | WB      | RAV     | v depend          | lency was | ignored (data hazard!)            |
| DIV a4,a1,a4 |         |         | IF      | ID      | DIV     | DIV     | DIV     | DIV               | WB        |                                   |
| SW a4,0(a3)  |         |         |         | IF      | ID      | AC      | MSA     |                   |           |                                   |
| ADDI a4,a3,4 |         |         |         |         | IF      | ID      | ALU     | WB                | DIV mus   | t write back result first         |
|              |         |         |         |         |         |         |         |                   |           | l Write-after-Write<br>dependency |

## **Instruction Dependencies**

A closer look at RAW, WAR and WAW!

## Types of Instruction Dependencies

- Read-after-Write (RAW): Also "True dependency"
  - Result of one instruction (write) is needed as input for another instruction (read)
  - May cause data hazards (we seen this one already)
- Write-after-Read (WAR): Also "anti-dependency"
  - A value is used (read) and then updated (write)
  - The update (write) is not allowed to overtake the use (read)
- Write-after-Write (WAW): Also "output dependency"
  - A value us updated (write) and then updated again (write)
  - The second update may not overtake the first update
  - Often created when registers are reused for different variables

Example for RAW: XOR a1,a2,a4 RAW ADD a3,a1,t1

Example for WAR: SW a1,0(a2) WAR ADDI a2,a3,4

Example for WAW:

LW **a1**,0(a2) WAW LI **a1**,a3,4

## Dep. For Example Program (vec\_add)

Example C-Code 3

```
// vector addition of 4-element integer vectors
void vec_add(int[4] a, int[4] b, int[4] c) {
  unsigned int i;
  for (i=0;i<4;i++) {
    c[i] = a[i] + b[i];
  }
}</pre>
```

```
# base address of a: a0.
# base address of b: a1,
# base address of c: a2,
# i: t0, constant 4: t3
vec add:
 LI t0,0 # i=0
 LI t3,4 # t3=4
vec add for:
 LW t1,0(a0) # t1 = a[i]
 LW t2,0(a1) \# t2 = b[i]
 ADD t1,t1,t2 \# t1 = a[i] + b[i]
 SW t1,0(a2) # c[i] = t1
 ADDI a0,a0,4 #next element is base address + 4
 ADDI a1,a1,4 #next element is base address + 4
 ADDI a2,a2,4 #next element is base address + 4
 ADDI t0,t0,1 # i++
 BLTU t0,t3,vec add for # for (i < 4)
 RET # void return
```

## Dep. For Example Program (vec\_add) (RAW)

 Mark all RAW dependencies for the following code block:

```
LI t0,0
LI t3,4
vec add for:
 LW t1,0(a0)
 LW t2,0(a1)
 ADD t1,t1,t2
SW t1,0(a2)
 ADDI a0,a0,4
ADDI a1,a1,4
 ADDI a2,a2,4
 ADDI t0,t0,1
 BLTU t0,t3,vec add for
 RET
```



## Dep. For Example Program (vec\_add) (WAR)

 Mark all WAR dependencies for the following code block:

```
LI t0,0
LI t3,4
vec add for:
 LW t1,0(a0)
 LW t2,0(a1)
 ADD t1,t1,t2
 SW t1,0(a2)
 ADDI a0,a0,4
 ADDI a1,a1,4
 ADDI a2,a2,4
 ADDI t0,t0,1
 BLTU t0,t3,vec add for
 RET
```



## Dep. For Example Program (vec\_add) (WAW)

 Mark all WAW dependencies for the following code block:

```
LI t0,0
LI t3,4
vec add for:
 LW t1,0(a0)
 LW t2,0(a1)
 ADD t1,t1,t2
SW t1,0(a2)
 ADDI a0,a0,4
ADDI a1,a1,4
 ADDI a2,a2,4
 ADDI t0,t0,1
 BLTU t0,t3,vec add for
 RET
```



## Dep. For Example Program (vec\_add) (ALL)

 Mark all dependencies for the following code block:

LI t0,0 LI t3,4 vec add for: LW t1,0(a0) LW t2,0(a1) ADD t1,t1,t2 SW t1,0(a2) ADDI a0,a0,4 ADDI a1,a1,4 ADDI a2,a2,4 ADDI t0,t0,1 BLTU t0,t3,vec add for RET



## Challenges with Interleaving Instruction Execution in EX Stage

1. We have to consider RAW, WAR and WAW dependencies.

2. Structural hazards must be avoided, e.g., FU is already busy.

Some instructions can cause so-called exceptions (e.g. memory fault on load/store)
 (See optional content for what is required for precise exceptions).

42

## **Dynamic Scheduling With Scoreboard**

Out-of-Order (OoO, O3) Pipeline

**Computer Architecture A Quantitative Approach – Section C7** 

## The CDC 6600 Project ['1964]

- First implementation of Scoreboard (Out-of-Order)
- 16 separate non-pipelined functional units (7 int, 4 Floating Point (FP), 5 memory)

 Out-of-order (OoO) execution is also called dynamic instruction scheduling



Steve Jurvetson CC BY 2.0

## The CDC 6600 Project ['1964]

#### CDC 6600 Scoreboard

- Three main components
- **►**Instruction status
- > Functional unit status
- ➤ Register result status
- For an example of use of Scoreboard in CDC 6600 see:
- Computer Architecture
   A Quantitative Approach Section C7

|         |           | Instruction status |               |                    |              |  |  |  |  |
|---------|-----------|--------------------|---------------|--------------------|--------------|--|--|--|--|
| Instruc | tion      | Issue              | Read operands | Execution complete | Write result |  |  |  |  |
| L.D     | F6,34(R2) | √                  | √             | √                  | V            |  |  |  |  |
| L.D     | F2,45(R3) | V                  | <b>√</b>      | √                  | V            |  |  |  |  |
| MUL.D   | F0,F2,F4  | V                  | <b>√</b>      | <b>√</b>           |              |  |  |  |  |
| SUB.D   | F8,F6,F2  | V                  | <b>√</b>      | <b>√</b>           | V            |  |  |  |  |
| DIV.D   | F10,F0,F6 | V                  |               |                    |              |  |  |  |  |
| ADD.D   | F6,F8,F2  | √                  | <b>V</b>      | <b>√</b>           |              |  |  |  |  |

|         | Functional unit status |      |     |    |    |       |    |    |     |  |  |
|---------|------------------------|------|-----|----|----|-------|----|----|-----|--|--|
| Name    | Busy                   | Op   | Fi  | Fj | Fk | Qj    | Qk | Rj | Rk  |  |  |
| Integer | No                     |      |     |    |    |       |    |    |     |  |  |
| Mult1   | Yes                    | Mult | F0  | F2 | F4 |       |    | No | No  |  |  |
| Mult2   | No                     |      |     |    |    |       |    |    |     |  |  |
| Add     | Yes                    | Add  | F6  | F8 | F2 |       |    | No | No  |  |  |
| Divide  | Yes                    | Div  | F10 | F0 | F6 | Mult1 |    | No | Yes |  |  |

|    | Register result status |    |    |     |    |        |     |  |     |  |
|----|------------------------|----|----|-----|----|--------|-----|--|-----|--|
|    | FO                     | F2 | F4 | F6  | F8 | F10    | F12 |  | F30 |  |
| FU | Mult 1                 |    |    | Add |    | Divide |     |  |     |  |

## Split of ID Stage

"To implement out-of-order execution, we must split the ID pipe stage into two stages:

- 1. *Issue*—Decode instructions, check for structural hazards.
- 2. Read operands—Wait until no data hazards, then read operands."

• "In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order"

-- Computer Architecture A Quantitative Approach – 5<sup>th</sup> Ed. Section C7

## Steps in Out-of-Order Execution (Scheme 1\*)

#### • 1. *Issue*

- > Functional unit is free
- ➤ No other active instruction has the same destination register (guarantee that **WAW hazards** cannot be present)
- ➤ If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.

#### • 2. Read operands

- > When source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution.
- > The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

#### • 3. Execution

> The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution.

#### • 4. Write result

> Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for **WAR hazards** and stalls the completing instruction, if necessary.

-- \*Computer Architecture A Quantitative Approach - 5<sup>th</sup> Ed. Section C7

## Steps in Out-of-Order Execution (Simpler Scheme 2\*\*)



- Issue Buffer (IB) holds multiple instructions waiting to issue.
- Instruction Decode (ID) adds next instruction to IB if
  - there is space in IB and
  - the instruction does not have a WAR or WAW dependency with any instruction in IB.
- Instruction Issue (IS) can issue any instruction in IB whose
  - RAW hazards are satisfied to all previous instructions in IB
  - FU is available.
- Note: With writeback (WB) we delete the instruction from the IB, this may enable more instructions to issue as RAW dependencies are resolved.
- -- \*\*Inspired by MIT course, Daniel Sanchez http://csg.csail.mit.edu/6.823S20/Lectures/L09.pdf

## Example OoO Processor: Simple Scoreboard Data Structure

- Simplified CDC-style Scoreboard Data Structure to track execution
- For Scheme 2, One Issue Buffer
- Logical, not HW implementation

| Issue | <b>Buffer</b> ( | (IB)     |
|-------|-----------------|----------|
|       | ,               | <i>-</i> |

| Inst | ruction | rd | rs1 | rs2 | Imm | RO | Complete |
|------|---------|----|-----|-----|-----|----|----------|
|      |         |    |     |     |     |    |          |
|      |         |    |     |     |     |    |          |
|      |         |    |     |     |     |    |          |
|      |         |    |     |     |     |    |          |

Scoreboard (ScB)

**FU Status (Ready?)** 

| DIV | MUL | ALU | ADD | SU | LU |
|-----|-----|-----|-----|----|----|
|     |     |     |     |    |    |

RO: Instruction read operands (started the computation)

Complete: Instruction finished computation (in last EX stage)



## For simplicity all FUs have fixed latency:

| FU  | Latency | Initialization Interval |                        |
|-----|---------|-------------------------|------------------------|
| ALU | 1       | 1                       |                        |
| ADD | 1       | 1                       |                        |
| MUL | 2       | 1                       | Pipelined              |
| DIV | 4       | 4                       | Serial (fixed latency) |
| LSU |         |                         |                        |
| LU  | 2       | 1                       | Nonblocking            |
| SU  | 1       | 1                       | Store buffered         |

- Instruction can only be issued when FU is available.
- SU and LU share same port, cannot be issued together
- We assume instruction cannot be issued to EX same cycle it was added to IB by ID

2 Cycle 1 LW x12,8(x9) IF IS LW x13,0(x7)IF DIV x17,x13,x12 ADDI x18,x12,28 MUL x19,x12,x18 MUL x10,x17,x14 ADD x10,x10,x13  $SW \times 10,0(\times 11)$ LW x10,4(x8)

ADDI X13,x10,4

#### Issue Buffer (IB)

| Instruction | rd  | rs1 | rs2 | lmm | RO | Complete |
|-------------|-----|-----|-----|-----|----|----------|
| LW          | x12 | x9  |     | 8   |    |          |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |

#### **FU Status (Ready?)**

9

8

10

11

12

13

14

**15** 

16

18

**17** 

19

| DIV | MUL | ALU | ADD | SU | LU |
|-----|-----|-----|-----|----|----|
|     |     |     |     |    |    |

9 10 11 12 13 14 **15** Cycle 1 2 3 8 LW x12,8(x9) IS LU IF  $\Rightarrow$  LW  $\times 13,0(\times 7)$ IS DIV x17,x13,x12 IF ADDI x18,x12,28 MUL x19,x12,x18 MUL x10,x17,x14 ADD x10,x10,x13  $SW \times 10,0(\times 11)$ LW x10,4(x8)

#### Issue Buffer (IB)

|   | Instruction | rd  | rs1 | rs2 | Imm | RO | Complete |
|---|-------------|-----|-----|-----|-----|----|----------|
|   | LW          | x12 | x9  |     | 8   | Х  |          |
| 1 | LW          | x13 | x7  |     | 0   |    |          |
|   |             |     |     |     |     |    |          |
|   |             |     |     |     |     |    |          |

ADDI X13,x10,4

#### **FU Status (Ready?)**

| DIV | MUL | ALU | ADD | SU | LU     |
|-----|-----|-----|-----|----|--------|
|     |     |     |     |    | Busy 1 |

16

18

**17** 

19



#### Issue Buffer (IB)

| Instruction | rd  | rs1 | rs2 | lmm | RO | Complete |
|-------------|-----|-----|-----|-----|----|----------|
| LW          | x12 | x9  |     | 8   | X  | Х        |
| LW          | x13 | x7  |     | 0   | Х  |          |
| DIV         | x17 | x13 | x12 |     |    |          |
|             |     |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL | ALU | ADD | SU | LU     |
|-----|-----|-----|-----|----|--------|
|     |     |     |     |    | Busy 2 |

19

54



#### Issue Buffer (IB)

|   | Instruction | rd         | rs1        | rs2 | Imm | RO | Complete |
|---|-------------|------------|------------|-----|-----|----|----------|
|   | LW          | <b>x13</b> | x7         |     | 0   | Х  | х        |
|   | DIV         | x17        | <b>x13</b> | x12 |     |    |          |
| 1 | ADDI        | x18        | X12        |     | 28  |    |          |
|   |             |            |            |     |     |    |          |

ADDI X13,x10,4

#### **FU Status (Ready?)**

| DIV | MUL | ALU | ADD | SU | LU     |
|-----|-----|-----|-----|----|--------|
|     |     |     |     |    | Busy 1 |



#### Issue Buffer (IB)

| Instruction | rd  | rs1 | rs2 | Imm | RO | Complete |
|-------------|-----|-----|-----|-----|----|----------|
| DIV         | x17 | x13 | x12 |     | Х  |          |
| ADDI        | x18 | x12 |     | 28  | Х  | Х        |
| MUL         | x19 | x12 | x18 |     |    |          |
|             |     |     |     |     |    |          |

ADDI X13,x10,4

#### **FU Status (Ready?)**

| DIV  | MUL | ALU  | ADD | SU | LU |
|------|-----|------|-----|----|----|
| Busy |     | Busy |     |    |    |



#### Issue Buffer (IB)

| Instruction | rd  | rs1 | rs2 | lmm | RO | Complete |
|-------------|-----|-----|-----|-----|----|----------|
| DIV         | x17 | x13 | x12 |     | Х  |          |
| MUL         | x19 | x12 | x18 |     | Х  |          |
| MUL         | x10 | x17 | x14 |     |    |          |
|             |     |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV  | MUL      | ALU | ADD | SU | LU |
|------|----------|-----|-----|----|----|
| Busy | Busy(s1) |     |     |    |    |



#### Issue Buffer (IB)

| Instruction | rd         | rs1        | rs2 | Imm | RO | Complete |
|-------------|------------|------------|-----|-----|----|----------|
| DIV         | <b>x17</b> | x13        | x12 |     | Х  |          |
| MUL         | x19        | x12        | x18 |     | Х  | х        |
| MUL         | x10        | <b>x17</b> | x14 |     |    |          |
|             |            |            |     |     |    |          |

#### **FU Status (Ready?)**

| DIV  | MUL      | ALU | ADD | SU | LU |
|------|----------|-----|-----|----|----|
| Busy | Busy(s2) |     |     |    |    |



#### Issue Buffer (IB)

| Instruction | rd         | rs1        | rs2 | Imm | RO | Complete |
|-------------|------------|------------|-----|-----|----|----------|
| DIV         | <b>x17</b> | x13        | x12 |     | Х  | Х        |
| MUL         | x10        | <b>x17</b> | x14 |     |    |          |
|             |            |            |     |     |    |          |
|             |            |            |     |     |    |          |

#### **FU Status (Ready?)**

| DIV  | MUL | ALU | ADD | SU | LU |
|------|-----|-----|-----|----|----|
| Busy |     |     |     |    |    |



#### Issue Buffer (IB)

| Instruction | rd  | rs1 | rs2 | Imm | RO | Complete |
|-------------|-----|-----|-----|-----|----|----------|
| MUL         | x10 | x17 | x14 |     | х  |          |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL      | ALU | ADD | SU | LU |
|-----|----------|-----|-----|----|----|
|     | Busy(s1) |     |     |    |    |



#### Issue Buffer (IB)

| Instruction | rd  | rs1 | rs2 | Imm | RO | Complete |
|-------------|-----|-----|-----|-----|----|----------|
| MUL         | x10 | x17 | x14 |     | х  | x        |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL      | ALU | ADD | SU | LU |
|-----|----------|-----|-----|----|----|
|     | Busy(s2) |     |     |    |    |



#### Issue Buffer (IB)

| Instruction | rd  | rs1 | rs2 | Imm | RO | Complete |
|-------------|-----|-----|-----|-----|----|----------|
| ADD         | x10 | x10 | x13 |     |    |          |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL | ALU | ADD | SU | LU |
|-----|-----|-----|-----|----|----|
|     |     |     |     |    |    |



#### Issue Buffer (IB)

| Instruction | rd  | rs1 | rs2 | lmm | RO | Complete |
|-------------|-----|-----|-----|-----|----|----------|
| ADD         | x10 | x10 | x13 |     | X  | Х        |
| SW          |     | x11 | x10 | 0   |    |          |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL | ALU  | ADD | SU | LU |
|-----|-----|------|-----|----|----|
|     |     | busy |     |    |    |



| Instruction | rd | rs1 | rs2 | Imm | RO | Complete |
|-------------|----|-----|-----|-----|----|----------|
| SW          |    | x11 | x10 | 0   | х  | x        |
|             |    |     |     |     |    |          |
|             |    |     |     |     |    |          |
|             |    |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL | ALU | ADD | SU     | LU |
|-----|-----|-----|-----|--------|----|
|     |     |     |     | Busy 1 |    |



|    | Instruction | rd  | rs1 | rs2 | lmm | RO | Complete |
|----|-------------|-----|-----|-----|-----|----|----------|
| 11 | LW          | x10 | x18 |     | 4   |    |          |
|    |             |     |     |     |     |    |          |
|    |             |     |     |     |     |    |          |

#### FU Status (Ready?)

| DIV | MUL | ALU | ADD | SU | LU |
|-----|-----|-----|-----|----|----|
|     |     |     |     |    |    |



|    | Instruction | rd  | rs1 | rs2 | Imm | RO | Complete |
|----|-------------|-----|-----|-----|-----|----|----------|
|    | LW          | x10 | x18 |     | 4   | Х  |          |
| 11 | ADDI        | x13 | x10 |     | 4   |    |          |
|    |             |     |     |     |     |    |          |
|    |             |     |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL | ALU | ADD | SU | LU     |
|-----|-----|-----|-----|----|--------|
|     |     |     |     |    | Busy 1 |



| Instruction | rd         | rs1        | rs2 | Imm | RO | Complete |
|-------------|------------|------------|-----|-----|----|----------|
| LW          | <b>x10</b> | x18        |     | 4   | X  | x        |
| ADDI        | x13        | <b>x10</b> |     | 4   |    |          |
|             |            |            |     |     |    |          |
|             |            |            |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL | ALU | ADD | SU | LU     |
|-----|-----|-----|-----|----|--------|
|     |     |     |     |    | Busy 1 |



| Instruction | rd  | rs1 | rs2 | lmm | RO | Complete |
|-------------|-----|-----|-----|-----|----|----------|
| ADDI        | x13 | x10 |     | 4   | X  | х        |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |
|             |     |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL | ALU  | ADD | SU | LU |
|-----|-----|------|-----|----|----|
|     |     | busy |     |    |    |



| Instruction | rd | rs1 | rs2 | Imm | RO | Complete |
|-------------|----|-----|-----|-----|----|----------|
|             |    |     |     |     |    |          |
|             |    |     |     |     |    |          |
|             |    |     |     |     |    |          |
|             |    |     |     |     |    |          |

#### **FU Status (Ready?)**

| DIV | MUL | ALU | ADD | SU | LU |
|-----|-----|-----|-----|----|----|
|     |     |     |     |    |    |

## **Terminology**

- Processors:
- ➤ Scalar (CPI >= 1)
- Some stages can be multiissue, e.g. four WB ports
- In-order/OoO can be different for every stage.
- ➤ But: OoO usually means instructions are scheduled OoO in EX stage.



# Register Renaming

### **Out-of-Order Limitations**

- WAW and WAR limit further reordering
  - Not real dependencies
  - Artificially added: limitation of registers
- Problem with limited registers
  - Number of registers limited by ISA
  - Compiler optimizations limited
  - Especially with different execution paths
- Approach: CPU solves problem by register renaming

### Register Renaming

- Approach: Rename to microarchitecture register names
  - More microarchitecture registers than logical ISA registers
  - Entirely eliminates WAR and WAW hazards
  - Not visible to the outside world



- Introduced by Robert Tomasulo (1967)
  - Reservation stations (FU-specific IBs) before FUs store instructions and reg. names
  - Tomasulo Algorithm: Computer Architecture A Quantitative Approach 5<sup>th</sup> Ed. Chapter 3

73

#### Example: Register Renaming removes WAW, RAW stalls



We <u>do not</u> have to stall IF and IS on WAW and WAR, but RAW still makes instruction wait in IB for operands. In this example the ADD caused 4 stall cycles that are gone now but the RAW still requires it to wait.

BUT: Instructions behind ADD can execute earlier in OoO fashion.

## Simple Superscalar (Scoreboard) – Dual Fetch and Decode



## Simple Superscalar (Scoreboard) – Dual Instruction Fetch and Decode – Example



Fetching more instructions assures the issue buffer is always filled.

**BUT: Instruction Level Parallelism can limit instructions executing in parallel** 

#### Simple Superscalar (Scoreboard) – Dual Fetch, Decode and Issue



## Simple Superscalar (Scoreboard) – Dual Instruction Fetch and Decode – Example



We still observe no structural hazards. ILP limits instruction issue below 2.

# Reorder Buffer (ROB)

### **Reorder Buffer (ROB)**

- Reorder buffer: Orders the WBs and commits them in-order
- Also assures stores are committed in order with WBs (needed for precise exceptions)



Optional, not relevant for exam

## A Look at a Real Processor

CVA6

## CVA6 Pipeline Diagram: https://github.com/openhwgroup/cva6



## CVA6 Pipeline Diagram: https://github.com/openhwgroup/cva6



83

Optional, not relevant for exam

## **Pipeline Support for Precise Traps**

### Challenge with OoO Pipelines and Exceptions

- Some instructions can cause exceptions
  - Memory fault on load/store
  - Before entering exception handling all previous instructions should have committed (done their write back)
  - No instruction after the one that caused the exception should have committed (done their write back)

|             | Cycle 1 | Cycle 2 | Cycle 3 | Cycle 4 | Cycle 5 | Cycle 6 | Cycle 7 | Cycle 8 | Cycle 9 | Cycle 10 |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|----------|
| SW t1,0(a0) | IF      | IS      | AC      | SB      | SB      | SB      | MSA     |         |         |          |
| SW t2,0(a1) |         | IF      | IS      | AC      | SB      | SB      | SB      | FAULT   |         |          |
| LI t2,4     |         |         | IF      | IS      | ALU     | WB      |         |         |         |          |
|             |         |         |         |         |         |         |         |         |         |          |

LI would have committed before we observe the memory store fault exception (imprecise exception)

85

### Implementing Precise Exceptions in OoO Pipelines

#### ➤ For Precise Exception:

- > Before entering exception handling all previous instructions should have committed
- > All previous stores should have written to memory or SB should continue to write them to memory
- ➤ No instruction after the instruction that caused the exception should have committed, instead they should be deleted (killed)
- ➤ No store after the instruction that caused the exception should have written to memory from the SB, instead they should be deleted (killed) from the SB
- ➤ Scoreboard approach did not support precise exceptions
- ➤ Different approaches to implement precise exceptions: e.g. Reorder-Buffer (ROB) sorts all WB commits and makes sure store buffer only sends committed stores to memory

## **Summary**

#### Where we are

- Four-Stage Superscalar Out-of-order Processor Pipeline
  - Exploit Instruction Level Parallelism to hide extra cycles of multi-cycle FUs.
  - Scoreboard to track instruction dependencies



88

- Four Stage
- Out-of-order (OoO) pipeline
- Superscalar pipeline (Multi-Issue)

Upcoming Lecture: More on Multi-Issue Processors (targeting IPC > 1)

# Thank you for your attention!