# Ch3-ILP & its exploration

Ch3-1

- Instruction Level Parallelism
- Basic Compiler Techniques for Exposing ILP
- Dynamic scheduling--Scoreboard









## What is Instruction-Level Parallelism?

- □Instruction-level parallelism
  - > The potential overlap among instructions

#### □ Basic Block ILP is quite small

- ➤ Basic Block: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit
- > average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches
- >Plus instructions in BB likely to depend on each other





# Recall from Pipelining Review

- ■When exploiting instruction-level parallelism, goal is to maximize CPI
- □Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls
  - ➤ Ideal pipeline CPI: measure of the maximum performance attainable by the implementation
  - >Structural hazards: HW cannot support this combination of instructions
  - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline
  - Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)



#### How to exploit ILP?

- ☐ there are two main approaches:
  - > Hardware-based dynamic approaches
    - OUsed in server and desktop processors
    - ONot used as extensively in PMP processors
  - Compiler-based static approaches
    - ONot as successful outside of scientific applications





#### Chapter 3

- □ILP: Concepts and Challenge
- Basic compiler Techniques for exposing ILP
- Overcoming Data Hazards with Dynamic Scheduling
- Reducing Branch Costs with Dynamic Branch Prediction
- Hardware-based Speculation
- Exploiting ILP with Multiple Issue & Static Scheduling
- Exploiting ILP with Dynamic scheduling, Multiple issue, & speculation
- Advanced Techniques for instruction Delivery and speculation
- Multithreading: exploiting TLP improve uniprocessor throughput



ChC

Ch3

#### Ideas to Reduce Stalls

| Technique                               | Reduces                      | Section  |
|-----------------------------------------|------------------------------|----------|
| Forwarding and bypassing                | Potential data hazard stalls | C.2,C.3  |
| Simple branch scheduling and prediction | Control hazard stalls        | C.3      |
| Basic compiler pipeline scheduling      | Data hazard stalls           | C.2, 3.2 |
| Loop unrolling                          | Control hazard stalls        | 3.2      |
| Dynamic branch prediction               | Control stalls               | 3.3      |
| Dynamic scheduling (scoreboard)         | Data hazard stalls           | C.7      |
| Dynamic scheduling (Tomasulo)           | DH stalls from Anti and      | 3.4, 3.5 |
|                                         | output dependences           | 7        |
| Hardware-based speculation              | Control stalls               | 3.6      |
| Issuing multi-instructions per cycle    | Ideal CPI                    | 3.7      |
| Dynamic scheduling + multiple issue     | Data and control stalls      | 3.8      |
| +Speculation                            |                              |          |
| Multi-threading                         | Data parallelism             | 3.11     |
| Compiler dependence analysis, software  | Ideal CPI and data hazard    | H.2, H.3 |
| pipelining, trace schedule              | stalls                       |          |
| Hardware support for compiler           | Ideal CPI and data hazard    | H.4, H.5 |
| speculation                             | stalls, branch hazard stalls |          |



# Instruction-Level Parallelism (ILP)

- ☐ To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks
- Simplest: <u>loop-level parallelism</u> to exploit parallelism among iterations of a loop
  - > Vector & GPU is one way
  - If not vector, then either dynamic via branch prediction or static via loop unrolling by compiler





#### Data Dependence & hazard

- □ Dependencies are a property of programs, presence of dependence indicates potential for a hazard,
- □Pipeline organization determines if dependence is detected and if it causes a stall, actual hazard and length of any stall is a property of the pipeline
- ■Data dependence conveys:
  - Possibility of a hazard (register & memory location)
  - Order in which results must be calculated
  - Upper bound on exploitable instruction level parallelism
- ■Dependencies that flow through memory locations are difficult to detect





### Recall: Types of data hazards

Consider two instructions, A and B. A occurs before B.



- □RAW(Read after write) true dependence
  - > Instruction A writes Rx, instruction B reads Rx
- □WAW(Write after write) output dependence
  - > Instruction A writes Rx, instruction B writes Rx
- □WAR(Write after read) anti-denpendence
  - > Instruction A reads Rx, instruction B writes Rx
- Hazards are named according to the ordering that MUST be preserved by the pipeline





#### True Data Dependence and Hazards

#### ☐ True Data Dependence:

➤Instr<sub>J</sub> is data dependent on Instr<sub>I</sub>
Instr<sub>J</sub> tries to read operand before Instr<sub>I</sub> writes it

```
I: add r1,r2,r3
J: sub r4,r1,r3
```

- ▶or Instr<sub>J</sub> is data dependent on Instr<sub>K</sub> which is dependent on Instr<sub>I</sub>
- □ Caused by a "True Dependence" (compiler term)
- □If true dependence caused a hazard in the pipeline, called a Read After Write (RAW) hazard





#### Name Dependence 1: Anti-dependence

- □Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name:
- □Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> reads it

I: sub r4,r1,r3 J: add r1,r2,r3

K: mul r6,r1,r7

called an "anti-dependence" by compiler writers. This results from reuse of the name "r1"

□ If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard





#### Name Dependence 2: Output dependence

□Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> writes it.

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

□ Called an "output dependence" by compiler writers

This also results from the reuse of name "r1"

If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard





#### Name Dependence

- □Two instructions use the same name but no flow of information
  - Not a true data dependence, but is a problem when reordering instructions
  - Antidependence: instruction j writes a register or memory location that instruction i reads
    - Olnitial ordering (i before j) must be preserved. ADD x5, x8,x9 •MUL x8, x6, x10
  - Output dependence: instruction i and instruction j write the same register or memory location
    - Ordering must be preserved

·MUL x5, x6, x9





#### **ILP and Data Hazards**

- □HW/SW must preserve program order: order instructions would execute in if executed sequentially 1 at a time as determined by original source program
- □HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program
- □Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict
  - > Register renaming resolves name dependence for registers
  - ➤ Either by compiler or by HW





#### **Control Dependencies**

Description is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order

```
if p1 {
    S1;
};
if p2 {
    S2;
}
```

□S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.





#### Control Dependence Ignored

- □ Control dependence need not be preserved
  - >willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program
- □Instead, 2 properties critical to program correctness are exception behavior and data flow





#### Examples

- Example 1:
- add x1,x2,x3
- beq x4,x0,L
- sub x1,x1,x6
- · L: ...
- or x7,x1,x8

or instruction dependent on add and sub

- Example 2:
- add x1,x2,x3
- beq x12,x0,skip
- sub x4,x5,x6
- add x5,x4,x9
- skip:
- or x7,x8,x9

- Assume x4 isn't used after skip
  - Possible to move sub before the branch





#### **Exception Behavior**

- Preserving exception behavior
  - => any changes in instruction execution order must not change how exceptions are raised in program

(=> no new exceptions)

>Example:

DADDU R2,R3,R4 BEQZ R2,L1 LW R1,O(R2)

L1: ......

Problem with moving LW before BEQZ?





#### A short summary

- **UILP** 
  - > The potential overlap among instructions
- □ Reduce stalls from
  - >Structural hazards
  - > Data hazards
  - > Control hazards
- ☐ To keep the program correctness, we should
  - >Preserving Data flow
  - > Preserving exception behavior





#### Lecture for ILP: Software approaches

- □ Basic Compiler Technique for Exposing ILP
  - >Loop unrolling
- ☐ Static Branch Prediction
- ☐Static multiple Issue: VLIW
- Advanced Compilor Support for Exposing and Exploiting ILP
  - > Software pipelining
  - > Global Code scheduling
- □ Hardware Support for Exposing More Parallelism at compile time
  - > Conditional or Predicated instructions
  - > Compiler speculation with hardware support





## FP Loop: Where are the Hazards?

Loop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar from F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot

| Instruction producing result | Instruction using result | Execution in cycles | Latency in cycles |
|------------------------------|--------------------------|---------------------|-------------------|
| FP ALU op                    | Another FP ALU op        | 4                   | 3                 |
| FP ALU op                    | Store double             | 3                   | 2                 |
| Load double                  | FP ALU op                | 1                   | 1                 |
| Load double                  | Store double             | 1                   | 0                 |
| Integer op                   | Integer op               | 1                   | 0                 |

Where are the stalls?





#### Specification for the latency

```
□ALU F1, -,-: IF ID FD FD FD WB
```

□ALU -, F1,-: IF ID s s s FD FD FD WB

□ALU: IF ID FD FD FD WB

□SW: IF ID s s EX DM

□LW F1, -: IF ID EX DM WB

□SW: F1, 8(R1): IF ID EX DM WB

MEM/WB.LDMR --→DM input port





# Reducing stalls from scheduling in BB and delayed branch

```
Loop: LD F0, O(R1)
     ADDD F4, F0, F2
     SD 0(R1), F4
     SUBI R1, R1, #8
     BNEZ R1, Loop
FDXMW
 F D \stackrel{\bullet}{S} A_1 A_2 A_3 A_4 W
    FsDsSXMW
        F s s D X M W
                   s DXM
  W
                           •23
```

```
Loop: LD F0, O(R1)
    SUBI R1, R1,#8
    ADDD F4, F0, F2
    BNEZ R1, Loop
    SD +8(R1), F4
FDXMW
 F D X M W
   FDA_1A_2A_3A_4W
    FDXMW
       FDsXMW
 6 CC F s_D_XMW
```



# Unroll Loop Four Times (straightforward way)

```
1 cycle stall
                F0,0(R1)
  Loop:LD
                                                    Rewrite loop to
                                 2 cycles stall
                F4,F0,F2
2
        ADDD
                                                      minimize stalls?
                0(R1),F4
                                 drop SUBI & BNEZ
        SD
                F6, -8(R1)
        \mathbf{L}\mathbf{D}
5
                F8, F6, F2
        ADDD
                -8(R1), F8
        SD
                                 drop SUBI & BNEZ
7
                F10, -16(R1)
        LD
8
        ADDD
                F12,F10,F2
                -16 (R1), F12
        SD
                                   ■1 cycle stall
10
                F14, -24(R1)
        LD
                                      1 cycle stall(waiting for F16
11
                F16,F14,F2
        ADDD
                                 alter to 4*8
12
                R1,R1,#32
        SUBI
13
        SD
                +8 (R1), F16
                               ■1 cycle control stall
14
                R1,LOOP
        BNEZ
15
        NOP
```

 $14 + 3 \times (1+2) + 1 + 1 + 1 = 26$  clock cycles, or 6.5 per iteration Assumes R1 is multiple of 4



## Unrolled Loop That Minimizes Stalls

| <ol> <li>Loop</li> <li>3</li> </ol> | : LD<br>LD<br>LD | F0,0(R1)<br>F6,-8(R1)<br>F10,-16(R1) | □What assumptions made when moved code? |
|-------------------------------------|------------------|--------------------------------------|-----------------------------------------|
| 4                                   | LD               | F14,-24(R1)                          | OK to move store past                   |
| 5                                   | ADDD             | F4,F0,F2                             | SUBI even though                        |
| 6                                   | ADDD             | F8,F6,F2                             | changes register                        |
| 7                                   | ADDD             | F12,F10,F2                           | > OK to move loads before               |
| 8                                   | ADDD             | F16,F14,F2                           | stores: get right data?                 |
| 9                                   | SD               | 0(R1),F4                             | > When is it safe for                   |
| 10                                  | SD               | -8(R1),F8                            |                                         |
| 11                                  | SUBI             | R1,R1,#32                            | compiler to do such                     |
| 12                                  | SD               | +16(R1),F12                          | changes?                                |
| 13                                  | BNEZ             | R1,LOOP                              |                                         |
| 14                                  | SD               | 8(R1),F16                            | ; 8-32 = -24                            |
|                                     |                  |                                      |                                         |

14 clock cycles, or 3.5 per iteration





#### Ideas to Reduce Stalls

|     | Technique                               | Reduces                      | Section  |
|-----|-----------------------------------------|------------------------------|----------|
| 4   | Forwarding and bypassing                | Potential data hazard stalls | C.2,C.3  |
| ChC | Simple branch scheduling and prediction | Control hazard stalls        | C.3      |
|     | Basic compiler pipeline scheduling      | Data hazard stalls           | C.2, 3.2 |
|     | Loop unrolling                          | Control hazard stalls        | 3.2      |
| Ch3 | Dynamic branch prediction               | Control stalls               | 3.3      |
|     | Dynamic scheduling (scoreboard)         | Data hazard stalls           | C.7      |
|     | Dynamic scheduling (Tomasulo)           | DH stalls from Anti and      | 3.4, 3.5 |
|     |                                         | output dependences           | 7 10     |
|     | Hardware-based speculation              | Control stalls               | 3.6      |
|     | Issuing multi-instructions per cycle    | Ideal CPI                    | 3.7      |
|     | Dynamic scheduling + multiple issue     | Data and control stalls      | 3.8      |
| ChH | +Speculation                            |                              |          |
|     | Multi-threading                         | Data parallelism             | 3.11     |
|     | Compiler dependence analysis, software  | Ideal CPI and data hazard    | H.2, H.3 |
|     | pipelining, trace schedule              | stalls                       | 100      |
|     | Hardware support for compiler           | Ideal CPI and data hazard    | H.4, H.5 |
|     | speculation                             | stalls, branch hazard stalls |          |



#### Why Dynamic Scheduling?

■Example1 : Data Hazard

DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14

Example2: Structure Hazard

DIVD F2,F2,F4

ADDD F10,F0,F8 ; FP ADDer unpipelined

ADDD F12, F0,F4

MULD F16, F14, F4

 Problem: instruction (SUBD, MULD) stalled due to irrelevent forward instructions.





#### HW Schemes: Dynamic scheduling

- □ Key idea: Allow instructions behind stall to proceed. Rearrange order of instructions to reduce stalls while maintaining data flow
- □ Enables out-of-order execution and allows out-of-order completion
- □Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instruction is in execution
- □In a dynamically scheduled pipeline, all instructions pass through issue stage in order (in-order issue)





# Adv. Of Dynamic Scheduling

- ☐ Handles cases when dependences unknown at compile time
  - > (e.g., because they may involve a memory reference)
- □ It simplifies the compiler, Compiler doesn't need to have knowledge of microarchitecture
- □Allows code that compiled for one pipeline to run efficiently on a different pipeline
- □ Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling



# Dynamic Scheduling Step 1

- Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue
- □Split the ID pipe stage of simple 5-stage pipeline into 2 stages:
- □Issue—Decode instructions, check for structural hazards
- □Read operands—Wait until no data hazards, then read operands





#### Dynamic Scheduling with a Scoreboard

#### □ Scoreboarding

- >Named after CDC6600 scoreboard
- Allowing instructions to execute out of order when there are <u>sufficient resources</u> and no data dependences.
- >In-order issue
- >Out-of order completion
- > Executing an instruction as early as possible





# Basic structure of a pipelined processor with a scoreboard







CDC6600 -First Supercomputer



ลูงโลเลสเปรเจนา





#### The pipeline stages with scoreboard

- ☐ The Five stages: IF, ID, EX, MEM, WB
  - >IF: the same for all instructions
  - ➤ID: split into two stages: issue and read operands
  - >EX: no change
  - MEM: omitted for only concentrating on the FP operations
  - >WB: no change
- □So, the stages are: IF, IS, RO, EX, WB.
  - ·Another way to lookat missing MEM?





# Pipeline supports multiple outstanding FP operations







### Scoreboard Pipeline stage description

#### □ **Issue**: a instruction is issued when

- > The functional unit is available and
- > No other active instruction has the same destination register.
- > Avoid strutural hazard and WAW hazard

#### □Read Operands (RO)

- > The read operation is delayed until both the operands are available.
- > This means that no previously issued but ncompleted instruction has the operand as its destination.
- > This resolves RAW hazards dynamically

#### □Execution (EX)

> Notify the scoreboard when completed so the functional unit can be reused.

#### ■Write result (WB)

> The scoreboard checks for WAR hazards and stalls the completing instruction if necessary.





### The scoreboard algorithm

- □ Scoreboard-takes full responsibility for instruction issue and execution
  - > Create the dependence records
  - > Decide when to fetch the operand
  - > Decide when to enter execution
  - > Decide when the result can be written into the register file
- ☐ Three data structure
  - >Instruction status:
    - Owhich of the four steps the instruction is in
  - Functional unit status: buzy,op,Fi, Fj,Fk,Qj,Qk,Rj,Rk
  - Register result status:
    - Owhich functional unit will write that register





### Example: Instruction status

LD F6, 34(R2)
LD F2, 45(R3)
MULTD F0, F2, F4
SUBD F8, F6, F2
DIVD F10, F0, F6
ADDD F6, F8, F2

|             | Ins | truct | ion st | tatus |
|-------------|-----|-------|--------|-------|
| Instruction | IS  | RO    | EX     | WB    |
| LD          | 1   | 1     | 1      | 1     |
| LD          | 1   | 1     | √      | 7     |
| MULTD       | 1   |       |        | -     |
| SUBD        | 1   |       |        |       |
| DIVD        | 1   |       | ÀL-    |       |
| ADDD        |     |       |        |       |
|             |     |       |        |       |



### Scoreboard Example



#### Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F30







#### Register result status:

Divide No





#### Register result status:





#### Register result status:





#### Register result status:





#### Register result status:





#### Register result status:



| Instructio   | n sta     | tus:          |           |            | Read  | Exec   | Write     |     |    |    |     |
|--------------|-----------|---------------|-----------|------------|-------|--------|-----------|-----|----|----|-----|
| Instructi    | on        | j             | k         | Issue      | Opran | d Comp | Result    |     |    |    |     |
| LD           | <b>F6</b> | 34+           | R2        |            |       |        | d( )      |     |    |    |     |
| LD           | <b>F2</b> | 45+           | <b>R3</b> |            | √     |        |           |     |    |    |     |
| <b>MULTD</b> | <b>F0</b> | <b>F2</b>     | <b>F4</b> | ✓          |       |        |           |     |    |    |     |
| <b>SUBD</b>  | <b>F8</b> | <b>F6</b>     | <b>F2</b> |            |       |        | 9         |     |    |    |     |
| DIVD         | F10       | $\mathbf{F0}$ | <b>F6</b> |            |       |        | T 11110   |     |    |    |     |
| ADDD         | <b>F6</b> | <b>F8</b>     | F2        |            |       |        |           |     |    |    |     |
| Function     | Unit      | Stato         | us:       |            | des   | S1     | S2        | RS  | RS |    |     |
|              | Time      | Name          | Busy      | <b>O</b> p | Fi    | Fj     | Fk        | Qj  | Qk | Rj | Rk  |
|              |           | Intege        | Yes       | load       | F2    | R3     |           | 160 |    | No |     |
|              |           | Mult1         | Yes       | Mul        | F0    | F2     | <b>F4</b> | Int |    | No | Yes |
|              |           | Mult2         | No        |            |       |        |           |     |    |    |     |
|              |           | Add           | No        |            |       |        |           |     |    |    |     |
|              |           | Divide        | No        |            |       |        |           |     |    |    |     |

#### Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F FU Mult1 Int M[R2+34]





#### Register result status:

Clock
0
FU Mult1 Int-47
M[R2+3-Add ZHEJIANG UNIVERSITY]



Sub

**Div** 

Yes

| nstructio       | n sta     | itus:          |                  |          | Read  | Exec     | Write     |        |        |            |      |
|-----------------|-----------|----------------|------------------|----------|-------|----------|-----------|--------|--------|------------|------|
| Instructi       | on        | $oldsymbol{j}$ | $\boldsymbol{k}$ | Issue    | Opran | d Comp   | Result    |        |        |            |      |
| LD              | <b>F6</b> | 34+            | R2               |          |       |          |           |        |        |            |      |
| LD              | <b>F2</b> | 45+            | R3               |          |       | <b>√</b> |           | Access | Data C | ache       |      |
| MULTD           | <b>F0</b> | <b>F2</b>      | <b>F4</b>        | √        |       |          |           |        |        |            |      |
| <b>SUBD</b>     | <b>F8</b> | <b>F6</b>      | <b>F2</b>        | √        |       |          |           |        |        |            |      |
| DIVD            | F10       | F0             | <b>F6</b>        | ✓        |       |          |           | 100    |        |            |      |
| ADDD            | F6        | F8             | F2               |          |       | 1111     |           |        |        |            |      |
| <b>Function</b> | Unit      | Stato          | us:              |          | des   | S1       | <i>S2</i> | RS     | RS     |            |      |
|                 | Time      | Name           | Busy             | Ор       | Fi    | Fj       | Fk        | Qj     | Qk     | Rj         | Rk   |
|                 |           | Intege         | Yes              | load     | F2    | R3       | 1011      | 1111   | 12 11  | No         |      |
| Clock cycl      | e         | Mult1          | Yes              | Mul      | F0    | F2       | F4        | Int    |        | No         | Yes  |
| counter         |           | Mult2          | No               |          |       |          |           |        |        |            |      |
|                 |           | A 1 1          | ₩ 7              | $\alpha$ | EO    | T.C      | Ea        |        | T 4    | <b>T</b> 7 | TA T |

**F8** 

F10

#### Register result status:

Add

Divide Yes

Clock
0 F0 F2 F4 F6 F8 F10 F12 ... F30
FU Mult1 Int M[R2+34 Add Div

**F6** 

F0

F2

**F6** 

Mult1

Int

Yes

No

No

Yes



# Example: Function unit status and Register status

| Name    |      | Functional unit status |     |    |    |         |         |     |     |  |  |  |  |
|---------|------|------------------------|-----|----|----|---------|---------|-----|-----|--|--|--|--|
|         | Busy | Ор                     | Fi  | Fj | Fk | Qj      | Qk      | Rj  | Rk  |  |  |  |  |
| Integer | Yes  | Load                   | F2  | F3 |    |         |         | No  |     |  |  |  |  |
| Mult1   | Yes  | Mult                   | F0  | F2 | F4 | Integer |         | No  | Yes |  |  |  |  |
| Mult2   | No   |                        |     |    |    | 1-11    |         |     |     |  |  |  |  |
| Add     | Yes  | Sub                    | F8  | F6 | F2 | 1 11111 | Integer | Yes | No  |  |  |  |  |
| Divide  | Yes  | Div                    | F10 | F0 | F6 | Mult1   |         | No  | Yes |  |  |  |  |

|    | Register result status |         |    |    |     |        |         |     |     |  |  |  |
|----|------------------------|---------|----|----|-----|--------|---------|-----|-----|--|--|--|
|    | FO                     | F2      | F4 | F6 | F8  | F10    | F12     | ••• | F30 |  |  |  |
| FU | Mult1                  | Integer |    |    | Add | Divide | Water . | ••• |     |  |  |  |





| nstructio       | n sta     | itus:          |           |       | Read      | Exec       | Write     |      |    |     |     |
|-----------------|-----------|----------------|-----------|-------|-----------|------------|-----------|------|----|-----|-----|
| Instructi       | on        | $oldsymbol{j}$ | k         | Issue | Opran     | d Comp     | Result    |      |    |     |     |
| LD              | <b>F6</b> | 34+            | <b>R2</b> |       |           |            |           |      |    |     |     |
| LD              | <b>F2</b> | 45+            | <b>R3</b> |       |           |            | <b>√</b>  |      |    |     |     |
| MULTD           | F0        | <b>F2</b>      | <b>F4</b> | √     |           |            | 1         |      |    |     |     |
| <b>SUBD</b>     | <b>F8</b> | <b>F6</b>      | <b>F2</b> | √     |           |            | 37/       |      |    |     |     |
| DIVD            | F10       | <b>F0</b>      | <b>F6</b> | √     |           |            | 7 1       |      |    |     |     |
| ADDD            | <b>F6</b> | <b>F8</b>      | <b>F2</b> |       |           |            |           |      |    |     |     |
| <b>Function</b> | Unit      | t Stato        | us:       |       | des       | <b>S</b> 1 | <i>S2</i> | RS   | RS |     |     |
|                 | Time      | Name           | Busy      | Ор    | Fi        | Fj         | Fk        | Qj   | Qk | Rj  | Rk  |
|                 |           | Intege         | No        | load  | <b>F2</b> | R3         |           |      |    | No  |     |
|                 |           | M.,.141        | Voc       | М1    | ΕO        | E          | E4        | Trat |    | Voc | Voc |

Clock cycle **Yes** Yes Mult2 No counter Add Yes Sub **F8 F6** F2 Int Yes Yes Mult1 Divide Yes Div F10  $\mathbf{F0}$ **F6** No Yes

#### Register result status:

Clock
0 F0 F2 F4 F6 F8 F10 F12 ... F30
FU Mult1 1[R3+45] M[R2+34 Add Div



| n sta     | tus:                              |                                                    |                                                        | Read  | Exec   | Write |     |    |
|-----------|-----------------------------------|----------------------------------------------------|--------------------------------------------------------|-------|--------|-------|-----|----|
| on        | j                                 | $\boldsymbol{k}$                                   | Issue                                                  | Opran | d Comp | Resul | t   |    |
| <b>F6</b> | 34+                               | R2                                                 |                                                        |       |        |       |     |    |
| <b>F2</b> | 45+                               | <b>R3</b>                                          |                                                        |       |        |       |     |    |
| <b>F0</b> | <b>F2</b>                         | <b>F4</b>                                          |                                                        | √     |        |       |     |    |
| <b>F8</b> | <b>F6</b>                         | <b>F2</b>                                          |                                                        | √     |        |       |     |    |
| F10       | <b>F0</b>                         | <b>F6</b>                                          | √                                                      |       |        |       | 100 |    |
| <b>F6</b> | <b>F8</b>                         | <b>F2</b>                                          |                                                        |       |        |       |     |    |
|           |                                   |                                                    |                                                        | des   | S1     | S2    | RS  | RS |
|           | F6<br>F2<br>F0<br>F8<br>F10<br>F6 | F6 34+ F2 45+ F0 F2 F8 F6 F10 F0 F6 F8  Unit State | on j k F6 34+ R2 F2 45+ R3 F0 F2 F4 F8 F6 F2 F10 F0 F6 | on    | on     | on    | on  | on |

| Function Un | it Statous:  |      | des       | S1 | <i>S2</i> | RS    | RS |    |     |
|-------------|--------------|------|-----------|----|-----------|-------|----|----|-----|
| Tin         | ne Name Busy | Op   | Fi        | Fj | Fk        | Qj    | Qk | Rj | Rk  |
|             | Intege No    | load | F2        | R3 | 11 11 11  |       |    | No |     |
| Clock cycle | Mult1 Yes    | Mul  | F0        | F2 | F4        |       |    | No | No  |
| counter     | Mult2 No     |      |           |    |           |       |    |    |     |
| Courte      | Add Yes      | Sub  | <b>F8</b> | F6 | F2        |       |    | No | No  |
|             | Divide Yes   | Div  | F10       | F0 | F6        | Mult1 |    | No | Yes |

#### Register result status:

 Clock
 F0
 F2
 F4
 F6
 F8
 F10
 F12
 ...
 F30

 0
 FU
 Multi 4[R3+45]
 M[R2+3]
 Add
 Div
 Image: Control of the co



| <i>nstructio</i> | n sta     | tus:      |                  |       | Read  | Exec     | Write      |         |        |           |      |
|------------------|-----------|-----------|------------------|-------|-------|----------|------------|---------|--------|-----------|------|
| Instructi        | on        | j         | $\boldsymbol{k}$ | Issue | Opran | d Comp   | Resul      | t       |        |           |      |
| LD               | <b>F6</b> | 34+       | R2               |       |       |          |            |         |        |           |      |
| LD               | <b>F2</b> | 45+       | <b>R3</b>        |       |       |          |            | 1       |        |           |      |
| MULTD            | <b>F0</b> | <b>F2</b> | <b>F4</b>        |       |       | <b>√</b> |            | Assume  | Mul ta | ikes 7 cy | cles |
| <b>SUBD</b>      | <b>F8</b> | <b>F6</b> | <b>F2</b>        |       |       | <b>√</b> |            | Assume  | Sub ta | kes 3 cyc | eles |
| DIVD             | F10       | <b>F0</b> | <b>F6</b>        | √     |       |          |            |         |        |           |      |
| ADDD             | <b>F6</b> | <b>F8</b> | F2               |       |       |          | 1141       |         |        |           |      |
| <b>Function</b>  | Unit      | State     | ous:             |       | des   | S1       | <i>S</i> 2 | RS      | RS     |           |      |
|                  | Time      | Name      | Busy             | Ор    | Fi    | Fj       | Fk         | Qj      | Qk     | Rj        | Rk   |
|                  |           | Intege    | No               | load  | F2    | R3       | 7 1111     | III III |        | No        | -    |
|                  |           | Mult1     | Yes              | Mul   | F0    | F2       | F4         |         |        | No        | No   |
|                  |           | Mult2     | No               |       |       |          |            |         |        |           |      |
|                  |           | Add       | Yes              | Sub   | F8    | F6       | F2         |         |        | No        | No   |
|                  |           | Divide    | Yes              | Div   | F10   | F0       | F6         | Mult1   |        | No        | Yes  |

#### Register result status:

Clock
0 F0 F2 F4 F6 F8 F10 F12 ... F30
FU Mult1 4[R3+45] M[R2+34 Add Div ZHEJIANG UNIVERSITY



| <i>nstructio</i> | n sta     | tus:             |                  |       | Read  | Exec     | Write  |        |        |           |      |
|------------------|-----------|------------------|------------------|-------|-------|----------|--------|--------|--------|-----------|------|
| Instructi        | on        | $\boldsymbol{j}$ | $\boldsymbol{k}$ | Issue | Opran | d Comp   | Result | 4      |        |           |      |
| LD               | <b>F6</b> | 34+              | R2               |       |       |          |        |        |        |           |      |
| LD               | <b>F2</b> | 45+              | <b>R3</b>        |       |       |          |        |        |        |           |      |
| MULTD            | <b>F0</b> | <b>F2</b>        | <b>F4</b>        |       |       | <b>√</b> |        | Assume | Mul ta | ikes 7 cy | cles |
| <b>SUBD</b>      | <b>F8</b> | <b>F6</b>        | <b>F2</b>        |       |       |          | ✓      | Assume | Sub ta | kes 3 cyc | cles |
| DIVD             | F10       | F0               | <b>F6</b>        | √     |       |          |        | 100    |        |           |      |
| ADDD             | <b>F6</b> | F8               | <b>F2</b>        |       |       |          |        |        |        |           |      |
| <b>Tunction</b>  | Unit      | State            | ous:             |       | des   | S1       | S2     | RS     | RS     |           |      |
|                  | Time      | Name             | Busy             | Op    | Fi    | Fj       | Fk     | Qj     | Qk     | Rj        | Rk   |
|                  |           | Intege           | No               | load  | F2    | R3       | 1111   | 37     |        | No        |      |
|                  |           | Mult1            | Yes              | Mul   | F0    | F2       | F4     |        |        | No        | No   |
|                  |           | Mult2            | No               |       |       |          |        |        |        |           |      |
|                  |           | Add              | No               | Sub   | F8    | F6       | F2     |        |        | No        | No   |
|                  |           | Divide           | Yes              | Div   | F10   | F0       | F6     | Mult1  |        | No        | Yes  |

#### Register result status:

Clock
0 F0 F2 F4 F6 F8 F10 F12 ... F30
FU Mult1 4[R3+45] M[R2+34 V- Div



| <i>nstructio</i> | n sta     | tus:             |                  |          | Read      | Exec      | Write     |           |        |            |      |
|------------------|-----------|------------------|------------------|----------|-----------|-----------|-----------|-----------|--------|------------|------|
| Instruction      | on        | $\boldsymbol{j}$ | $\boldsymbol{k}$ | Issue    | Opran     | d Comp    | Resul     | <u>'t</u> |        |            |      |
| LD               | <b>F6</b> | 34+              | R2               |          |           |           |           |           |        |            |      |
| LD               | <b>F2</b> | 45+              | <b>R3</b>        |          |           |           |           |           |        |            |      |
| MULTD            | F0        | <b>F2</b>        | <b>F4</b>        |          |           | <b>√</b>  |           | Assume    | Mul ta | akes 7 cyc | cles |
| <b>SUBD</b>      | <b>F8</b> | <b>F6</b>        | <b>F2</b>        |          |           |           |           |           |        |            |      |
| DIVD             | F10       | FO               | <b>F6</b>        | √        |           |           |           | 350       |        |            |      |
| ADDD             | F6        | <b>F8</b>        | <b>F2</b>        | <b>√</b> |           |           |           |           |        |            |      |
| Function         | Unit      | State            | ous:             |          | des       | S1        | S2        | RS        | RS     |            |      |
|                  | Time      | Name             | Busy             | Op       | Fi        | Fj        | Fk        | Qj        | Qk     | Rj         | Rk   |
|                  |           | Intege           | No               | load     | F2        | R3        | 1776      |           |        | No         |      |
|                  |           | Mult1            | Yes              | Mul      | F0        | F2        | F4        |           |        | No         | No   |
|                  |           | Mult2            | No               |          |           |           |           |           |        |            |      |
|                  |           | Add              | Yes              | ADD      | <b>F6</b> | <b>F8</b> | <b>F2</b> |           |        | Yes        | Yes  |
|                  |           | Divide           | Yes              | Div      | F10       | F0        | F6        | Mult1     |        | No         | Yes  |

#### Register result status:

 Clock
 F0
 F2
 F4
 F6
 F8
 F10
 F12
 ...
 F30

 0
 FU
 Mult1 4[R3+45]
 Add
 V Div
 ...
 F30



| Instructio  | n sta     | tus:             |                  |       | Read  | Exec        | Write  |        |        |           |      |
|-------------|-----------|------------------|------------------|-------|-------|-------------|--------|--------|--------|-----------|------|
| Instructi   | on        | $\boldsymbol{j}$ | $\boldsymbol{k}$ | Issue | Opran | d Comp      | Result | t      |        |           |      |
| LD          | F6        | 34+              | R2               |       |       | J. Promoter |        |        |        |           |      |
| LD          | <b>F2</b> | 45+              | <b>R3</b>        |       |       |             |        |        |        |           |      |
| MULTD       | <b>F0</b> | <b>F2</b>        | <b>F4</b>        |       |       | √           |        | Assume | Mul ta | ikes 7 cy | cles |
| <b>SUBD</b> | <b>F8</b> | <b>F6</b>        | <b>F2</b>        |       |       |             |        |        |        |           |      |
| DIVD        | F10       | F0               | <b>F6</b>        | √     |       |             |        |        |        |           |      |
| ADDD        | F6        | <b>F8</b>        | <b>F2</b>        |       | √     | 1111        |        |        |        |           |      |
| Function    | Unit      | State            | ous:             |       | des   | S1          | S2     | RS     | RS     |           |      |
|             | Time      | Name             | Busy             | Op    | Fi    | Fj          | Fk     | Qj     | Qk     | Rj        | Rk   |
|             |           | Intege           | No               | load  | F2    | R3          | 11111  |        | 12 11  | No        |      |
|             |           | Mult1            | Yes              | Mul   | F0    | F2          | F4     |        |        | No        | No   |
|             |           | Mult2            | No               |       |       |             |        |        |        |           |      |
|             |           | Add              | Yes              | ADD   | F6    | <b>F8</b>   | F2     |        |        | No        | No   |
|             |           | Divide           | Yes              | Div   | F10   | F0          | F6     | Mult1  |        | No        | Yes  |

#### Register result status:

 Clock
 F0
 F2
 F4
 F6
 F8
 F10
 F12
 ...
 F30

 0
 FU
 Mult1 4[R3+45]
 Add
 V Div
 ...
 F30



| <i>nstructio</i> | n sta     | tus:             |                  |       | Read  | Exec     | Write |           |          |            |     |
|------------------|-----------|------------------|------------------|-------|-------|----------|-------|-----------|----------|------------|-----|
| Instructi        | on        | $\boldsymbol{j}$ | $\boldsymbol{k}$ | Issue | Opran | d Comp   | Resul | <u>lt</u> |          |            |     |
| LD               | <b>F6</b> | 34+              | R2               |       |       |          |       |           |          |            |     |
| LD               | <b>F2</b> | 45+              | <b>R3</b>        |       |       |          |       |           |          |            |     |
| MULTD            | <b>F0</b> | <b>F2</b>        | <b>F4</b>        |       |       | <b>√</b> |       | Last cy   | cle of M | <b>[ul</b> |     |
| <b>SUBD</b>      | <b>F8</b> | <b>F6</b>        | <b>F2</b>        |       |       |          |       |           |          |            |     |
| DIVD             | F10       | FO               | <b>F6</b>        | √     |       |          |       |           |          |            |     |
| ADDD             | F6        | <b>F8</b>        | F2               |       |       | <b>√</b> |       |           |          |            |     |
| Function         | Unit      | State            | ous:             |       | des   | S1       | S2    | RS        | RS       |            |     |
|                  | Time      | Name             | Busy             | Op    | Fi    | Fj       | Fk    | Qj        | Qk       | Rj         | Rk  |
|                  |           | Intege           | No               | load  | F2    | R3       |       |           |          | No         |     |
|                  |           | Mult1            | Yes              | Mul   | F0    | F2       | F4    |           |          | No         | No  |
|                  |           | Mult2            | No               |       |       |          |       |           |          |            |     |
|                  |           | Add              | Yes              | ADD   | F6    | F8       | F2    |           |          | No         | No  |
|                  |           | Divide           | Yes              | Div   | F10   | F0       | F6    | Mult1     |          | No         | Yes |

#### Register result status:

 Clock
 F0
 F2
 F4
 F6
 F8
 F10
 F12
 ...
 F30

 0
 FU
 Mult1 4[R3+45]
 Add
 V Div
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...</



| nstructio    | n sta     | tus:      |           |       | Read  | Exec   | Write     |    |   |
|--------------|-----------|-----------|-----------|-------|-------|--------|-----------|----|---|
| Instruction  | on        | j         | k         | Issue | Opran | d Comp | Result    |    |   |
| LD           | <b>F6</b> | 34+       | R2        |       |       |        |           |    |   |
| LD           | <b>F2</b> | 45+       | <b>R3</b> |       |       |        |           |    |   |
| <b>MULTD</b> | <b>F0</b> | <b>F2</b> | <b>F4</b> |       |       |        | ✓         |    |   |
| <b>SUBD</b>  | <b>F8</b> | <b>F6</b> | <b>F2</b> |       |       |        |           |    |   |
| DIVD         | F10       | F0        | <b>F6</b> | √     |       |        | 7 1       |    |   |
| ADDD         | <b>F6</b> | <b>F8</b> | F2        |       |       | √      |           |    |   |
| unction      | Unit      | Stato     | us:       |       | des   | S1     | <i>S2</i> | RS | R |

| Function Unit Statous: |      | des       | S1 | <i>S2</i> | RS    | RS |     |     |
|------------------------|------|-----------|----|-----------|-------|----|-----|-----|
| Time Name Busy         | Op   | Fi        | Fj | Fk        | Qj    | Qk | Rj  | Rk  |
| Intege No              | load | F2        | R3 |           |       |    | No  |     |
| Mult1 No               | Mul  | F0        | F2 | F4        |       |    | No  | No  |
| Mult2 No               |      |           |    |           |       |    |     |     |
| Add Yes                | ADD  | <b>F6</b> | F8 | F2        |       |    | No  | No  |
| Divide Yes             | Div  | F10       | F0 | F6        | Mult1 |    | Yes | Yes |

#### Register result status:

| Clock |    | F0 | <b>F2</b> | <b>F4</b> | <i>F6</i> | F8 | F10 | <i>F12</i> | ••• | F30 |
|-------|----|----|-----------|-----------|-----------|----|-----|------------|-----|-----|
| 0     | FU | V* | 4[R3+45   | 5]        | Add       | V- | Div |            |     | 1 1 |



### **Examples: Dynamic Scheduling**

# Example 1: RAW fdiv.d f0,f2,f4 fadd.d f10,f0,f8 fsub.d f12,f8,f14

fsub.d is not dependent, issue before fadd.d

#### □Example 2:

fdiv.d f0,f2,f4 fmul.d f6,<u>f0</u>,f8 fadd.d <u>f0</u>,f10,f14

fadd.d is not dependent, but the antidependence makes it impossible to issue earlier without register renaming

```
□Example 1: structural hazard fdiv.d f2,f2,f4 Fadd.d f10,f0, f8 ;unpipelined ;Adder
```

Fadd.d f12, f0,f4 fMul.d f16, f14, f4

fMul.d is not dependent, issue before fadd.d

#### □Example 2:

fdiv.d f2,f2,f4 fdiv.d f6,<u>f0</u>,f8 fadd.d <u>f0</u>,f10,f14

fadd.d is not dependent, but the antidependence makes it impossible to issue earlier without register renaming



### Limitations of Scoreboard-1

#### **UILP**

If we can't find independent instructions to execute, scoreboard (or any dynamic scheduling scheme for that matter) helps very little.

### □ Size of the "issued" queue

- This determines how far ahead the CPU can look for instructions to execute in parallel.
- It's called the window.
- For now, we assume that a window can not span a branch.
- In other words, the window includes instructions only within basic blocks.





### Limitations of Scoreboard-2

- □Number, types, and speed of the functional units
  - This determines how often a structural hazard results in stall.
- ☐ The presence of anti-dependences and output dependences
  - WAR and WAW hazards limit the scoreboard more than RAW hazards, lead to WAR and WAW stalls.
  - > RAW hazards are problems for any technique.
  - > But WAR and WAW hazards can be solved in ways other than scoreboards.





### Register Renaming

### ----solve WAW, WAR

☐ Example

fdiv.d f0,f2,f4

fadd.d f10,f0,f8

fsd f10,0(x1)

fsub.d f12,f10,f14

fmul.d f14, f2, f16 ;f14 WAR

After Reg Rename
fdiv.d f0,f2,f4
fmul.d S,f0,f8
fsd S,0(x1)
fsub.d T,S,f14
fmul.d U, f2, f16

□ Though multiplier can finish execution far earlier before sub, but it can write result into f14 due to WAR dependence on sub.

Now only RAW hazards remain, which can be strictly ordered. WAR disappear.

