# Computer Systems II

Li Lu

Room 605, CaoGuangbiao Building li.lu@zju.edu.cn

https://person.zju.edu.cn/lynnluli







## **Pipeline Performance**

- Assume time for stages is
  - 100ps for register read or write
  - 200ps for other stages
- Compare pipelined datapath with single-cycle datapath

| Inst   | Inst fetch | Register read | ALU op | Memory access | Register write | Total time |
|--------|------------|---------------|--------|---------------|----------------|------------|
| lw     | 200ps      | 100 ps        | 200ps  | 200ps         | 100 ps         | 800ps      |
| sw     | 200ps      | 100 ps        | 200ps  | 200ps         |                | 700ps      |
| R-type | 200ps      | 100 ps        | 200ps  |               | 100 ps         | 600ps      |
| beq    | 200ps      | 100 ps        | 200ps  |               |                | 500ps      |



## **Pipeline Performance**

Single-cycle (T<sub>c</sub>= 800ps)

Pipelined (T<sub>c</sub>= 200ps)



## **How Pipelining Improves Performance?**

Decreasing the execution time of an individual instruction ×

Increasing instruction throughput **V** 









## **Pipeline Hazards**

• Structural Hazard A required resource is busy

Data Hazards

- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

Control Hazards

Flow of execution depends on previous instruction

## **Structural Hazard**

A required resource is busy

Example: Consider the situation while the pipeline only has a single memory



Question: Can the four instructions execute correctly?

## **Structural Hazard**

A required resource is busy

**Solution:** 

Use Instruction and data memory simultaneously

Example: Consider the situation while the pipeline only has a single memory



## **How to Deal with Structural Hazard?**

Problem: Two or more instructions in the pipeline compete for access to a single physical resource

 Solution 1: Instructions take it in turns to use resource, some instructions have to stall

Solution 2: Add more hardware to machine

Can always solve a structural hazard by adding more hardware



## How to Stall?

#### **NOP** instruction

#### **ADDI x0, x0, 0**

| 31        | 20 19 | 15 14 | 12 11 | 7 6 |        | 0 |
|-----------|-------|-------|-------|-----|--------|---|
| imm[11:0] | rs1   | fı    | ınct3 | rd  | opcode |   |
| 12        | 5     | •     | 3     | 5   | 7      |   |
| 0         | 0     | A     | ADDI  | 0   | OP-IMM |   |



- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

#### Problem: Instruction depends on result from previous





- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

#### **Problem: Instruction depends on result from previous**





- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

#### **Problem: Instruction depends on result from previous**



- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

# Solution "forwarding": Adding extra hardware to retrieve the missing item early from the internal resources





- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

#### Solution "forwarding": Could not avoid all pipeline stalls





- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

#### **Example: Reordering code to avoid pipeline stalls**

#### Consider the following code segment in C:

```
a = b + e;

c = b + f;
```

#### The generated RISC-V code:

Id x1, 0(x31) // Load b
Id x2, 8(x31) // Load e
add x3, x1, x2 // b + e
sd x3, 24(x31) // Store a
Id x4, 16(x31) // Load f
add x5, x1, x4 // b + f
sd x5, 32(x31) // Store c

- Assuming all variables are in memory
- and are addressable as offsets from x31

Question: Find the data hazard and reorder RISC-V code

- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

#### **Example: Reordering code to avoid pipeline stalls**

#### Consider the following code segment in C:

```
a = b + e;

c = b + f;
```

#### The generated RISC-V code:

```
Id x1, 0(x31) // Load b
Id x2, 8(x31) // Load e
add x3, x1, x2 // b + e
sd x3, 24(x31) // Store a
Id x4, 16(x31) // Load f
add x5, x1, x4 // b + f
sd x5, 32(x31) // Store c
```

- Assuming all variables are in memory
- and are addressable as offsets from x31



- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

#### **Example: Reordering code to avoid pipeline stalls**

#### Consider the following code segment in C:

```
a = b + e;
c = b + f;
```

- Assuming all variables are in memory
- and are addressable as offsets from x31

#### The generated RISC-V code:

```
1d x1, 0(x31)
1d x1, 0(x31) // Load b
                          eliminates
                                         1d x2, 8(x31)
ld x2, 8(x31) // Load e
                          both hazards
                                         ld x4, 16(x31)
add x3, x1, x2 // b + e
                                         add x3, x1, x2
sd x3, 24(x31) // Store a
                                         x3, 24(x31)
ld x4, 16(x31) // Load f
                                         add x5, x1, x4
add x5, x1, x4 // b + f
                                         x5, 32(x31)
sd x5, 32(x31) // Store c
```



Flow of execution depends on previous instruction

#### **Problem: The conditional branch instruction**





Flow of execution depends on previous instruction

#### **Problem: The conditional branch instruction**





Flow of execution depends on previous instruction

#### **Solution 1: Stall**



- calculate the branch address
- update the PC



Flow of execution depends on previous instruction

#### **Solution 2: Prediction (simple version)**

Predicts always that conditional branches will be untaken

branch is not taken

branch is taken





Flow of execution depends on previous instruction

**Solution 2: Prediction (sophisticated version)** 

Predicts that some conditional branches will be untaken

#### **Example: While the Conditional Branches at the bottom of loops**



they are likely to be taken to the top of the loops



Flow of execution depends on previous instruction

#### **Solution 2: Prediction (dynamic prediction)**

Predicts that some conditional branches will be untaken





Flow of execution depends on previous instruction

#### **Solution 3: Delayed Decision**

Places an instruction immediately after the delayed branch instruction that is not affected by the branch





Flow of execution depends on previous instruction

#### **Solution 3: Delayed Decision**

Places an instruction immediately after the delayed branch instruction that is not affected by the branch





## **Summary of Pipelining Design**

- Pipeline is a technique that exploits parallelism between the instructions in a sequential instruction stream
- Pipeline increases the number of simultaneously executing instructions and the rate at which instructions are started and completed
- Pipeline designers need to cope with structural, control, and data hazards
- Pipeline is fundamentally invisible to the programmer



## **Pipelining Design and Implementation**

RISC-V ISA designed for pipelining





#### §3.2.0 An Implementation of Pipelining

| Stage | Any instruction                                                                                                                                                |                                                                                                            |                                                |  |  |  |  |
|-------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|------------------------------------------------|--|--|--|--|
| ĪF    | <pre>IF/ID.IR←Mem[PC] IF/ID.NPC,PC←(if((EX/MEM.opcode == branch) &amp; EX/MEM.cond){EX/MEM. ALUOutput} else {PC+4});</pre>                                     |                                                                                                            |                                                |  |  |  |  |
| ID    | <pre>ID/EX.A←Regs[IF/ID.IR[rs1]]; ID/EX.B←Regs[IF/ID.IR[rs2]]; ID/EX.NPC←IF/ID.NPC; ID/EX.IR←IF/ID.IR; ID/EX.Imm←sign-extend(IF/ID.IR[immediate field]);</pre> |                                                                                                            |                                                |  |  |  |  |
|       | ALU instruction                                                                                                                                                | Load instruction                                                                                           | Branch instruction                             |  |  |  |  |
| EX    | EX/MEM.IR ← ID/EX.IR;<br>EX/MEM.ALUOutput ←<br>ID/EX.A func ID/EX.B;<br>or<br>EX/MEM.ALUOutput ←<br>ID/EX.A op ID/EX.Imm;                                      | EX/MEM.IR to ID/EX.IR<br>EX/MEM.ALUOutput←<br>ID/EX.A+ID/EX.Imm;                                           | EX/MEM.ALUOutput← ID/EX.NPC + (ID/EX.Imm<< 2); |  |  |  |  |
|       | ,                                                                                                                                                              | EX/MEM.B←ID/EX.B;                                                                                          | <pre>EX/MEM.cond ← (ID/EX.A == ID/EX.B);</pre> |  |  |  |  |
| MEM   | MEM/WB.IR←EX/MEM.IR;<br>MEM/WB.ALUOutput←<br>EX/MEM.ALUOutput;                                                                                                 | MEM/WB.IR←EX/MEM.IR;<br>MEM/WB.LMD←<br>Mem[EX/MEM.ALUOutput];<br>or<br>Mem[EX/MEM.ALUOutput]←<br>EX/MEM.B; |                                                |  |  |  |  |
| WB    | Regs[MEM/WB.IR[rd]] ← MEM/WB.ALUOutput;                                                                                                                        | For load only:<br>Regs[MEM/WB.IR[rd]] ←<br>MEM/WB.LMD;                                                     |                                                |  |  |  |  |



# Pipelined Datapath



# Let's find two exceptions to this left-to-right flow of instructions



# Let's find two exceptions to this left-to-right flow of instructions



# Question: The impacts of the two exceptions





# Question: The impacts of the two exceptions



# Single-Cycle Datapath to the Pipelined Version











Three instructions need three datapaths



Let's add registers to share single datapaths during instruction execution



#### **Pipelined Version of the Datapath**



## **Understand How Pipelined Datapath Work**

Through Load Instruction and Store Instruction



## **Load Instruction**



### IF Stage of Load Instruction



### **ID Stage of Load Instruction**



### **EX Stage of Load Instruction**



### **MEM Stage of Load Instruction**



### WB Stage of Load Instruction

