# CENG 3420 Computer Organization & Design

# Lecture 11: Pipeline – Basis

Bei Yu CSE Department, CUHK byu@cse.cuhk.edu.hk

(Textbook: Chapters 4.5 & 4.6)

Spring 2022

# Single Cycle Disadvantages



- Single cycle: the whole datapath is finished in one clock cycle
- It is simple and easy to understand
- Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instr
- Problematic for more complex instructions like floating point multiply
- May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle



# Single Cycle Disadvantages



- Though simple, the single cycle approach is not used because it is very slow
- Clock cycle must have the same length for every instruction
- What is the longest path (slowest instruction)? Load instruction!
- It is too long for the store instruction so the last part of the cycle here is wasted.



### **EX**: Instruction Critical Paths

Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires) except:

- Instruction memory (IM) and data memory (DM) (4 ns)
- ALU and adders (ALU) (2 ns)
- Register File access (Reg) (1 ns)

| Instr.   | IM | Reg | ALU | DM | Reg | Total |
|----------|----|-----|-----|----|-----|-------|
| R/I-type |    |     |     |    |     |       |
| lw       |    |     |     |    |     |       |
| SW       |    |     |     |    |     |       |
| beq      |    |     |     |    |     |       |
| <br>jal  |    |     |     |    |     |       |
| jalr     |    |     |     |    |     |       |



### **EX**: Instruction Critical Paths

Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires) except:

- Instruction memory (IM) and data memory (DM) (4 ns)
- ALU and adders (ALU) (2 ns)
- Register File access (Reg) (1 ns)

| Instr.      | IM | Reg | ALU | DM | Reg | Total |
|-------------|----|-----|-----|----|-----|-------|
| R/I-type    | 4  | 1   | 2   |    | 1   | 8     |
| lw          | 4  | 1   | 2   | 4  | 1   | 12    |
| SW          | 4  | 1   | 2   | 4  |     | 11    |
| beq         | 4  | 1   | 2   |    |     | 7     |
| jal<br>jalr |    |     |     |    |     |       |



# Solution:

| Instr.   | IM | Reg | ALU | DM | Reg | Total |
|----------|----|-----|-----|----|-----|-------|
| R/I-type | 4  | 1   | 2   |    | 1   | 8     |
| lw       | 4  | 1   | 2   | 4  | 1   | 12    |
| SW       | 4  | 1   | 2   | 4  |     | 11    |
| beq      | 4  | 1   | 2   |    |     | 7     |
| jal      | 4  |     | 2   |    | 1   | 7     |
| jalr     | 4  | 1   | 2   |    | 1   | 8     |

### How Can We Make It Faster?



### $CPU time = CPI \times CC \times IC$

- Start fetching and executing the next instruction before the current one has completed
  - Pipelining (all?) modern processors are pipelined for performance
  - Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages
  - A five stage pipeline is nearly five times faster because the CC is "nearly" five times faster
- Fetch (and execute) more than one instruction at a time
  - Superscalar processing stay tuned



### Note:

- CC: clock cycle;
- IC: instruction count
- In reality the time per instruction in a pipeline processor is longer than the minimum possible because
  - the pipeline stages may not be perfectly balanced
  - 2 pipelining involves some overhead (like pipeline stage isolation registers).

## The Five Stages of Load Instruction





- IF: Instruction Fetch and Update PC
- ID: Registers Fetch and Instruction Decode
- EXE: Execute R-type; calculate memory address
- MEM: Read/write the data from/to the Data Memory
- WB: Write the result data into the register file

# **Equivalent Definitions**



In this course, we treat the following definitions equivalent:



### A Pipelined RISC-V Processor



Start the next instruction before the current one has completed

- improves throughput total amount of work done in a given time
- instruction latency (execution time, delay time, response time time from the start of an instruction to its completion) is not reduced



- 1 clock cycle (pipeline stage time) is limited by the slowest stage
- ② for some stages don't need the whole clock cycle (e.g., WB)
- 6 for some instructions, some stages are wasted cycles (i.e., nothing is done during that cycle for that instruction)





# Single Cycle versus Pipeline





- To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why?
- How long does each take to complete 1,000,000 adds?



### Example:

If IF=100ps, ID=100ps, EXE=200ps, MEM=200ps, WB=100ps

- In single cycle setting, cycle-length=700ps
- In pipeline setting, each cycle length=200ps, so finish one instr will take 5 stages (ie.  $5 \times 200$ ps).

### Pipelining the RISC-V ISA



### What makes it easy

- all instructions are the same length (32 bits)
  - can fetch in the 1st stage and decode in the 2nd stage
- few instruction formats (three) with symmetry across formats
  - can begin reading register file in 2nd stage
- memory operations occur only in loads and stores
  - can use the execute stage to calculate memory addresses
- each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB)
- operands must be aligned in memory so a single data transfer takes only one data memory access

### RISC-V Pipeline Datapath Additions/Mods



State registers between each pipeline stage to isolate them



# RISC-V Pipeline Control Path Modifications



### All control signals can be determined during Decode

• and held in the state registers between pipeline stages



# Pipeline Control



- IF Stage: read Instr Memory (always asserted) and write PC (on System Clock)
- ID Stage: no optional control signals to set

|     | EX Stage   |            |            |            | М    | EM Sta      | ige          | WB Stage     |              |
|-----|------------|------------|------------|------------|------|-------------|--------------|--------------|--------------|
|     | Reg<br>Dst | ALU<br>Op1 | ALU<br>Op0 | ALU<br>Src | Brch | Mem<br>Read | Mem<br>Write | Reg<br>Write | Mem<br>toReg |
| R   | 1          | 1          | 0          | 0          | 0    | 0           | 0            | 1            | 0            |
| lw  | 0          | 0          | 0          | 1          | 0    | 1           | 0            | 1            | 1            |
| SW  | Х          | 0          | 0          | 1          | 0    | 0           | 1            | 0            | Х            |
| beq | Х          | 0          | 1          | 0          | 1    | 0           | 0            | 0            | Х            |

# Graphically Representing RISC-V Pipeline





### Can help with answering questions like:

- How many cycles does it take to execute this code?
- What is the ALU doing during cycle 4?
- Is there a hazard, why does it occur, and how can it be fixed?

# Other Pipeline Structures Are Possible



What about the (slow) multiply operation?

- Make the clock twice as slow or ...
- let it take two cycles (since it doesn't use the DM stage)



What if the data memory access is twice as slow as the instruction memory?

- make the clock twice as slow or ...
- let data memory access take two cycles (and keep the same clock rate)



# Other Sample Pipeline Alternatives



• ARM7:



• XScale:



## Why Pipeline? For Performance!





# Can Pipelining Get Us Into Trouble?



### Yes! Pipeline Hazards

- structural hazards: a required resource is busy
- data hazards: attempt to use data before it is ready
- control hazards: deciding on control action depends on previous instruction

### Can usually resolve hazards by waiting

- pipeline control must detect the hazard
- and take action to **resolve** hazards

### Structure Hazards



- Conflict for use of a resource
- In RISC-V pipeline with a single memory
  - Load/store requires data access
  - Instruction fetch requires instruction access
- Hence, pipeline datapaths require separate instruction/data memories
  - Or separate instruction/data caches
- Since Register File

### Resolve Structural Hazard 1





Fix with separate instr and data memories (I\$ and D\$)



