# 1 Pipelining Registers

In order to pipeline, we add registers between the five datapath stages. Label each of the five stages (IF, ID, EX, MEM, and WB) on the diagram below.



### 1.1 What is the purpose of the new registers?

When we pipeline the datapath, the values from each stage need to be passed on at each clock cycle. Each stage in the pipeline only operates on a small set of values, but those values need to be correct with respect to the instruction that is currently being processed. Say we use load word (lw) as an example: if it is in the EX stage, then the EX stage should look like a snapshot of the single-cycle datapath. The values on the rs1, rs2, immediate, and PC values should be as if lw was the only instruction in the entire path. This also includes the control logic: the instruction is passed in at each stage, the appropriate control signals are generated for the stage of interest, and that stage can execute properly.

1.2 Why do we add +4 to the PC again in the memory stage?

We add +4 to the PC again in the memory stage so we dont need to pass both PC and PC+4 along the whole pipeline

1.3 Why do we need to save the instruction in a register multiple times?

We need to save the instruction in a register multiple times because each pipeline stage needs to receive the right control signals for the instruction currently in that stage.

# 2 Performance Analysis

Register clk-to-q 30 ps Branch comp. 75 ps Memory write 200 ps

Register setup 20 ps ALU 200 ps RegFile read 150 ps

Mux 25 ps Memory read 250 ps RegFile setup 20 ps

2.1 With the delays provided above for each of the datapath components, what would be the fastest possible clock time for a single cycle datapath?

$$t_{\rm clk} \ge t_{\rm PC~clk-to-q} + t_{\rm IMEM~read} + t_{\rm RF~read} + t_{\rm mux} + t_{\rm ALU} + t_{\rm DMEM~read} + t_{\rm mux} + t_{\rm RF~setup}$$
 
$$\ge 30 + 250 + 150 + 25 + 200 + 250 + 25 + 20$$
 
$$\ge 950~{\rm ps}$$

$$\frac{1}{950~\mathrm{ps}} = 1.05~\mathrm{GHz}$$

2.2 What is the fastest possible clock time for a pipelined datapath?

Max (eah pth)

IF:  $t_{PC \text{ clk-to-q}} + t_{IMEM \text{ read}} + t_{Reg \text{ setup}} = 30 + 250 + 20 = 300 \text{ ps}$ 

ID:  $t_{\text{Reg clk-to-q}} + t_{\text{RF read}} + t_{\text{Reg setup}} = 200 \text{ ps}$ 

**EX**:  $t_{\text{Reg clk-to-q}} + t_{\text{mux}} + t_{\text{ALU}} + t_{\text{Reg setup}} = 25 + 200 = 275 \text{ ps}$ 

 $MEM: t_{Reg\ clk-to-q} + t_{DMEM\ read} + t_{mux} + t_{Reg\ setup} = 325 \text{ ps}$ 

**WB**:  $t_{\text{Reg clk-to-q}} + t_{\text{RF setup}} = 30 + 20 = 50 \text{ ps}$ 

$$max(IF, ID, EX, MEM, WB) = 325 ps$$

NOTE: For the **EX** stage, the branch comparator time is overshadowed by the ALU computation:

Branch comparator :  $t_{PC \text{ clk-to-q}} + t_{Branch \text{ comp.}} = 30 + 75 = 115 \text{ ps}$ 

ALU computation:  $t_{\text{Reg clk-to-q}} + t_{\text{mux}} + t_{\text{ALU}} + t_{\text{Reg setup}} = 25 + 200 = 275 \text{ ps}$ 

2.3 What is the speedup from the single cycle datapath to the pipelined datapath? Why is the speedup less than 5?

 $\frac{950 \text{ ps}}{325 \text{ ps}}$ , or a 2.9 times speedup. The speedup is less than 5 because of (1) the necessity of adding pipeline registers, which have clk-to-q and setup times, and (2) the need to set the clock to the maximum of the five stages, which take different amounts of time.

Note: because of hazards, which require additional logic to resolve, the actual speedup would likely be even less than 2.9 times.

# 3 Hazards

One of the costs of pipelining is that it introduces three types of pipeline hazards: structural hazards, data hazards, and control hazards.

### Structural Hazards

Structural hazards occur when more than one instruction needs to use the same datapath resource at the same time. There are two main causes of structural hazards:

Register File The register file is accessed both during ID, when it is read, and during WB, when it is written to. We can solve this by having separate read and write ports. To account for reads and writes to the same register, processors usually write to the register during the first half of the clock cycle, and read from it during in the second half. This is also known as double pumping.

Memory Memory is accessed for both instructions and data. Having a separate instruction memory (abbreviated IMEM) and data memory (abbreviated DMEM) solves this hazard.

Something to remember about structural hazards is that they can always be resolved by adding more hardware.

# Data Hazards

Data hazards are caused by data dependencies between instructions. In CS 61C, where we will always assume that instructions are always going through the processor in order, we see data hazards when an instruction **reads** a register before a previous instruction has finished **writing** to that register.

#### **Forwarding**

Most data hazards can be resolved by forwarding, which is when the result of the EX or MEM stage is sent to the EX stage for a following instruction to use.

#### 4 Pipelining

3.1 Look for data hazards in the code below, and figure out how forwarding could be used to solve them.

| Instruction                | C1 | C2 | C3 | C4  | C5  | C6  | C7 |
|----------------------------|----|----|----|-----|-----|-----|----|
| 1. addi <u>t0</u> , a0, -1 | IF | ID | EX | MEM | WB  |     |    |
| 2. and s2, t0, a0          |    | IF | ID | EX  | MEM | WB  |    |
| 3. sltiu a0, t0, 5         |    |    | IF | ID  | EX  | MEM | WB |

Done me Ped to World want 207

There are two data hazards, between instructions 1 and 2, and between instructions 1 and 3. The first could be resolved by forwarding the result of the EX stage in C3 to the beginning of the EX stage in C3, and the second could be resolved by forwarding the result of the EX stage in C3 to the beginning of the EX stage in C5.

[3.2] Imagine you are a hardware designer working on a CPU's forwarding control logic. How many instructions after the first addi instruction above could be affected a potential data hazard created by this addi instruction?

Three instructions. For example, with the addi instruction, any instruction that uses to that has its ID stage in C3, C4, or C5 will not have the result of addi's writeback in C5. (Side note: how is this implemented in hardware? We add 2 wires: one from the beginning of the MEM stage for the output of the ALU and one from the beginning of the WB stage. Both of these wires will connect to the A mux in the EX stage.)

3.3 You have the signals rs1, rs2, RegWEn, and rd for two instructions, instruction n and instruction n+1. Write a condition you can check to see if there is a data hazard between the two instructions, in terms of these signals.

```
if (rs1(n + 1) == rd(n) \mid \mid rs2(n + 1) == rd(n) \&\& RegWen(n) == 1) { forward ALU output of instruction n }
```

### Stalls

3.4 Look for data hazards in the code below. One of them cannot be solved with forwarding—why? What can we do to solve this hazard?

| Instruction       | C1     | C2 | C3 | C4  | C5  | C6        | C7        | C8   |
|-------------------|--------|----|----|-----|-----|-----------|-----------|------|
| 1. addi s0, s0, 1 | IF     | ID | EX | MEM | WB  | aramata a |           |      |
| 2. addi t0, t0, 4 |        | IF | ID | EX  | MEM | WB        | is marine |      |
| 3. lw_t1, 0(t0)   | 13.000 |    | IF | ID  | EX  | MEM       | WB        | 9.54 |
| 4. add t2, t1, x0 |        |    |    | IF  | ID  | EX        | MEM       | WB   |

There are two data hazards in the code. The first hazard is between instructions 2 and 3, from t0, and the second is between instructions 3 and 4, from t1. The hazard between instructions 2 and 3 can be resolved with forwarding, but the hazard between instructions 3 and 4 cannot be resolved with forwarding. This is because even with forwarding, instruction 4 needs the result of instruction 3 at the beginning of C6, and it wont be ready until the end of C6.

We can fix this by inserting a nop (no-operation) between instructions 3 and 4.

3.5 Say you are the compiler and can re-order instructions to minimize data hazards while guaranteeing the same output. How can you fix the code above?

Reorder the instructions 2-3-1-4, because instruction 1 has no dependencies.

### Control Hazards

Control hazards are caused by **jump and branch instructions**, because for all jumps and some branches, the next PC is not PC + 4, but the result of the computation completed in the EX stage. We could stall the pipeline for control hazards, but this decreases performance.

3.6 Besides stalling, what can we do to resolve control hazards?

We can predict which way branches will go, and when this prediction is incorrect, "flush" the pipeline and continue with the correct instruction. (The most naive prediction method is to simply predict that branches are always not taken).

# Extra for Experience

3.7 Given the RISC-V code above and a pipelined CPU with no forwarding, how many hazards would there be? What types are each hazard? Consider all possible hazards from all pairs of instructions.

How many stalls would there need to be in order to fix the data hazard(s)? What about the control hazard(s)?

| Instruction       | C1 | C2 | C3 | C4  | C5 247 | . C6 | .C7 | C8  | C9 |
|-------------------|----|----|----|-----|--------|------|-----|-----|----|
| 1. sub t1, s0, s1 | IF | ID | EX | MEM | WB     |      |     |     |    |
| 2. or s0, t0, t1  |    | IF | ID | EX  | MEM    | WB   |     |     |    |
| 3. sw s1, 100(s0) |    |    | IF | ID  | EX     | MEM  | WB  |     |    |
| 4. bgeu s0, s2, 1 |    |    |    | IF  | ID     | EX   | MEM | WB  |    |
| 5. add t2, x0, x0 |    |    |    |     | IF     | ID   | EX  | MEM | WB |

There are four hazards: between instructions 1 and 2 (data hazard from t1), instructions 2 and 3 (data hazard from s0), instructions 2 and 4 (from s0), and instructions 4 and 5 (a control hazard).

Assuming that we can read and write to the RegFile on the same cycle, two stalls are needed between instructions 1 and 2, and two stalls are needed between instructions 2 and 3. No stalls are needed for the control hazard, because it can be handled with branch prediction/flushing the pipeline.