### Pipelined CPU Design

Now, we will optimize a single cycle CPU using pipelining. Pipelining is a powerful logic design method to reduce the clock time and improve the throughput, even though it increases the latency of an individual task and adds additional logic. In a pipelined CPU, multiple instructions are overlapped in execution. This is a good example of parallelism, which is one of the great ideas in computer architecture. To obtain a pipelined CPU, we will take the following steps.

# **Step1: Pipeline Registers**

Pipelining starts from adding pipelining registers by dividing a large combinational logic. We have already chopped a single cycle CPU into five stages, and thus, will add pipeline registers between two stages. We kindly added pipeline registers for the datapath. Note that the write address for the register file should be passed to the write back stage through the pipeline registers so that the instruction writes the result to the correct register. However, we miss the pipeline registers for the control signals.

Q1. Add the pipeline registers for the control signals and connect them to the datapath. Purple objects and lines in the diagram

### **Step2: Performance Analysis**

A great advantage of pipelining is the performance improvement with a shorter clock time. We will use the same timing parameters as those in the previous discussion.

| Element   | Register<br>clk-to-q | Register<br>Setup | MUX       | ALU          | Mem<br>Read      | Mem<br>Write      | RegFile<br>Read | RegFile<br>Setup |
|-----------|----------------------|-------------------|-----------|--------------|------------------|-------------------|-----------------|------------------|
| Parameter | $t_{\rm clk-to-q}$   | $t_{ m setup}$    | $t_{mux}$ | $t_{ m ALU}$ | $t_{ m MEMread}$ | $t_{ m MEMwrite}$ | $t_{RFread}$    | $T_{RFsetup}$    |
| Delay(ps) | 30                   | 20                | 25        | 200          | 250              | 200               | 150             | 20               |

# Q1. What was the clock time and frequency of a single cycle CPU?

$$\begin{split} t_{clk,single} > &= t_{PC,\ clk-to-q} + t_{IMEMread} + t_{RFread} + t_{ALU} + t_{DMEMread} + t_{mux} + t_{RFsetup} \\ &= 30 + 250 + 150 + 200 + 250 + 25 + 20 = 925\ ps \\ f_{clk,single} = 1/t_{clk,pipe} <= 1/\ (925\ ps) = 1.08\ GHz \end{split}$$

# Q2. What is the clock time and frequency of a pipelined CPU?

$$t_{\text{clk-to-q}} + t_{\text{MEMread}} + t_{\text{setup}} \text{ (Fetch)}$$

$$t_{\text{clk-to-q}} + t_{\text{RFread}} + t_{\text{setup}} \text{ (Decode)}$$

$$t_{\text{clk-to-q}} + t_{\text{ALU}} + t_{\text{mux}} + t_{\text{setup}} \text{ (Execute)}$$

$$t_{\text{clk-to-q}} + t_{\text{MEMread}} + t_{\text{setup}} \text{ (Memory)}$$

$$t_{\text{clk-to-q}} + t_{\text{mux}} + t_{\text{RFsetup}} \text{ (Writeback)}$$

 $f_{clk,pipe} = 1/t_{clk,pipe} \le 1/(300 \text{ ps}) = 3.33 \text{ GHz}$ 

#### Q3. What is the speed-up? Why is it less than five?

Speed-up =  $t_{clk,pipe} / t_{clk,single} = f_{clk,pipe} / f_{clk,single} = 3.08$ .

This is because pipeline stages are not balanced evenly and there is overhead from pipeline registers (t<sub>clk-to-q</sub>, t<sub>setup</sub>). Moreover, this does not include the delays from the additional logic for hazard resolution.

## Step3: Pipeline Hazard

The performance improvement comes at a cost. Pipelining introduces pipeline hazards we have to overcome.

#### Structural Hazard

Structural hazards occur when more than one instruction use the same resource at the same time.

- **Register File:** One instruction reads from the register file while another writes to it. We can solve this by having separate read and write ports and writing to the register file at the falling edge of the clock.
- **Memory:** The memory is accessed not only for the instruction but also for the data. Separate caches for instructions and data solve this hazard.

# **Data Hazard and Forwarding**

Data hazards occur due to data dependencies among instructions. Forwarding can solve many data hazards.

Q1. Spot the data dependencies in the code below and figure out how forwarding can resolve data hazards.

| <b>Clock Cycles</b>               | C0 | <b>C1</b> | <b>C2</b> | С3  | <b>C4</b>   | <b>C5</b> | <b>C6</b> |
|-----------------------------------|----|-----------|-----------|-----|-------------|-----------|-----------|
| addi <mark>\$t0</mark> , \$s0, -1 | IF | ID        | EX        | MEM | WB          |           |           |
| and \$s2, \$t0, \$a0              |    | IF        | ID        |     | MEM         | WB        |           |
| sw \$s0, 100( <mark>\$t0</mark> ) |    |           | IF        | ID  | <b>V</b> EX | MEM       | WB        |

The and and sw instructions need the values of \$t0 for their execution stages. Fortunately, these values are ready before their execution stages. Thus, we can forward it from the pipeline registers before writing it to the register file.

Q2. Provide the inputs to the forwarding unit.

rsE, rtE, WriteAddrM, WriteAddrW, RegWrM, RegWrW

Q3. Write the condition for each forwarding signal.

FwdAM (Forwarding ALUOutM to A)

```
= WriteWrM & (rsE != 0) & (rsE == WriteAddrM)
```

FwdBM (Forwarding ALUTOutM to B)

```
= WriteWrM & (rtE != 0) & (rtE == WriteAddrM)
```

FwdAW (Forwarding WriteDataW to A)

= WriteWrW & (rsE != 0) & (rsE == WriteAddrW)

FwdBW (Forwarding WriteDataW to B)

= WriteWrW & (rtE != 0) & (rtE == WriteAddrW)

Q4. Implement the forwarding logic.

Green objects and lines in the diagram.

#### **Data Hazard and Stalls**

Forwarding cannot solve all data hazards. We need to stall the pipeline in some cases.

Q1. Spot the data dependencies in the code below and figure out why forwarding cannot resolve this hazard.

| Clock Cycles         | CO | <b>C1</b> | C2 | <b>C3</b> | <b>C4</b> | <b>C5</b> |
|----------------------|----|-----------|----|-----------|-----------|-----------|
| lw \$t0, 20(\$s0)    | IF | ID        | EX | MEM       | WB        |           |
| add \$t1, \$t0, \$t0 |    | IF        | ID | EX        | MEM       | WB        |

The add instruction needs the value of \$t0 in the beginning of C3, but it is ready at the end of C3.

Q2. Now we stall the pipeline one cycle and insert nop after the lw instruction. Figure out how this can resolve the hazard.

| Clock Cycles         | CO | <b>C1</b> | <b>C2</b> | <b>C3</b> | <b>C4</b> | <b>C</b> 5 | C6 |
|----------------------|----|-----------|-----------|-----------|-----------|------------|----|
| lw \$t0, 20(\$s0)    | IF | ID        | EX        | MEM       | WB        |            |    |
| nop                  |    | IF        | ID        | Bub       |           | Bub        |    |
| add \$t1, \$t0, \$t0 |    |           | IF        | ID        | EX        | MEM        | WB |

By stalling one cycle, the add instruction can start its execution stage after the \$t0 value is ready.

#### Q3. Provide the inputs to the hazard detection logic.

First, we assume that MemToReg is 1 only if the instruction is 1w. (No don't care terms for MemToReg) Then, provide rs, rt, rtE, MemToRegE to the logic.

### Q4. Write the condition of the stall signal.

# Q5. Implement the stalling logic.

Red objects and lines in the diagram. Note that we add a nop by zeroing control signals

#### **Control Hazard and Prediction**

Control hazards occur due to jumps and branches. We may solve them by stalling the pipeline. However, it is painful since the branch condition is calculated after the execution stage and the pipeline is stalled for three cycles. Instead, we predict branches are not taken and flush the pipeline if they are actually taken.

Q1. Assume that the branch is actually taken for the beq instruction. Figure out how the pipeline flush can resolve the control hazard.

| PC     | Clock Cycles          | <b>C1</b> | C2 | С3    | <b>C4</b> | <b>C</b> 5 | C6  | С7  | C8  | С9 |
|--------|-----------------------|-----------|----|-------|-----------|------------|-----|-----|-----|----|
| 0x2000 | beq \$t0, \$s0, 0x10  | IF        | ID | EX    | MEM       | WB         |     |     |     |    |
| 0x2004 | add \$s2, \$t0, \$a0  |           | IF | ID    | EX 3      | Bub        | Bub |     |     |    |
| 0x2008 | sw \$s0, 100(\$t0)    |           |    | IF fl | ushiD     | Bub        | Bub | Bub |     |    |
| 0x2010 | xor \$s2, \$0, \$0    |           |    |       | IF 🌂      | Bub        | Bub | Bub | Bub |    |
| 0x2040 | add, \$s2, \$t0, \$a1 |           |    |       |           | IF         | ID  | EX  | MEM | WB |

When a taken branch is detected, the incorrect stream of instructions is nullified by inserting bubbles to pipeline registers and fetch the correct instruction.

### Q2. Implement the pipeline flush.

Blue objects and lines in the diagram. We use the PCSrc signal to flush the pipeline.

