# CS:APP Chapter 4 Computer Architecture Pipelined Implementation Part I

Randal E. Bryant`

# Carnegie Mellon University

http://csapp.cs.cmu.edu

-2-

CS:APP

# **Real-World Pipelines: Car Washes**

Sequential



**Pipelined** 



**Parallel** 

CS:APP



#### Idea

- Divide process into independent stages
- Move objects through stages in sequence
- At any given times, multiple objects being processed

### **Overview**

#### **General Principles of Pipelining**

- Goal
- Difficulties

#### **Creating a Pipelined Y86 Processor**

- Rearranging SEQ
- Inserting pipeline registers
- Problems with data and control hazards

# **Computational Example**



#### **System**

- Computation requires total of 300 picoseconds
- Additional 20 picoseconds to save result in register
- Can must have clock cycle of at least 320 ps

-3- CS:APP -4- CS:APP

# **3-Way Pipelined Version**



#### **System**

- Divide combinational logic into 3 blocks of 100 ps each
- Can begin new operation as soon as previous one passes through stage A.
  - Begin new operation every 120 ps
- Overall latency increases
  - 360 ps from start to finish

# **Pipeline Diagrams**

#### Unpipelined



■ Cannot start new operation until previous one completes

#### 3-Way Pipelined



■ Up to 3 operations in process simultaneously

-5- CS:APP -6- CS:APP

# **Operating a Pipeline**





# **Limitations: Nonuniform Delays**



- Throughput limited by slowest stage
- Other stages sit idle for much of the time
- Challenging to partition system into balanced stages

# **Limitations: Register Overhead**



- As try to deepen pipeline, overhead of loading registers becomes more significant
- Percentage of clock cycle spent loading register:

1-stage pipeline: 6.25%
3-stage pipeline: 16.67%
6-stage pipeline: 28.57%

 High speeds of modern processor designs obtained through very deep pipelining Combinational logic g

Clock

OP1
OP2
OP3

**Data Dependencies** 

Time

#### **System**

CS:APP

■ Each operation depends on result from preceding one

- 10 - CS:APP

### **Data Hazards**

**-9-**



- Result does not feed back around in time for next operation
- Pipelining has changed behavior of system

# **Data Dependencies in Processors**



- Result from one instruction used as operand for another
  - Read-after-write (RAW) dependency
- Very common in actual programs
- Must make sure our pipeline handles these properly
  - Get correct results
  - Minimize performance impact

-11 - CS:APP -12 - CS:APP

# **SEQ Hardware**

- Stages occur in sequence
- One operation in process at a time



**SEQ+ Hardware** Memory

Still sequential implementation

Reorder PC stage to put at beginning

#### **PC Stage**

- Task is to select PC for current instruction
- Based on results computed by previous instruction

#### **Processor State**

- PC is no longer stored in register
- But, can determine PC based on other stored information



- 14 - CS:APP

#### - 13 -

# **Adding Pipeline Registers**



# **Pipeline Stages**

#### **Fetch**

- Select current PC
- Read instruction
- **Compute incremented PC**

#### **Decode**

■ Read program registers

#### Execute

■ Operate ALU

#### Memory

Read or write data memory

#### **Write Back**

Update register file

Petch PC PC



CS:APP

#### **PIPE- Hardware**

 Pipeline registers hold intermediate values from instruction execution

#### Forward (Upward) Paths

- Values passed from one stage to next
- Cannot jump past stages
  - e.g., valC passes through decode



#### **Feedback Paths**

#### **Predicted PC**

■ Guess value of next PC

#### **Branch information**

- Jump taken/not-taken
- Fall-through or target address

#### **Return point**

Read from memory

#### Register updates

■ To register file write ports



- 17 -

**– 18 –** 

# Predicting the | M. Both | M. Solde | M. M.

#### Start fetch of new instruction after current one has completed fetch stage

- Not enough time to reliably determine next instruction
- Guess which instruction will follow
  - Recover if prediction was incorrect

# **Our Prediction Strategy**

#### Instructions that Don't Transfer Control

- Predict next PC to be valP
- Always reliable

#### **Call and Unconditional Jumps**

- Predict next PC to be valC (destination)
- Always reliable

#### **Conditional Jumps**

- Predict next PC to be valC (destination)
- Only correct if branch is taken
  - Typically right 60% of time

#### **Return Instruction**

■ Don't try to predict

# Recovering from PC Misprediction



- Mispredicted Jump
  - Will see branch flag once instruction reaches memory stage
  - Can get fall-through PC from valA
- Return Instruction
  - Will get return PC when ret reaches write-back stage

# **Pipeline Demonstration**



File: demo-basic.ys

-22 -



-21 - CS:APP

# **Data Dependencies: 3 Nop's**



# **Data Dependencies: 2 Nop's**



-23-

# **Data Dependencies: 1 Nop**



# **Data Dependencies: No Nop**



CS:APP

# **Branch Misprediction Example**

```
0x000:
          xorl %eax,%eax
0x002:
          jne t
                              # Not taken
                              # Fall through
0x007:
          irmovl $1, %eax
0x00d:
          nop
0x00e:
          nop
0x00f:
          nop
0x010:
          halt
0x011: t: irmov1 $3, %edx
                              # Target (Should not execute)
0x017:
          irmovl $4, %ecx
                              # Should not execute
0x01d:
          irmovl $5, %edx
                              # Should not execute
```

Should only execute first 8 instructions

# **Branch Misprediction Trace**



**- 27 -**

- 25 -

demo-j.ys

-26-

# **Return Example**

demo-ret.ys

```
0x000:
           irmovl Stack, %esp
                                # Intialize stack pointer
0x006:
                                # Avoid hazard on %esp
0 \times 007:
           nop
:800x0
           nop
0x009:
                                # Procedure call
           call p
           irmovl $5,%esi
0x00e:
                               # Return point
0 \times 014:
           halt
0x020: .pos 0x20
                                  # procedure
0x020: p: nop
0 \times 021:
          nop
0 \times 022:
          nop
0x023:
          ret
0 \times 024:
          irmovl $1,%eax
                                  # Should not be executed
0x02a:
          irmovl $2,%ecx
                                  # Should not be executed
0x030:
          irmovl $3,%edx
                                  # Should not be executed
0x036:
                                  # Should not be executed
          irmovl $4,%ebx
0x100: .pos 0x100
0x100: Stack:
                                # Stack: Stack pointer
```

Require lots of nops to avoid data hazards

- 29 - CS:APP -

# **Pipeline Summary**

#### Concept

- Break instruction execution into 5 stages
- Run instructions through in pipelined mode

#### Limitations

- Can't handle dependencies between instructions when instructions follow too closely
- Data dependencies
  - One instruction writes register, later one reads it
- Control dependency
  - Instruction sets PC in way that pipeline did not predict correctly

CS:APP

Mispredicted branch and return

#### **Fixing the Pipeline**

We'll do that next time

## **Incorrect Return Example**





■ Incorrectly execute 3 instructions following ret



- 30 -

-31 -