# Lecture Notes of Computer Architecture

# Zhou Fan ACM Class, Shanghai Jiao Tong University

# Contents

| 1 | Pip                             | elining |                                          | 2 |
|---|---------------------------------|---------|------------------------------------------|---|
|   | 1.1                             | Introd  | uction of Pipelining                     | 2 |
|   |                                 | 1.1.1   | Laundry Example                          | 2 |
|   |                                 | 1.1.2   | Speedup from Pipelining                  | 2 |
|   | asics of a RISC Instruction Set | 2       |                                          |   |
|   |                                 | 1.2.1   | Introduction of RISC                     | 2 |
|   |                                 | 1.2.2   | Implementation of a RISC Instruction Set | 5 |
|   |                                 | 1.2.3   | Pipeline Hazards                         | 4 |

# 1 Pipelining

## 1.1 Introduction of Pipelining

*Pipelining* is an implementation technique whereby multiple instructions are overlapped in execution; it takes advantage of parallelism that exists among the actions needed to execute an instruction.<sup>[1]</sup>

#### 1.1.1 Laundry Example

Suppose we have many loads of clothes to wash, dry and fold.

- Each step (washing, drying and folding) is called a *pipe stage* or a *pipe segment*.
- Latency is the total time spent on a single task, which is not improved by pipelining. Unbalanced lengths of pipe stages reduces speedup.
- *Throughput* is defined as the number of loads of clothes per minute. It shows how often a load of clothes exits the pipeline.
- The time required between moving an instruction one step down the pipeline is a processor cycle. In a computer, this processor cycle is usually 1 clock cycle. [1]

#### 1.1.2 Speedup from Pipelining

Throughput is what matters. Pipelining helps throughput of whole workload, while it doesn't help latency of single task.

- Unbalanced lengths of pipe stages reduces speedup.
- Handover time between pipe stages reduces speedup.

To improve the efficiency of a pipeline, one should balance the length of each pipeline stage. If the stages are perfectly balanced, then the time per instruction on the pipeline processor is equal to (under ideal conditions)

Time per instruction on unpipelined machine

Number of pipe stages

and the throughput of the pipeline is equal to

Number of pipe stages × Throughput on unpipelined machine

#### 1.2 The Basics of a RISC Instruction Set

#### 1.2.1 Introduction of RISC

A RISC<sup>1</sup> is a computer whose instruction set architecture has lower cycles per instruction (CPI) than a  $CISC^2$ .

MIPS is a RISC instruction set architecture.

<sup>&</sup>lt;sup>1</sup>reduced instruction set computer

<sup>&</sup>lt;sup>2</sup>complex instruction set computer

## Key Properties of RISC Architectures<sup>[1]</sup>

- All operations on data apply to data in registers and typically change the entire register (32 or 64 bits per register).
- Only load and store operations can affect memory. Load and store operations that load or store less than a full register are often available.
- The instruction formats are few in number, with all instructions typically being one size

These properties make the implementation of pipelining simple. Most RISC architectures like MIPS have three classes of instructions:

- 1. ALU<sup>3</sup> instructions
- 2. Load and store instructions
- 3. Branches and jumps

#### 1.2.2 Implementation of a RISC Instruction Set

The implementation here will focus only on a pipeline for an integer subset of a RISC architecture that consists of load-store word, branch, and integer ALU operations.

**Implementation Without Pipelining**<sup>[1]</sup> Every instruction in the RISC subset can be implemented in at most 5 clock cycles as follows:

1. Instruction fetch cycle (IF):

Send the program counter (PC) to memory and fetch the current instruction from memory. Update the PC to the next sequential PC by adding 4 (since each instruction is 4 bytes) to the PC<sup>4</sup>.

2. Instruction decode / register fetch cycle (ID):

Decode the instruction and read the registers corresponding to register source specifiers from the register file. Do the equality test on the registers as they are read, for a possible branch. Sign-extend (introduced later) the offset field of the instruction in case it is needed. Compute the possible branch target address by adding the sign-extended offset to the incremented PC.

Decoding is done in parallel with reading registers, which is possible because the register specifiers are at a fixed location in a RISC instruction. This technique is known as *fixed-field decoding*. It may read a register that we don't use, and it doesn't help but also doesn't hurt performance.

3. Execution / effective address cycle (EX):

The ALU operates on the operands prepared in the prior cycle, performing one of three functions depending on the instruction type.

<sup>&</sup>lt;sup>3</sup>Arithmetic logic unit

<sup>&</sup>lt;sup>4</sup>program counter, indicates where a computer is in its program sequence.

- Memory reference The ALU adds the base register and the offset to form the effective address.
- Register-Register ALU instruction The ALU performs the operation specified by the ALU opcode on the values read from the register file.
- Register-Immediate ALU instruction The ALU performs the operation specified by the ALU opcode on the first value read from the register file and the sign-extended immediate.

In a load-store architecture the effective address and execution cycles can be combined into a single clock cycle, since no instruction needs to do both of them.

#### 4. Memory Access (MEM):

The memory does a read using the effective address computed in the previous cycle or writes the data read from the register file using the effective address, if the instruction is a load or a store.

#### 5. Write-back cycle (WB):

For a Register-Register ALU instruction or a load instruction, write the result into the register file.

**Sign Extension** In computer arithmetic, sigh extension is the operation of increasing the number of bits of a binary number while preserving the number's sign and value. This is done by appending digits (same as the sign bit) to the most significant side of the number, following a procedure dependent on the particular signed number representation used.

For example:

- "00 1010" (decimal positive 10)  $\rightarrow$  "0000 0000 0000 1010"
- "11 1111 0001" (decimal negative 15)  $\rightarrow$  "1111 1111 1111 0001"

**Pipelined Implementation**<sup>[1]</sup> Under ideal conditions, we can easily get a pipelined implementation from the execution described above, doing one of the five steps for five instructions in parallel in each pipeline stage. This procedure is showed in the following table.

|                    | Clock Number |    |    |     |     |     |     |     |    |
|--------------------|--------------|----|----|-----|-----|-----|-----|-----|----|
| Instruction Number | 1            | 2  | 3  | 4   | 5   | 6   | 7   | 8   | 9  |
| Instruction i      | IF           | ID | EX | MEM | WB  |     |     |     |    |
| Instruction $i+1$  |              | IF | ID | EX  | MEM | WB  |     |     |    |
| Instruction $i+2$  |              |    | IF | ID  | EX  | MEM | WB  |     |    |
| Instruction $i+3$  |              |    |    | IF  | ID  | EX  | MEM | WB  |    |
| Instruction $i+4$  |              |    |    |     | IF  | ID  | EX  | MEM | WB |

#### 1.2.3 Pipeline Hazards

Hazards prevent the next instruction in the instruction stream from executing during its designated clock cycle. There are three classes of hazards<sup>[1]</sup>:

- 1. Structural hazards arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution.
- 2. Data hazards arise when an instruction depends on the results of a previous instrucion still in the pipeline.
- 3. Control hazards is caused by delay between the fetching of instruction and decisions about changes in control flow (branches and jumps).

These hazards can make it necessary to stall the pipeline.

There are three generic data hazards:

• Read After Write (RAW):

 $Instr_{J}$  tries to read oprand before  $Instr_{J}$  writes it.

- I: add **r1**, r2, r3
- J: sub r4, **r1**, r3
- Write After Read (WAR):

 $Instr_J$  writes oprand before  $Instr_I$  reads it.

- I: sub r4, **r1**, r3
- J: add **r1**, r2, r3
- K: mul r6, r1, r7

This cannot happen in MIPS 5 stage pipeline.

• Write After Write (WAW):

 $Instr_J$  writes oprand before  $Instr_I$  writes it.

- I: sub **r1**, r4, r3
- J: add **r1**, r2, r3
- K: mul r6, r1, r7

This cannot happen in MIPS 5 stage pipeline.

## References

[1] John L. Hennessy, David A. Patterson, et al. Computer Architecture: A Quantitative Approach, Fifth Edition, 2012.