



## Single Cycle Processor Design

#### **Kai Huang**



### Rise of the robots

## The Economist



Mar 29th 2014



http://www.economist.com/printedition/2014-03-29



### **Outline**

- Designing a Processor: Step-by-Step
- Datapath Components and Clocking
- Assembling an Adequate Datapath

4/4/2014

- Controlling the Execution of Instructions
- The Main Controller and ALU Controller
- Drawback of the single-cycle processor design





## **The Performance Perspective**

- Recall, performance is determined by:
  - Instruction count
  - Clock cycles per instruction (CPI)
  - Clock cycle time (CT)



- Clock cycles per instruction
- Clock cycle time
- Single cycle datapath and control design:
  - Advantage: One clock cycle per instruction
  - Disadvantage: long cycle time







### Designing a Processor: Step-by-Step

- Analyze instruction set => datapath requirements
  - The meaning of each instruction is given by the register transfers
  - Datapath must include storage elements for ISA registers
  - Datapath must support each register transfer
- Select datapath components and clocking methodology
- Assemble datapath meeting the requirements
- Analyze implementation of each instruction
  - Determine the setting of control signals for register transfer
- Assemble the control logic





### **MIPS Instruction Formats**

- All instructions are 32-bit wide
- Three instruction formats: R-type, I-type, and J-type

| Op <sup>6</sup> | Rs <sup>5</sup>         | Rt <sup>5</sup> | Rd <sup>5</sup> sa <sup>5</sup> funct <sup>6</sup> |  |  |  |  |
|-----------------|-------------------------|-----------------|----------------------------------------------------|--|--|--|--|
| Op <sup>6</sup> | Rs <sup>5</sup>         | Rt <sup>5</sup> | immediate <sup>16</sup>                            |  |  |  |  |
| Op <sup>6</sup> | immediate <sup>26</sup> |                 |                                                    |  |  |  |  |

- Op<sup>6</sup>: 6-bit opcode of the instruction
- Rs<sup>5</sup>, Rt<sup>5</sup>, Rd<sup>5</sup>: 5-bit source and destination register numbers
- o sa<sup>5</sup>: 5-bit shift amount used by shift instructions
- o funct<sup>6</sup>: 6-bit function field for R-type instructions
- o immediate<sup>16</sup>: 16-bit immediate value or address offset
- o immediate<sup>26</sup>: 26-bit target address of the jump instruction





### **MIPS Subset of Instructions**

- Only a subset of the MIPS instructions are considered
  - ALU instructions (R-type): add, sub, and, or, xor, slt
  - Immediate instructions (I-type): addi, slti, andi, ori, xori
  - Load and Store (I-type): lw, sw
  - Branch (I-type): beq, bne
  - Jump (J-type): j
- This subset does not include all the integer instructions
- But sufficient to illustrate design of datapath and control
- Concepts used to implement the MIPS subset are used to construct a broad spectrum of computers





### **Details of the MIPS Subset**

| Instr | uction                    | Meaning          | Format     |                 |                 |                   |   |      |
|-------|---------------------------|------------------|------------|-----------------|-----------------|-------------------|---|------|
| add   | rd, rs, rt                | addition         | $op^6 = 0$ | rs <sup>5</sup> | rt <sup>5</sup> | rd <sup>5</sup>   | 0 | 0x20 |
| sub   | rd, rs, rt                | subtraction      | $op^6 = 0$ | rs <sup>5</sup> | rt <sup>5</sup> | rd <sup>5</sup>   | 0 | 0x22 |
| and   | rd, rs, rt                | bitwise and      | $op^6 = 0$ | rs <sup>5</sup> | rt <sup>5</sup> | rd <sup>5</sup>   | 0 | 0x24 |
| or    | rd, rs, rt                | bitwise or       | $op^6 = 0$ | rs <sup>5</sup> | rt <sup>5</sup> | rd <sup>5</sup> 0 |   | 0x25 |
| xor   | rd, rs, rt                | exclusive or     | $op^6 = 0$ | rs <sup>5</sup> | rt <sup>5</sup> | rd <sup>5</sup>   | 0 | 0x26 |
| slt   | rd, rs, rt                | set on less than | $op^6 = 0$ | rs <sup>5</sup> | rt <sup>5</sup> | rd <sup>5</sup>   | 0 | 0x2a |
| addi  | rt, rs, im <sup>16</sup>  | add immediate    | 80x0       | rs <sup>5</sup> | rt <sup>5</sup> | im <sup>16</sup>  |   |      |
| slti  | rt, rs, im <sup>16</sup>  | slt immediate    | 0x0a       | rs <sup>5</sup> | rt <sup>5</sup> | im <sup>16</sup>  |   |      |
| andi  | rt, rs, im <sup>16</sup>  | and immediate    | 0x0c       | rs <sup>5</sup> | rt <sup>5</sup> | im <sup>16</sup>  |   |      |
| ori   | rt, rs, im <sup>16</sup>  | or immediate     | 0x0d       | rs <sup>5</sup> | rt <sup>5</sup> | im <sup>16</sup>  |   |      |
| xori  | rt, im <sup>16</sup>      | xor immediate    | 0x0e       | rs <sup>5</sup> | rt <sup>5</sup> | im <sup>16</sup>  |   |      |
| lw    | rt, im <sup>16</sup> (rs) | load word        | 0x23       | rs <sup>5</sup> | rt <sup>5</sup> | im <sup>16</sup>  |   |      |
| SW    | rt, im <sup>16</sup> (rs) | store word       | 0x2b       | rs <sup>5</sup> | rt <sup>5</sup> | im <sup>16</sup>  |   |      |
| beq   | rs, rt, im <sup>16</sup>  | branch if equal  | 0x04       | rs <sup>5</sup> | rt <sup>5</sup> | im <sup>16</sup>  |   |      |
| bne   | rs, rt, im <sup>16</sup>  | branch not equal | 0x05       | rs <sup>5</sup> | rt <sup>5</sup> | im <sup>16</sup>  |   |      |
| j     | im <sup>26</sup>          | jump             | 0x02       |                 |                 | im <sup>26</sup>  |   |      |





## Register Transfer Level (RTL)

- RTL is a description of data flow between registers
- RTL gives a meaning to the instructions
- All instructions are fetched from memory at address PC

#### Instruction RTL Description

```
ADD
                        Reg(Rd) \leftarrow Reg(Rs) + Reg(Rt);
                                                                                                              PC \leftarrow PC + 4
SUB
                        Reg(Rd) \leftarrow Reg(Rs) - Reg(Rt);
                                                                                                              PC \leftarrow PC + 4
                        Reg(Rt) \leftarrow Reg(Rs) \mid zero ext(Im16);
                                                                                                              PC \leftarrow PC + 4
ORI
                        Reg(Rt) \leftarrow MEM[Reg(Rs) + sign_ext(Im16)];
                                                                                                              PC \leftarrow PC + 4
LW
                                                                                                              PC \leftarrow PC + 4
SW
                        MEM[Reg(Rs) + sign ext(Im16)] \leftarrow Reg(Rt);
BEQ
                        if (Reg(Rs) == Reg(Rt))
                                 PC \leftarrow PC + 4 + 4 \times sign extend(Im16)
                        else PC \leftarrow PC + 4
```



### **Instructions are Executed in Steps**

R-type Fetch instruction: Instruction ← MEM[PC]

Fetch operands:  $data1 \leftarrow Reg(Rs)$ ,  $data2 \leftarrow Reg(Rt)$ 

Execute operation: ALU\_result ← func(data1, data2)

Write ALU result:  $Reg(Rd) \leftarrow ALU result$ 

Next PC address:  $PC \leftarrow PC + 4$ 

I-type Fetch instruction: Instruction ← MEM[PC]

Fetch operands:  $data1 \leftarrow Reg(Rs)$ ,  $data2 \leftarrow Extend(imm16)$ 

Execute operation: ALU\_result  $\leftarrow$  op(data1, data2)

Write ALU result:  $Reg(Rt) \leftarrow ALU_result$ 

Next PC address:  $PC \leftarrow PC + 4$ 

■ BEQ Fetch instruction: Instruction ← MEM[PC]

Fetch operands:  $data1 \leftarrow Reg(Rs)$ ,  $data2 \leftarrow Reg(Rt)$ 

Equality:  $zero \leftarrow subtract(data1, data2)$ 

Branch: if (zero)  $PC \leftarrow PC + 4 + 4 \times sign\_ext(imm16)$ 

else  $PC \leftarrow PC + 4$ 



### Instruction Execution - cont'd

■ LW Fetch instruction: Instruction ← MEM[PC]

Fetch base register: base  $\leftarrow \text{Reg}(Rs)$ 

Calculate address:  $address \leftarrow base + sign_extend(imm16)$ 

Read memory:  $data \leftarrow MEM[address]$ 

Write register Rt:  $Reg(Rt) \leftarrow data$ 

Next PC address:  $PC \leftarrow PC + 4$ 

SW Fetch instruction: Instruction ← MEM[PC]

Fetch registers: base  $\leftarrow$  Reg(Rs), data  $\leftarrow$  Reg(Rt)

Calculate address:  $address \leftarrow base + sign_extend(imm16)$ 

Write memory:  $MEM[address] \leftarrow data$ 

Next PC address:  $PC \leftarrow PC + 4$ 

■ Jump Fetch instruction: Instruction ← MEM[PC]

Target PC address:  $target \leftarrow PC[31:28]$ , lmm26, '00'

Jump:  $PC \leftarrow target$ 

concatenation

### Requirements of the Instruction Set

- Memory
  - Instruction memory where instructions are stored
  - Data memory where data is stored
- Registers
  - $\circ$  32  $\times$  32-bit general purpose registers, R0 is always zero
  - Read source register Rs
  - Read source register Rt
  - Write destination register Rt or Rd
- Program counter PC register and Adder to increment PC
- Sign and Zero extender for immediate constant
- ALU for executing instructions



#### Next . . .

- Designing a Processor: Step-by-Step
- Datapath Components and Clocking
- Assembling an Adequate Datapath
- Controlling the Execution of Instructions
- The Main Controller and ALU Controller
- Drawback of the single-cycle processor design



## **Components of the Datapath**

- Combinational Elements
  - o ALU, Adder
  - Immediate extender
  - Multiplexers
- Storage Elements
  - Instruction memory
  - Data memory
  - PC register
  - Register file
- Clocking methodology
  - Timing of reads and writes











## Register Element

- Register
  - Similar to the D-type Flip-Flop
- n-bit input and output
- Write Enable:
  - Enable / disable writing of register
  - Negated (0): Data\_Out will not change
  - Asserted (1): Data\_Out will become Data\_In after clock edge
- Edge triggered Clocking
  - Register output is modified at clock edge



### **MIPS** Register File

- Register File consists of 32  $\times$  32-bit registers
  - BusA and BusB: 32-bit output busses for reading 2 registers
  - BusW: 32-bit input bus for writing a register when RegWrite is 1
  - Two registers read and one written in a cycle
- Registers are selected by:
  - RA selects register to be read on BusA
  - RB selects register to be read on BusB
  - RW selects the register to be written
- Clock input
  - The clock input is used ONLY during write operation
  - During read, register file behaves as a combinational logic block
    - RA or RB valid => BusA or BusB valid after access time



### **Tri-State Buffers**

- Allow multiple sources to drive a single bus
- Two Inputs:
  - Data signal (data\_in)
  - Output enable
- One Output (data\_out):
  - If (Enable) Data\_out = Data\_in
     else Data\_out = High Impedance state (output is disconnected)

Tri-state buffers can be used to build multiplexors



Enable





Data\_out

Data\_in

## **Details of the Register File**







### **Building a Multifunction ALU**







### **Instruction and Data Memories**

- Instruction memory needs only provide read access
  - Because datapath does not write instructions
  - Behaves as combinational logic for read
  - Address selects Instruction after access time
- Data Memory is used for load and store
  - MemRead: enables output on Data\_out
    - Address selects the word to put on Data\_out
  - MemWrite: enables writing of Data\_in
    - Address selects the memory word to be written
    - The Clock synchronizes the write operation
- Separate instruction and data memories
  - Later, we will replace them with caches





## **Clocking Methodology**

- Clocks are needed in a sequential logic to decide when a state element (register) should be updated
- To ensure correctness, a clocking methodology defines when data can be written and read



- We assume edgetriggered clocking
- All state changes occur on the same clock edge
- Data must be valid and stable before arrival of clock edge
- Edge-triggered clocking allows a register to be read and written during same clock cycle





## **Determining the Clock Cycle**

 With edge-triggered clocking, the clock cycle must be long enough to accommodate the path from one register through the combinational logic to another register



- T<sub>clk-q</sub>: clock to output delay through register
- Tmax\_comb : longest delay through combinational logic
- T<sub>s</sub>: setup time that input to a register must be stable before arrival of clock edge
- Th: hold time that input to a register must hold after arrival of clock edge
- Hold time (Th) is normally satisfied since T<sub>clk-q</sub> > Th





### **Clock Skew**

- Clock skew arises because the clock signal uses different paths with slightly different delays to reach state elements
- Clock skew is the difference in absolute time between when two storage elements see a clock edge
- With a clock skew, the clock cycle time is increased

$$T_{cycle} \ge T_{clk-q} + T_{max\_combinational} + T_{setup} + T_{skew}$$

Clock skew is reduced by balancing the clock delays



#### Next...

- Designing a Processor: Step-by-Step
- Datapath Components and Clocking
- Assembling an Adequate Datapath
- Controlling the Execution of Instructions
- The Main Controller and ALU Controller
- Drawback of the single-cycle processor design



## **Instruction Fetching Datapath**

- We can now assemble the datapath from its components
- For instruction fetching, we need ...
  - Program Counter (PC) register
  - Instruction Memory
  - Adder for incrementing PC

The least significant 2 bits of the PC are '00' since PC is a multiple of 4

Instruction Address
Instruction Memory

The least significant 2 bits of the PC are '00' since PC is a multiple of 4

Datapath does not handle branch or jump instructions

Improved datapath increments upper 30 bits of PC by 1







### **Datapath for R-type Instructions**



- Control signals
  - ALUCtrl is derived from the funct field because Op = 0 for R-type
  - RegWrite is used to enable the writing of the ALU result





## Datapath for I-type ALU Instructions



- ALUCtrl is derived from the Op field
- RegWrite is used to enable the writing of the ALU result
- ExtOp is used to control the extension of the 16-bit immediate





## **Combining R-type & I-type Datapaths**



Another mux selects 2<sup>nd</sup> ALU input as either source register Rt data on BusB or the extended immediate

- Control signals
  - ALUCtrl is derived from either the Op or the funct field
  - RegWrite enables the writing of the ALU result
  - ExtOp controls the extension of the 16-bit immediate
  - RegDst selects the register destination as either Rt or Rd
  - ALUSrc selects the 2<sup>nd</sup> ALU source as BusB or extended immediate





## **Controlling ALU Instructions**



For R-type ALU instructions, RegDst is '1' to select Rd on RW and ALUSrc is '0' to select BusB as second ALU input. The active part of datapath is shown in green



For I-type ALU
instructions, RegDst is
'0' to select Rt on RW
and ALUSrc is '1' to
select Extended
immediate as second
ALU input. The active
part of datapath is
shown in green





### **Details of the Extender**

- Two types of extensions
  - Zero-extension for unsigned constants
  - Sign-extension for signed constants
- Control signal ExtOp indicates type of extension
- Extender Implementation: wiring and one AND gate







## **Adding Data Memory to Datapath**

A data memory is added for load and store instructions



ALU calculates data memory address

- Additional Control signals
  - MemRead for load instructions
  - MemWrite for store instructions

A 3<sup>rd</sup> mux selects data on BusW as either ALU result or memory data\_out

BusB is connected to Data\_in of Data Memory for store instructions



MemtoReg selects data on BusW as ALU result or Memory Data out

## **Controlling the Execution of Load**



as destination register

MemRead = '1' to read data memory

ALUSrc = '1' selects extended immediate as second ALU input

MemtoReg = '1' places the data read from memory on BusW

ALUCtrl = 'ADD' to calculate data memory address as Reg(Rs) + sign-extend(Imm16) RegWrite = '1' to write the memory data on BusW to register Rt





## **Controlling the Execution of Store**



ALUSrc = '1' to select the extended immediate as second ALU input

ALUCtrl = 'ADD' to calculate data memory address as Reg(Rs) + sign-extend(Imm16)

MemtoReg = 'x' because we don't care what data is placed on BusW

RegWrite = '0' because no register is written by the store instruction





# **Adding Jump and Branch to Datapath**



- Additional Control Signals
  - J, Beq, Bne for jump and branch instructions
  - Zero condition of the ALU is examined
  - PCSrc = 1 for Jump & taken Branch

Next PC computes jump or branch target instruction address

For Branch, ALU does a subtraction



#### **Details of Next PC**



Jump target address: upper 4 bits of PC are concatenated with Imm26





## **Controlling the Execution of Jump**







## **Controlling the Execution of Branch**



Next PC outputs branch target address

ALUSrc = '0' (2<sup>nd</sup> ALU input is BusB) ALUCtrl = 'SUB' produces zero flag

MemRead = MemWrite = RegWrite = 0

Next PC logic determines PCSrc according to zero flag

RegDst = ExtOp = MemtoReg = x



#### Next . . .

- Designing a Processor: Step-by-Step
- Datapath Components and Clocking
- Assembling an Adequate Datapath
- Controlling the Execution of Instructions
- The Main Controller and ALU Controller
- Drawback of the single-cycle processor design



### **Main Control and ALU Control**



## Single-Cycle Datapath + Control



# **Main Control Signals**

| Signal   | Effect when '0'                                                               | Effect when '1'                                             |  |  |  |  |  |
|----------|-------------------------------------------------------------------------------|-------------------------------------------------------------|--|--|--|--|--|
| RegDst   | Destination register = Rt                                                     | Destination register = Rd                                   |  |  |  |  |  |
| RegWrite | None                                                                          | Destination register is written with the data value on BusW |  |  |  |  |  |
| ExtOp    | 16-bit immediate is zero-extended                                             | 16-bit immediate is sign-extended                           |  |  |  |  |  |
| ALUSrc   | Second ALU operand comes from the second register file output (BusB)          | Second ALU operand comes from the extended 16-bit immediate |  |  |  |  |  |
| MemRead  | None                                                                          | Data memory is read Data_out ← Memory[address]              |  |  |  |  |  |
| MemWrite | None                                                                          | Data memory is written<br>Memory[address] ← Data_in         |  |  |  |  |  |
| MemtoReg | BusW = ALU result                                                             | BusW = Data_out from Memory                                 |  |  |  |  |  |
| Beq, Bne | PC ← PC + 4                                                                   | PC ← Branch target address If branch is taken               |  |  |  |  |  |
| J        | PC ← PC + 4                                                                   | PC ← Jump target address                                    |  |  |  |  |  |
| ALUOp    | This multi-bit signal specifies the ALU operation as a function of the opcode |                                                             |  |  |  |  |  |





## **Main Control Signal Values**

| Ор     | Reg<br>Dst | Reg<br>Write | Ext<br>Op | ALU<br>Src | ALU<br>Op | Beq | Bne | J | Mem<br>Read | Mem<br>Write | Mem<br>toReg |
|--------|------------|--------------|-----------|------------|-----------|-----|-----|---|-------------|--------------|--------------|
| R-type | 1 = Rd     | 1            | х         | 0=BusB     | R-type    | 0   | 0   | 0 | 0           | 0            | 0            |
| addi   | 0 = Rt     | 1            | 1=sign    | 1=lmm      | ADD       | 0   | 0   | 0 | 0           | 0            | 0            |
| slti   | 0 = Rt     | 1            | 1=sign    | 1=lmm      | SLT       | 0   | 0   | 0 | 0           | 0            | 0            |
| andi   | 0 = Rt     | 1            | 0=zero    | 1=lmm      | AND       | 0   | 0   | 0 | 0           | 0            | 0            |
| ori    | 0 = Rt     | 1            | 0=zero    | 1=lmm      | OR        | 0   | 0   | 0 | 0           | 0            | 0            |
| xori   | 0 = Rt     | 1            | 0=zero    | 1=lmm      | XOR       | 0   | 0   | 0 | 0           | 0            | 0            |
| lw     | 0 = Rt     | 1            | 1=sign    | 1=lmm      | ADD       | 0   | 0   | 0 | 1           | 0            | 1            |
| SW     | Х          | 0            | 1=sign    | 1=lmm      | ADD       | 0   | 0   | 0 | 0           | 1            | Х            |
| beq    | Х          | 0            | Х         | 0=BusB     | SUB       | 1   | 0   | 0 | 0           | 0            | Х            |
| bne    | Х          | 0            | Х         | 0=BusB     | SUB       | 0   | 1   | 0 | 0           | 0            | Х            |
| j      | Х          | 0            | Х         | Х          | Х         | 0   | 0   | 1 | 0           | 0            | х            |

■ X is a don't care (can be 0 or 1), used to minimize logic





## **Logic Equations for Control Signals**

RegDst <= R-type

RegWrite  $\leq$  (sw + beq + bne + j)

ExtOp <= (andi + ori + xori)

ALUSrc <= (R-type + beq + bne)

MemRead <= lw

MemWrite <= sw

MemtoReg <= |w



### **ALU Control Truth Table**

| On 6            | А      | 4-bit              |         |          |  |
|-----------------|--------|--------------------|---------|----------|--|
| Op <sup>6</sup> | ALUOp  | funct <sup>6</sup> | ALUCtrl | Encoding |  |
| R-type          | R-type | add                | ADD     | 0000     |  |
| R-type          | R-type | sub                | SUB     | 0010     |  |
| R-type          | R-type | and                | AND     | 0100     |  |
| R-type          | R-type | or                 | OR      | 0101     |  |
| R-type          | R-type | xor                | XOR     | 0110     |  |
| R-type          | R-type | R-type slt         |         | 1010     |  |
| addi            | ADD    | X                  | ADD     | 0000     |  |
| slti            | SLT    | X                  | SLT     | 1010     |  |
| andi            | AND    | Х                  | AND     | 0100     |  |
| ori             | OR     | Х                  | OR      | 0101     |  |
| xori            | XOR    | Х                  | XOR     | 0110     |  |
| lw              | ADD    | Х                  | ADD     | 0000     |  |
| SW              | ADD    | Х                  | ADD     | 0000     |  |
| beq             | SUB    | Х                  | SUB     | 0010     |  |
| bne             | SUB    | Х                  | SUB     | 0010     |  |
| j               | Х      | Х                  | Х       | Х        |  |

The 4-bit encoding for ALUctrl is chosen here to be equal to the last 4 bits of the function field

Other binary
encodings are also
possible. The idea is
to choose a binary
encoding that will
minimize the logic for
ALU Control

#### Next...

- Designing a Processor: Step-by-Step
- Datapath Components and Clocking
- Assembling an Adequate Datapath
- Controlling the Execution of Instructions
- The Main Controller and ALU Controller
- Drawback of the single-cycle processor design



# **Drawbacks of Single Cycle Processor**

- Long cycle time
  - All instructions take as much time as the slowest



- Alternative Solution: Multicycle implementation
  - Break down instruction execution into multiple cycles



