# Computer Architecture Final Assessment Report

5-stage Pipelined Processor Team #4

Mohamed Shawky SEC:2, BN:16

Remonda Talaat SEC:1, BN:20 Evram Youssef SEC:1, BN:9

Mahmoud Adas SEC:2, BN:21

June 1, 2020

# Contents

| Ι  | Int | trodu  | ction                | 1  |
|----|-----|--------|----------------------|----|
|    | 0.1 | Abstra | ct                   | 2  |
| II | O   | veral  | System Design        | 3  |
|    | 0.2 | Overal | System Design Schema | 4  |
|    | 0.3 | Memo   | y Specs              | 4  |
|    | 0.4 |        | ntrol Unit           | 5  |
|    |     | 0.4.1  | Inputs               | 5  |
|    |     | 0.4.2  | Outputs              | 5  |
|    |     | 0.4.3  | Logic                | 6  |
|    | 0.5 | Dynan  | ic Branch Prediction | 6  |
|    |     | 0.5.1  | Inputs               | 7  |
|    |     | 0.5.2  | Outputs              | 7  |
|    |     | 0.5.3  | Logic                | 7  |
|    | 0.6 | Brancl | Address Unit         | 7  |
|    |     | 0.6.1  | Inputs               | 7  |
|    |     | 0.6.2  | Outputs              | 8  |
|    |     | 0.6.3  | Logic                | 8  |
|    | 0.7 | Regist | er File              | 9  |
|    |     | 0.7.1  | Registers            | 9  |
|    |     | 0.7.2  | Inputs               | 9  |
|    |     | 0.7.3  | Outputs              | 10 |
|    |     | 0.7.4  | Logic                | 10 |
|    | 0.8 | ALU    |                      | 11 |
|    |     | 0.8.1  | Inputs               | 11 |
|    |     | 0.8.2  | Outputs              | 11 |
|    |     | 0.8.3  | ALU Operations       | 11 |

|              |                  | 0.8.4 Logic                          | 12                              |
|--------------|------------------|--------------------------------------|---------------------------------|
| (            | 0.9              | PC Navigator                         | 12                              |
|              |                  | 0.9.1 Inputs                         | 12                              |
|              |                  |                                      | 13                              |
|              |                  | 0.9.3 Logic                          | 13                              |
|              | _                |                                      |                                 |
| Ш            |                  |                                      | 14                              |
|              |                  | 1 1                                  | 15                              |
|              |                  | 1 1                                  | 15                              |
|              |                  |                                      | 15                              |
|              |                  | V I                                  | 16                              |
| (            | 0.14             | Branch and Change Control Operations | 17                              |
| IV           | (                | Control Unit (Signals)               | 18                              |
|              |                  | · = /                                | 19                              |
|              |                  |                                      | 20                              |
| (            | J.10             | <u> </u>                             | 20                              |
|              |                  |                                      | 21                              |
|              |                  |                                      | 22                              |
|              |                  | 1                                    | 23                              |
| (            | 17               |                                      | <ul><li>23</li><li>24</li></ul> |
|              |                  |                                      | 2 <del>4</del> 25               |
|              |                  | 1                                    | <ul><li>25</li><li>26</li></ul> |
| (            | J.19             | Special Instructions                 | 20                              |
| $\mathbf{V}$ | $\mathbf{P}^{i}$ | ipeline Stages                       | 28                              |
| (            | 0.20             | Overview                             | 29                              |
|              |                  | 0.20.1 Fetch Stage                   | 29                              |
|              |                  |                                      | 29                              |
|              |                  | 0.20.3 Execute Stage                 | 29                              |
|              |                  |                                      | 30                              |
|              |                  |                                      | 30                              |
| (            | 0.21             | <u> </u>                             | 30                              |
|              |                  |                                      | 30                              |
|              |                  | ,                                    | 31                              |
|              |                  | ,                                    | 31                              |

|      | 0.21.4 M/WB Buffer               | 32        |
|------|----------------------------------|-----------|
| 0.22 | Special Workflows                |           |
|      | 0.22.1 CALL Workflow             |           |
|      | 0.22.2 RET Workflow              | 32        |
|      |                                  | 33        |
|      |                                  | 33        |
|      |                                  |           |
|      | -r                               | 34        |
| 0.23 |                                  | 35        |
|      | 0.23.1 Detection                 | 35        |
|      | 0.23.2 Handling                  | 35        |
| 0.24 | Data Hazards                     | 35        |
|      | 0.24.1 Detection                 | 36        |
|      | 0.24.2 Handling                  | 36        |
| 0.25 | Control Hazards                  | 37        |
|      | 0.25.1 Detection                 | 37        |
|      | 0.25.2 Handling                  | 37        |
| 0.26 | Software Solutions               | 37        |
|      |                                  |           |
| VII  |                                  | 39        |
|      | 0.26.1 Intuition and Assumptions |           |
|      | 0.26.2 Caches Size               |           |
|      | 0.26.3 Workflow                  | 41        |
| VIII | Implementation Overview          | 45        |
|      | Completeness                     |           |
|      | Functionality                    |           |
|      | General Notes                    |           |
| 0.29 | General Notes                    | 40        |
| IX I | mplementation Analysis 4         | <b>47</b> |
|      | Provided Test Cases              | 48        |
|      |                                  | 48        |

# List of Figures

| 1 | Overall System Design          |
|---|--------------------------------|
| 2 | Branch Prediction Unit Diagram |
| 3 | Branch Address Unit Diagram    |
| 4 | Register File Diagram          |
| 5 | PC Navigator Diagram           |
| 6 | Hazard Detection Unit Diagram  |
| 7 | Cache Design                   |
| 8 | Data Cache R/W                 |

# List of Tables

| 1  | One Operand Instruction Mapping                       | 15 |
|----|-------------------------------------------------------|----|
| 2  | Two Operand Instruction Mapping                       | 16 |
| 3  | Memory Instruction Mapping                            | 17 |
| 4  | One Operand Instruction Mapping                       | 17 |
| 5  | One Operand Instruction Control Signals Part I        | 21 |
| 6  | One Operand Instruction Control Signals Part II       | 21 |
| 7  | Two Operands Instruction Control Signals Part I       | 22 |
| 8  | Two Operands Instruction Control Signals Part II      | 22 |
| 9  | Immediate Operand Instruction Control Signals Part I  | 23 |
| 10 | Immediate Operand Instruction Control Signals Part II | 23 |
| 11 | Data Instruction Control Signals Part I               | 24 |
| 12 | Data Instruction Control Signals Part II              | 24 |
| 13 | Stack Instruction Control Signals Part I              | 25 |
| 14 | Stacks Instruction Control Signals Part II            | 25 |
| 15 | Jumpers Instruction Control Signals Part I            | 26 |
| 16 | Jumpers Instruction Control Signals Part II           | 26 |
| 17 | Specials Instruction Control Signals Part I           | 26 |
| 18 | Specials Instruction Control Signals Part II          | 26 |

# Part I Introduction

## 0.1 Abstract

This report describes our design and implementation of the 5-stage pipelined processor, that follows Harvard architecture.

This report contains:

- Overall system blocks and connections.
- Functionalities of the different blocks.
- Hazard solutions.
- Memory cache system design.
- Implementation and analysis of the full processor.

# Part II Overall System Design



Figure 1: Overall System Design

# 0.2 Overall System Design Schema

Figure 1 shows the overall system design in detail. Each unit is described in details in its section.

## 0.3 Memory Specs

The CPU follows Harvard architecture and thus uses the following 2 separate memory units:

- Instructions Memory: Read-only, stores instructions.
  - Word Width: 16 bits.
  - Address Bus Width: 32 bits.
  - Data Bus Width: 16 bits.
  - Total Number of Words: 2<sup>32</sup> words.
- Data Memory: Read-Write, stores data and the stack.

- Word Width: 16 bits.
- Address Bus Width: 32 bits.
- Data Bus Width: 32 bits.

The higher bits (31 downto 16) are data at address A.

The lower bits (15 down to 0) are data at address A+1 (A-1 for stack) .

where  $A \mod 2 = 0$ .

On data read, data-memory loads data bus with data from A and A+1 .

On data write, data-memory stores data from data bus to both A and A+1 addresses.

On stack read, data-memory loads data bus with data from A and A-1 .

On stack write, data-memory stores data from data bus to both A and A-1 addresses.

- Total Number of Words: 2<sup>32</sup> words.

#### 0.4 PC Control Unit

#### **0.4.1** Inputs

- IF Flush (1 bit)
- Stall Signal (1 bit)
- RESET Signal (1 bit)
- Interrupt Signal (1 bit)
- Current OPCode (7 bits)
- Parallel Load PC Selector (1 bit)

#### 0.4.2 Outputs

• PC Mux Selectors (3 bits)



Figure 2: Branch Prediction Unit Diagram

#### 0.4.3 Logic

- If IF Flush == 1, Output = 001
- If PL PC Selector == 1, Output = 010
- If RESET == 1, Output = 011
- If OPCode == Call, Output = 100
- If Stall == 1, Output = 101
- Else, Output = 000

# 0.5 Dynamic Branch Prediction

Figure 2 shows the branch prediction unit.

#### 0.5.1 Inputs

- Previous Hashed Address (4 bits)
- Current Hashed Address (4 bits)
- Update Bit (1 bit): Taken or Not to update FSM
- Enable Bit (1 bit): to enable FSM update
- OPcode (4 bits)

#### 0.5.2 Outputs

• Taken (1 bit): predict whether the branch taken or not

#### 0.5.3 Logic

- Updates the FSM corresponding to the hashed address.
- Checks whether the OPCode is of a conditional branch instruction.
- Outputs the prediction bit (Taken or Not) accordingly.

#### 0.6 Branch Address Unit

Figure 3 shows the branch address unit.

#### **0.6.1** Inputs

- Next PC Address (32 bits)
- Instruction Address (32 bits)
- Incremented PC Address (32 bits)
- Hashed Address (4 bits)
- OpCode (4 bits)
- CCR (3 bits)



Figure 3: Branch Address Unit Diagram

#### 0.6.2 Outputs

- IF Flush (1 bit)
- Branch Address (32 bits)
- Feedback Hashed Address (4 bits)

#### 0.6.3 Logic

- Check if OpCode is of a conditional branch instruction, if true:
  - Check whether PC Next Address is equal to Instruction Address
  - If true:
    - \* IF Flush = 0, Branch Address = Instruction Address
  - If false:
    - \* IF Flush = 1, Branch Address = Instruction Address
- Else:
  - IF Flush = 0, Branch Address = PC Next Address



Figure 4: Register File Diagram

## 0.7 Register File

Figure 4 shows the register file.

#### 0.7.1 Registers

All internal registers are 32bit in width. Each one has a 4bit address to select it, either for reading or writing.

- 8 general purpose registers, addresses (0 to 7).
- Stack pointer (SP) register with address = 8.

#### 0.7.2 Inputs

- Dest Regs: 2×4 bits (for destination selection)
- SRC Regs: 2×4 bits (for source selection)
- Fetch Reg: 4 bits (for fetch branch register selection)
- WB values: 2×32 bits (for write back values)

- RESET 1 bit: async, clears all registers.
- BR\_IO\_ENBL: 2 bits (to determine whether the operation is OUT or branch)
  - 00 = Do Nothing
  - 10 = Out
  - 11 = Branch
- CLK: input operations are in rising edge, while output operations are in falling edge.

#### 0.7.3 Outputs

- OP1: 32 bits (value of first operand (src0))
- OP2: 32 bits (value of second operand (src1))
- Fetch Value: 32 bits (value of branch address required by fetch)
- Instruction Address: 32 bits (value of branch address)
- OUT Port: 32 bits (IO output port)

#### 0.7.4 Logic

The register selector acts like a decoder to select the required operation and the register on which the operation performed. It operates in the following manner:

- if RESET == 1, then:
  - All registers are set to 0.
- Else
  - Do normal read/write operations based on Src and Dest Regs, as well as WB values.
  - if BR\_IO\_ENBL == 10, then:
    - \* Output from first Src Reg to OUT Port.
  - Else if  $BR_IO_ENBL == 11$ , then:
    - \* Output from first Src Reg to Instruction Address.

#### 0.8 ALU

#### **0.8.1** Inputs

- ALUop: 4 bits (refer to ALU Operations below)
- Operands:  $2 \times 32$  bits (2 input operands)

#### 0.8.2 Outputs

- ALUout: 32 bits (operation result)
- CCR: 3 bits

#### 0.8.3 ALU Operations

- 0000 NOP (no operation)
- 0001 INC (first operand + 1)
- 0010 DEC (first operand 1)
- 0011 ADD (first operand + second operand)
- 0100 SUB (first operand second operand)
- 0101 AND (first operand && second operand)
- 0110 OR (first operand || second operand)
- 0111 NOT (!first operand)
- 1000 SHL (shift first operand to the left with the value of second operand)
- 1001 SHR (shift first operand to the right with the value of second operand)
- 1010 INC2 (first operand + 2)
- 1011 DEC2 (first operand 2)
- 1100 SWAP (first operand = second operand & second operand = first operand)



Figure 5: PC Navigator Diagram

#### 0.8.4 Logic

- ALU performs the operation and changes the CCR accordingly.
- The input operands of the ALU are multiplexed between forwarded data and register data, with selectors from data forwarding unit.

## 0.9 PC Navigator

Figure 5 shows the PC Navigator.

#### 0.9.1 Inputs

- Interrupt Bit (1 bit)
- OPCode Signals (2 bits): to check whether the operation is RET or RTI
- Previous SP (32 bits): to increment or decrement it correspondingly to access Data Memory
- MEM OUT (32 bits): loaded PC from memory

#### 0.9.2 Outputs

- Stall Signal (1 bit)
- New SP (32 bits)
- PC Selector (1 bit): to enable PC parallel load from Data Memory
- Loaded PC Value (32 bits)

#### 0.9.3 Logic

- The Operation Checker checks the OPCode Signals and Interrupt Bit to check whether the operation is RET, RTI or Interrupt and produces its signals accordingly:
  - PC Return Enable is set.
  - Counter is set to 0 for RET, 1 for RTI and Interrupt.
  - Operation Selector is issued to the Adder/Subtractor to change the value of SP.
- Stall Counter enables stall for memory stage and all previous stages. It's set to 0 for RET, 1 for RTI and Interrupt.
- Adder/Subtractor is used to update the SP every stall cycle, to get the right data from the memory.
- PC Return issues a signal to the PC Control Unit to enable parallel load from data memory and passes the loaded PC value.
- RET doesn't stall any cycles, as it only loads PC value from data memory.
- RTI stalls only one cycle, as it loads PC value from data memory and then loads CCR.
- Interrupt, also, stalls only one cycle, as it pushes CCR into stack and then pushes PC into stack.

# Part III Instruction Format

# 0.10 One Operand Operations

- 4 bits (1111) for one operand instructions.
- 3 bits to define instruction.
- 3 bits for destination register.
- 1 bit to define the memory slots occupied by the instruction.
- Total of 11 bits, padded with 5 0's to fit 16 bits.

| Table 1. One Operand Instruction Mapping |         |             |       |                 |  |  |  |  |  |
|------------------------------------------|---------|-------------|-------|-----------------|--|--|--|--|--|
| Operation                                | OpCode  | Destination | 16 32 | Conditions      |  |  |  |  |  |
| IN                                       | 1111000 | 000:111     | 0     |                 |  |  |  |  |  |
| NOT                                      | 1111001 | 000:111     | 0     | if !Rdst=0,Z=1  |  |  |  |  |  |
|                                          |         |             |       | if !Rdst<0,N=1  |  |  |  |  |  |
| INC                                      | 1111010 | 000:111     | 0     | if Rdst+1=0,Z=1 |  |  |  |  |  |
|                                          |         |             |       | if Rdst+1<0,N=1 |  |  |  |  |  |
| DEC                                      | 1111011 | 000:111     | 0     | if Rdst-1=0,Z=1 |  |  |  |  |  |
|                                          |         |             |       | if Rdst-1<0,N=1 |  |  |  |  |  |
| OUT                                      | 1111100 | 000:111     | 0     |                 |  |  |  |  |  |

Table 1: One Operand Instruction Mapping

## 0.11 Special Operations

- NOP (0000000000000000).
- END (1110000000000000).

# 0.12 Two Operand Operations

- 4 bits to define instruction.
- 3 bits for each of Rsrc1, Rsrc2 and Rdst.
- 1 bit to define the memory slots occupied by the instruction.
- 16 bits for immediate values.

16 bits

16 bits

• Total of 14 bits in most cases with some exceptions mentioned below.

OpCode Rsrc2 Rdst 16|32Conditions Operation Rsrc1 immSWAP 0001 000:111 000:111 0 ADD 0010 000:111 000:111 000:111 0 if Result=0,Z=1 if Result<0,N=1 SUB 0011 000:111 000:111 000:111 if Result=0,Z=1 0 if Result<0,N=1AND 0100 000:111 000:111 000:111 0 if Result=0,Z=1 if Result<0,N=1OR 0101 000:111 000:111 000:111 if Result=0,Z=1 0 if Result<0,N=1 SHL 000:111 16 bits 1 0110 update carry

000:111

Table 2: Two Operand Instruction Mapping

# 0.13 Memory Operations

000:111

000:111

• 4 bits to define instruction.

0111

1000

SHR

IADD

- 3 bits for destination register.
- 1 bit to define the memory slots occupied by the instruction.
- 16 bits for immediate values.
- 20 bits for effective addresses.
- Total of 8 bits with no immediate values or effective addresses.
- Total of 24 bits with immediate values.
- Total of 28 bits with effective addresses.

flag

flag

update

if Result=0,Z=1

if Result<0,N=1

carry

1

1

Rdst Conditions Operation OpCode EA16|32imm PUSH 1001 000:1110 POP 1010 000:111 0 LDM 1011 000:111 1 16 bits LDD 1100 000:111 20 bits 1 STD 20 bits 1 1101 000:111

Table 3: Memory Instruction Mapping

## 0.14 Branch and Change Control Operations

- 4 bits (0000) for branching instructions.
- 3 bits to define instruction.
- 3 bits for destination register.
- 1 bit to define the memory slots occupied by the instruction.
- Total of 11 bits, padded with 5 0's to fit 16 bits.

Table 4: One Operand Instruction Mapping

| Operation | OpCode  | Destination | 16 32 | Conditions |
|-----------|---------|-------------|-------|------------|
| JZ        | 0000001 | 000:111     | 0     |            |
| JMP       | 0000010 | 000:111     | 0     |            |
| CALL      | 0000011 | 000:111     | 0     |            |
| RET       | 0000100 |             | 0     |            |
| RTI       | 0000101 |             | 0     |            |

# Part IV Control Unit (Signals)

## 0.15 Overview

Control unit is responsible for generating the control signals that are used to activate several operations throughout the pipeline. Also, it's responsible for the extraction of specific information from instruction bits.

- It communicates with:
  - IF/ID buffer: for reading the instruction bits.
  - ID/EX buffer: for writing the appropriate registers, ALUop and signals.
  - Register file: for selecting the registers needed to be read (Rsrc1 and Rsrc2).
  - Hazards units (HDU and Branch Address Unit): for sending enables and needed signals.
- Unit Interface:
  - Inputs:
    - \* Instruction Bits (32 bits)
    - \* Interrupt bit (1 bit)
  - Outputs:
    - \* Rsrc2\_val (32 bits) for immediate values or effective addresses
    - \* Rsrc1\_sel (4 bits)
    - \* Rsrc2\_sel (4 bits)
    - \* Rdst1\_sel (4 bits)
    - \* Rdst2\_sel (4 bits) used only in case of swap
    - \* Branch/IO Enable (2 bits)
    - \* OP2\_sel (1 bit)
    - \* SP Enable (1 bit)
    - \* OpCode (7 bits)
    - \* Branch Enable (1 bit)
    - \* ALUop (4 bits)
    - \* R/W Control Signal (2 bits)
- Interpretation:

- Rsrc2\_val (32 bits): occupies a single place in the ID/EX buffer. However, it's used in many different ways. It can be used as a register value extracted form register file. it can be used as an immediate value extracted from IF/ID buffer. Also, it can hold the stack pointer address, as well as the effective address (EA) sent to the memory for reading or writing.
- Rdst2 (4 bits): only used when dealing with a SWAP instruction, thus we need Op1 and Op2 and their new selectors.
- OP2\_sel (1 bit): determines the value of Rsrc2 register in ID/EX buffer, whether it's immediate or register value.
- Branch/IO Enable (2 bits): informs the register file what operation of these are we executing (No/Out/Branch), however *Branch Enable* (1 bit) interacts with the Branch Address Unit, informing it what type of OpCode are we dealing with (branching or not).

## 0.16 Control Signals

In this section, instructions are divided into seven types based on the signals produced:

- One Operand (not,inc,dec,out,in).
- Two Operands (add,sub,and,or).
- Immediate Operand (iadd,shl,shr,ldm).
- Data (ldd,std).
- Stack (push,pop,call,ret,rti).
- Jump (jz, jmp).
- Special (nop,swap,reset,int).

#### 0.16.1 One Operand Instructions

• IB[31:0] are the instruction bits.

- Inserting (1111) to Rsrc/Rdst selectors informs the register file not to output any register values.
- 'x' indicates don't care.
- 0000 at the ALUop indicates no operation.
- Rsrc1\_sel is the same as Rdst1\_sel.

Table 5: One Operand Instruction Control Signals Part I

| Instruction | OPCode    | ALUop | Rsrc1     | Rsrc2    | Rdst1     | Rsrc2 |
|-------------|-----------|-------|-----------|----------|-----------|-------|
|             |           |       | selector  | selector | selector  | value |
| NOT         | IB[31:25] | 0111  | 0 and     | 1111     | 0 and     | X     |
|             |           |       | IB[24:22] |          | IB[24:22] |       |
| INC         | IB[31:25] | 0001  | 0 and     | 1111     | 0 and     | X     |
|             |           |       | IB[24:22] |          | IB[24:22] |       |
| DEC         | IB[31:25] | 0010  | 0 and     | 1111     | 0 and     | X     |
|             |           |       | IB[24:22] |          | IB[24:22] |       |
| OUT         | IB[31:25] | 0000  | 0 and     | 1111     | 0 and     | X     |
|             |           |       | IB[24:22] |          | IB[24:22] |       |
| IN          | IB[31:25] | 0000  | 0 and     | 1111     | 0 and     | X     |
|             |           |       | IB[24:22] |          | IB[24:22] |       |

Table 6: One Operand Instruction Control Signals Part II

| Instruction | OP2 se- | Rdst2  | Branch | SP En- | Branch | R/W     |
|-------------|---------|--------|--------|--------|--------|---------|
|             | lector  | (swap) | /IO    | able   | Enable | Control |
|             |         |        | Enable |        |        | Signal  |
| NOT         | X       | 1111   | 00     | 0      | 0      | 00      |
| INC         | X       | 1111   | 00     | 0      | 0      | 00      |
| DEC         | X       | 1111   | 00     | 0      | 0      | 00      |
| OUT         | X       | 1111   | 01     | 0      | 0      | 00      |
| IN          | X       | 1111   | 10     | 0      | 0      | 00      |

#### 0.16.2 Two Operand Instructions

• OP2\_sel: 0 the register value and 1 the imm/ea value.

OPCode ALUop Rdst1 Instruction Rsrc1 Rsrc2 Rsrc2 selector selector selector value ADD IB[31:25] 0011 0 0 0 and and and Х IB[27:25] IB[24:22]IB[21:19] SUB IB[31:25] 0100 and 0 and and IB[27:25] IB[24:22]IB[21:19] AND IB[31:25] 0101 0 and and 0 and X IB[27:25]IB[21:19] IB[24:22] OR IB[31:25] 0110 0 and 0 and 0 and  $\mathbf{X}$ IB[27:25] IB[24:22]IB[21:19]

Table 7: Two Operands Instruction Control Signals Part I

Table 8: Two Operands Instruction Control Signals Part II

| Instruction | OP2 se- | Rdst2  | Branch | SP En- | Branch | R/W     |
|-------------|---------|--------|--------|--------|--------|---------|
|             | lector  | (swap) | /IO    | able   | Enable | Control |
|             |         |        | Enable |        |        | Signal  |
| ADD         | 0       | 1111   | 00     | 0      | 0      | 00      |
| SUB         | 0       | 1111   | 00     | 0      | 0      | 00      |
| AND         | 0       | 1111   | 00     | 0      | 0      | 00      |
| OR          | 0       | 1111   | 00     | 0      | 0      | 00      |

#### 0.16.3 Immediate Operand Instructions

- Rsrc1\_sel is the same as Rdst1\_sel, in SHL and SHR cases. However, in IADD case, it's a different register and in LDM case, there's no need for Rsrc, it's just a destination.
- In IADD case, Rsrc! = Rdst.
- In LDM case, there's no Rsrc, it's Rdst.
- Rsrc2\_val is the immediate value extracted from the IF/ID buffer.
- R/W memory (11) is write and (10) is read.

- Sign extend unit is used to adjust the (16 bits) immediate value to (32 bits).
- SE: sign extend enable (0/1).

Table 9: Immediate Operand Instruction Control Signals Part I

| Instruction | OPCode    | ALUop | Rsrc1<br>selector | Rsrc2<br>selector | Rdst1<br>selector | Rsrc2<br>value          |
|-------------|-----------|-------|-------------------|-------------------|-------------------|-------------------------|
| IADD        | IB[31:25] | 0011  | 0 and IB[27:25]   | 1111              | 0 and IB[24:22]   | 0XSE<br>and<br>IB[15:0] |
| SHL         | IB[31:25] | 1000  | 0 and IB[27:25]   | 1111              | 0 and IB[27:25]   | 0XSE<br>and<br>IB[15:0] |
| SHR         | IB[31:25] | 1001  | 0 and IB[27:25]   | 1111              | 0 and IB[27:25]   | 0XSE<br>and<br>IB[15:0] |
| LDM         | IB[31:25] | 0000  | 1111              | 1111              | 0 and IB[27:25]   | 0XSE<br>and<br>IB[15:0] |

Table 10: Immediate Operand Instruction Control Signals Part II

| Instruction | OP2 se- | Rdst2  | Branch | SP En- | Branch | R/W     |
|-------------|---------|--------|--------|--------|--------|---------|
|             | lector  | (swap) | /IO    | able   | Enable | Control |
|             |         |        | Enable |        |        | Signal  |
| IADD        | 1       | 1111   | 00     | 0      | 0      | 00      |
| SHL         | 1       | 1111   | 00     | 0      | 0      | 00      |
| SHR         | 1       | 1111   | 00     | 0      | 0      | 00      |
| LDM         | 1       | 1111   | 00     | 0      | 0      | 11      |

#### 0.16.4 Data Instructions

Note that:

• Effective address does not need a sign extend, that's why it's always zero extended with only 12 bits.

- OP2\_sel is 1 to pass the EA.
- R/W memory (11) is write and (10) is read.

Table 11: Data Instruction Control Signals Part I

| Instruction | OPCode    | ALUop | Rsrc1     | Rsrc2    | Rdst1     | Rsrc2    |
|-------------|-----------|-------|-----------|----------|-----------|----------|
|             |           |       | selector  | selector | selector  | val      |
| LDD         | IB[31:25] | 0000  | 0 and     | 1111     | 1111      | 0x000    |
|             |           |       | IB[27:25] |          |           | and      |
|             |           |       |           |          |           | IB[19:0] |
| STD         | IB[31:25] | 0000  | 1111      | 1111     | 0 and     | 0x000    |
|             |           |       |           |          | IB[27:25] | and      |
|             |           |       |           |          |           | IB[19:0] |

Table 12: Data Instruction Control Signals Part II

| Instruction | OP2 selector | Rdst2 (swap) | Branch<br>/IO<br>Enable | SP En-<br>able | Branch<br>Enable | R/W<br>Control<br>Signal |
|-------------|--------------|--------------|-------------------------|----------------|------------------|--------------------------|
| LDD         | 1            | 1111         | 00                      | 0              | 0                | 10                       |
| STD         | 1            | 1111         | 00                      | 0              | 0                | 11                       |

#### 0.17 Stack Instructions

- Rsrc2\_val is the stack pointer, as it's the address of the operation.
- ALUop's Inc2 and Dec2 are used to manipulate the stack pointer, thus the output of the ALU will be the new stack pointer.
- In case of Call, Rsrc1\_sel is none, as no register is used. It is the PC pushed at the memory.
- In case of Call, Rdst1\_sel, is the register holding the new address.
- In case of Ret and Rti, no registers are affected, as the PC is updated at the fetch stage.

• R/W memory (11) is write and (10) is read.

Table 13: Stack Instruction Control Signals Part I

| Instruction | OPCode    | ALUop | Rsrc1     | Rsrc2    | Rdst1     | Rsrc2 |
|-------------|-----------|-------|-----------|----------|-----------|-------|
|             |           |       | selector  | selector | selector  | val   |
| PUSH        | IB[31:25] | 1011  | 0 and     | 1111     | 1111      | SP(32 |
|             |           |       | IB[27:25] |          |           | bits) |
| POP         | IB[31:25] | 1010  | 1111      | 1111     | 0 and     | SP(32 |
|             |           |       |           |          | IB[27:25] | bits) |
| CALL        | IB[31:25] | 1011  | 1111      | 1111     | 0 and     | SP(32 |
|             |           |       |           |          | IB[27:25] | bits) |
| RET         | IB[31:25] | 1010  | 1111      | 1111     | 1111      | SP(32 |
|             |           |       |           |          |           | bits) |
| RTI         | IB[31:25] | 1100  | 1111      | 1111     | 1111      | SP(32 |
|             |           |       |           |          |           | bits) |

Table 14: Stacks Instruction Control Signals Part II

| Instruction | OP2 se- | Rdst2  | Branch | SP En- | Branch | R/W     |
|-------------|---------|--------|--------|--------|--------|---------|
|             | lector  | (swap) | /IO    | able   | Enable | Control |
|             |         |        | Enable |        | (JZ)   | Signal  |
| PUSH        | 1       | 1111   | 00     | 1      | 0      | 11      |
| POP         | 1       | 1111   | 00     | 0      | 0      | 10      |
| CALL        | 1       | 1111   | 00     | 0      | 0      | 11      |
| RET         | 1       | 1111   | 00     | 0      | 0      | 10      |
| RTI         | 1       | 1111   | 00     | 0      | 0      | 10      |

# 0.18 Jump Instructions

- Rsrc1\_sel is the address we are jumping to, that's why we need to verify that our prediction at the JZ case is correct.
- Branch/IO Enable is (11) as it is a branching instruction.
- Branch enable (1) to detect if the JZ operated correctly.

ALUop OPCode Instruction Rsrc1Rsrc2Rdst1 Rsrc2selector selector selector val IB[31:25] 0000 JMP 1111 1111 1111  $\mathbf{X}$  $\overline{\mathrm{JZ}}$ IB[31:25] 0000 and 1111 1111 X IB[27:25]

Table 15: Jumpers Instruction Control Signals Part I

Table 16: Jumpers Instruction Control Signals Part II

| Instruction | OP2 se- | Rdst2  | Branch | SP En- | Branch | R/W     |
|-------------|---------|--------|--------|--------|--------|---------|
|             | lector  | (swap) | /IO    | able   | Enable | Control |
|             |         |        | Enable |        | (JZ)   | Signal  |
| JMP         | X       | 1111   | 11     | 0      | 0      | 00      |
| JZ          | X       | 1111   | 11     | 0      | 1      | 00      |

# 0.19 Special Instructions

There's no interrupt instruction, but there's a bit called Interrupt, sent to the Control Unit as an input to indicate an interrupt signal was triggered.

Table 17: Specials Instruction Control Signals Part I

| Instruction | OPCode    | ALUop | Rsrc1     | Rsrc2     | Rdst1     | Rsrc2 |
|-------------|-----------|-------|-----------|-----------|-----------|-------|
|             |           |       | selector  | selector  | selector  | val   |
| NOP         | IB[31:25] |       | 1111      | 1111      | 1111      | X     |
| SWAP        | IB[31:25] | 0000  | 0 and     | 0 and     | 0 and     | X     |
|             |           |       | IB[27:25] | IB[24:22] | IB[24:22] |       |
| Reset       | IB[31:25] | 0000  | 1111      | 1111      | 1111      | X     |
| Int         | IB[31:25] | 0000  | 1111      | 1111      | 1111      | X     |

Table 18: Specials Instruction Control Signals Part II

| Instruction | OP2 se- | Rdst2     | Branch | SP En- | Branch | R/W     |
|-------------|---------|-----------|--------|--------|--------|---------|
|             | lector  | (swap)    | /IO    | able   | Enable | Control |
|             |         |           | Enable |        | (JZ)   | Signals |
| NOP         | X       | 1111      | 00     | 0      | 0      | 00      |
| SWAP        | 0       | 0 and     | 00     | 0      | 0      | 11      |
|             |         | IB[27:25] |        |        |        |         |
| Reset       | X       | 1111      | 00     | 0      | 0      | 00      |
| Int         | X       | 1111      | 00     | 0      | 0      | 00      |

# Part V Pipeline Stages

#### 0.20 Overview

This section discusses the 5 stages of our system and their functionalities.

#### 0.20.1 Fetch Stage

- Responsible for fetching the next instruction.
- Can take two cycles in case of 32-bit instructions.
- Contains a branch prediction unit to determine the next address to be fetched in case of branching.
- Outputs the instruction bits into IF/ID Buffer.
- Reads from register file in the second half of cycle.

#### 0.20.2 Decode Stage

- Responsible for decoding the instruction bits into control signals.
- Outputs the corresponding signals to ID/EX Buffer.
- Contains register file to output operand values and register-related operations.
- Determines the correct branch address in case of branching instructions by using Branch Address Unit.
- Reads from register file in the second half of cycle.
- Reads IN port, in case of IN operation and propagates it to be written in Write-Back stage.

#### 0.20.3 Execute Stage

- Responsible for ALU operations.
- $\bullet$  Determines the correct ALU output and pass it with other signals to EX/M Buffer.
- The ALU operations and CCR update are done in the first half of cycle (to avoid JZ branch hazards).

#### 0.20.4 Memory Stage

- Responsible for Data Memory IO.
- Returns stored PC, in case of RET and RTI.
- Stores PC and CCR, in case of interrupt.

#### 0.20.5 Write-Back Stage

- Responsible for passing correct output values to the destination registers.
- Write back is done in the first half of cycle.

#### 0.21 Intermediate Buffers

Each buffer has internal latches. The buffer updates its latches when any input changes, regardless of clock. The buffer outputs the value of the internal latches at the **rising edge** of the clock.

Flush signal takes precedence over stall signal. That's it, if the buffer received both flush and stall, it must flush its internal buffers.

#### 0.21.1 IF/ID Buffer

#### Registers

- Instruction Register (32 bits)
- Next Address Register (32 bits)
- Incremented PC Register (32 bits)
- Hashed Address Register (4 bits)
- Interrupt Register (1 bit)

#### Control Signals

- Flush: clear buffer (1 bit)
- Stall: freeze buffer (1 bit)

# 0.21.2 ID/EX Buffer

### Registers

- Operand Registers (2X32 bits)
- Destination Register (2X4 bits)
- Destination Register Value (32 bits)
- OpCode Register (7 bits)
- ALU Operation (4 bits)
- R/W Register (2 bits)
- Interrupt Register (1 bit)

# **Control Signals**

- Stall (IN): freeze buffer (1 bit)
- Destination Register (OUT) (4 bits)

# 0.21.3 EX/M Buffer

### Registers

- ALUout Register (32 bits)
- MEM IN Register (32 bits)
- Memory Address (32 bits)
- Opcode Register (7 bits)
- Destination Register (2X4 bits)
- Destination Register Value (2X32 bits)
- R/W Register (2 bits)
- Interrupt Register (1 bit)

## **Control Signals**

• Destination Register (OUT) (4 bits)

# 0.21.4 M/WB Buffer

# Registers

- ALUout (32 bits)
- MEM OUT (32 bits)
- OpCode (7 bits)
- Destination Register (2X4 bits)
- Destination Register Value (2X32 bits)

# 0.22 Special Workflows

### 0.22.1 CALL Workflow

- Rdest value is loaded in fetch stage (like branches) and stored in PC.
- The current value of PC is propagated through the pipe, until it reaches the memory stage, where it's stored in data memory.

#### 0.22.2 RET Workflow

- Compiler inserts 3 NOPs after each RET instruction to avoid any hazards.
- Once the RET operation reaches the memory stage it loads the PC value from stack (like a normal POP) and uses PC Navigator to write it to the PC.
- **NOTE:** Data hazards related to SP are handled normally through hazard detection unit.

# 0.22.3 Interrupt Workflow

- Interrupt signal is passed to the PC Control Unit and IF/ID Buffer, the fetch stage is stalled for two cycles to fetch the interrupt address and the Interrupt Bit propagates through the whole pipe, until it reaches the memory stage.
- In the memory stage, the interrupt stalls the pipe one cycles to be able to push both PC and CCR into stack.

### 0.22.4 RTI Workflow

- Compiler inserts 3 NOPs after each RTI instruction to avoid any hazards.
- Once the RTI operation reaches the memory stage it loads the PC value from stack (like a normal POP) and uses PC Navigator to write it to the PC.
- However, RTI stalls the pipe for one cycle to be able to load CCR, too.
- **NOTE:** Data hazards related to SP are handled normally through hazard detection unit.

# 

# 0.23 Structural Hazards

### 0.23.1 Detection

The structural hazard occurs in data memory and register file.

# 0.23.2 Handling

The structural hazard in data memory is solved by using 2 memory units, one for instructions and one for data.

However, structural hazard in register file is handled by forcing the write back to happen in the first half of the clock cycle and register reading from decode and fetch to happen in the second half.

# 0.24 Data Hazards

Figure 6 shows the hazard detection unit.



Figure 6: Hazard Detection Unit Diagram

#### 0.24.1 Detection

## Hazard Detection Unit (HDU)

HDU consists of 3 parts:

- OPCode Checker: checks the opcode of the current instruction to check whether it will cause data hazard or not. Also, it checks for load-use case, in order to activate the stall signal.
- Register Comparator: compares the decode source registers with the destination registers of the execute and memory stages. Also, it compares the execute source registers with the destination registers of the memory stage.
- Output Unit: outputs stall signal in case of load and pop instructions (considering the branch special case). Also, it outputs ALU and decode operands selectors.

# **0.24.2** Handling

#### Stall

Occurs only at Fetch and Decode stage, due to load(pop) use case.

- Fetch same instruction (don't increment the program counter).
- Latch IF/ID buffer with the same values.
- Freeze Decode stage.
- Clear ID/EX buffer.

#### **Data Forwarding**

- EX/MEM buffer -> Execute / Decode.
- ID/EX buffer > Decode.

# 0.25 Control Hazards

### 0.25.1 Detection

The branch address calculation occurs in the Decode stage. So, the hazard might affect only the Fetch stage, which will be flushed in case of wrong address prediction.

# 0.25.2 Handling

- At Fetch stage, always check the branch predictor and calculate the next address accordingly.
- At Decode stage, we have a *Branch Address Unit* that checks whether the OPCode is of a branch operation. If so, it passes the address to the program counter and compares the correct address with the address of the counter to decide whether to flush the Fetch stage or not.

#### Flush

Occurs only at Fetch Stage, due to wrong branch prediction at Decode stage.

- Load new address in the program counter.
- Remove fetched instructions from IF/ID buffer.

#### **Dynamic Branch Prediction**

We use 2-bit branch predictor, which is a hash table of *Finite State Machines* (FSMs) to predict whether the branch will be taken (1) or not (0) at each individual branch address.

# 0.26 Software Solutions

There are some specialized software solutions done by the compiler. It can be summarized in:

• Insertion of 3 NOPs after each RTI or RET operation to avoid unnecessary instruction fetch.

 $\bullet$  Insertion of 1 NOP before each JZ operation to avoid data hazards in CCR.

# Part VII Memory Cache Design

# 0.26.1 Intuition and Assumptions

Since data bus is 16 - bits in width i.e. word, so the following sizes are in terms of words.

Since we are dealing with ONE main memory that contains all the Data and Instructions, and two caches one for Data and the other is for Instructions.

This Design proposes to divide the Main memory into two parts, one for Data and one for Instructions, the upper most part is saved for Data from address 00000000000 to 011111111111, while the Instructions address starts from 10000000000 to 111111111111.

Considering this, the LSB of the address will indicate whether that address is corresponding to an instruction (1) or data (0).

These assumptions allow us to:

- Limit collisions (conflicts) since for each Cache the tag size is reduced to only two bits since the actual address is 10 bits.
- Create an internal pipeline between the instruction cache and data cache, since the instruction cache is used only for reading, and filled periodically, more on that on section Workflow.
- No dirty bit array is used for instruction cache.

### 0.26.2 Caches Size

What is Given:

• Main memory:  $4KB = 2^{12}B = 2^{11}W$ 

• Cache Size: 512Bytes = 256Words

• Block Size: 16B = 8W

• Number of Rows/Slots: 32

• Number of Caches: 2 one for Data and one for Instructions.

Sizes:

• Each Cache is 256Words

- Tags are 2 bits in width.
- Validity is only 1 bit.
- Dirty bit is used only in case of Data cache.
- Extra bits required in total: 32 \* (2 + 1) + 16 \* (1) = 112bits

#### 0.26.3 Workflow

#### General Cache Design

As proposed in the Intuition and Assumptions section, two caches exist in the design.

In order to increase efficiency; we propose this fig.7:

- Initially the instruction cache is filled with the first *Block*.
- The program will start working as usually...
- If No memory instruction is produced, this means the Memory is Ideal..
- The Cache Controller requests more *Blocks* of Instructions from main memory.
- Meanwhile if Data is required from memory to be read or written, it will wait until the current block is retrieved then the required *Data* will be provided.
- Then the *Data* request will be full-filled.
- This means that Data Memory has higher priority unless the instruction cache is empty.

The following algorithm may make things more clear...

What makes this algorithm efficient is these two information, One: the Instruction cache can be written periodically, Two: the Rate of Memory Instruction is very low and the Number of memory instruction (such as SW and LD) is less than other Instructions.

## Algorithm 1: Cache Controller

#### Data Cache Design

In the following flowchart fig.8 we illustrate how read and write operations are handled in our system.

#### Notes:

- These operations are related to *Data* cache.
- When Valid bit equals 0, this means it's not valid, and vv.
- When Dirty bit equals 0. this means the block is not modified, and vv.
- Circle A is responsible for reading from main memory.
- Circle B is responsible for writing into main memory (Replacement).
- Replacement occurs only when read is demanded let's see why:
  - when dirty bit is set, and that exact same block is needed for read or write operation
  - If it's needed for read operation then the old block will be written first, then the ordered block will be read, this is ordinary Replacement.



Figure 7: Cache Design

- and if it's needed for write operation, then if it's not valid or missed ordinary Replacement is requested,
- but if it was valid and tag is matched then we don't need to check the dirty bit data can be overwritten without updating the main memory and the dirty bit is set anyway.



Figure 8: Data Cache R/W

# Part VIII Implementation Overview

# 0.27 Completeness

The processor is completely implemented except for **memory cache system**, which is already designed. However, we ran out of time trying to integrate it to our current implementation.

So, we reverted back to the two memories, *Harvard* architecture. However, we included the cache system design in this report to share our thinking.

# 0.28 Functionality

The implemented processor, with *Harvard* architecture, completely works with correct functionalities and hazard handling.

All instructions are executed successfully, except for:

• **Interrupt**: which suffers from some implementation issues, as PC and CCR values aren't written correctly in data memory.

# 0.29 General Notes

- Fetch stage outputs a single *NOP* instruction, in case of *Interrupt*, *CALL*, *JZ* and *JMP*, due to the delay of reading data (PC values or branch address) from register file.
- Flushing happens only to the fetch stage, in case of incorrect predicted branch address for JZ instruction.
- We used ghdl as a VHDL compiler and GTKWave as a simulator throughout the development process for fast development and debugging. We included our development setup, along with the main deliverables of the project.

# Part IX Implementation Analysis

- 0.30 Provided Test Cases
- 0.31 Complete Hazard Analysis