# MIPS 32-bit CPU & Testbench Coursework

Team 20: Alexandra Neagu, Chenghong Ren, Tanglitong Zhang, TszHang Wong, Zack Reeves

## CPU Design and Architecture

#### Overview

The overall architecture follows a synthesisable MIPS-compatible CPU design, which interfaces with the real-world using a single Avalon compatible memory-mapped bus, that gives it access to memory and other peripherals. So, the CPU issues read and write transactions of instructions and data over the same memory interface.

The focus of our design is to implement a functional Big-Endian CPU with the highest area efficiency possible.

#### **Design Decisions**

The data path shown in  $Figure\ 1$  can easily offer an idea of the CPU's major computational blocks implemented and their interconnection of I/O ports. Scalability, maintenance, regularity, and area efficiency are the main factors we considered during the design process. As "Simplicity favours regularity" [1], our design is constructed to be open to future developments and can be easily debugged and maintained. The following section presents the primary design decisions taken to achieve those factors.

Firstly, the implemented MIPS I instructions [2] can be classified into four different types, each performing distinct procedures according to the CPU's state. Considering this, we decided to create a singular control block that sets the behavioural signals of each specific instruction type to the block components. This highly structured and centralized design allows for easy future upgrading of implementation onto MIPS II. A more in-depth discussion of this decision can be found in a lower section.

Secondly, our design approach pursues a methodology of 5-CPI (cycles per instruction). Alongside implementing additional intermediate registers, this approach presents the opportunity for the CPU to be adapted into a pipelined architecture in the future. Similarly, the waitrequest input port enables the CPU to support all kinds of memory-mapped devices. Therefore, those two characteristics illustrate the high degree of flexibility our CPU design has for real-world integration.



Figure 1: Top Level Datapath of the CPU

Thirdly, we assumed that the compiler always optimises the time efficiency, so all arithmetic operations are implemented only once in the ALU module. Hence the ALU would be used multiple times to operate an instruction. For instance, a branch-and-link instruction uses the ALU for calculating the target address, return address, and its condition.

Additionally, taking the branch delay slot into account, two special registers are added to the PC block in contrast to the standard MIPS CPU. The *branch* register indicates if the previous instruction is a Jump/Branch instruction and the *PC\_Branch\_Address* register holds the branch target address.

#### Instruction classification

As mentioned in the first point, we categorize the instructions into four types according to their repeated usages of some CPU blocks: Arithmetic, Jump, Conditional Branch, and Memory Referencing instructions.

Their behaviour is distributed over the five cycles implemented: FETCH, DECODE, EXECUTION, MEMORY\_ACCESS, WRITE\_BACK. For the first two states, all instructions behave as follows. During FETCH, instructions are inputted in the CPU, stored into the instruction register (IR), and the program counter (PC) gets incremented by four. During DECODE, values from two registers from the register file are prepared, and the return address used by link instructions is computed. The CPU always has the next address (PC + 4) or return address (PC + 8) prepared early in the FSM period, as these values might be used later, which gives rise to a general speed up, and the CPU doesn't receive any delay penalty.

For the remaining three states, operations vary depending on the type of instructions.

#### Arithmetic-type Instructions

These instructions are associated with values stored in registers or immediate values included in the instruction words. During EXECUTION, the corresponding operation is applied to the received inputs within the ALU. The result is then written into the destination register during MEMORY ACCESS.

#### Jump-type Instructions

Unlike other instructions which adapt the value of PC+4, these instructions allow the PC to be rewritten by a specific value. During EXECUTION, the target address is computed or obtained from the immediate operand. This would then be passed to  $PC\_Branch\_Address$  register in MEMORY ACCESS to achieve the jump after the execution of the delay branch instruction.

#### Conditional Branch Instructions

This type of instruction is similar to the Jump-type, but it only rewrites PC if its specific condition is met. Take instruction BGEZ as an example (*Figure 2*). During EXECUTION, the target address is computed. The branch condition is then checked during MEMORY ACCESS to determine whether the computed target address should write into the *PC\_Branch\_Address* register. Thus, deciding to complete the branch or not.

If Jump or Conditional Branch Instructions are followed with a link, then during EXECUTE, the CPU would store the precomputed return address (from DECODE stage) into register \$31 to return after exiting a subroutine.

#### Memory Reference Instructions

These instructions involve the references and modifications of data in the memory. During EXECUTION, the target address is computed. In the following states, the data in that address would either be read out or rewritten by the given data depending on whether it is a store or load instruction. For load instructions, the value read from the memory unit is then loaded into the corresponding register during WRITE BACK.



Figure 1 Step by step visualisation of the 5 cycles, BGEZ

## **CPU Performance**

Our CPU has been tested with various sized test cases. Thus, it allowed us to notice the advantages and issues of our design. Firstly, the CPU has a fixed 5-cycle FSM as it is a general-purpose CPU, and it is expected to execute a mixture of memory-reference (5 cycles) and arithmetic (4 cycles) instructions. If most instructions were arithmetic, then a volatile FSM would be more efficient because it would skip unnecessary cycles for specific short instructions. However, in our case, the cost of frequently changing the total number of stages according to the instruction may overwhelm the benefit it brings.

Additionally, we have simulated our CPU using Intel's Quartus Prime software which allowed us to evaluate a timing and area analysis. We have chosen the FPGA device with the most similar specification, Cyclone IV E, to simulate the closest performance to real-world implementation of our CPU. The results of the achieved timing analysis show that the simulated device had a conservative lower-bound state machine frequency of 3.92MHz and clock frequency of 103.3MHz, that was accomplished under worse-case conditions of 1200mV and  $85^{\circ}C$ . This result could highly be affected by the critical path which requires 127ns to complete. This longest delay path pointed out by Quartus can also be seen in  $Figure\ I$  as staring at the output of memory and ending at input  $ALU\_MULTorDIV$  of registers HI, LO. So, it signifies the path of the div(u) or mult(u) instructions computed combinatorically. This leaves time-related decisions to the compiler; hence, we don't have control over its time efficiency. A possible improvement could be dividing the 32-bit words into smaller variables and applying the operation sequentially. Thus, allowing us a degree of control over the computation speed.

This simulation proves that our overall focus was the area efficiency, as the CPU utilises 9,180 logic elements out of 15,408 on the FPGA. We have tried performing the simulation onto other modules and sized devices, and we noticed that we are achieving the best result for this model of FPGA. This can be a result of the synthesis, pin, area, and timing constraints of the FPGA, which determine the best performance due to the high compatibility between the CPU's and the device's specifications. [3]

## **CPU Testing and Testbench**

To assert the correctness and clear out all the potential hazards of our CPU, we created two testing methods.

#### 1. Testbench

To ensure each instruction performs correctly, we have created a testbench that reviews the output of different test programmes. The tests cover both general cases for the instructions and edge cases that have a higher chance of getting the CPU to output the wrong value.

Our testbench design expects that all test cases must have the last executed value stored in register \$v0 (register of index 2) for an accurate check of the instructions called. Also, the programs must include a jump instruction which will set the PC to address 0x0, hence signalling the ending of the program and setting the CPU's active signal to low. Therefore, the checking process is completed by comparing the value of register \$v0 to an expected reference point for the specific test case.

Besides overviewing the transfers between the CPU and the Memory, the testbench also sets the clock, reset flag and time duration restrictions. Hence, the testbench observes whether the CPU fails to complete a test program in the time required (10,000 cycles) and halts the process resulting in aborting the test case and will send out an error code.

So, as shown in Figure~3, following the check of a successful compilation, the comparison between the given and expected output can lead to one of the following situations:

- The test program fails due to the CPU not finishing the program in the expected amount of time
- The test program fails due to a different CPU output result than the expected value
- The test program successfully passes



Figure 2 Testing procedure

Another architectural choice made is our approach for building a memory of size  $2^{18}$  bytes that signifies  $2^{16}$  word addresses. The testbench uses two separate files for initialising the data and instructions' machine code. So, to try avoiding address clashes during the creation of the virtual memory, the initial memory is divided into two halves. The first half between addresses 0x00000004 and 0xBFBFFFFF is assigned to data, and the second half from address 0xBFC00000 onwards is designated for instructions.

We also decided to implement the memory interface to be human-readable. Having four bytes per line in the data files denote the expected big-endian formatting of the instructions as used by the CPU (after endianness conversion). By implementing this, the memory interface converts the reader-friendly file to a memory respecting the small endianness requirements. While not affecting the 32-bit inputs and outputs of the CPU, these implementations offer the opportunity of easily enlarging the testbench and debugging errors of the CPU.

Additionally, the implementation of randomising the value of an active high waitrequest signal creates a more real world-like behaviour of the memory interface. This further checks for the versatility of the tested CPU.

Finally, the  $test\_mips\_cpu\_bus.sh$  script merges the testbench files by calling all the necessary commands for compilation, execution, and comparison of the output files. The script is run with the below formats:

| test_mips_cpu_bus.sh | directory_path_of_cpu | $specific\_instruction\_to\_test$ | Tests all testcases available for a specific instruction |
|----------------------|-----------------------|-----------------------------------|----------------------------------------------------------|
| test_mips_cpu_bus.sh | directory_path_of_cpu |                                   | Test all testcases for all instructions                  |
| test_mips_cpu_bus.sh | help                  |                                   | Prints the available testcases                           |

### 2. Complex test cases

To ensure that the CPU could perform correctly even in a complex scenario, we crafted a carefully designed test case that consists of a mixture of all instruction types, and more importantly, complex branching logics (including a subroutine and consecutive jump).

Throughout building our CPU, we have created multiple similar test cases. They tested a group of similar instructions, such as a sequential shifting test case, an edge-case test for multiplication and division instructions, a loop conditional branching program, etc.

These test cases are not in the general testbench file as they could not be classified under one specific instruction.

# References

- [1] F. K. Gürkaynak and M. Püschel, "MIPS Assembly," 2013. [Online]. Available: https://syssec.ethz.ch/content/dam/ethz/special-interest/infk/inst-infsec/system-security-group-dam/education/Digitaltechnik\_14/15\_MIPS\_Assembly.pdf. [Accessed 16 Dec 2021].
- [2] C. Price, "MIPS IV Instruction Set," Sept 1995. [Online]. Available: https://www.cs.cmu.edu/afs/cs/academic/class/15740-f97/public/doc/mips-isa.pdf. [Accessed 20 Nov 2021].
- [3] R. C. Cofer and B. F. Harding, "Design Constraints and Optimization," in *Rapid System Prototyping with FPGAs*, Newnes, 2011.
- [4] Whimsical Flowcharts, "Diagrams and Flowcharts," Whimsical, 2021. [Online]. Available: https://whimsical.com/. [Accessed 13 Dec 2021].
- [5] D. A. Patterson and J. L. Hennessy, Computer Organisation and Design, Elsevier Inc., 2014.