

## Microelectronic Systems

# DLX Microprocessor: Design & Development Final Project Report

Master degree in Electronics Engineering Master degree in Computer Engineering

Referents: Prof. Mariagrazia Graziano, Giovanna Turvani

Authors: group 9 Nicole Dai Prà, Leonardo Izzi

October 19, 2020

# Contents

| 1 | Intr | roduction                               | 1         |
|---|------|-----------------------------------------|-----------|
|   | 1.1  | General architecture                    | 1         |
|   | 1.2  | Supported ISA                           | 2         |
|   | 1.3  | Design objectives                       | 3         |
| 2 | IF s | stage                                   | 4         |
|   | 2.1  | Branch Target Buffer                    | 5         |
|   |      | 2.1.1 Fully associative cache           | 6         |
|   | 2.2  | Program counter selection               | 7         |
| 3 | ID : | $\operatorname{stage}$                  | 8         |
|   | 3.1  | Register file                           | 8         |
|   | 3.2  | Sign extender                           | 9         |
|   | 3.3  | Selection muxes                         | 9         |
| 4 | EX   | E stage                                 | 11        |
| 5 | ME   | M stage                                 | 12        |
| • | 5.1  | 8                                       | 12        |
|   |      |                                         | 13        |
|   | 5.2  | - · · · · · · · · · · · · · · · · · · · | 14        |
| 6 | WB   | f S stage                               | 15        |
| 7 | Cor  | ntrol Unit                              | 16        |
| • | 7.1  |                                         | 16        |
|   | 7.2  |                                         | 17        |
|   | 7.3  |                                         | 17        |
|   | ,    | <u> </u>                                | - ·<br>17 |
|   |      |                                         | 19        |
|   |      |                                         | 19        |
|   |      |                                         | 19        |
|   | 7.4  |                                         | 19        |
|   |      | ě                                       | 20        |
|   |      |                                         | 20        |
|   |      |                                         | 20        |
|   | 7.5  | v                                       | 21        |
|   | 7.6  |                                         | 21        |

| _ | Microelectronic Systems      | ii |
|---|------------------------------|----|
| 8 | Synthesis and Implementation | 22 |
| 9 | Conclusions                  | 23 |

## Introduction

#### 1.1 General architecture

The proposed DLX is a 5-stage, MIPS-based, scalar processor. Its implementation respects the MIPS ABI and O32 calling convention, it supports all the basic instructions, most of the ones proposed as PRO and a few 32-bits ones coming from the MIPS64 architecture [2]. The DLX also features a 32 entries BTB, a 64 entries four-way associative data cache and a forwarding unit. To support the ABI and the extended instruction set we had to modify the file dlxasm.pl. By going a little bit more in details, here there is a summary of what each stage does:

- 1. **IF** stage: here is where the *program counter*, also referred to as PC, is stored. This stage contains the logic to update the value of the next PC and to choose the right PC at each clock cycle.
- 2. **ID stage**: in this stage the instruction fetched with the PC calculated the cycle before is decoded and the register file is accessed for reading. In case an instruction has an immediate field it is extracted and properly extended.
- 3. **EXE stage**: based on the decoding happened the cycle before, one of the ALU's functional units is activated to perform the requested calculation. It is worth noting that the ALU is capable of executing an instruction like jalr in a single clock cycle instead of the 2 specified in the MIPS architecture manual [2]. Another exception is the mult instruction which works with the *integer register file* (also referred to as RF) and it is a multi-cycle operation.
- 4. **MEM stage**: if the operation in this stage is a load or a store the cache is accessed, and in case of a load shorter than 32 bits sign extension is performed (if required). On the other hand, if an instruction does not fall in the two categories outlined before, its result is simply propagated to the next stage.
- 5. **WB stage**: in this stage all the operations that produce a result to be stored in the RF can do so. The result to be written is chosen among the one coming from the cache and the one coming from the ALU.

The datapath is managed by the *control unit*, which has been implemented as a mix of two approaches: the hardwired and the FSM. The reasoning behind this choice is explained in chapter 7. Within the control unit there is also a stall unit, whose job is to detect any possible hazard, and to either enable data forwarding or to force a bubble in the pipeline for stalling the processor.

Outside the control unit and the datapath there is the *memory controller*, which handles the communication between the RAM and the cache as well as cache misses.

### 1.2 Supported ISA

In table 1.1 are shown all the supported instructions:

| Name  | Opcode | Func | Type | Name  | Opcode | Func | Type |
|-------|--------|------|------|-------|--------|------|------|
| nop   | 0x00   | 0x00 | R    | sll   | 0x00   | 0x00 | R    |
| srl   | 0x00   | 0x06 | R    | sra   | 0x00   | 0x07 | R    |
| jr    | 0x00   | 0x08 | R    | jalr  | 0x00   | 0x09 | R    |
| mult  | 0x00   | 0x0E | R    | mfhi  | 0x00   | 0x10 | R    |
| mflo  | 0x00   | 0x12 | R    | add   | 0x00   | 0x20 | R    |
| addu  | 0x00   | 0x21 | R    | sub   | 0x00   | 0x22 | R    |
| subu  | 0x00   | 0x23 | R    | and   | 0x00   | 0x24 | R    |
| or    | 0x00   | 0x25 | R    | xor   | 0x00   | 0x26 | R    |
| seq   | 0x00   | 0x28 | R    | sne   | 0x00   | 0x29 | R    |
| slt   | 0x00   | 0x2A | R    | sgt   | 0x00   | 0x2B | R    |
| sle   | 0x00   | 0x2C | R    | sge   | 0x00   | 0x2D | R    |
| sltu  | 0x00   | 0x3A | R    | sgtu  | 0x00   | 0x3B | R    |
| sleu  | 0x00   | 0x3C | R    | sgeu  | 0x00   | 0x3D | R    |
| bgez  | 0x01   | 0x00 | I    | bltz  | 0x01   | 0x00 | I    |
| j     | 0x02   | 0x00 | J    | jal   | 0x03   | 0x00 | J    |
| beq   | 0x04   | 0x00 | I    | bne   | 0x05   | 0x00 | I    |
| blez  | 0x06   | 0x00 | I    | bgtz  | 0x07   | 0x00 | I    |
| addi  | 0x08   | 0x00 | I    | addui | 0x09   | 0x00 | I    |
| subi  | 0x0A   | 0x00 | I    | subui | 0x0B   | 0x00 | I    |
| andi  | 0x0C   | 0x00 | I    | ori   | 0x0D   | 0x00 | I    |
| xori  | 0x0E   | 0x00 | I    | beqz  | 0x10   | 0x00 | I    |
| bnez  | 0x11   | 0x00 | I    | slli  | 0x14   | 0x00 | I    |
| srli  | 0x16   | 0x00 | I    | srai  | 0x17   | 0x00 | I    |
| seqi  | 0x18   | 0x00 | I    | snei  | 0x19   | 0x00 | I    |
| slti  | 0x1A   | 0x00 | I    | sgti  | 0x1B   | 0x00 | I    |
| slei  | 0x1C   | 0x00 | I    | sgei  | 0x1D   | 0x00 | I    |
| lb    | 0x20   | 0x00 | I    | 1h    | 0x21   | 0x00 | I    |
| lw    | 0x23   | 0x00 | I    | lbu   | 0x24   | 0x00 | I    |
| lhu   | 0x25   | 0x00 | I    | sb    | 0x28   | 0x00 | I    |
| sh    | 0x29   | 0x00 | I    | sw    | 0x2B   | 0x00 | I    |
| sltui | 0x3A   | 0x00 | I    | sgtui | 0x3B   | 0x00 | I    |
| sleui | 0x3C   | 0x00 | I    | sgeui | 0x3D   | 0x00 | I    |

Table 1.1: Supported instructions

It is worth pointing out that nop and sll have exactly the same opcode and func fields, because the nop actually corresponds to sll \$zero, \$zero, \$zero ([2]), which is an instruction that achieves nothing. Also bltz and bgez share the same opcode (as stated in [2] for release 1 up to 5), as their difference lies in the instruction's bits 20-16: they are set to 0b00000 for the former one and to 0b00001 for the latter one. Finally, jr and jalr are considered R instructions as specified in the manual [2].

### 1.3 Design objectives

Our goal during the design phase was to deliver a fast processor and to cut latencies of the most used general purpose instructions. Our frequency target was to reach 1 GHz, and to be able to execute jal and jalr in a single clock cycle instead of the 2 specified in [2]. We also felt that was important to provide to a potential user a complete instruction set, therefore we have introduced also mult, which is normally executed by the FPU. As discussed in chapter 4, achieving high clock frequencies with a multiplication unit is not trivial and some compromises had to be done in term of latencies for this particular operation. We decided to trade the presence of the operation with latency, because as stated in [1] multiplications accounts only for the 0.02% of the utilized instructions.

# IF stage

In this chapter we will show how the IF stage is structured and how each block works. In figure 2.1 the block diagram of the stage is shown.



Figure 2.1: Block diagram of the IF stage

#### There are:

- 1. a register storing the value of the current PC.
- 2. A mux that selects among its 4 inputs which will be the next PC value.
- 3. An adder that calculates  $current\ pc+4$
- 4. A Branch Target Buffer (BTB) able to recognize and predict the behavior of known branches and jumps
- $5.\ A$  32-entries,  $32\ \mathrm{bits}$ , fully associative cache that it is used by the BTB to recognize known branches and jumps

6. A subtractor calculating  $btb\ pc\ exe-4$ , that is the PC of the instruction in the EXE stage. It is used when updating the BTB content.

In this stage the PC is used on 30 bits instead of 32 because instructions are aligned on 32 bits boundaries, therefore the 2 LSBs are always 0 and it was useless to add more HW to handle them.

#### 2.1 Branch Target Buffer

The branch target buffer is a hardware structure used to known in the IF stage if the instruction being fetched is a known branch or jump. It does so by accessing in read mode its cache using as address the current pc (this connection is not shown in figure 2.1). For the differences between read mode and write mode of the cache refer to subsection 2.1.1. Our BTB operates over 3 values, as shown in figure 2.2.



Figure 2.2: Branch target buffer values

- 1. The *instruction address* is the memory address where the instruction resides. In our implementation corresponds to the TAG value in the cache.
- 2. The *target address* is where the PC branch or the jump would go in case the jump condition is verified.
- 3. The *prediction bits* are used by the FSM inside the BTB to predict whether a branch will be taken or not. The state machine of the prediction bits is shown in figure 2.3.

The target address, concatenated with the prediction bits, form a 32 bit value which corresponds to the data stored inside the cache.

Now let's take a closer look to the FSM implementation.



Figure 2.3: Branch target buffer FSM representation

There are two entry points in the FSM: one that enters the state 01 and one that enters the state 10. The former is used to add a new branch as non-taken, while the latter is used for the opposite. In the delivered implementation only the *taken* entry point is used, but since at the beginning of the

development we did not know if we would have added also non-taken branch to the BTB, we felt that this degree of flexibility would have been nice to have.

Besides this particular, the FSM is straightforward: the 2 bits are used to store the state of a branch, and their values have the following meanings:

- 1. 00: strongly not taken.
- 2. 01: weakly not taken.
- 3. 10: weakly taken.
- 4. 11: strongly taken.

The value of btb\_taken is 0 for the first two cases and 1 for the last two, provided that the address that is being considered is known to the cache (that is, we have hit = 1 in read mode).

The BTB of course needs to be updated every time a branch is executed or discovered. This is achieved by writing in btb\_update 01 to update only the history bits or 10 to add a new branch, by concatenating the target address to the history bits and by setting the cache's write address to the instruction's PC.

#### 2.1.1 Fully associative cache

A key component of the BTB is the cache, where as already said all the known branches and jumps are stored. We decided to use a 32-entries fully associative cache to have the highest possible hit rate, since every miss could hurt the pipeline's performance. Its general, simplified structure is shown in figure 2.4.



Figure 2.4: Cache general representation

This cache support two read operations and a write operation per clock cycle and implements a FIFO replacement logic. Each cache line has a hit detector, which is used to detects a hit in case of a read, and to issue a line update (only the data is updated, not the tag) if requested by the BTB. The hits signals then enter two 32-entries encoders which in turn drive two 32-entries muxes, used to output the data from the cache. To illustrate exactly how a cache line and a hit detector work we'll refer to picture 2.5.

A cache line is composed of a 30 bits tag field, that stores the PC of the branch/jump, a validity bit used to discriminate invalid data after a reset and a 32 bits data field that keeps the value of the target address along with the history bits. The tag enter two equality comparators to check if a match exists either with the read address, the write address, or both. To determine if a hit has occurred this is not sufficient though, because the data stored in the tag may be invalid. To solve this issue an AND is executed between the validity bit and each comparator's output. The write part has an additional AND between its hit value and the update\_data signal to determine if this line have to update its data field.



Figure 2.5: Cache line and hit detectors

## 2.2 Program counter selection

There are 4 possible next program counter values to choose at each clock cycle:

- 1. PC + 4
- 2.  $PC_{BTB}$
- 3.  $PC_{main\ adder}$
- 4.  $PC_{secondary\ adder}$

PC+4 is the default choice when the BTB doesn't recognize the address and the EXE have not executed a branch or a jump.  $PC_{BTB}$  is taken whenever the BTB recognizes the address and the EXE have not executed any branch or jump.  $PC_{main\ adder}$  is used when the EXE have executed a jr or jalr. Finally,  $PC_{secondary\ adder}$  is the value used when any branch or jump except jr and jalr is executed in the EXE stage.

# ID stage

In the ID stage instructions are decoded and the register file is accessed. Its block diagram is shown in picture 3.1



Figure 3.1: Block diagram of the ID stage

## 3.1 Register file

The register file is one of the two main components in this stage. It has 2 read ports and 3 write ports, although a write operation never use all of them at the same time. It contains 32 general purpose registers (GPR), as stated by the ABI, and 2 additional registers called 10 and hi, accessible only by special instructions and used to store multiplications' results. Among the GPR there is a special register, the register number 0, that cannot be written and always contains the value 0x00000000. The full list of the ABI-compliant registers is shown in table 3.1:

In the ID stage the RF is accessed in read, since it is the WB stage to access it in write. To select

| Name      | Number | Use                                                   | Directly accessible |
|-----------|--------|-------------------------------------------------------|---------------------|
| \$zero    | 0      | constant 0                                            | Yes                 |
| \$at      | 1      | assembler temporary                                   | Yes                 |
| \$v0-\$v1 | 2-3    | values for function returns and expression evaluation | Yes                 |
| \$a0-\$a3 | 4-7    | function arguments                                    | Yes                 |
| \$t0-\$t7 | 8-15   | temporaries                                           | Yes                 |
| \$s0-\$s7 | 16-23  | saved temporaries                                     | Yes                 |
| \$t8-\$t9 | 24-25  | temporaries                                           | Yes                 |
| \$k0-\$k1 | 26-27  | reserved for OS kernel                                | Yes                 |
| \$gp      | 28     | global pointer                                        | Yes                 |
| \$sp      | 29     | stack pointer                                         | Yes                 |
| fp        | 30     | frame pointer                                         | Yes                 |
| \$ra      | 31     | return address                                        | Yes                 |
| \$10      | 32     | stores the lower 32 bits of a multiplication          | No                  |
| \$hi      | 33     | stores the higher 32 bits of a multiplication         | No                  |

Table 3.1: DLX registers. Taken from [3]

the registers the bits 25-21 and 20-11 are used, because they represent the rs and rt fields in an R instruction. These values are not the only ones used by the RF to decide which values its read ports should take: as illustrated in table 3.1, there are two non-directly accessible registers, lo and hi. To perform the final output selection two control signals are used, rp1\_out\_sel and rp2\_out\_sel, that drive 2 internal muxes that choose the value of the read ports as follows:

1. 00: output the selected GP register.

2. 01: output the lo register.

3. 10: output the hi register.

4. 11: output 0x00000000.

## 3.2 Sign extender

This component is used by I and J instructions to extend their immediate fields on 32 bits. In fact, I instructions have their immediate stored in the lower 16 bits of the instruction, while the J have it stored on the lower 26 bits. Moreover, these bits could represent a signed or unsigned value, like in the case of a addi and a addui, therefore this has to be taken in account during the extension. As shown in figure 3.1, two controls signals enter the sign extender: is\_signed and sign\_ext\_sel. The former is used to extend either with 0s or taking into account the sign, while the latter is used to decide if the extension should consider only 16 bits or 26.

#### 3.3 Selection muxes

In the rightmost part of figure 3.1 3 muxes 2x1 can be seen. In a standard MIPS pipeline these would reside in the EXE stage, but since it is the slowest stage of the pipeline we decided to shorten its critical path by bringing them in the ID stage. The first mux is used to select between the read port 1 of the RF and the program counter coming from the stage before, while the second mux chooses among the RF's read port 2 and the immediate value coming from the sign extender. The output of these two muxes form respectively the a and b inputs of the EXE stage. The third mux is used to

select the output of the RF's read port 2, which is the data that would go inside the memory when executing a store, and the npc, that is used to execute in a single cycle a jalr instruction.

# EXE stage

## MEM stage

In this stage memory is read during a load and it is written during a store. Its block scheme is shown in figure 5.1.



Figure 5.1: Block diagram of the MEM stage

The lowest mux is used when in the MEM stage when an operation different from a load and a store is executed. It selects alu\_out\_low\_in when in the RF the data coming from the ALU must be stored, otherwise it selects op\_b\_in, used by jalr and jal instruction to save in the RF the value of the program counter.

#### 5.1 Data cache

The main datapath component of this stage is the data cache. It is a 64-entries four-way associative cache, able to perform two reads, two writes, or one read and one write per clock cycle. This is due to the fact that the cache has two interfaces, one connected to the processor and another one connected to the memory controller. In figure 5.2 it is shown the block diagram for the processor's side.

As it can be seen, the cache's lines are divided among four blocks. Each line in a block corresponds to a different set, therefore a set is formed by considering all the lines having the same number. The



Figure 5.2: Dcache's processor side

structure of the line is the same as the one shown in figure 2.5, but with a difference: if the data being stored comes from the CPU it is saved in big endian. In fact, the DLX is expected to work in big endian, therefore every time data comes out the CPU it must do so in this format. To reuse some modules we already had, however, we preferred to internally work in little endian. We've identified the cache as the best place to perform DLX-RAM data conversion transparently.

Moving on, each block has its own logic to detect if the line corresponding to the chosen set is generating a hit. The detection logic works in the same way as in the fully associative cache explained in section 2.1.1 and will not be discussed again. The four hits signal generated by the blocks then enter a priority encoder, which generate a suitable encoding for the data mux. Moreover, these signals converge into a 4-inputs OR gate that outputs the final hit signal.

#### 5.1.1 Replacement logic

On top of the cache's blocks there is the replacement logic, shown in figure 5.3.

This unit receives in feedback the hits signal generated by the four blocks and the encoding value for the data mux, moreover receives either from the memory controller or the control unit an update signals. These information are combined together to decide how and if to update or replace the data contained in the cache. When an update is requested for data not present yet in the cache and there are free blocks, a FIFO scheme is used and both replacement and dirty\_bit are low. On the other hand, when an update is requested for data already known to the cache, ram\_out\_enc forwards the value of the input mux\_enc and it keeps replacement low. However, if new data must be written and there's no room in the blocks, an eviction must be performed. In this case the signal replacement is



Figure 5.3: Replacement logic

raised to 1. Also the dirty\_bit is set to 1 since we are considering the case in which all the blocks are used. Hence, the AND goes to 1 and the both the control unit and the memory controller receive it. For the explanation of how cache eviction and cache misses are handled please refer to section 7.4.

#### 5.2 Sign extender

The DLX, as seen in table 1.1, supports many types of load operation: with or without sign extension, and on different sizes of data. The job of the sign extender is therefore the one of taking in input the data read from the cache and to extend it properly (if necessary). It is driven by the control unit through the ld\_sign anf ld\_type signals. The former is set to 0 when load is unsigned, to 1 when it is signed. The latter is a 2-bit signals that specifies from where 0-extension or sign-extension should take place. The possible values are:

- 1. 00: use 32 bits. In this case no operation is performed on the incoming data.
- 2. 01:use 16 bits.
- 3. 10: use 8 bits.
- 4. 11: reserved. In this particular implementation it works in the same way as 00.

# WB stage

This is the last stage of the pipeline. It is conceptually part of the ID stage since it drives the write signals of the register file. It is composed only by a 2x1 mux, which decides the value of the wp port of the register file shown in figure 3.1. The choice is made among the data read from the cache in the MEM stage or the one coming from a calculation performed in the EXE stage. It also drives:

- 1. wp\_en, set to 1 when an operation has to store data in a GP register.
- 2. hilo\_wr\_en, set to 1 when the result of a multiplication has to be stored in lo and hi.
- 3. wp\_addr, which is the value of the destination register.

## Control Unit

The DLX's control unit is mainly based on the hardwired approach, although some operations are handled by in-stage FSMs. The former approach has been chosen because most of the control signals of an operation never changes during its execution. For example, for an add instruction the value of rp1\_out\_sel is always 00 and will never change, no matter what happens in the pipeline. Because of this, reading the values from a control register and writing them directly in the component is the fastest and cheapest way to control the datapath. However, not all the signals can be known a-priori, or they need to change over time to handle complex operations: the former case is represented by the forwarding signals for example, while the latter from the multiplication. Due to this we had to resort to in-stage FSM able to correctly manage these situations.

The control unit is by far the most complex component in the DLX, therefore to ease the explanation the rest of the chapter is divided in sections, each of which explains what happens in the related stage.

## 7.1 IF stage

The PC register in the datapath is used to access the *instruction ROM*, or IROM, where all the instructions to be executed are stored. The IROM's output is then sent to both the control unit, to analyze the instruction's *opcode* and *func* fields, and to the IF/ID registers for decoding the rest of the fields in the ID stage. The control units hosts two ROMs: one contains all the control words (also referred to as *CW*) related the I and J instructions, while the other one stores all the CW belonging to the R instructions. The former is accessed using the opcode field, the latter using the func field. This distinction had to be done because all R instructions have the same opcode, which is equal to 0x00.



Figure 7.1: bltz and bgez instruction format. Taken from [2]

The control unit recognize this pattern and it chooses the right CW for each instruction. Nonetheless this is not the only particular case. As already shown in table 1.1, bgez and bltz shares the same opcode. Their difference reside in the bits 20-16, which are set to 00001 for the former, to 00000 for the latter (figure 7.1). This information cannot be used to differentiate the branches directly in the CW's ROM, hence their final CW is generated dynamically.

This part of the control unit also decide from which source must come the next value of the PC register. Four cases have to be distinguished:

- 1. The EXE stage have not executed a branch and the BTB predicts as not taken the current fetched instruction: in this case PC + 4 is chosen.
- 2. The EXE stage have not executed a branch and the BTB predicts as taken the current fetched instruction: in this case  $PC_{BTB}$  is chosen.
- 3. The EXE has executed a jr or a jalr: the prediction of the BTB is discarded and  $PC_{main\ adder}$  is chosen.
- 4. the EXE has executed a branch or a jump different from jr or jalr: the prediction of the BTB is discarded and  $PC_{secondary\ adder}$  is chosen.

### 7.2 ID stage

The bits used to drive the ID stage are all static, therefore they are directly read from the control unit's IF/ID register and are sent to the datapath. Here is it worth commenting how this stage interacts with the EXE stage. When a multiplication is in progress in the EXE stage it is the EXE stage's FSM to drive the ID/EXE registers, therefore all the ID's signals going to the ID/EXE registers are set in high impedance to avoid conflicts.

The ID stage is also in charge of detecting a mult instruction, and when it does so it may need to stall the pipeline. Unfortunately, when the multiplication's operand are sent to the ID/EXE stage, they are automatically extended on 64 bits by the registers themselves. This means that a mult cannot enters the EXE stage if one of its source operands is still being processed in the pipeline. Due to this limitation, when an hazard is detected for a multiplication is detected, the control unit's ID stage in conjunction with the *stall unit* (explained in section 7.6) stalls the IF and the ID stage, until all the data hazards have been resolved.

### 7.3 EXE stage

This is the biggest stage in the control unit's pipeline. It contains an FSM to handle the multiplication, which is the only instruction that requires more than a single clock cycle to execute. Its state diagram is shown in figure 7.2.

#### 7.3.1 NORMAL\_OP\_EXE

This is the reset state and also the one where all the operations but mul are executed. Here few checks are executed, therefore to ease the reader's understanding we have provided a flow diagram, shown in figure 7.3.

The first check performed is whether the operation currently in the EXE stage is a multiplication or not. In the former case the EXE stage switch state as shown in figure 7.3 and stalls the pipeline, while in the latter case the state remains unchanged. If the operation is not a multiplication, the other notable case is to check if the instruction is a branch or jump. To do so, it is controlled if



Figure 7.2: EXE stage state diagram



Figure 7.3: NORMAL\_OP\_EXE flow diagram

the BTB knew the instruction's address in the IF stage or not. If the address was known the only remaining check to be performed is whether the prediction was correct or not. In the first case the BTB prediction is updated, otherwise besides the BTB's update the pipeline must be flushed and the PC changed. If, instead, the address was not known, the FSM has to figure out if the instruction is a new branch or jump or not. In the latter case, besides writing as always the static control bits,

nothing is done. On the other hand, in the former case the address is added to the BTB, the pipeline is flushed and the PC is updated.

#### 7.3.2 A\_NEG\_SAMPLE

This is the first state where the multiplication's execution starts. The EXE stage takes control over the ID/EXE registers, and uses the main adder to negate the value of the operand a on 64 bits. The next state of the FSM is automatically set to MUL\_IN\_PROG.

#### 7.3.3 MUL\_IN\_PROG

The FSM iterates over this state 15 times, using the multiplier's adder to calculate each time the partial result. As soon as the 15th iteration is executed, the next state is changed to MUL\_END. In this state the EXE still controls the ID/EXE registers.

#### 7.3.4 MUL\_END

This state is entered when the final result of the multiplication must be sampled by the EXE/MEM registers. After the result has been sampled, the EXE remains in this state (keeping therefore the pipeline locked) until the multiplication result has reached the WB stage. When this happens, the next state of the FSM is set to NORMAL\_OP\_EXE and the pipeline is unlocked. Again, this shows a limitation of the mult: its result cannot be forwarded to mflo and mfhi because it is on 64 bits and with out forwarding logic only 32 bits can be forwarded. Hence, to maintain data flow correctness the multiplication's result must be sampled in the RF before the next instruction can enters the EXE stage.

### 7.4 MEM stage

This stage, as all the previous ones, sends the static control bits to the datapath. When operations different from loads and stores pass through this stage, nothing more is done. Special attention must be given to stores and, more importantly, to loads. In fact, the formers may generate cache evictions, while the latter ones may generate cache misses. Thanks to the design of the data cache and of the memory controller cache eviction do not cause stalls, but cache misses do. Therefore, also in the MEM stage a FSM exists, of which the state diagram is shown in figure 7.4.



Figure 7.4: MEM stage state diagram

#### 7.4.1 NORMAL\_OP\_MEM

The FSM starts in this state and remains always here for all operations but loads. When a load is executed, the content indexed by the address may have been swapped out in cache, so the cache's hit signal would stay at 0. When this situation is detected, the MEM stage along with the stall unit stalls the entire processor and then sets the next state to CACHE\_MISS.

It is important to outline why mem\_condition in figure 7.4 has that equation. Technically to detect the cache miss it would have been enough to look for dcache\_update = 0 AND hit\_mem = 0, since dcache\_update signals the processor's will to perform a read (by not updating anything in the memory) and hit\_mem signals if a hit or a miss have occurred. This is not sufficient however, because the cache is accessed at every cycle using as address the output of the ALU, no matter if the instruction was actually a load or not. For this reason in the CW there is a field called cpu\_is\_reading, which is set to 1 when the processor wants to actually access the cache.

#### 7.4.2 CACHE\_MISS

In this state the control unit hands over the control of the cache to the memory controller, which has recognized as well the cache miss in the previous cycle, by putting in high impedance dcache\_update and update\_type\_mem. The CU remains in this state as long as the memory controller does not raise the cu\_resume signal. When this happens, the pipeline is unlocked and the MEM stage FSM enters again the NORMAL\_OP\_MEM state.

#### 7.4.2.1 Memory controller

The memory controller resides outside the control unit. It acts as a communication layer for the cache and the RAM, moreover as said before is capable of resolving cache misses. It has its own FSM to perform its job, shown in figure 7.5.



Figure 7.5: Memory controller state diagram

- 1. IDLE: this is the reset state. It stays here as long as no cache miss is detected and it keeps its cache's control signal in high impedance.
- 2. CACHEMISS: in this state it asks the cache to update its value. Two possible outcomes are possible here: if the cache is not signalling the need of an eviction the next state will be IDLE, otherwise it will be EVICTION\_READ. In this case the enable signal of an internal register is enabled to sample the evicted data coming from the cache. In any case, after this state the control unit regains the cache's control.
- 3. EVICTION\_READ: in this state the RAM is updated with the evicted data, then the next state is set to IDLE.

In the IDLE state, the memory controller may detect the occurrence of a data eviction during a write operation. This does not require a stall since it is able to directly forward the evicted data to the RAM.

## 7.5 WB stage

This is the final stage of the pipeline. Its job is fairly easy, since it has to forward the control signals to the datapath without any special exception. Besides this, the only other thing it does is to unlock the pipeline when a multiplication has reached this stage.

#### 7.6 Stall unit

The stall unit has the duty of detecting data hazards and of handling all the signals required to stall the processor when needed. Hazards are detected by looking at rs and rt stored in the ID/EXE registers, and by confronting them with the rd registers stored in the EXE/MEM and MEM/WB registers. The rd registers are accompanied in the pipeline by a bit that specifies whether they are valid or not. This is necessary because to reduce the switching activity the registers not needed by an operation are disabled, therefore an invalid hazard could be detected. Data hazards are actually checked also from the ID stage, because as said in section 7.3 if a multiplication detects an hazard it has to stall.

This unit, before attempting to forward data, checks if it has to stall because the control unit has requested so. The sources of a stall can be:

- 1. A cache miss has occurred. This has the highest priority in the stall unit to guarantee data correctness. For example, a load could trigger a cache miss while a multiplication enters the EXE stage: if they have been served in the reversed order, the load would store in the register file the wrong data.
- 2. A multiplication is in the EXE stage. In this case nop operations are scheduled from the MEM stage onwards in order to allow to the operations in the MEM and WB stage to complete their execution.
- 3. A multiplication is in the ID stage and a data hazard has been detected.

If none of the above conditions are verified, the stall unit then controls what data can forwards to the EXE stage. Not all operations supports forwarding, as in the case of a j, or they could support the forwarding only of the rs register, as in the case of most —I— instructions. This information is embedded in the CW of the instruction in the EXE stage and it is encoded in two bits. The possible values are:

- 1. 00: forwarding is not supported.
- 2. 01: only rs can be forwarded.
- 3. 10: both rs and rt can be forwarded (rt goes inside the adder's and shifter's operands).
- 4. 11: both rs and rt can be forwarded (rt is forwarded in op\_b, the signal where the data to be saved in the cache during a store is kept).

When the forwarding is supported by the instruction, it could be possible to generate a stall. In fact, if the instruction that should forward the data is a load, and said load is the MEM stage, it has not fetched yet the content of the memory. In this situation a bubble must be inserted in the pipeline.

# Synthesis and Implementation

## **Conclusions**

All in all, we have mixed feelings for this project. For a part we are satisfied by the amount of things we have been able to achieve, since we managed to add a large instruction set, a BTB, two caches, the forwarding unit and the ABI support.

However, we regret the way we implemented the multiplier. Actually, we regret to have added it at all.

In the earliest phase of development we have decided to rearrange the Booth's algorithm in such a way that it could be pipelined and would not take a lot of area. Unfortunately, we did not have enough time to implement it properly and this forced us to do compromise in order to keep it, compromises we would not have accepted by going back. It has hugely increased the complexity of the control unit and the amount of signals needed to handle all the possible stalls and corner cases. It also took a lot of time from us for testing and it had many problems, time that we could have dedicated to improving the rest of the system. This is all without even considering the fact that it is the only component in the EXE stage that, synthesized by itself with compile\_ultra, is not able to reach the target frequency of 1 GHz.

The final frequency of our DLX is, indeed, "only" of  $525 \ MHz$ , a decent value but far from our target. Sadly due to all the delays we had we did not have time to remove it and test everything again so in the end we kept it.

# Bibliography

- [1] Peter Kankowski. x86 Machine Code Statistics. URL: https://www.strchr.com/x86\_machine\_code\_statistics. (accessed: 16.10.20).
- [2] Inc. MIPS Technologies. MIPS64 Architecture For Programmers Volume II: The MIPS64 Instruction Set. 2005.
- [3] Wikipedia. MIPS architecture. URL: https://en.wikipedia.org/wiki/MIPS\_architecture. (accessed: 17.10.20).