

### Microelectronic Systems

# DLX Microprocessor: Design & Development Final Project Report

Master degree in Computer Engineering Master degree in Electronics Engineering

Referents: Prof. Mariagrazia Graziano, Giovanna Turvani

Authors: group\_16

Battilana Matteo, La Greca Salvatore Gabriele, Pollo Giovanni

## Feature

- Frequency - Slack - Area - Ecc

 $Grandes\ nacelles:$ 

- Nacelle A318 PW
- $\bullet$  Inverseur A320 CFM
- Inverseur A340 CFM
- Nacelle A340 TRENT
- Inverseur A330 TRENT
- Nacelles A380 TRENT900
- Nacelles A380 GP7200

Petites nacelles:

- Nacelle SAAB2000
- Inverseur DC8
- Inverseur CF34-8
- Inverseur BR710
- Nacelle F7X

# Contents

| Fe | eatur | re                                              | i  |
|----|-------|-------------------------------------------------|----|
| 1  | Inti  | roduction                                       | 1  |
|    | 1.1   | Abstract                                        | 1  |
|    | 1.2   | Workflow                                        | 1  |
| 2  | Har   | rdware Architecture                             | 2  |
|    | 2.1   | Overview                                        | 2  |
|    | 2.2   | Pipeline Stages                                 | 3  |
|    | 2.3   | Control Unit                                    | 3  |
|    |       | 2.3.1 Internal organization of the Control Unit | 4  |
|    | 2.4   | Memory Interface                                | 5  |
|    |       | 2.4.1 Signals and Timing                        | 5  |
|    |       | 2.4.2 Memory Addressing                         | 6  |
|    |       | 2.4.3 Memory Data Size: the MAS[1:0] signal     | 6  |
|    | 2.5   | Instruction Set                                 | 8  |
|    |       | 2.5.1 R-Type Instructions                       | 8  |
|    |       | 2.5.2 Pseudo R-Type Instructions                | 9  |
|    |       | 2.5.3 J-Type Instructions                       | 9  |
|    |       | 2.5.4 I-Type Instructions                       | 10 |
| 3  | Fete  | ch Stage                                        | 12 |
|    | 3.1   | Instruction Register                            | 12 |
|    | 3.2   | Program Counter                                 | 12 |
|    | 3.3   | Jump and Branch Management                      | 13 |
| 4  | Doc   | code Stage                                      | 14 |
| 4  | 4.1   | Instruction Decode                              | 14 |
|    | 4.1   | Register File and Windowing                     | 14 |
|    | 4.2   | 4.2.1 Decoder                                   | 15 |
|    |       | 4.2.1 Decoder                                   | 16 |
|    |       | 4.2.2 Connection Matrix                         | 18 |
|    |       | 4.2.4 Select Block                              | 19 |
|    |       | 4.2.4 Select Block                              | 19 |
|    |       | 4.2.6 Next Window Calculator                    | 20 |
|    |       |                                                 | 20 |
|    | 4.9   |                                                 |    |
|    | 4.3   | Hazard Control                                  | 21 |
|    | 4.4   | Comparator                                      | 21 |
|    | 4.5   | Jump and Branch decision                        | 24 |

|    | 4.6 Next Program Counter computation | 24 |
|----|--------------------------------------|----|
| 5  | Execute Stage                        | 25 |
|    | 5.1 ALU: Arithmetic Logic Unit       | 26 |
|    | 5.1.1 Adder                          | 27 |
|    | 5.1.2 Multiplier                     | 31 |
|    | 5.1.3 Logic Operands                 | 32 |
|    | 5.1.4 Shifter                        | 34 |
|    | 5.2 Set-Like Operations Unit         | 36 |
| 6  | Memory Stage                         | 38 |
|    | 6.1 Address Mask Unit                | 39 |
|    | 6.2 Load-Store Unit                  | 39 |
| 7  | Write Back Stage                     | 41 |
| 8  | Testing and Verification             | 43 |
|    | 8.1 Testbenches                      | 43 |
|    | 8.2 Simulation                       | 44 |
|    | 8.3 Post Synthesis Simulation        | 44 |
| 9  | Physical Design                      | 45 |
|    | 9.1 Synthesis                        | 45 |
|    | 9.2 Place and Route                  | 45 |
| 10 | Conclusions                          | 16 |

# Listings

| 4.1 | VHDL code for the encodig                             | 22 |
|-----|-------------------------------------------------------|----|
| 6.1 | VHDL code for address alignment                       | 39 |
| 7.1 | VHDL code gating for the WS signal coming from the CU | 42 |

### CHAPTER 1

### Introduction

### 1.1 Abstract

The goal of this project is to build from scratch a working implementation of a DLX. In order to achieve the goal, some known blocks, created during laboratories, were used.

The second step was the design of the datapath, done in the best possible way to obtain a high optimization and performance level. Some optimization examples that will be explained more in depth in the document are the use of the P4 adder and the Booth multiplier inside the ALU, the comparator and many others.

The third step was the design of the control unit. The choice fell on the microprogrammed version that guaranteed the pipeline implementation. In order to simplify the maintainability of the control unit, some struct-like constructs were used in the VHDL code.

In the fourth step, a very exhaustive testing has been executed. All the proposed asm codes were verified, but some well-known algorithm (bubble sort, Fibonacci and factorial) were written and tested.

Last but not least, the DLX has been synthetized using synopsis, and a post-synthesiss simulation has been executed.

Thanks to the datapath optimization and the synthesis optimization, the microprocessor proposed in this paper reached a peak speed of 400MHz.

### 1.2 Workflow

As many tools we used to automate the working process, all of them are explained in this section.

The first and most important tool was versioning control. The choice fell on GitHub. Thanks to this, the team managed all versions of the code and stepped back if any problem occurs. In addition to that, team communication and issue management were very straightforward, thanks to the possibilities offered by the tool. Another handy feature was the milestone, which allows the team to be on time and respect deadlines.

The used programming technique was the pair programming, that allows to write code and checking its correctness at the same time. This technique was particularly useful in difficult part of the projects. To exploit pair programming the extension used was Live Share for Visual Studio Code (https://github.com/MicrosoftDocs/live-share). Indeed, for the easier steps, the use of the branches and pull request gave the possibility of parallel working and drastically reduced the presence of conflicts.

The last thing to point out was the intensive use of scripting for compiling VHDL code, simulating it, adding wave and then synthetizing the full project. All this scripts are reported in the appendix.

### CHAPTER 2

### Hardware Architecture

### 2.1 Overview

This DLX is a 32-bit RISC processor with a five-stage pipeline. The external interface is made mainly for memories connection (IRAM, DRAM and a DRAM for the Register File), and for the Clock and Reset signals. Inside we find the following blocks:

- Control Unit: it receives the fetched instruction from the IR register and starts to output the correct control signals towards all the pipeline stages. Moreover, it receives status signals from all other units about their working status like the comparator result (for branch decision), the status about possible hazards in the pipeline, Register File's Push & Pop operations under execution, and all the memories readiness. It's in charge of controling the entire pipeline and stop it in case of hazards or other situations that requires a stall.
- Decode Unit: part of the decode stage, it is in charge of keeping the status about all registers under use (for further hazard controls), computation of the new Program Counter (given a Jump or not), data comparison (for branches) and, the most important thing, the operation decode with the dispatch of all the operands towards the right ports of the DataPath.
- DataPath: the computational core of the processor. Made of 4 pipeline stages (Instruction Decode, Execution, Memory, Write Back) contains all the units capable of doing computation. In particular, we have the Register File (that manages all the registers of the core), the Arithmetic Logic Unit, the Load-Store Unit for data memory management, and other units useful for the correct operation of everything.
- IR and PC: two registers the compose the Instruction Fetch stage of the pipeline, they are in charge of keeping in memory the current instruction under execution and the address for the next instruction to execute, respectively.



Figure 2.1: Schematic of the DLX

### 2.2 Pipeline Stages

### 2.3 Control Unit

The Control Unit is one of the most important part of the DLX processor. Its role is orchestrate all the jobs through the pipeline of the processor and to manage dangerous situations.

Its interface is made of input signals (status signals from other components) and of control signals (towards other components), as described in Table 2.1 and Table 2.2.

| Signal Name   | Description                                                        |  |  |
|---------------|--------------------------------------------------------------------|--|--|
| IR_IN         | Fetched Instruction, output from the IR Register                   |  |  |
| $HAZARD\_SIG$ | The Decode Stage is signaling a Hazard situation                   |  |  |
| BUSY_WINDOW   | The Current RF Window has some registers in use                    |  |  |
| SPILL         | The Register File has started a push operations towards the memory |  |  |
| FILL          | The Register File has started a pop operations from the memory     |  |  |
| $IRAM\_READY$ | The Instruction Memory has a data ready as output                  |  |  |
| LGET          | Comparison status from the comparator inside the Decode Unit       |  |  |
| DRAM_READY    | Indicates if the DRAM is ready or not                              |  |  |

Table 2.1: Input signals towards the Control Unit

| Signal Name                    | Description                                                            |
|--------------------------------|------------------------------------------------------------------------|
| PIPLIN_IF_EN                   | IF Stage enable: enable of IR register                                 |
| $IF\_STALL$                    | IF Stage stall: a NOP is insterted in the IR register                  |
| PC_EN                          | Enable of the PC register                                              |
| $JUMP\_EN$                     | A JUMP must occur, the Decode Stage will compute the new PC            |
| CALL                           | The Register File must start the context switch in a new window        |
| RET                            | The Register File must restore the previous window                     |
| $SEL_CMPB$                     | Register or Immediate field as comparator input in the Decode Stage    |
| UNSIGNED_ID                    | Indicates that all arithmetic units must consider operands as unsigned |
| $NPC\_SEL$                     | Selects as new PC between the value of a REG NPC Adder's output        |
| $HAZARD\_TABLE\_WR1$           | Enables the logging of the current instruction in the hazard control   |
| RF_RD1_EN                      | Enables the Read Port 1 in the RF                                      |
| RF_RD2_EN                      | Enables the Read Port 2 in the RF                                      |
| PIPLIN_ID_EN                   | ID Stage enable: enables all the ID pipeline REGs                      |
| PIPLIN_EX_EN                   | EX Stage enable: enables all the EX pipeline REGs                      |
| $SEL\_LGET$                    | Used by the SET Comparator unit for selection of the SET operand       |
| $\mathrm{DRAM}_{-}\mathrm{WE}$ | Enables the Write Operation in the DRAM                                |
| $DRAM_RE$                      | Enables the Read Operation in the DRAM                                 |
| $\mathrm{DRAM}_{-}\mathrm{ME}$ | Enable signal for the DRAM                                             |
| DATA_SIZE                      | Indicates the data of transfer towards/from the DRAM                   |
| $UNSIG\_SIGN\_N$               | Same as UNSIGNED_ID but 2 clock cycles delayed                         |
| PIPLIN_MEM_EN                  | MEM Stage enable: enables all the MEM pipeline REGs                    |
| $WB\_MUX\_SEL$                 | Selects between MEM output and ALU out the content to write back       |
| PIPLIN_WB_EN                   | WB Stage enable: enables the Write signal in the RF                    |

Table 2.2: Output signals towards the Control Unit

### 2.3.1 Internal organization of the Control Unit

The Control Unit internally is organized mainly with a stage that converts the actual Instruction into a Control Word and then this control word is propagated in order to support all the pipeline stages, as shown in Figure 2.2.



Figure 2.2: Schematic of the DLX Control Unit

It is made mainly by the following components:

- IR-Control Word Generation & Translation Unit: its role is to take the Instruction given as input and to output a valid Control Word related to that instruction.
- Pipeline Execution Unit: it's in charge of controlling the Pipeline Flow and stop it in case of needs and/or masking the actual Control Word in order to stop its propagation through the chain in case of Hazards.
- Control Word Mask Unit: when the *Pipeline Execution Unit* signals to mask the control word, this unit will do it by disabling all the bits in the CW that referees to the stage enabling signals.
- Jump & Branch Control unit: given the Control Word and other internal control signals, its job is to manage (by enabling or disabling them) signals like CALL, RET, JUMP\_EN and IF\_STALL.

### 2.4 Memory Interface

The DLX processor has a Harvard architecture, with two 32-bit data bus carrying instructions and data respectively. Only load and store instructions can access data from Data Memory. The data are stored in memory in a Big-Endian format.



The third RAM required, the one for the register file, is not under the direct access of the User. It's managed as a Stack memory by the Register File for automatic sub routines management and it has a dedicated interface. It can be merged with the Data Memory with an external logic.

All the subsequent paragraph are referred in the same way to all the three memory types.

#### 2.4.1 Signals and Timing

The signals in the DLX processor bus interface can be grouped into tree categories:

- Address class signals
- Memory Request signals
- Data class signals

The Address class signals are:

- A[31:0]
- DATA\_SIZE[1:0] or MAS[1:0]

The Memory Request signals are:

• Enable

- $\bullet$   $R\overline{W}$
- Ready

Moreover, all the memories connected must agree on a certain protocol both for writing and reading operations. The most important thing to take under consideration is the *Ready* signal: it must be high only when the operation is really completed. For example after a data read, the *Ready* stays at 1 only when the data is valid. If meanwhile the address changes, the *Ready* signal must go off.



If the memory in use can't accomplish to this timing, an external *Memory Control Unit* must be placed between the CPU and the Memory.

### 2.4.2 Memory Addressing

A[31:0] is the 32-bit address bus that specifies the address for the transfer. All addresses are byte addresses, so a burst of word accesses results in the address bus incrementing by four for each cycle.

The address bus provides 4GB of linear addressing space, and this can be used externally in different manners like in a SoC with memory mapped peripherals that shares the same address space.

When a word access is signaled the memory system ignores the bottom two bits, A[1:0], and when a halfword access is signaled the memory system ignores the bottom bit, A[0]. However, the core already masks the two LSBs when needed.

### 2.4.3 Memory Data Size: the MAS[1:0] signal

The MAS[1:0] bus encodes the size of the transfer. The DLX processor can transfer word, halfword, and byte quantities and the processor indicates the size of the transfer through this signal.

When a halfword or byte read is performed, a 32-bit memory system can return the complete 32-bit word, and the processor extracts the valid halfword or byte field from it as shown in Table 2.3. For 8 and 16 bit memories, the data must be placed on the right byte lanes in the data bus.

| DATA_SIZE[1:0] | A[1:0] | D[31:0]    | DLX Register |
|----------------|--------|------------|--------------|
| 00 WORD        | 00     | OxAABBCCDD | OxAABBCCDD   |
| 01 HALF        | 00     | OxAABB     | 0x0000AABB   |
| 01 HALF        | 10     | 0xCCDD     | 0x0000CCDD   |
| 10 BYTE        | 00     | OxAA       | 0x000000AA   |
| 10 BYTE        | 01     | 0xBB       | 0x000000BB   |
| 0 BYTE         | 10     | 0xCC       | 0x00000CC    |
| 10 BYTE        | 11     | 0xDD       | 0x00000DD    |

Table 2.3: How data are read by the CPU in different MAS configurations

Instead, when the DLX processor performs a byte or halfword write, the data being written is replicated across the data bus, as shown in Figure 2.3. The memory system can use the most convenient copy of the data.

A writable memory system must be capable of performing a write to any single byte in the memory system. This is required for the correct working of the DLX.



Figure 2.3: Data Write Replication

### 2.5 Instruction Set

The Instruction Set supported by the DLX is made of different instructions, in particular regarding Integer operations. Instructions are on 32 bit and are grouped in 3 different types:



Figure 2.4: R-Type



Figure 2.5: I-Type



Figure 2.6: J-Type

In particular, all of them have in common the OP CODE field, used to identify the instruction.

### 2.5.1 R-Type Instructions

R-Type Instructions are called in this way because all the operands are between registers. They all have in common the  $OP\ CODE$  equal to 0b00000 and each instruction is differentiated from another one thanks to the  $FUNC\ CODE$  field, as shown in Table 2.4 and Table 2.5.

| MNEMONIC | FUNC CODE | EXAMPLE         | OPERATION                              | DESCRIPTION            |
|----------|-----------|-----------------|----------------------------------------|------------------------|
| SLL      | 0x04      | SLL R3, R2, R1  | $R3 \leftarrow R2 \ll R1$              | Logical shift left     |
| SRL      | 0x06      | SRL R3, R2, R1  | $R3 \leftarrow R2 \gg R1$              | Logical shift right    |
| SRA      | 0x07      | SRA R3, R2, R1  | $R3 \leftarrow R2 \gg R1$              | Arithmetic shift right |
| ROR      | 80x0      | ROR R3, R2, R1  | $R3 \leftarrow R2 \circlearrowleft R1$ | Right rotation         |
| ROL      | 0x09      | ROL R3, R2, R1  | $R3 \leftarrow R2 \circlearrowleft R1$ | Left rotation          |
| MUL      | 0x0E      | MUL R3, R2, R1  | $R3 \leftarrow R2 * R1$                | Integer multiplication |
| ADD      | 0x20      | ADD R3, R2, R1  | $R3 \leftarrow R2 + R1$                | Integer Add            |
| ADDU     | 0x21      | ADDU R3, R2, R1 | $R3 \leftarrow R2 + R1$                | Integer Add            |
| SUB      | 0x22      | SUB R3, R2, R1  | $R3 \leftarrow R2 - R1$                | Integer subtraction    |
| SUBU     | 0x23      | SUBU R3, R2, R1 | $R3 \leftarrow R2 - R1$                | Integer subtraction    |
| AND      | 0x24      | AND R3, R2, R1  | $R3 \leftarrow R2$ & $R1$              | Bitwise AND            |
| OR       | 0x25      | OR R3, R2, R1   | $R3 \leftarrow R2 \mid R1$             | Bitwise OR             |
| XOR      | 0x26      | XOR R3, R2, R1  | $R3 \leftarrow R2 \oplus R1$           | Bitwise XOR            |

Table 2.4: R-TYPE Instructions: logical and arithmetic

| MNEMONIC | FUNC CODE | EXAMPLE         | OPERATION                  | DESCRIPTION               |
|----------|-----------|-----------------|----------------------------|---------------------------|
| SEQ      | 0x28      | SEQ R3, R2, R1  | $R3 \leftarrow R2 == R1$   | 1 if R2 equal R1          |
| SNE      | 0x29      | SNE R3, R2, R1  | $R3 \leftarrow R2! = R1$   | 1 if R2 not equal R1      |
| SLT      | 0x30      | SLT R3, R2, R1  | $R3 \leftarrow R2 < R1$    | 1 if R2 less R1           |
| SGT      | 0x31      | SGT R3, R2, R1  | $R3 \leftarrow R2 > R1$    | 1 if R2 great R1          |
| SLE      | 0x32      | SLE R3, R2, R1  | $R3 \leftarrow R2 \leq R1$ | 1 if R2 less or equal R1  |
| SGE      | 0x33      | SGE R3, R2, R1  | $R3 \leftarrow R2 \geq R1$ | 1 if R2 great or equal R1 |
| SLTU     | 0x3A      | SLTU R3, R2, R1 | $R3 \leftarrow R2 < R1$    | 1 if R2 unsigned less R1  |
| SGTU     | 0x3B      | SGTU R3, R2, R1 | $R3 \leftarrow R2 > R1$    | 1 if R2 uns great R1      |
| SLEU     | 0x3C      | SLEU R3, R2, R1 | $R3 \leftarrow R2 \leq R1$ | 1 if R2 uns less/eq R1    |
| SGEU     | 0x3D      | SGEU R3, R2, R1 | $R3 \leftarrow R2 \geq R1$ | 1 if R2 uns great/eq R1   |

Table 2.5: R-TYPE Instructions: test instructions

### 2.5.2 Pseudo R-Type Instructions

Here we find R-Type like operations: they works only with regiters but have their own OP CODE.

| MNEMONIC | OP CODE | EXAMPLE     | OPERATION                  | DESCRIPTION           |
|----------|---------|-------------|----------------------------|-----------------------|
| JR       | 0x12    | JR R1       | $PC \leftarrow PC + R1$    | PC += R31             |
| JALR     | 0x13    | JALR R1     | $R31 \leftarrow PC$        | R31 = actual PC       |
|          |         |             | $PC \leftarrow PC + R1$    | PC += R1              |
| RET      | 0x1F    | RET         | $PC \leftarrow R31$        | PC = R31              |
|          |         |             |                            | Context Switch begins |
| NOP      | 0x15    | NOP         |                            | Does exactly nothing  |
| TICKTMR  | 0x34    | TICKTMR R16 | $R16 \leftarrow TICKTIMER$ | Get Tick Timer value  |

Table 2.6: Pseudo R-TYPE Instructions: branch, jump and mix

### 2.5.3 J-Type Instructions

J-Type Instructions are groups of instructions made for code jump. Thanks to the structure of these instructions, long jumps can be executed (i.e. the relative address can be very big).

| MNEMONIC | OP CODE | EXAMPLE    | OPERATION                  | DESCRIPTION           |
|----------|---------|------------|----------------------------|-----------------------|
| J        | 0x02    | J LABEL    | $PC \leftarrow PC + LABEL$ | PC += rel label       |
| JAL      | 0x03    | JAL LABEL  | $R31 \leftarrow PC$        | R31 = actual PC       |
|          |         |            | $PC \leftarrow PC + LABEL$ | PC += rel label       |
| CALL     | 0x1E    | CALL LABEL | $R31 \leftarrow PC$        | R31 = actual PC       |
|          |         |            | $PC \leftarrow PC + LABEL$ | PC += rel label       |
|          |         |            |                            | Context Switch begins |

Table 2.7: J-TYPE Instructions

### 2.5.4 I-Type Instructions

I-Type instructions support immediate operands, as shown in Table 2.8.

| MNEMONIC | OP CODE | EXAMPLE          | OPERATION                           | DESCRIPTION            |
|----------|---------|------------------|-------------------------------------|------------------------|
| ADDI     | 80x0    | ADDI R3, R0, #2  | $R3 \leftarrow R0 + 2$              | Integer Add            |
| ADDUI    | 0x09    | ADDUI R3, R2, #2 | $R3 \leftarrow R2 + 2$              | Integer Add            |
| SUBI     | OxOA    | SUBI R3, R2, R1  | $R3 \leftarrow R2 - R1$             | Integer subtraction    |
| SUBUI    | 0x0B    | SUBUI R3, R2, R1 | $R3 \leftarrow R2 - R1$             | Integer subtraction    |
| ANDI     | 0x0C    | ANDI R3, R2, R1  | $R3 \leftarrow R2$ & $R1$           | Bitwise AND            |
| ORI      | 0x0D    | ORI R3, R2, R1   | $R3 \leftarrow R2 \mid R1$          | Bitwise OR             |
| XORI     | 0x0E    | XORI R3, R2, R1  | $R3 \leftarrow R2 \oplus R1$        | Bitwise XOR            |
| LHI      | 0x0F    | LHI R3, #0xABCD  | $R3 \leftarrow \texttt{OxABCD0000}$ | Load Upper Half        |
| SLLI     | 0x14    | SLLI R3, R2, #1  | $R3 \leftarrow R2 \ll 1$            | Logical shift left     |
| SRLI     | 0x16    | SRLI R3, R2, #1  | $R3 \leftarrow R2 \gg 1$            | Logical shift right    |
| SRAI     | 0x17    | SRAI R3, R2, #1  | $R3 \leftarrow R2 \gg 1$            | Arithmetic shift right |

Table 2.8: I-TYPE Instructions: logical and arithmetic

| MNEMONIC | OP CODE | EXAMPLE          | OPERATION                 | DESCRIPTION              |
|----------|---------|------------------|---------------------------|--------------------------|
| SEQI     | 0x18    | SEQI R3, R2, #1  | $R3 \leftarrow R2 == 1$   | 1 if R2 equal 1          |
| SNEI     | 0x19    | SNEI R3, R2, #1  | $R3 \leftarrow R2! = 1$   | 1 if R2 not equal 1      |
| SLTI     | Ox1A    | SLTI R3, R2, #1  | $R3 \leftarrow R2 < 1$    | 1 if R2 less 1           |
| SGTI     | 0x1B    | SGTI R3, R2, #1  | $R3 \leftarrow R2 > 1$    | 1 if R2 great 1          |
| SLEI     | 0x1C    | SLEI R3, R2, #1  | $R3 \leftarrow R2 \leq 1$ | 1 if R2 less or equal 1  |
| SGEI     | 0x1D    | SGEI R3, R2, #1  | $R3 \leftarrow R2 \ge 1$  | 1 if R2 great or equal 1 |
| SLTUI    | 0x3A    | SLTUI R3, R2, #1 | $R3 \leftarrow R2 < 1$    | 1 if R2 unsigned less 1  |
| SGTUI    | 0x3B    | SGTUI R3, R2, #1 | $R3 \leftarrow R2 > 1$    | 1 if R2 uns great 1      |
| SLEUI    | 0x3C    | SLEUI R3, R2, #1 | $R3 \leftarrow R2 \leq 1$ | 1 if R2 uns less/eq 1    |
| SGEUI    | 0x3D    | SGEUI R3, R2, #1 | $R3 \leftarrow R2 \ge 1$  | 1 if R2 uns great/eq 1   |

Table 2.9: I-TYPE Instructions: test instructions

| MNEMONIC | OP CODE | EXAMPLE        | OPERATION                    | DESCRIPTION          |  |
|----------|---------|----------------|------------------------------|----------------------|--|
| LB       | 0x20    | LB R1, 40(R2)  | $R1 \leftarrow MEM[R2 + 40]$ | Load Byte / Sign Ext |  |
| LH       | 0x21    | LH R1, 40(R2)  | $R1 \leftarrow MEM[R2 + 40]$ | Load Half / Sign Ext |  |
| LW       | 0x23    | LW R1, 40(R2)  | $R1 \leftarrow MEM[R2 + 40]$ | Load Word / Sign Ext |  |
| LBU      | 0x24    | LBU R1, 40(R2) | $R1 \leftarrow MEM[R2 + 40]$ | Load Byte            |  |
| LHU      | 0x25    | LHU R1, 40(R2) | $R1 \leftarrow MEM[R2 + 40]$ | Load Half            |  |
| SB       | 0x28    | SB 40(R2), R1  | $MEM[R2+40] \leftarrow R1$   | Store Byte           |  |
| SH       | 0x29    | SH 40(R2), R1  | $MEM[R2+40] \leftarrow R1$   | Store Half           |  |
| SW       | 0x2B    | SW 40(R2), R1  | $MEM[R2+40] \leftarrow R1$   | Store Word           |  |

Table 2.10: I-TYPE Instructions: Load & Store

| MNEMONIC | OP CODE | EXAMPLE           | OPERATION                  | DESCRIPTION    |
|----------|---------|-------------------|----------------------------|----------------|
| BEQZ     | 0x04    | BEQZ R1, LABEL    | $PC \leftarrow PC + LABEL$ | R1 = 0 ?       |
|          |         |                   |                            | PC += label    |
| BNEZ     | 0x05    | BNEZ R1, LABEL    | $PC \leftarrow PC + LABEL$ | $R1 \neq 0$ ?  |
|          |         |                   |                            | PC += label    |
| BGT      | 0x30    | BGT R1, R2, LABEL | $PC \leftarrow PC + LABEL$ | R1 > R2 ?      |
|          |         |                   |                            | PC += label    |
| BGE      | 0x31    | BGE R1, R2, LABEL | $PC \leftarrow PC + LABEL$ | $R1 \ge R2$ ?  |
|          |         |                   |                            | PC += label    |
| BLT      | 0x32    | BLT R1, R2, LABEL | $PC \leftarrow PC + LABEL$ | R1 < R2?       |
|          |         |                   |                            | PC += label    |
| BLE      | 0x33    | BLE R1, R2, LABEL | $PC \leftarrow PC + LABEL$ | $R1 \leq R2$ ? |
|          |         |                   |                            | PC += label    |

Table 2.11: I-TYPE Instructions: Conditional Branch Instructions

### **CHAPTER 3**

### Fetch Stage

The first stage of the DLX pipeline is the Fetch Stage, that has been included directly inside the DLX. It is used to manage the current Program Counter PC and send it out to the memory in order to fetch the instruction from it and save the instruction into the Instruction Register IR. At each clock cycle, the new PC is computed by summing 4, since the instructions are on 32 bits and the memory is organized in words of 1 byte. This can be summarized in the following way:

$$IR \leftarrow MEM[PC]$$

$$NPC \leftarrow PC + 4$$

As will be explained in the Jump and Branch Management section (refer to 3.3), in order to manage jumps and conditional branches, some additional hardware is necessary in order to compute the new address.

### 3.1 Instruction Register

The Instruction Register (IR) is, in this case, a 32 bits register that is used to store the instruction (that is in fact on 32 bits) that comes from the Instruction RAM.

During the normal operation, this register is updated with the new instruction coming from the IRAM, but this is not always true. We have two cases in which the register is not updated:

- 1. When i\_IR\_LATCH\_EN is at 0, in fact this signal is used to avoid to update the value inside the IR. During the instruction execution, it could be that the DRAM is not ready, a signal i\_DRAM\_NOTREADY is checked inside the Control Unit and in case of '1' the i\_IR\_LATCH\_EN is put at 0. In this way, everything is stalled so the CW in the pipeline and the Instruction Register will remain the same until the DRAM will become ready again.
- 2. When the reset signal is asserted or when the <code>i\_IR\_STALL</code> is at one, a "bubble" is added to the pipeline. So, the Instruction Register will contain the <code>x"54000000"</code> value. This is used in order to manage the jump and branch instructions. A more detailed description is available at section 3.3.

### 3.2 Program Counter

As anticipated before, the Program Counter (PC) is the address used to fetch the instruction from the IRAM and it is on 32 bits, thus it is able to address up to 4GB.

At each clock cycle, the current Program Counter is incremented by 4, in order to point to the next instruction. Only in case of reset, it is set to 0 otherwise it will set to the value of the new Program Counter, computed in the Decode stage. In a specular way, from what done for the Instruction Register, the PC is update only when the i\_PC\_LATCH\_EN is '1'.

### 3.3 Jump and Branch Management

Jumps and branches are crucial instructions in order to be able to build a program, but their management from the point of view of the pipeline is not straightforward. In fact, the Fetch Stage must be able to manage three additional signals:

- 1. i\_PC\_LATCH\_EN: this signal is used to avoid to update the PC with the new one coming from the Decode Stage. The update is inhibited when the instruction is a jump or when i\_IR\_LATCH\_EN is '0'.
- 2. i\_IR\_LATCH\_EN: it is used to avoid to update the Instruction Register. This happens when:
  - A Data Hazard occurs
  - A CAL instruction is executed and we have to wait for a spill operation to be completed
  - A RET instruction is executed and we have to wait for a fill operation to be completed

It's important to say that, when the IR is not updated, also the PC is not update too.

3. i\_IR\_STALL: a NOP is inserted in the pipeline in order to manage the jump and the branch instructions. This is due to the control hazards; when a branch is executed, it may or may not change the PC to something other than its current value plus 4. As it will be described more precisely in the Decode Stage section, the computation of the new PC, both for branch and jump, is done there. For this reason, it is updated only at the end of the Decode Stage and, if not blocked, a wrong instruction is fetch.

The solution is this DLX implements is to add a NOP instruction, in order to not execute a wrong instruction.

### CHAPTER 4

### Decode Stage

### 4.1 Instruction Decode

### 4.2 Register File and Windowing

The general structure of a register file is based on a decoder that takes the selection input (so the address of the desired register) and enables it (using also the enable signal). At this point, an input signal will contain the value to be written. On the other hand, a read signal is used to select among all the registers.

The DLX presented in this document has been enhanced in order to be able to manage subroutine in a transparent manner from the point of view of the user. For this reason, the DLX must be able to handle subroutines, and so the context switching, that consists in saving the registers content in order to be restored once the procedure has been completed. The straightforward solution is to save into the memory all registers but this is not feasible in terms of delay, since for 32 registers we will need 32 clock cycles; if you image this in a pipeline, this corresponds to a long stall each time a procedure is called.

A windowed register allows to reduce the overhead due to the context switch; the basic idea is to split the available registers in the physical register file into blocks, called *windows*. We have limited amount of physical registers in the register file, for this reason a finite number of windows are defined. Each window is assigned to a subroutine, so that the procedure can write only on those register. This is transparent from the point of view of the CPU, that sees all registers available. Thus, the physical register file has a wrapper around it with a logic and a Register Management Logic (MML) that allows to perform the translation between the CPU requests to corresponding window for the running procedure.

What if the number of called procedure is larger than the number of available windows? The main memory is involved only when there are no free windows in the register file. In this case, the oldest allocated window is swapped into the main memory, so that the new one can be allocated. Obviously, once all the recursion chain has been unrolled, the swapped window in the memory must be restored into the register file. All windows, so each procedure, has 4 blocks of 8 registers each one:

- 1. IN: the first block is dedicated to the data inherited from the parent routine (OUT);
- 2. **LOCALS**: contains the registers that are dedicated to the procedure;
- 3. OUT: is dedicated to the variables to be passed to the child routine, that is the IN of the next

sub-procedure

4. GLOBAL: the last block is common to every windows.

When a procedure is called, the first LOCALS and OUT blocks are allocated from the physical file register and assigned to it (because IN is the OUT of the previous one).

By calling many nested procedures, at some point there will be no free windows; for this reason the oldest is de-allocated from the physical FR and swapped to the main memory, the operation is called **SPILL**. This it accomplished by using a support pointer, called **Saved Window Pointer SWP** that stores the point of the spilled data, exactly the end of the LOCALS block (only IN and LOCAL are spilled, the OUT block is not spilled because is the IN of the next sub-procedure). In practice it define the position of the last free cell. Notice that this operation cannot be executed in one clock cycle: each register is spilled once at a clock cycle.

On the other hand, when the last procedure in the chain is finished, the other are unrolled; if some of them have been spilled, a **FILL** must be executed before the actual execution. This can be achieved by, firstly decrement CWP by 16 and check if now CWP >> SWP.

It's important to notice that the implementation of the entire register file has been implemented in Structural. Is is composed by several components:

- **Decoder**: it is used to generate a single enable signal from a signal on **NBIT\_ADD** bits; in this way, a register is selected in order to perform a write. The register will check also if a write is requested;
- Connection matrix: this block allows to "highlight" the active windows, the block IN, LOCAL and OUT will be the default destination when writing and reading;
- Register file: this block corresponds to the physical registers, composed by rows of Flip-Flops;
- **Select block**: this block is used for the reading, is connected to all the registers and selects, using the read address, the single register to be read;
- Address generator: this block is used only when perform a FILL or a SPILL, it generates the address for the registers to be moved from/to the memory. The memory, in this case works exactly like a stack;
- WRF Control Unit: A FSM used to manage the address generator when performing a SPILL or a FILL.

Additional, but less complex components, have been used in order to manage the CWP and SWP.

#### 4.2.1 Decoder

This block receives as input the *write address* on **NBIT\_ADD** bits and outputs  $2^{\text{NBIT\_ADD}} - 1$  bits. It has the utility of converting the address of the register at which we need to write into its enable signal.

The output is divided (in the schematic) in order to represent the group of bits. In particular we have that:

 $\bullet$  M - 1 DOWNTO 0: bits associated to the GLOBAL register

- $\bullet$  M + N 1 DOWNTO M: bits associated to the *IN* register
- M + 2N 1 DOWNTO M + N: bits associated to the LOCAL register
- M + 3N 1 DOWNTO M + 2N: bits associated to the OUT register

On the top of the schematic (Figure 4.1) we can see an AND logic port between ENABLE and WR signals. If both ENABLE and WR are 1, it means that our register need to work. In fact, the output of the dedocer is anded with 1 and so we maintain the value. Otherwise, if one signal between ENABLE and WR is 0, the output will be 0 and so the AND with the output of the decoder will return all 0.

This signal goes into the *connection matrix*, which is the next block described.



Figure 4.1: Schematic of the Decoder

#### 4.2.2 Connection Matrix

With the previous block, we generated all our enable signals. The problem is that we have more windows. So how do we decide which window needs to be activated? Here comes the connection matrix. This block receives as inputs the signal coming from the decoder, the current window, the saved window and the address for the pop (fill) operation. The output is a signal that contains the enable signals ready for all the registers of all windows.

We have a specific structure for each block:

• GLOBAL: the global is the simplest, because it is connected directly to the output

- IN: for this block we AND the IN bits coming from the decoder with the bit (that is extended) of the related window. For example if we are evaluating the IN of the first window, we will AND the IN bits with the bit 0 of the current window.
- OUT: for this block we AND the OUT bits coming from the decoder with the bit (that is extended) of the previous related window. For example if we are evaluating the OUT of the first window, we will AND the OUT bits with the bit 4 of the current window (supposing our window has 5 bits).
- LOCAL: for this block we AND the LOCAL bits coming from the decoder with the bit (that is extended) of the related window. For example if we are evaluating the LOCAL of the first window, we will AND the LOCAL bits with the bit 0 of the current window.

For the IN and OUT we then an OR between the two outputs (the logic can be seen in the schematic), while for the LOCAL we don't have anything.

In addition to that, the connection matrix also manages the saved window, used for the pop (fill) operation. First we need to invert the addr\_pop, because when we execute the pop operation, we restore data starting from the last one (we are using a STACK). The addr\_pop\_inverted is composed like this:

- 2N 1 DOWNTO 0: we have the IN bits
- N 1 DOWNTO 0: we have the LOCAL bits

The signal is splitted into two wires and is anded with the saved related saved window pointer. In the end, we definitely OR the output of the previously described OR with the output of this AND. This is visible in Figure 4.2



Figure 4.2: Connection matrix

### 4.2.3 Register File

The next block is the Register File, that is a sequence of registers. The important thing to notice in our design is how we managed the data that goes into the registers. We have two choices, data\_in and from\_mem. In order to choose we decided to use multiplexers. We have a multiplexer for each window. The signal used to drive the multiplexer is the saved window pointer, rotated right by 1 position and anded with the pop signal. In fact, we select from\_mem only when the pop signal is 1, otherwise we need to select data\_in. We use the saved window pointer shifted right by 1 because when the saved window pointer is, for example 00010 we need to restore the window 00001.

Indeed, for dataout there are no multiplexers, because there is no choice.

### 4.2.4 Select Block

This block is very simple and straightforward. It receives as input the current window, and the output of the register file (of all windows). The it selects the bit of the IN, LOCAL, OUT of the current window. The interface is shown in Figure 4.3.



Figure 4.3: Interface of the select block

### 4.2.5 Output Selection

This is the stage that decide the two output Data1\_Out and Data2\_Out. The design is shown in Figure 4.4.



Figure 4.4: Design of the output selection

This stage receives the IN, LOCAL, OUT of the current window, thanks to the select block and the GLOBAL. The two addresses, ADD\_RD1 and ADD\_RD2, select the output of the multiplexer which goes into the register, used to respect the timing. The Enable of each register is the and of the ENABLE and the RD signal. The decision was made in order to stop reading when the circuit is not enabled, and so to have a granural and precise control of the circuit.

### 4.2.6 Next Window Calculator

This block is used to compute the next window, both for the current window and the saved window. The schematic is shown in Figure 4.5.



Figure 4.5: Design of the next window calculator

Inside this block there is a logic able to rotate right or left and a multiplexer, that allows to select the correct output based on what the circuit needs.

### 4.2.7 WRF Control Unit

An additional element is necessary in order to be able to manage the addresses for the SPILL and FILL procedure, to and from the memory. It has been implemented using a Moore FSM and basically consists in different phases; a starting one that set the memory address to 0 and when the FILL or SPILL input is '1' and the ram is ready, the address is decreased or increased by 4, accordingly to the operation. Refer to figure 4.6. It's important to say that, if the RAM is not ready the address is keep untouched, so that, when the RAM returns in a ready state, the procedure can continue.



Figure 4.6: Moore FSM implementation for WRF CU

### 4.3 Hazard Control

### 4.4 Comparator

The straightforward way to implement a comparator, allows only to check if two operands, A and B, are equals. The solution is sketched at figure 4.7, that is based on N XNOR, where N is the number of bits of the operands and an AND gate with N inputs.

Even if this solution is extremely compact, it allows to perform only the equality comparison; since this DLX implementation has the ability to perform complex conditional branch instructions (refer to the Instruction section 2.5) and conditional set instructions (refer to the Set-Like Operations Unit  $\ref{thm:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perform:perfo$ 



Figure 4.7: Design of the basic comparator

Figure 4.8: Design of the advanced comparator

The advance comparator exploits the comparison by performing a subtraction between  $\mathbb{A}$  and  $\mathbb{B}$  and then checking the result. This DLX implementation is based on a P4 adder that is able to perform subtraction and then a set of checks, the same that are in the 4.8 are performed in order to generate the comparison outputs. We can perform the comparisons using this boolean equations, where C is

the carry-out and Z is the zero check (all bits of the results are zeros):

$$A > B \to C \cdot \overline{Z}$$

$$A \ge B \to C$$

$$A < B \to \overline{C}$$

$$A \le B \to \overline{C} + Z$$

$$A = B \to Z$$

$$A \ne B \to \overline{Z}$$

In order to avoid to propagate six different signal, the outcomes of the comparisons are encoded into a signal LGET on two bits. The encoded value are the ones in the 4.1 table.

| LC | ET | Case       | 1 LGET $\leq 01$ when $(a_l_b = 1)$                               | else |
|----|----|------------|-------------------------------------------------------------------|------|
|    | )1 | A < B      | 2 "00" when (a_le_b = '1') else<br>3 "11" when (a_g_b = '1') else |      |
| (  | 00 | $A \leq B$ | 4 "10" when (a_g_b = '1') else                                    |      |
| ]  | 11 | A > B      | 5 "00";                                                           |      |
| 1  | 10 | $A \ge B$  | U                                                                 |      |

Table 4.1: LEQ encoding

Listing 4.1: VHDL code for the encodig

The ordering of the comparison in the when statement is not casual nor follows the normal patterns but, the strictly lower comparison is done before the lower equals one, because, if the latter one is true it means that is also A lower than B but not vice-versa. Using this encoding, we can simply check the second bit in order to understand if  $A \leq B$  or  $A \geq B$ ; instead, if we want to check only < or > comparisons we have to check also the first bit.

A further improvement has been done to the advanced comparator in order to manage comparison between both signed and unsigned numbers. The carry value works like this:

- Carry = 1: if A > B in unsigned
- Carry = 0: if  $A \leq B$  in unsigned

The advanced comparator works with unsigned numbers only. So it simply needs to be adapted for cases in which signed comparison and unsigned comparison are different. They are shown in the Table 4.2 highlighted in red. It is easy to notice that in the red lines A and B always have different sign. The logic must work when the UNSIG\_SIGN\_N bit is 0, that means the circuit is dealing with a signed number. In this case the carry bit must be complemented. Knowing this things, it's easy to derive the following logic:

```
i_cout_masked <= Cout xor (not(UNSIG_SIGN_N) and (A(A'length-1) xor B(B'length-1)));</pre>
```

| ${f A}$ | В  | Carry out | Signed comparison | Unsigned comparison |
|---------|----|-----------|-------------------|---------------------|
| 2       | 3  | 0         | Less              | Less                |
| 4       | 3  | 1         | Greater           | Greater             |
| 3       | 3  | 1         | Equal             | Equal               |
| -3      | 3  | 1         | Less              | Greater             |
| -2      | 3  | 1         | Less              | Greater             |
| -5      | 3  | 1         | Greater           | Greater             |
| 3       | 2  | 1         | Greater           | Greater             |
| 3       | 4  | 0         | Less              | Less                |
| 3       | 3  | 1         | Equal             | Equal               |
| 3       | -3 | 0         | Greater           | Less                |
| 3       | -2 | 0         | Greater           | Less                |
| 3       | -5 | 0         | Greater           | Less                |
| 2       | -3 | 0         | Greater           | Less                |
| 4       | -3 | 1         | Greater           | Less                |
| 3       | -3 | 1         | Greater           | Less                |
| -3      | -3 | 1         | Equal             | Equal               |
| -2      | -3 | 1         | Greater           | Greater             |
| -5      | -3 | 1         | Less              | Less                |
| -3      | 2  | 1         | Less              | Greater             |
| -3      | 4  | 0         | Less              | Greater             |
| -3      | 3  | 1         | Less              | Greater             |
| -3      | -3 | 0         | Equal             | Equal               |
| -3      | -2 | 0         | Less              | Less                |
| -3      | -5 | 0         | Greater           | Greater             |

Table 4.2: All cases of possible comparison

So, we the final unit corresponds to the one at figure 4.8 with an additional encoder before the output, that encode the six conditions into a signal on two bits. The resulting schema is the one in figure 4.9.



Figure 4.9: Final implementation of the comparator

- 4.5 Jump and Branch decision
- 4.6 Next Program Counter computation

### **CHAPTER 5**

## Execute Stage



Figure 5.1: Execute stage

The execute stage is the second stage of the Datapath and it is used to perform computation over data, like sum, multiplication, shift and logical operation; the Control Word coming from the Control Unit, allows to correctly configure the internal signal to perform the desired operation with the correct operands. In fact, in this stage we have multiple signals that can be configured:

• EN1: this signal used in order to enable/disable the registers sampling in the DECODE/EXE junction;

- EN2: this signal used in order to enable/disable the registers sampling in the EXE/MEM junction;
- S1: this signal is used to select between the output of REG\_A (from register file) and REG\_IN1 (immediate). If 0, REG\_A is selected;
- S2: this signal is used to select between the output of REG\_B (from register file) and REG\_IN2 (immediate). If 0, REG\_B is selected;
- SEL\_ALU\_SETCMP: it is used to select the input for REG\_ALU\_OUT between the SET-CMP and ALU outputs;
- SEL\_LGET: this is a 3 bits signal, that is used to select among the SEQ, SNE, SLE, SLT, SGE, SGT operation for the SET-CMP unit;
- ALU\_OP: this is a 5 bit signal, that is used to select the correct operation among the sub-unit that composes the ALU. Refer to the table 5.1 for a complete description.

Beside the main path, used to compute the output result, two addition paths (the ones in the dotted rectangles) are present in order to correctly manage the execution of all the operations in the instruction set:

- A path for the propagation of the register value to be saved into the memory in case of a load. It was not possible to pass thought the addition with B and O because it's already needed in order to compute the address. So, the output from REG\_B is directly connected to the input of REG\_ME;
- The second path is the one used to manage the Write Back, at each step, until the MEM one, the reference to the register to perform the WB must be propagated through three stages (decode, execute and memory). So, the output of the WRB1 register goes into the input of the WRB2 register.

### 5.1 ALU: Arithmetic Logic Unit



Figure 5.2: ALU unit

The Arithmetic Logic Unit can be seen as a block, that given a selection signal and the inputs is able to perform computation over the operands. The ALU implementation described in this document is based on the following block:

- Adder
- Multiplier
- Logic unit
- Shifter

Each one of these blocks will be explained in the following sections.

The base concept is that, internally, the 4 units are selected through a multiplexer that takes two out five bits from a selection signal called OP. Having 5 bits to describe the type of operation, the possible combinations and their relative operations are:

| OP         | $\mathbf{Unit}$ | Operation         |
|------------|-----------------|-------------------|
| 000 00     | ADD             | ADD               |
| $001 \ 00$ | ADD             | SUB               |
| 000 01     | MUL             | MUL               |
| 000 10     |                 | AND               |
| 001 10     |                 | NAND              |
| 010 10     | I OOIO          | OR                |
| 011 10     | LOGIC           | NOR               |
| 100 10     |                 | XOR               |
| 101 10     |                 | XNOR              |
| 000 11     |                 | SHIFT RIGHT       |
| 001 11     | SHIFT           | SHIFT LEFT        |
| 010 110    |                 | ARITH SHIFT RIGHT |
| 011 11     |                 | ARITH SHIFT LEFT  |
| 100 11     |                 | ROTATE RIGHT      |
| 101 11     |                 | ROTATE LEFT       |

Table 5.1: ALU operations encoding

The two LSBs are the ones used as selection input for the multiplexer that select from which ALU unit takes the result. In fact, the univocally define the unit to be used. The remaining three MSBs are used as input for the units that compose the ALU in order to select the correct operation.

#### 5.1.1 Adder

The straightforward way to implement an adder is to use the Ripple Carry Adder structure, that is composed of N-1 Full Adder and one Half Adder (the first), where N is the number of bits of the two operands. This solution is not optimal from a timing point of view due to the time needed to propagate the carry, that defines the critical path, that is the bottleneck.

Since the sum and the subtraction is one of the most common operation, the DLX includes an adder that is based on a CLA - Carry Look Ahead (Sparse Tree) and a Carry Select Like Adder. The complete structure can be seen at figure 5.3.

As said before, the adder is composed of two blocks:

• Carry Select Like Adder: The main point of the Carry Select Adder is that it doubles the complexity of the adder itself in order to obtain better performances. It is composed by two RCA, in order to perform two sums in parallel.

The idea is to compute both the results, on 4 bits in this case, for both when the carry-in is equal to '0' or '1'. In this way, the results is computed in parallel for all the stages, even if the



Figure 5.3: Booth's multiplier on 32 bits

carry-in is '0' or '1'; then the carry-in is used to mux against the two results (on 4 bits) and the two carry-outs. The carry-out will be used as result selection signal for the next Carry Select unit.

We are paying complexity in order to reduce the addition computation time, in fact by having a carry out that is used as carry in for the next state, there is still propagation but is lower.

The DLX implementation, instead of using a straightforward implementation of the Carry Select Adder, it uses a modified version of it. It has been implemented using CLA - Sparse Tree Carry Generator and a Carry Select Like Adder. The base idea is to use the CLA in order to compute a carry every n bits, then these carries are fed into the sum generator that uses them to compute the results in parallel.



Figure 5.4: Carry Select Like Adder block for a 32 bits implementation

• Carry Look Ahead - Sparse Tree: this block is used to compute the carry out every 4 bits. The idea behind the CLA is to compute several carries simultaneously and to avoid waiting until the correct carry propagates from the stage of the adder in which it has been generated. This is done thanks to the *propagate* (P) (that is 1 if the carry-in is equal to the carry-out) and generate (G) (that is one if carry-in is 0 and carry-out is 1).

| a | b | $c_{in}$ | out | $c_{out}$ | p | g |
|---|---|----------|-----|-----------|---|---|
| 0 | 0 | 0        | 0   | 0         | 0 | 0 |
| 0 | 0 | 1        | 1   | 0         | 0 | 0 |
| 0 | 1 | 0        | 1   | 0         | 1 | 0 |
| 0 | 1 | 1        | 0   | 1         | 1 | 0 |
| 1 | 0 | 0        | 1   | 0         | 1 | 0 |
| 1 | 0 | 1        | 0   | 1         | 1 | 0 |
| 1 | 1 | 0        | 0   | 1         | 0 | 1 |
| 1 | 1 | 1        | 1   | 1         | 0 | 1 |

Table 5.2: Computation of propagate and generate bits

$$g = a \oplus b \qquad p = a \cdot b \tag{5.1}$$

The base idea is to write any  $s_i$ , that is the i-esim bit of the sum and  $c_i$ , the carry-out at i index in function of p and g. We now use p and g to express the same result:

$$s_1 = a \oplus b \oplus c_{in} = p_1 \oplus c_0$$
  

$$c_1 = a \cdot b + a \cdot c + b \cdot c = a_1 \cdot b_1 + (a_1 + b_1) \cdot c_0 = g_1 + p_1 \cdot c_0$$

The crucial point is that it's possible to compute the carry at i position only using the initial carry-in  $c_{in}$  and p and q generate in the current and previous blocks. Tree are a family of Carry Look Ahead that differ for the carry-logic. They are based always on *propagate* and *generate*. We have that

$$carry = g + p \cdot c_{in}$$

$$G_{i:j} = G_{i:k} + P_{i:k} \cdot G_{k-1:j}$$

$$P_{i:j} = P_{i:k} \cdot P_{k-1:j}$$

where

$$-i \ge k > j$$

$$-G_{x:x} = g_x \text{ and } P_{x:x} = p_x$$

$$-g_0 = C_{in}$$

The white and gray blocks in the Sparse Tree block at 5.6, that are used in the carry\_generator block, are PG and G blocks. Normally two blocks are used, the first G generates only  $G_{i:j}$  and the other PG both  $G_{i:j}$  and  $P_{i:j}$ . The base idea is to combine their outputs and take only the



Figure 5.5: PG and P block

G one as carries.

So, Carry Look Ahead - Sparse Tree needs a starting block that generates all the p and g for all the couples of bit using the 5.1 equation. The spars tree and the PG network structure is shown in the figure 5.6, in the code this block is called **prop\_gen\_generic** and is made of **prop\_gen** block in order to compute p and g.



Figure 5.6: Carry Look Ahead - Carry Generator Block

In order to be able to perform both subtraction and sum, the first block must be modified and must include the logic to manage also the carry-in, that in case of a subtraction it is '1'. This is not enough to perform subtraction, in fact, an additional signal called SUB\_SUMN is needed. So, the same structure can be used to implement the subtraction, by adding an XOR on each B input with the SUB\_SUMN control signal.

We can say that a subtraction in 2's complement can be implemented as  $A + \overline{B} + 1$ ; in order to implement this in the circuit we need to set the carry-in to '1' (so SUB\_SUMN = '1') in order to add 1 and invert the B input by using the XOR. In fact:

| $\boldsymbol{x}$ | y | $y \oplus q$ |
|------------------|---|--------------|
| 0                | 0 | 0            |
| 0                | 1 | 1            |
| 1                | 0 | 1            |
| 1                | 1 | 0            |

This solution is good because the PG and G block has the same delay, that is driven by G and since both include it, they are the same. Many paths have the same delay and the load on components is good since an output of a block is connected at maximum to 2 other block. These two factors bring a good equilibrium to the entire structure.

### 5.1.2 Multiplier

In order to overcome the limitation of the array multiplier, this DLX implementation includes a modified version of the Booth's multiplier, since the multiplex for the partial values to be added is only on two bits instead of three. The Booth's algorithm copes with 3 bits at a time, so the number of stages is N/2 (this corresponds to the number of the encoders) and this allows to speed up the result computation. The Booth's algorithm is the following:

```
 \begin{split} \mathbf{i} &= 0 \\ \mathbf{P} &= 0 \\ \text{while } \mathbf{i} &\leq \mathbf{M} - 2 \text{ loop} \\ &\quad \mathbf{P} &= \mathbf{P} + \text{Vp}(\ B_{i+1}, B_i, B_{i-1}\ ) \\ &\quad \mathbf{A} &= \mathbf{A} \ * \ 4 \\ &\quad \mathbf{i} &= \mathbf{i} \ + \ 2 \\ \end{split}  end loop
```

Where P is the final value of the product and during the algorithm execution it will contain the partial result; M represents the number of bit of the multiplicand, in this case B. The algorithm takes as convention that  $B_{i-1} = 0$ . The Vp is a lookup table (see 5.3), that return the value to add to P, according to the 3 bits selected. The value of A is multiplied by 4.

| $B_{i+1}$ | $B_i$ | $B_{i-1}$ |     |
|-----------|-------|-----------|-----|
| 0         | 0     | 0         | 0   |
| 0         | 0     | 1         | +A  |
| 0         | 1     | 0         | +A  |
| 0         | 1     | 1         | +2A |
| 1         | 0     | 0         | -2A |
| 1         | 0     | 1         | -A  |
| 1         | 1     | 0         | -A  |
| 1         | 1     | 1         | 0   |

Table 5.3: Booth's LUT

The Booth's multiplier work with 3 main components, supposing A is the multiplicand and B the multiplier:

- 1. N/2 encoders in order to take 3 bits from the operand B; the two LSB are used as selection signal for the multiplexers and the last one, the MSB, for the adder. In fact, as it's possible to notice from the table 5.3, when the MSB is 1 the value to be added to the partial result is positive, negative otherwise. For this reason, the inputs to the multiplexers are only  $\{0, A, 2A\}$ . Since we need also to generate negative values, like -A, -2A, the MSB of the three bits is used as input for the adder. This signal, called SUB\_SUMn is used to define if the operation is a sum or a subtraction; if it is 1, a subtraction is performed;
- 2. N/2 multiplexers that select only among  $\{0, A, 2A\}$ , since at each stage A must be multiplied by 4, a shift by two is done starting from A of the previous multiplexer;
- 3. N/2 ripple carry adders, that allow to preform the partial sums. Since the final results will be on  $NBIT \cdot 2 + 1$  bits, the adders in each level have been optimized in order to work only with the minimum bits needed. In fact, the adder at i level, will generate the result on  $NBIT + 2 \cdot i$  bits. As said before, an addition signal called SUB\_SUMn has been added in order to be able to perform the subtraction. The Ripple Carry Adder has been select for its simplicity and since

the multiplication is a less common instruction, it was not worth to use a more sophisticated adder. This allowed to reduce the total area of the multiplier itself.

The multiplexer implements the LUT and at the same time the  $A=A\cdot 4$ . The two values from the two multiplexer are summed together via an adder, this implements the partial sum. The overall structure can be observed at 5.7.



Figure 5.7: Booth's multiplier on 32 bits

### 5.1.3 Logic Operands

The basic and most simple implementation of a logic unit is based on single logic gates on N bits whose outputs are muxed, in order to generate the correct output. The problem with this solution is that the number of input signals to the multiplexer is extremely high; this implementation does not only suffer from the point of view of the delay but, since each logic function is implemented with a specific gate, the total area is huge.

In order to overcome the problems highlighted before, a more compact implementation has been chosen: the T2 logic unit.

This logic unit allows to perform AND, NAND, OR, NOR, XOR and XNOR using only 5 NAND gates, on two levels, and 4 selection signals. The schematic is the one in figure 5.8.

In order to compute one of the logical instructions, the select signals are properly activated as follow:



Figure 5.8: Logic unit

| $ S_0 $ | $S_1$ | $S_2$ | $S_3$ | Operation |
|---------|-------|-------|-------|-----------|
| 0       | 0     | 0     | 1     | AND       |
| 1       | 1     | 1     | 0     | NAND      |
| 0       | 1     | 1     | 1     | OR        |
| 1       | 0     | 0     | 0     | NOR       |
| 0       | 1     | 1     | 0     | XOR       |
| 1       | 0     | 0     | 1     | NXOR      |

Figure 5.9: Logic input signals with the relative operation

For example, in order to generate the AND logical operation, we have to select  $S_3=1$ , so that  $out=R_1\cdot R_2$ ; on the other hand, if we need NAND  $S_0=S_1=S_2=1$  and  $S_3=0$ , so that  $out=\overline{R_1}\cdot \overline{R_2}+\overline{R_1}\cdot R_2+R_1\cdot \overline{R_2}=\overline{R_1}\cdot \overline{R_2}$  that using the De Morgan law  $out=\overline{R_1}\cdot \overline{R_2}$ . This allows to obtain the best performances also because all paths work in parallel, compacting the area and the delay.

Since only the 3 bits are used to select among the logical operations (S signal), a direct correspondence is needed to generate the signal show in table 5.9. The following table, shows the conversion:

| S   | Decoded signal |
|-----|----------------|
| 000 | 0001           |
| 001 | 1110           |
| 010 | 0111           |
| 011 | 1000           |
| 100 | 0110           |
| 101 | 1001           |
| 110 | 0000           |
|     | •              |

Table 5.4: Conversion table, from  ${\tt S}$  input signal on 3 bits into 4 bits

#### 5.1.4 Shifter

The implemented shifter allows to perform shift right, logical/arithmetical shift left and left/right rotate using the full operand A on 32 bits and 6 bits from the second one B and three *control signals*. Differently from the T2 version, it uses and addition signal in order to be able to manage also the rotate instruction. Our implementation takes three inputs:

- A: the operand to be shifted/rotated;
- B: only the 5 LSB [4,3,2,1,0] are used to select first the mask to be used and then the starting point from that mask;
- SEL: it encodes the operation type; the second bit is used to select among arithmetic and logic, the third bit is used to select the direction of the shift/rotate (left/right) and the first one is used only if the operation is a rotate. This is the encoding:

| SEL | Operation         |  |  |
|-----|-------------------|--|--|
| 000 | Shift logic right |  |  |
| 001 | Shift logic left  |  |  |
| 010 | Shift arith right |  |  |
| 011 | Shift arith left  |  |  |
| 100 | Rotate right      |  |  |
| 101 | Shift right       |  |  |



Figure 5.10: Blocks of the Shifter/Rotate Unit

The unit perform the requested operation in three stages, sketched in figure 5.10:

- 1. The first consist in preparing 4 possible "masks", each already shifted of 0, 8, 16, 32 left or right depending on the configuration. This allows to shift for all 32 bits. Basically it copies the input A into the 4 masks that will be used by the next stage. Being in 32 bits, the generated masks are in 32+8 = 40 bits. The only different between this implementation and the T2 one, is that, in case of rotate, the additional 8 bits of the masks are filled with the corresponding 8 bits that are going "out" during the rotation.
- 2. The second level perform a coarse grain shift, that is basically consist on selecting one mask among the 4 possible masks generated in the previous stage. This selection is done by using the bits 4, 3 of B.

3. The third level, using the bits 2, 1, 0 of B and the selected mask, preform a fine grain refinement. The 3 bits allows to select the starting index from the mask, in fact it allows to select among 8 positions.



Figure 5.11: Masks for left shift



Figure 5.12: Masks for right shift



Figure 5.13: Masks for left rotate



Figure 5.14: Masks for right rotate

Figure 5.15: Masks for shift unit on 32 bits

#### Examples

For example, if we need to perform a left left of 9 bits A, where A=18, the corresponding B value will be 1001; this means that the second masks will be taken and the output result will from the bit at position 40 - 1 = 39 to the one at 39 - 32 = 7 included.

### 

On the other hand, if we need to perform a right shift the masks are generated in the opposite way, so the zeros are put in the MSB of the mask, shifted by 0, 8 ... positions. In this case we need also to distinguish between the an arithmetic and a logic shift; in the first case, instead of filling the "empty" bits with zero, the operand sign is used. For example, if we want to shift A=-18 of B=3 bits, the first mask is used:

In the last case, let's suppose to rotate right A=1255 (=10011100111) by 5 position:

As you can see, in case of MASK 1 for the right rotation, the 8 LSB of A are copied into the 8 MSB of the mask.

### 5.2 Set-Like Operations Unit



Figure 5.16: Execute stage with focus on the SET-CMP unit

The DLX execution stage includes also the possibility to set the value of a register accordingly to the outcome of the comparison of two operands; the operands can came from two sources:

- 1. Both from the register file, trough REG\_B and REG\_A (refer to figure 5.1);
- 2. One from the register file (REG\_A by setting S1 = 0) and the immediate from REG\_IN2 (refer to figure 5.1).

The unit designed to perform this set operation is called  $\mathtt{set\_comparator}$ , that using a behavioural process, is able to generate the corresponding '1' or '0' to be set to the register, accordingly to the comparison result. The output on  $NBIT_DATA$  bits will be muxed with the one coming from the ALU unit, using the CW that is configured in the Control Unit.

In order to decrease the area and since the comparison was already generated by the comparator unit in the decode stage (refer to section 4.4), the set\_comparator unit takes the LGET signal that came from the decode unit and perform the following checks:

| Operation | VHDL implementation            |  |  |  |
|-----------|--------------------------------|--|--|--|
| SEQ       | not LGET(0)                    |  |  |  |
| SNE       | LGET(0)                        |  |  |  |
| SLE       | LGET(1) = '0' or LGET(0) = '0' |  |  |  |
| SLT       | LGET = "01"                    |  |  |  |
| SGE       | LGET(1) = '1' or LGET(0) = '0' |  |  |  |
| SGT       | LGET = "11"                    |  |  |  |

Table 5.5: Performed checks in order to generate '1' or '0' accordingly to the comparison outcome. Refer to table 4.1.

## Memory Stage

The memory stage allows to perform operations with the memory, the DRAM in this case, in order to load or store a value from/to the memory respectively. The figure 6.1 shows the Memory Stage of the DLX pipeline.



Figure 6.1: Memory stage

If the instruction that reaches the memory stage is a load, the value stored into the REG\_ALU\_OUT register is fed to a block that performs the masking of the address itself. Then the masked address is given as input to the external memory in order to retrieve the data. A multiplexer with two inputs allows to select the data coming from the memory and redirects it to another unit, called datamem\_ldstr that uses the DATA\_SIZE and UNSIG\_SIGN\_N inputs to select the correct bits for word, half word and byte.

In case of a store, the address in which the data must be saved into the memory is masked too via

the Address Mask Unit and its output is given to the DRAM. The data stored in the REG\_ME register is selected, by correctly configuring the multiplex to select it, and put as input to the datamem\_ldstr unit. The output of this unit is put as input data to the DRAM, that perform the store.

### 6.1 Address Mask Unit

The basic idea behind the Address Mask Unit is to modify the address, given as input, depending on the data size we want to store or load. All the possibility and masking procedures are summed up in the table 6.1.

| DATA_SIZE | Dimension | Masking                                   |
|-----------|-----------|-------------------------------------------|
| 00        | word      | ADDR_IN(ADDR_IN'length-1 downto 2) & "00" |
| 01        | half word | ADDR_IN(ADDR_IN'length-1 downto 1) & "0"  |
| 10        | byte      | ADDR_IN                                   |

Table 6.1: Address masking for all the three possible cases

Give an address, it must be correctly aligned depending on the dimension of the data we want to write/read to/from the memory. Refer to the 6.1 VHDL snippet. The problem with the alignment arise when data to be accessed are larger that the addressable unit (word vs byte). This allows to read always 32 bits from the memory from the correct position, thanks to the masking process. Since the data is on 32 bits, we need to select 8, 16 or 32 bits from it accordingly to the DATA\_SIZE two bits signal. This is done in the Load-Store Unit.

```
with DATA_SIZE select
ADDR_OUT <= ADDR_IN(ADDR_IN'length-1 downto 2) & "00" when "00",
ADDR_IN(ADDR_IN'length-1 downto 1) & "0" when "01",
ADDR_IN when others;</pre>
```

Listing 6.1: VHDL code for address alignment

### 6.2 Load-Store Unit

The Load-Store Unit is used to perform two operations:

- Extract the correct bits from the data coming from the memory, using the address generated by the Address Mask Unit. Depending of the DATA\_SIZE and the ALOW signals, the data is select. When the address has been correctly masked in order to fetch a word (32 bits), the data doesn't need to be extracted, neither the sign extension is needed. In the other two cases, when we are managing half word or byte, the correct portion of bits must be taken.
- In case we are managing half word or byte, the bits can be simply taken and set as output of the Load-Store Unit only if we are working with unsigned. In this case, all the others bits are se to 0. On the other hand, if the byte or the half word granularity is used and the UNSIG\_SIGN\_N is set to '0', a sign extension must be performed. In this case, the correct portion of bits is taken, as defined in table 6.2, and the first bit of it used used extends the sign.

| DATA_SIZE | ALOW | Type     | Selected bits |
|-----------|------|----------|---------------|
| 00        | *    | word     | [31-0]        |
| 01        | 0*   | half A   | [31-16]       |
| 01        | 1*   | half A+2 | [15-0]        |
| 10        | 11   | byte A+3 | [7-0]         |
| 10        | 10   | byte A+2 | [15-8]        |
| 10        | 01   | byte A+1 | [23-16]       |
| 10        | 00   | byte A   | [31-24]       |

Table 6.2: Bits selection using  ${\tt DATA\_SIZE}$  and  ${\tt ALOW}$ 

A more detailed explanation of the memory interface is available at section 2.4.3.

### Write Back Stage

The Write Back Stage is the last stage in the DLX pipeline and it is used to write the results, that can come from the memory (due to a load) or from the computation of ALU, into the File Register. Since this DXL is a pipelined processor, the destination register is available when the instruction is decoded and must be propagated until the write back register is needed. For this purpose, inside the entire Datapath, there is a skewing network that is used to have the register index exactly when necessary.

So, we need to extract from the instruction itself, depending on its type, the destination address. The position on a 32 bit instruction of the destination address is the following (refer to 2.5):

R-Type: [25...21]I-Type: [20...16]



Figure 7.1: Execute stage

The figure 7.1 shows the simplified diagram used to delaying the destination register (WS1) that comes from the Decode Stage after the extraction from the instruction, through the three registers in the Datapath. When the instruction reaches the Write Back stage in the pipeline, the register index on 5 bits, is feed back to the write input of the Register File and an additional signal, called WF is used to enable the write.

An additional signal, called S3, is used to select which data take as input fro the register file in order to perform the Write Back. Data can come from two different sources:

- It can take directly the output of the Load and Store unit, via the REG\_MEM\_LDSR\_OUT in case the selection signal S3 = 1
- It can take the output of the ALU (that includes set-cmp unit), via the REG\_MEM\_ALU\_OUT in case the selection signal S3 = 0

The RO register must be protected against the Write Back, in fact it has to be always 0. For this reason, a masking procedure has been implemented in order to solve this problem. The WF signal is gated with a non-zero check over the i\_RF\_WS signal; using few words, the write enable signal is set to 0 if a write on RO is requested.

```
i_WF <= WF when (TO_INTEGER(unsigned(i_RF_WS)) /= 0) else '0';
Listing 7.1: VHDL code gating for the WS signal coming from the CU
```

### Testing and Verification

Testing a verification is used to verify the correctness of parts or the whole DLX. Testing can be defined as the process of observing and verifying the results of a component under specific conditions in order to find possible inconsistency between the actual and the required behaviour.

The flow used to verify a component follows multiple steps:

- Creation of the Testbench
- Simulation
- Post Synthesis simulation (after the Synthesis and Optimization, refer to section 9.1)

#### 8.1 Testbenches

All Testbenches developed for the DLX verification have been implemented using VHDL. The most critical components have been tested with their own testbenches, like:

- Booth's multiplier
- P4 adder
- Comparator
- ALU
- Shifter
- Windowing Register File

While the others smaller components have been indirectly intensively tested while verifying the correct behaviour of the DLX itself.

The DLX Testbench is implicitly based on two processes, the first one manages and creates only the clock signal with a period of 1 ns, while the second one asserts the reset signal for the first clock period and the negates it in order to start the correct test execution.

In order to create the most flexible test possible, multiple components have been instantiated; these are providing all the external interfaces the DLX needs. So, besides the DLX component, the Testbench contains also the following component:

• Instruction RAM (IRAM): this is a read only memory that takes an external file, in this case a .mem file, and uses it as source for fetching the instruction. When the reset is performed, all the instructions in the file are written in an internal array of 32 bits. At each clock cycle, the array is indexed with an index that comes from the outside, that corresponds to the DLX PC.

• Data RAM (DRAM): in a similar way with the respect to the IRAM, the DRAM is loaded at the startup using the same .mem explained before. In fact, this compiled file does not contains only the code instruction but, after a blank line, also all the memory content the program should have once it starts. Since a DRAM must be also writable, the memory is implemented as a read/write one.

In order to simplify the verification of the memory content, each time a value is written and external file, that is the exact copy of the DRAM content, is updated too. In order to correctly access the data that are stored into the memory, the address comes from outside and corresponds to the one used in the Memory Stage (refer to 6).

To replicate a more realistic environment and increase the test reliability, the DRAM has been enhanced with a ready signal that is '1' when it is ready. This has been implemented with a counter, that only after data\_delay clock cycles make the memory ready and allows to generate a meaningful value.

Last but not least, the DRAM must be able to correctly manage all the different data sizes both during the write and the read operation. So, an additional signal on two bits, called MAS, has been included. A more accurate description of the memory data size management is available at section 2.4.3.

### 8.2 Simulation

### 8.3 Post Synthesis Simulation

# Physical Design

- 9.1 Synthesis
- 9.2 Place and Route

# Conclusions