

# EDRICO - Educational DHBW RISC-V Core

## Semester Paper

from the Course of Studies Electrical Engineering at the Cooperative State University Baden-Württemberg Ravensburg

by

Levi-Pascal Bohnacker, Noah Wölki

June 2021

Time of Project Student ID, Course Reviewer 31 Weeks6818486, 5040009, TEN18Prof. Dr. Ralf Gessler

#### Author's declaration

Hereby we solemnly declare:

- 1. that this Semester Paper , titled EDRICO Educational DHBW RISC-V Core is entirely the product of our own scholarly work, unless otherwise indicated in the text or references, or acknowledged below;
- 2. we have indicated the thoughts adopted directly or indirectly from other sources at the appropriate places within the document;
- 3. this Semester Paper has not been submitted either in whole or part, for a degree at this or any other university or institution;
- 4. we have not published this Semester Paper in the past;
- 5. the printed version is equivalent to the submitted electronic one.

We are aware that a dishonest declaration will entail legal consequences.

Friedrichshafen, June 2021

Levi-Pascal Bohnacker, Noah Wölki

# **Contents**

| Ac  | crony  | ms                                  | IV  |
|-----|--------|-------------------------------------|-----|
| Lis | st of  | Figures                             | V   |
| Lis | st of  | Tables                              | VI  |
| Lis | stings | S ·                                 | VII |
| 1   | Intro  | oduction                            | 1   |
| 2   | Mot    | tivation                            | 2   |
| 3   | Proj   | ject Planning                       | 3   |
| 4   | Stat   | te of the Art                       | 4   |
|     | 4.1    | Basics of Processing Units          | 4   |
|     | 4.2    | RISC vs CISC                        | 4   |
|     | 4.3    | RISC-V                              | 4   |
|     | 4.4    | Benchmarks                          | 5   |
|     |        | 4.4.1 Coremark                      | 5   |
|     |        | 4.4.2 SPECint                       | 5   |
|     | 4.5    | Memory Management                   | 5   |
|     |        | 4.5.1 Memory Hierarchy              | 5   |
|     |        | 4.5.2 Communication Interfaces      | 5   |
|     | 4.6    | FPGA                                | 5   |
|     | 4.7    | Hardware Description Languages      | 5   |
| 5   | EDF    | RICO (Educational DHBW RISC-V Core) | 7   |
|     | 5.1    | Control Unit                        | 8   |
|     |        | 5.1.1 Architecture & Design         | 8   |
|     |        | 5.1.2 Implementation                | 12  |
|     | 5.2    | Arithmetic Logical Unit             | 15  |
|     |        | 5.2.1 Architecture & Design         | 16  |
|     |        | 5.2.2 Implementation                | 16  |
|     | 5.3    | Register File (RF)                  | 18  |
|     |        | 5.3.1 Architecture & Design         | 19  |
|     |        | 5.3.2 Implementation                | 24  |
|     | 5.4    | PMP & PMA Checker                   | 24  |
|     |        | 5.4.1 Architecture & Design         | 24  |
|     |        | 5.4.2 Implementation                | 25  |

|    | 5.5 Exception Control |                 |                            |      |  |  |  |  |
|----|-----------------------|-----------------|----------------------------|------|--|--|--|--|
|    |                       | 5.5.1           | Architecture & Design      | . 26 |  |  |  |  |
|    |                       | 5.5.2           | Implementation             | . 27 |  |  |  |  |
|    | 5.6                   | AXI4-           | Lite Master                | . 27 |  |  |  |  |
|    |                       | 5.6.1           | Architecture & Design      | . 28 |  |  |  |  |
|    |                       | 5.6.2           | Implementation             | . 29 |  |  |  |  |
|    | 5.7                   | AXI4-           | Lite Slave                 | . 29 |  |  |  |  |
|    |                       | 5.7.1           | Architecture & Design      | . 29 |  |  |  |  |
|    |                       | 5.7.2           | Implementation             | . 31 |  |  |  |  |
| 6  | Test                  | : & <b>V</b> ei | rification                 | 32   |  |  |  |  |
|    | 6.1                   | Unit &          | k Integration Verification | . 32 |  |  |  |  |
|    | 6.2                   | System          | n Verification             | . 32 |  |  |  |  |
|    | 6.3                   | Accep           | tance Verification         | . 32 |  |  |  |  |
|    | 6.4                   | Bench           | marks                      | . 32 |  |  |  |  |
| 7  | Futu                  | ıre Wo          | rk                         | 33   |  |  |  |  |
| 8  | Con                   | clusion         |                            | 34   |  |  |  |  |
| Bi | bliogr                | raphy           |                            | 35   |  |  |  |  |
| Αŗ | pend                  | lix             |                            | 36   |  |  |  |  |

# **Acronyms**

ALU Arithmetic Logical Unit
CSR Control and Status Registers
CU Control Unit
FPGA Field Programmable Gate Array
FSM Finite State Machine
IP Intellectual Property
ISA Instruction Set Architecture
PMP Physical Memory Protection
PMA Physical Memory Attributes
RF Register File
RISC Reduced Instruction Set Computer
SISD Single Instruction Single Data
VHDL Very High Speed Integrated Circuit Hardware Description Language

# **List of Figures**

| 3.1  | V-Model                                             | 3  |
|------|-----------------------------------------------------|----|
| 4.1  | Xilinx FPGA [Xil17]                                 | 6  |
| 5.1  | EDRICO Overview                                     | 7  |
| 5.2  | Control Unit Architecture                           | 8  |
| 5.3  | Control Unit FSM overview                           | 9  |
| 5.4  | RISC-V Instruction formats [RIS17]                  | 10 |
| 5.5  | Decoding Structure to determine instruction cluster | 13 |
| 5.6  | Information extraction from 32-bit instruction word | 14 |
| 5.7  | ALU operations                                      | 15 |
| 5.8  | ALU operations                                      | 16 |
| 5.9  | Register Files architecture                         | 19 |
| 5.10 | General Purpose Registers                           | 20 |
| 5.11 | CSR Register File Architecture                      | 23 |
| 5.12 | PMP & PMA Checker Architecture                      | 24 |
| 5.13 | Exception Control Architecture                      | 26 |
| 5.14 | AXI4-Lite Master Architecture                       | 28 |
| 5.15 | AXI4-Lite Slave Architecture                        | 30 |

# **List of Tables**

| 5.1 | Timing of FSM                                               | 9  |
|-----|-------------------------------------------------------------|----|
| 5.2 | Program Counter control: Instructions and resulting actions | 11 |
| 5.3 | Decoding instruction clusters                               | 12 |
| 5.4 | Input code and respective operation                         | 17 |
| 5.5 | List of implemented CSRs                                    | 22 |
| 5.6 | Memory Mapped CSRs                                          | 29 |
| 5.7 | Memory Mapped CSRs reset values                             | 31 |

# Listings

| 5.1 ALU VHDL $\operatorname{code}$ | 5.1 | ALU VHDL code |  |  |  |  | 18 |
|------------------------------------|-----|---------------|--|--|--|--|----|
|------------------------------------|-----|---------------|--|--|--|--|----|

## 1 Introduction

These days one of the key benchmarks for technology is processing speed and calculation power. To realize mathematical operations and execute programs, different platforms can be utilized. The most commonly used unit is the standard processor consisting of transistors realized on silicium and other materials. Another crucial technology that is gaining more attention is the so-called Field Programmable Gate Array (FPGA). The FPGA consists of logical units that can be wired and configured individually for the required use-case. The advantage of FPGA is that the speed of applications can be drastically increased since the hardware will be very optimized for the specific application. This project aims to develop a Intellectual Property (IP)-core based on the Open Source Instruction Set RISC-V. The goal is to build a reusable unit of logic that can interpret compiled C-Code. The IP core is realized in the Very High Speed Integrated Circuit Hardware Description Language (VHDL) language and will be deployed on a FPGA. IP Cores are used in every computer, phone and electronic device that requires to execute some computational function. The developers of these IP Cores are big companies like Intel, ARM or AMD. These IP Cores and Instruction Sets are strictly licensed and not available for everyone. For the development of an own IP Core the Instruction Set is the main source of information and therefore the RISC-V open-source Instruction Set is used for this project.

## 2 Motivation

RISC-V was first proposed at Berkeley University in 2010. The architecture is therefore relatively new in comparison to others like x86, ARM or SPARC. Even though its young age is already very promising, every year new breakthroughs are achieved in the field of RISC-V based cores. MicroMagic for example announced in 2020 a chip with a total CoreMark score of 13000 and an incredible 110000 Coremark/Watt. This poses a significant development and is approximately 10 times better than any CISC, RISC or MIPS implementation in terms of Performance per Watt. Many other companies like Alibaba, Nvidia and SiFive are currently increasing research on RISC-V based cores. The Motivation behind this project was to gain experience in processor and FPGA design and verification. Furthermore it poses an interesting opportunity for students to work on a new and upcoming processor architecture.

# 3 Project Planning

In order to control the flow of the project, the V-Model approach was taken. The project is therefore divided into Requirements, System Design, Architecture Design, Module Design and Implementation. After Implementation the corresponding verification phases are ready to be executed, starting from the lowest level (Unit Verification) to Integration Verification, System Verification and last but not least Acceptance Verification.



Figure 3.1: V-Model

At the beginning of every project phase, workloads were defined e.g. the definition of the Control Unit Entity. The target of Requirements Engineering was to define everything that is expected from the core and gather information about RISC-V. The data path as well as the entities of the Control Unit, Arithmetic Unit, Register Files, Exception Control, PMP & PMA Checker and AXI4-Lite Interfaces as well as a short summary of their function were defined during System Design. The next step, Architecture Design, aimed to further specify the entities mentioned above and sub-divide them into several entities. Module Design will be executed to define every single architecture, after that implementation and testing may start.

## 4 State of the Art

## 4.1 Basics of Processing Units

#### 4.2 RISC vs CISC

#### 4.3 RISC-V

RISC-V is an open standard Instruction Set Architecture (ISA) developed by the University of California, Berkely. The ISA is based on reduced instruction set computer (RISC) principles. The ISA supports 32, 64 and 128 bit architectures and includes different extensions like Multiplication, Atomic, Floating Point and more. The ISA is open source and therefore can be used by everyone without licensing issues and high fee requirements. Due to the open source nature of the RISC-V project, many companies like Alibaba and NVIDIA have started to develop hardware based on this ISA. RISC-V opens the opportunity to optimize and configure computer hardware to a level that would not be realizable with licensed ISA like ARM or x86. As a result of this possibility there are many projects and companies working on hardware and software that are beating common CPU in terms of performance and power usage by a lot.

#### 4.4 Benchmarks

#### 4.4.1 Coremark

#### 4.4.2 SPECint

## 4.5 Memory Management

#### 4.5.1 Memory Hierarchy

#### 4.5.2 Communication Interfaces

#### 4.6 FPGA

To verify a digital circuit software simulations as well as implementing the design on a prototype are common practice. For prototyping and even implementing a finished product, FPGA are widely used. FPGAs are special fine granularity Programmable Logic Devices. The digital logic can be described using hardware description languages such as Verilog or VHDL. These designs are then synthesized, placed and routed in order to generate a hardware configuration file, also called bitstream. The bitstream can then be loaded onto the FPGA via a programming interface e.g. JTAG. Many different vendors produce FPGAs, the most famous ones are Xilinx, Altera/Intel and Microchip. Some smaller vendors like NanoXplore produce FPGAs targeting rare use cases like space applications. Despite the many differences in design of an FPGA, the basic architecture always remains the same. An array of logic cells and building blocks of different features 9 like BRAM and DSP slices are connected to each other through configurable routing channels. Figure 4.1 shows the basic architecture of a Xilinx FPGA:

## 4.7 Hardware Description Languages

The CLBs in this architecture are comprised of LUTs and Flip-Flops, in order to implement boolean functions and allow the design of synchronous circuits. FPGAs produced by Xilinx are mostly SRAM based, other approaches are flash or anti-fuse based architectures.



Figure 4.1: Xilinx FPGA [Xil17]

# 5 EDRICO (Educational DHBW RISC-V Core)

The Proposed Processor design named EDRICO implements a basic RV32I instruction set architecture. Besides the mandatory "Zicsr" extension no other instruction set extensions are implemented. To keep the implementation simple and straight-forward only one privilege mode (Machine-mode) is implemented. This mode allows full access to the processor and peripherals. Future Versions could be extended to implement S-Mode and U-Mode.

The core is a simple Single Instruction Single Data (SISD) processor without any pipeline or even cache. The basic instruction cycle of fetch, decode, execute, store is performed for every instruction one at a time.

Figure 5.1 shows the full overview of the processor design:



Figure 5.1: EDRICO Overview

Its main components are the Exception Control, Control Unit, Arithmetic Unit, Register Files, PMP & PMA checker and the AXI4 Interfaces. Each one of the components will be described in more detail in the following section.

#### 5.1 Control Unit

The Control Unit (CU) is the heart of the processor and controls the other parts of the processor depending on the input instruction. The CU is responsible for fetching instructions from the instruction memory, decode the bitstream and set the respective control signals for the other processor components. Due to the complexity of the CU, there are several sub-modules which together form the overall CU.

#### 5.1.1 Architecture & Design

A general overview of the CU architecture is displayed in Figure 5.2.



Figure 5.2: Control Unit Architecture

To describe the functionality of the CU in more detail, every sub-module will be described closely.

Since the Control Unit is responsible for the whole processor, it is important to have a persistent and stable procedure for every instruction that shall be executed. The Control Unit Finite State Machine (FSM) is responsible for the correct clock timings which is important due to memory operations and the execution time of the other processor parts. The states and conditions of the FSM are displayed in Figure 5.3.



Figure 5.3: Control Unit FSM overview

Table 5.1 shows a more detailed overview of the clock cycles and the corresponding actions and states:

| ClockCycle | Edge    | Action                                                                                            | Signal                                        |
|------------|---------|---------------------------------------------------------------------------------------------------|-----------------------------------------------|
| 1          | rising  | pass the PC and enable PMP & PMA checker with respective information                              |                                               |
|            | falling | N/A                                                                                               |                                               |
| 4          | rising  | data is ready in instruction register - switch to execute state                                   | $memOP finished \& store\_systemData$ is high |
| 5          | rising  | execution is started - if memory operation wait for another $memOP finished$ flag, otherwise wait | $execute\_enable$                             |
| x          | rising  | during memory operation: data loaded to buffer \store transfer finished $\rightarrow$ wait state  | $memOP finished \& store\_systemData$ is high |
|            | falling | if load: store data form buffer to specified location                                             |                                               |
| 6 / x+1    | rising  | go to fetch_state                                                                                 |                                               |

Table 5.1: Timing of FSM

During an execution cycle, the FSM controls the rest of the CU consisting of memory, decoding unit, PC control and the different multiplexers. To understand what the purpose of the different signals are, the other components of the Control Unit are described in the following sections.

After loading an instruction from the memory to the instruction register, the decoding process can begin. The responsible part for this process is the decoding unit which is described below. (Also visible in figure 5.2)

In this project the RISC-V RS32I instruction set is used which consists of 32-bit instruction words. The instruction words have a pre-defined structure and are divided into six instruction formats. The instruction formats are shown in Figure 5.4.

| 31 30 25                  | 24 21   | 20     | 19  | 15 14  | 4 12   | 11 8     | 7       | 6 0    |        |
|---------------------------|---------|--------|-----|--------|--------|----------|---------|--------|--------|
| funct7                    | rs2     |        | rs1 | 1      | funct3 | ro       | l       | opcode | R-type |
|                           |         |        |     |        |        |          |         |        |        |
| imm[1                     | 1:0]    |        | rs1 | f      | funct3 | ro       | l       | opcode | I-type |
|                           |         |        |     |        |        |          |         |        | _      |
| imm[11:5]                 | rs2     |        | rs1 | f      | funct3 | imm      | [4:0]   | opcode | S-type |
|                           |         |        |     |        |        |          |         |        |        |
| $[imm[12] \mid imm[10:5]$ | rs2     |        | rs1 | f      | funct3 | imm[4:1] | imm[11] | opcode | B-type |
|                           |         |        |     |        |        |          |         |        |        |
|                           | imm[31] | 12]    |     |        |        | ro       | l       | opcode | U-type |
|                           |         |        |     |        |        |          |         |        |        |
| [imm[20]] $[imm[1]$       | 0:1] i  | mm[11] | imn | ո[19:1 | 12]    | ro       | l       | opcode | J-type |

Figure 5.4: RISC-V Instruction formats [RIS17]

The different instruction formats are useful for the decoding process since e.g. all LOAD instructions have the same structure and therefore, the effort to decode the 32-bit word can be reduced. Since the control signals are unique for every instruction and depending on the content of the 32-bit word, the decoder has to identify the encoded instruction, extract the information and respectively set the control signals, calculate immediates and control the multiplexers. A more detailed description of the decoding process can be found in section 5.1.2.

After the instruction is decoded, all output control signals are stable and ready to be fed through. Before leaving the CU, the *Execute Buffer* (figure 5.2) buffers the control signals. Once the FSM sets the *execute\_enable* flag, the control signals are fed through. This buffer prevents the processor to confuse timing and clock cycles, or use signals which are not yet set correctly.

During an instruction execution, the program counter has to be incremented for the processor to know what instruction will follow. But since there are several instructions that modify the program counter, a so called PC control is designed. The PC control receives information from the decoder which consists of a 4 bit signal. The different instructions and the respective action as well as the respective control signal are shown in following table 5.1.1:

#### 5 EDRICO (Educational DHBW RISC-V Core)

| Instruction | Action                                                                                        | Control Signal |
|-------------|-----------------------------------------------------------------------------------------------|----------------|
| Default     | No action required                                                                            | 0000           |
| Branch      | Depending on the result of branch operation, PC will be incremented respectively              | 0010           |
| JAL         | Target address obtained by adding current PC and immediate, rejump address stored in register | 0100           |
| JALR        | Target address obtained by adding input register to immediate                                 | 1000           |

Table 5.2: Program Counter control: Instructions and resulting actions

For instructions which do not influence the program counter, the standard operation performs the PC + 4 operation.

The instruction register displayed in figure 5.2 manages the instruction string coming from the memory. All of these parts together form the Control Unit and are responsible for the correct execution of the instructions. The implementation of the sub-units in VHDL are described in the following section 5.1.2.

## 5.1.2 Implementation

The implementation of the Control Unit is split up into multiple sub-implementations. As shown in figure 5.2 those sub-modules are the FSM, decoder, execute\_buffer, PC control and instruction register. Since the implementation of the FSM is very similar to other FSM implementations in this project, the detailed description of a FSM in VHDL is found in the next chapters.

In this section the implementation of the decoder will be described more closely. As already described in section 5.1.1 the instructions can be separated in different instruction formats. To distinguish the different instructions, so-called *instruction clusters* are created. These clusters sum up instructions which are encoded in the same instruction format or in general are similar. The following table shows the different clusters and the corresponding instructions:

| Cluster                                                     | Instructions                                                     |  |  |  |  |  |
|-------------------------------------------------------------|------------------------------------------------------------------|--|--|--|--|--|
| LOAD                                                        | Load - Byte \Halfword \Word                                      |  |  |  |  |  |
| STORE                                                       | Store - Byte \Halfword \Word                                     |  |  |  |  |  |
| BRANCH                                                      | BRANCH Different Branch Instructions (e.b. Branch if equal)      |  |  |  |  |  |
| JALR only JALR, since it has a unique instruction structure |                                                                  |  |  |  |  |  |
| JAL only JAL, since it has a unique instruction structure   |                                                                  |  |  |  |  |  |
| OP                                                          | All arithmetic instructions like ADD, SUB, shift and comparisons |  |  |  |  |  |
| OP-IMM                                                      | All arithmetic instructions performed with immediate             |  |  |  |  |  |
| AUIPC                                                       | only AUIPC, since it has a unique instruction structure          |  |  |  |  |  |
| LUI                                                         | only LUI, since it has a unique instruction structure            |  |  |  |  |  |

Table 5.3: Decoding instruction clusters

To determine the cluster for each instruction, a decoding procedure is implemented in VHDL based on structure visualized in figure 5.5:



Figure 5.5: Decoding Structure to determine instruction cluster

After determining the cluster, the VHDL code assigns all the outputs visible in figure 5.2 with the respective information. For a better understanding of the information extraction, figure 5.6 shows how the 32-bit instruction word is split up (in this case for the OP and OP-IMM instructions.)



Figure 5.6: Information extraction from 32-bit instruction word

# 5.2 Arithmetic Logical Unit

The Arithmetic Logical Unit (ALU) is the part of the processor that performs the arithmetic and logical operations. Figure 5.7 gives an overview of what type of operations are performed.



Figure 5.7: ALU operations

#### 5.2.1 Architecture & Design

To implement the ALU, it is required to have the data inputs as well as clock input and a control signal consisting of 4 bits to specify the required operation to be performed. Since there are instructions that require a branch response to know whether the next instructions shall be skipped or not, the ALU needs an additional output called *branch\_re* other than the result output of the arithmetic/logical operation. The architecture of the ALU is shown in figure 5.8:



Figure 5.8: ALU operations

Not only the instructions e.g. ADD, SUB, XOR... require an arithmetic operation but also LOAD, STORE.. require an ALU action. While ADD, SUB, XOR... require an operation between the two input values (either register-register or register-immediate) to get a mathematical or logical result, the LOAD, STORE... instructions require the ALU to build the target addresses for the memory access.

#### 5.2.2 Implementation

The implementation of the ALU is based on figure 5.7 and performs a switch-case on all the different input values of  $ALU\_OP$ . The 4-bit input variable specifies the operation based on following declarations:

| ALU_OP | Operation                    |
|--------|------------------------------|
| 0000   | ADD                          |
| 0001   | SUB                          |
| 0010   | AND                          |
| 0011   | OR                           |
| 0100   | XOR                          |
| 0101   | EQUAL                        |
| 0110   | NEQUAL                       |
| 0111   | shift_left                   |
| 1000   | $\operatorname{shift}$ right |
| 1001   | shift_right (arithmetic)     |
| 1010   | <                            |
| 1011   | < (unsigned)                 |
| 1100   | ≥                            |
| 1101   | $\geq$ (unsigned)            |

Table 5.4: Input code and respective operation

To visualize the implementation, a part of the VHDL code is displayed in the following. The case statement is based on the input  $alu\_op$ . The ALU then performs the corresponding operation with the two inputs  $in\_a$  and  $in\_b$ .

```
begin
   process(in_a, in_b, alu_op)
2
   begin
    --default output is 0
4
   branch re <= '0';
   6
     case alu_op is
7
       when "0000" =>--"ADD"
          alu_result <= in_b + in_a;
9
       when "0001" =>--"SUB"
10
          alu_result <= in_b -in_a;
11
       when "0010" => -- "AND"
12
          alu_result <= in_b AND in_a;
       when "0011" =>--"OR"
14
          alu_result <= in_b OR in_a;
15
       when "0100" =>--"XOR"
16
          alu result <= in b XOR in a;
17
       when "0101" =>--"EQUAL"
         if(in_b = in_a) then
19
         branch re <= '1';
20
         else
21
         branch re <= '0';
22
         end if;
24
```

Listing 5.1: ALU VHDL code

In case the operation determined by  $alu\_op$  might be originating of a branch instruction, the  $branch\_re$  flag has to be set respectively (line 19). Since the branch instructions only include some of the arithmetic and logical operations of the ALU, the default value for the  $branch\_re$  is set as  $\theta$ .

## 5.3 Register File (RF)

The Register can be used to store data and configure the core. General Purpose registers are mostly used to temporarily store data like operands and results of calculations. The CSR Register File is used to configure the core and retrieve information about it. More detailed information about these two RFs can be found in the following chapters.

#### 5.3.1 Architecture & Design

The architecture of the register files is shown in following figure 5.9



Figure 5.9: Register Files architecture

In order to write data to a register, the value needs to be put on the *data\_in* bus and the corresponding control signals need to be configured. Data can be read either via the *data\_busA* or *data\_busB*, the CSR registers can only be accessed on *data\_busB* via a MUX controlled by the *CSR\_read* signal. If the bit is set, the CSR specified by *CSR\_address* will be visible on *data\_busB* at the next rising clock edge.

In order to store data to a CSR register, it has to be put on the *data\_in* bus. If the *CSR\_write* bit is set, the data is saved to the register specified by *CSR\_address* on the next falling clock edge.

If the  $CSR\_save$  signal is applied, the CSR Register File output is connected to the data input bus of the general purpose registers. This allows to save CSR data directly to the RV32I registers without using the system data bus. Some of the CSR registers allow direct memory accesses, e.g. the  $msip\_dra \& mtip\_dra$  signals are used to set the software and timer interrupt pending bits respectively. To increase the performance of memory accesses, the PMP CSRs can be read via a dedicated bus.

While the CSR RF is accessed by 12 Bit addresses, RV32I registers are accessed without addressing in order to decrease decoding time. Therefore, each register requires three

Signals resulting in a total of 96 signals for read and write. The RFs are clocked with the system clock and reset on system reset. As per usual, data is stored on the falling clock and displayed on the rising clock.

#### **General Purpose Registers**



Figure 5.10: General Purpose Registers

Every register may be modified via  $data\_in$  and read via  $data\_busA$  and  $data\_busB$ . The only limitation is  $x\theta$ , per definition  $x\theta$  must be hardwired to zero. Writing to the register is allowed but has no effect, reading will return  $\theta x\theta\theta\theta\theta\theta\theta\theta\theta$ . According to the RV32I programmers model some of the other registers should be used for dedicated purposes, like the stack pointer. There are no hardware checks implemented in order to enforce those rules.

Writes are executed on a falling edge if the corresponding  $register\_write$  signal is high. Reads are performed on a rising edge if the corresponding  $register\_read\_A/B$  is high. There are no hardware checks implemented to prevent multiple registers from writing to the same data bus at the same time. This must be prevented by the controlling element e.g. the control unit.

All registers are cleared to 0x00000000 on reset.

#### Control and Status Registers (CSR)

CSRs are Control and Status Registers that are introduced to the design by the RISC-V privileged specification. Some of the values stored inside the CSRs need to be provided to the Exception Unit and the PMP & PMA checker to decrease complexity, some of that accesses can be performed through direct memory access. Some of the CSRs are memory mapped. Memory mapped means they need to be accessible via the system bus by any AXI4-Lite master. In order to implement this, an AXI4-Lite salve module implementing all memory mapped CSRs is added to the design.

Further Information about this module can be found in the chapter 5.7. The CSRs that provide direct access are:

- msip and mtip bits in the mpi register (write)
- pmpcfg (read)
- pmpaddress (read)

Some registers are specified as WARL registers, meaning anything can be written to them, but the value returned on read must be a legal value. Table 5.3.1 displays every non memory mapped CSR, the corresponding address, type, access possibility, width and a short description:

## $5\ EDRICO\ (Educational\ DHBW\ RISC-V\ Core)$

| register          | address     | type | access | width  | description                                                       |
|-------------------|-------------|------|--------|--------|-------------------------------------------------------------------|
| misa              | 0x301       | WARL | R      | 32-bit | describes supported ISAs                                          |
| mvendorid         | 0xF11       | N/A  | R      | 32-bit | describes vendor id                                               |
| marchid           | 0xF12       | N/A  | R      | 32-bit | describes architecture ID                                         |
| mimpid            | 0xF13       | N/A  | R      | 32-bit | describes implementation ID                                       |
| mhartid           | 0xF14       | N/A  | R      | 32-bit | describes hart id                                                 |
| mstatus           | 0x300       | N/A  | R/W    | 32-bit | reflects & controls a hart's current operating state              |
| mtvec             | 0x305       | N/A  | R/W    | 32-bit | holds trap vector configuration                                   |
| mie               | 0x304       | N/A  | R/W    | 32-bit | reflects interrupt enable state                                   |
| mip               | 0x344       | N/A  | R      | 32-bit | holds interrupt pending bits                                      |
| mcycle            | 0xB00       | N/A  | R/W    | 64-bit | holds count of clock cycles                                       |
| minstret          | 0xB02       | N/A  | R/W    | 64-bit | holds count of executed instructions                              |
| mhpcounter(3-31)  | 0xB03-0xB1F | N/A  | R/W    | 32-bit | holds count of events (lower 32 bit)                              |
| mhpcounterh(3-31) | 0xB83-0xB9F | N/A  |        | 32-bit | holds counter of events (upper 32 bit)                            |
| mhpevent(3-31)    | 0x323-0x33F | N/A  | R/W    | 32-bit | specifies events on which to increment corresponding mhpmcounter  |
| mcountinhibit     | 0x320       | WARL | R/W    | 32-bit | controls which hardware performance monitoring counters increment |
|                   |             |      |        |        | (if set, no increment)                                            |
| mscratch          | 0x340       | N/A  | R/W    | 32-bit | Dedicated for use by machine-mode                                 |
| mepc              | 0x341       | WARL | R/W    | 32-bit | holds jump-back address during interrupt                          |
| mcause            | 0x343       | N/A  | R/W    | 32-bit | specifies cause of exception/interrupt                            |
| mtval             |             | N/A  | R/W    | 32-bit | holds information about trap                                      |
| pmpcfg0-3         | 0x3A0-0x3A3 | N/A  | R/W    | 32-bit | Physical memory protection configuration                          |
| pmpaddr0-15       | 0x3B0-0x3BF | N/A  | R/W    | 32-bit | Physical memory protection address register                       |

Table 5.5: List of implemented CSRs

data bus B hardwired to zero regs MSIP\_dra MTIP\_dra misa mvendorid marchid mscratch mimpid mip mhartid mhpmcounterH3-31 mhpmcounter3-31 Hardware Performance Monitor mcvcleH mcvcle mhpmevent3-31 minstretH minstret mcountinhibit mstatus MSIE/MTIE (mie) interrupt\_ge MSIP/MTIP (mip) nerator mtvec mepc CSR address CSR\_read\_sel CSR mcause Controller pmp regs

The architecture of the CSR Register File is shown in 5.11.

Figure 5.11: CSR Register File Architecture

The CSR Register File is comprised of the different CSR Registers, CSR\_Controller and interrupt\_generator.

Some of the implemented CSR registers are hardwired to zero, reads to those addresses return 0x00000000 and writes have no effect if the register is defined as read only the write will result in an  $illegal\_instruction\_exception$ . Other CSR registers may be defined as read-only. The  $CSR\_Controller$  checks if a CSR access is allowed and produces access signals for each register from the given address. If an address is not writable but yet a write is requested, the  $CSR\_Controller$  raises an  $illegal\_instruction\_exception$ .

The *interrupt\_generator* checks if an interrupt is pending, enabled and if interrupts are globally enabled. In that case the corresponding interrupt is raised. The interrupts remain pending, as long as the corresponding direct register access signals it. To prevent the machine from being stuck in the interrupt, the programmer of the ISR must clear the pending interrupts, e.g. the timer interrupt by writing to the memory mapped CSRs.

The Hardware Performance Monitor registers are incremented at special events (clock cycle and instruction finish) therefore two 64-Bit increment units have to be implemented. The increments are only performed, if the corresponding bit inside the *mcounterinhibit* register is set.

The PMP CSRs are composed of the 16 *pmpaddr* registers and four *pmpcfg* registers. They can be read and written, additional direct memory access is implemented in order to decrease memory access latency

#### 5.3.2 Implementation

#### 5.4 PMP & PMA Checker

Physical memory protection and physical memory attribute checking must be done in order to ensure that only allowed/defined memory regions are accessed to provide a minimum of security and prevent the core from accessing not defined regions, which may result in the core getting stuck.

#### 5.4.1 Architecture & Design

An overview of the PMP & PMA Checker is presented by 5.12:



Figure 5.12: PMP & PMA Checker Architecture

The PMP Unit is used to define memory regions and enforce several rules onto those, for example if instructions may be fetched from a region or if writes are allowed. In order to apply the rules the corresponding region must be locked by setting the L-bit inside the corresponding *pmpcfg* CSR. Once locked, regions may only be unlocked by a system reset. After a restart every region is unlocked and reset, e.g. a boot loader could enforce several rules for memory accesses in M-Mode before control hand-over to the main software running on the core. If a PMP entry is not locked, every memory access that matches this address space may succeed.

In order to specify, if reads, writes or instruction fetches are possible to a certain address, it needs to be specified inside the *pmpaddr* and corresponding *pmpcfg* CSRs. The PMP unit has direct access to those and enforces the rules. If an address is not specified every access is allowed.

Right now there is a single PMA check implemented, its job is to check if the memory access is aligned. A word is 32-bit, halfword 16-bit and byte 8-bit. The smallest addressable data unit is one byte long. Therefore a word access is aligned, if the memory address modulo 4 is 0 (two LSBs are zero), for a halfword access the memory address modulo 2 must be zero (LSB is zero) and byte accesses are allowed on every address. For both units (PMP and PMA) information about the access like the access size and type must be provided, this is done by the control unit. PMP and PMA checks are applied on the data present as soon as the enable signal is applied. Both checks are simple logical functions and must be performed under a clock cycle. The address register is updated with the corresponding address after every PMP, independent of its result. This must be done since the Exception Control needs access to the faulty address at the rising clock edge if an exception is risen.

The logical and is used, in order to ensure that both the *PMA\_checker* and *PMP\_checker* rules are enforced.

#### 5.4.2 Implementation

## 5.5 Exception Control

The Exception Control Unit is used to guard exception entries and exits one of its tasks is e.g. to modify the PC accordingly and save information about the exception. In addition, two interrupt entries are guarded by the unit. A list of all supported exceptions and interrupts is listed below:

Exceptions can be caused by the following modules:

- Control Unit
- CSR register file
- PMP & PMA checker

The Control Unit may cause the following exceptions:

- illegal instruction exception
- breakpoint exception

• environment-call-from-M-mode exception

The PMP & PMA checker may cause the following exceptions:

- load access fault exception
- store/AMO access fault exception
- instruction access fault exception
- load address misaligned exception
- store/AMO address misaligned exception
- instruction address misaligned exception

The CSR register file may cause the following exceptions/interrupts:

- timer interrupt
- software interrupt
- illegal instruction exception

#### 5.5.1 Architecture & Design

Figure 5.13 shows the architecture of the Exception Control:



Figure 5.13: Exception Control Architecture

The Exception Control module is comprised of a FSM, an arbiter to decide if an interrupt or exception is taking place and which one shall be handled if multiple are raised at a time. To modify registers a  $DRA\_control$  as well as a  $CSR\_access\_unit$  and a buffer register are added to the design.

The arbiter receives the different exception and interrupt signals, it's purpose is to decide which interrupt will be executed and generate the according exception code as well as the signal for the FSM to start the routine.

The *DRA\_control* module is used to load the Instruction Register, PMP address register and Program Counter. A load to the PC may also be performed, in order to do so the *PC\_rw* signals must be asserted one clock cycle earlier by the FSM.

The buffer\_register is implemented to allow data shares between the DRA\_control and CSR\_access\_unit. It implements another functionality that allows to set either the MIE register to 1 (exit) or the MPIE register to 0 (entry) and switch the two bits (MIE and MPIE) on the data\_out line. This feature is used during the phase of modifying the mstatus register and allows the modify to happen in at least two clock cycles.

The  $CSR\_access\_unit$  is used to perform register accesses to the CSR register file. During Exception entry or return the data bus B must be connected to the  $data\_in\_EC$  bus and the  $data\_in$  bus to the  $data\_out\_EC$  bus. This is done by implementing a multiplexer at the corresponding buses controlled by the  $halt\_core$  signal.

If an Exception or Interrupt is raised, the Exception Control unit must write the current PC+4 to the mpec register, modify the mcause register to reflect the cause of the exception/interrupt and update the mtval CSR to provide additional information on the taken trap. If multiple exceptions occur at once, only the highest priority exception is taken.

If an Exception is raised, while the processor is handling another exception, *mpec*, *mcause* & *mtval* are overwritten. The saving of the return address and other information stored in those registers is left to the trap handler, in order to avoid a large hardware register stack. If an exception entry is performed the *exceptionORinterrupt\_occured* signal on the FSM is high, if the return signal is high it indicates that an exit shall be performed, however if both signals are high (should not happen in normal operation) the entry will be performed.

#### 5.5.2 Implementation

#### 5.6 AXI4-Lite Master

To connect the processor to peripherals the AXI4-Lite protocol is used. Due to its popularity many IPs such as BRAM can be connected to each other using an Interconnect. EDRICO contains a Master and a Slave interface. Both of which are explained in more detail in the following sections:

#### 5.6.1 Architecture & Design

The AXI4-Lite Master is implemented in order to allow memory accesses e.g. to a BRAM or UART IP. The following figure depicts the architecture of the master:



Figure 5.14: AXI4-Lite Master Architecture

It includes two registers, the data register and the address register, these are loaded on the rising edge of the system clock if the corresponding load signal is applied. Both register outputs are constantly applied to the corresponding AXI signals (AWADDR, ARADDR and WDATA).

The AXI4-Lite Master Control Unit is in charge of controlling the AXI Transfer. It controls the ready and valid signals as well as the register reads and writes. It is clocked with *clk*. The *rst* signal will reset it. The Master also provides the clock and reset for every slave connected to the AXI interconnect. To start a transfer, the enable signal must be high on a rising edge of the clock, in that case the address and data register are loaded on the next falling edge of system clock.

If data is ready to be read from the system bus, it is routed to an output of the Master and the  $store\_systemSystemData$  is set to high, in order to ensure a correct read, this process shall not be clocked. The surrounding system must then process these signals to store the data in the correct register.

At the end of a data transfer, the *memOp\_finished* signal is set to high and remains high until a new transfer is initiated.

If an error occurred during an access, the instruction\_afe\_P, storeAMO\_afe\_P or

load\_afe\_P exceptions are raised. And remain high until the halt\_core signal is applied to the AXI4-Lite Master Control Unit.

#### 5.6.2 Implementation

#### 5.7 AXI4-Lite Slave

Some of the RISC-V CSRs must be memory mapped, meaning the must be accessible by other devices via the memory space. The addresses can be specified by setting a generic in the VHDL code. A basic address is predefined for each CSR, corresponding to the CLINT module by sifive since this is the closest thing to industry standard.

| register  | address     | access | width  | description                            |
|-----------|-------------|--------|--------|----------------------------------------|
| msip      | 0x0200_0000 | R/W    | 32-bit | hold software interrupt pending bit    |
| mtimecmp  | 0x0200_4000 | R/W    | 32-bit | hold time compare value (lower 32 bit) |
| mtimecmph | 0x0200_0004 | R/W    | 32-bit | hold time compare value (upper 32 bit) |
| mtime     | 0x0200_BFF8 | R/W    | 32-bit | hold time (lower 32 bit)               |
| mtimeh    | 0x0200_BFFB | R/W    | 32-bit | hold time (upper 32 bit)               |

Table 5.6: Memory Mapped CSRs

#### 5.7.1 Architecture & Design

The AXI4-Lite slave is in charge of accepting data transfers and reading/writing the CSRs. Figure 5.15 shows the architecture, including the memory mapped CSRs:



Figure 5.15: AXI4-Lite Slave Architecture

As an AXI4-Lite slave an IP core may be used. In fact the Vivado AXI Interface generator may be used to generate the AXI4-Lite Slave as well as the registers, this IP could then be modified with the comparator and the *msip* and *mtip* output The AXI Slave allows reads and writes to the 32 bit registers *mtimeh*, *mtime*, *mtimecmp* and *mtimecmph* and the one byte *msip* register.

The comparator is used to trigger a timer interrupt. It compares *mtime* and *mtimecmp*. If the value of *mtime* is equal or greater than the one in *mtimecmp*, the *mtip* signal is asserted.

The interrupt remains posted, until it is cleared by writing to the *mtimecmp* register. The *msip* register is a one byte register, the remaining 31 bits are hardwired to zero. Its reset value is zero. If it is set to one, the *msip* bit in *mip* is set and therefore, a software interrupts may be risen.

Table 5.7.1 shows the reset values for each CSR:

| register  | value      |
|-----------|------------|
| msip      | '0'        |
| mtimeh    | 0x00000000 |
| mtime     | 0x00000000 |
| mtimecmph | 0xFFFFFFFF |
| mtimecmp  | 0xFFFFFFFF |

Table 5.7: Memory Mapped CSRs reset values

## 5.7.2 Implementation

# 6 Test & Verification

- 6.1 Unit & Integration Verification
- 6.2 System Verification
- **6.3 Acceptance Verification**
- 6.4 Benchmarks

# 7 Future Work

# 8 Conclusion

# **Bibliography**

- [RIS17] RISC-V. The RISC-V Instruction Set Manual. 2017. URL: https://riscv.org/technical/specifications.
- [Xil17] Xilinx. Understanding FPGA Architecture. 2017. URL: https://www.xilinx.com/html\_docs/xilinx2017\_2/sdaccel\_doc/topics/devices/con-fpga-architecture.html.

# **Appendix**