# University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science

 $\mathrm{EECS151/251A}$  - LB, Fall 2019

# Project Specification: RISCV151

# Version 4.3

# Contents

| 1 | $\mathbf{Intr}$ | oduction                                     | 3  |
|---|-----------------|----------------------------------------------|----|
|   | 1.1             | Tentative Deadlines                          | 3  |
|   | 1.2             | General Project Tips                         | 4  |
| 2 | Che             | 1                                            | 4  |
|   | 2.1             |                                              | 5  |
|   | 2.2             |                                              | 5  |
|   | 2.3             | Project Skeleton Overview                    | 5  |
|   | 2.4             | RISC-V 151 ISA                               | 6  |
|   |                 | 2.4.1 CSR Instructions                       | 7  |
|   | 2.5             | Pipelining                                   | 7  |
|   | 2.6             | Hazards                                      | 7  |
|   | 2.7             | Register File                                | 9  |
|   | 2.8             | RAMs                                         | 9  |
|   |                 | 2.8.1 Initialization                         | 9  |
|   |                 | 2.8.2 Endianness + Addressing                | 9  |
|   |                 | 2.8.3 Reading from RAMs                      | 0  |
|   |                 | 2.8.4 Writing to RAMs                        | 0  |
|   | 2.9             | Memory Architecture                          | .1 |
|   |                 | 2.9.1 Summary of Memory Access Patterns      | 1  |
|   |                 | 2.9.2 Unaligned Memory Accesses              | 2  |
|   |                 | 2.9.3 Address Space Partitioning             | 2  |
|   |                 | 2.9.4 Memory Mapped I/O                      | 3  |
|   | 2.10            | Testing                                      | 3  |
|   |                 | 2.10.1 Integration Testing                   | 4  |
|   | 2.11            | Software Toolchain - Writing RISC-V Programs | 4  |
|   | 2.12            | Assembly Tests                               | 5  |
|   | 2.13            | BIOS and Programming your CPU                | 5  |
|   |                 | Target Clock Frequency                       | 6  |
|   | 2.15            | Matrix Multiply                              | 7  |
|   |                 | How to Survive This Checkpoint               | 7  |
|   |                 | 2.16.1 How To Get Started                    |    |
|   | 2 17            |                                              | q  |

|              |      | 2.17.1 Checkpoint 1: Block Diagram                                    |            |
|--------------|------|-----------------------------------------------------------------------|------------|
|              |      | 2.17.2 Non-Checkpoint Weeks                                           |            |
|              |      | 2.17.3 Checkpoint 2: Base RISCV151 System                             |            |
|              |      | 2.17.4 Checkpoints 1 & 2 Deliverables Summary                         | 20         |
| 3            | Che  | eckpoint 3 - I/O Integration, PWM Controller, Subtractive Synthesizer | 21         |
|              | 3.1  | I/O Integration                                                       |            |
|              |      | 3.1.1 Hookup User I/O                                                 |            |
|              |      | 3.1.2 User I/O Test Program                                           |            |
|              | 3.2  | PWM Controller                                                        |            |
|              |      | 3.2.1 Piano Program                                                   |            |
|              | 3.3  | Subtractive Synth                                                     |            |
|              | 3.4  | Checkpoint 3 Deliverables Summary                                     |            |
|              |      |                                                                       |            |
| 4            |      |                                                                       | 23         |
|              | 4.1  | Checkpoint 4 Deliverables Summary                                     | 25         |
| 5            | Fina | al Checkpoint - Optimization                                          | <b>2</b> 4 |
|              | 5.1  | Clock Generation Info + Changing Clock Frequency                      | 24         |
|              | 5.2  | Critical Path Identification                                          | 24         |
|              |      | 5.2.1 Finding Actual Critical Paths                                   | 25         |
|              | 5.3  | Optimization Tips                                                     | 26         |
| 6            | Ont  | imizations, Extra Credit, and Grading                                 | 27         |
| U            | 6.1  | Grading on Optimization                                               |            |
|              | 6.2  | Checkpoints                                                           |            |
|              | 6.2  |                                                                       |            |
|              |      | Style: Organization, Design                                           |            |
|              | 6.4  | Final Project Report                                                  |            |
|              | e r  | 6.4.1 Report Details                                                  |            |
|              | 6.5  | Extra Credit                                                          |            |
|              | 6.6  | Project Grading                                                       | 28         |
| 7            | Pro  | ject Timeline                                                         | 30         |
| $\mathbf{A}$ | Loc  | al Development                                                        | 31         |
|              |      |                                                                       | 31         |
|              |      |                                                                       | 31         |
| _            |      |                                                                       |            |
| В            |      |                                                                       | 31         |
|              | B.1  |                                                                       | 31         |
|              | B.2  | *                                                                     | 32         |
|              |      |                                                                       | 32         |
|              |      |                                                                       | 32         |
|              | B.3  | Tone Generator and I2S Extra Credit Deliverables                      | 34         |
| $\mathbf{C}$ | BIC  | $\mathbf{o}\mathbf{s}$                                                | <b>3</b> 4 |
|              |      | Registrated                                                           | 2/         |

# 1 Introduction

The goal of this project is to familiarize EECS151/251A students with the methods and tools of digital design. In teams of 2, you will design and implement a 3-stage pipelined RISC-V CPU with a UART for tethering. Afterwards, you will attach the IO circuits you built in the lab to the CPU and design a subtractive audio synthesizer. Then, you will implement a simple dynamic frequency scaling algorithm to bound the power consumption and temperature of your FPGA design while achieving a performance target. Finally, you will optimize your CPU for performance (maximizing the Iron Law) and cost (FPGA resource utilization).

You will use Verilog to implement this system, targeting the Xilinx Pynq platform (a Pynq-Z1 development board with a Zynq 7000-series FPGA). The project will give you experience designing with RTL descriptions, resolving hazards in a simple pipeline, building interfaces, and teach you how to approach system-level optimization.

In tackling these challenges, your first step will be to map the high level specification to a design which can be translated into a hardware implementation. After that, you will produce and debug that implementation. These first steps can take significant time if you have not thought out your design prior to trying implementation.

As in previous semesters, your EECS151/251A project is probably the largest project you have faced so far here at Berkeley. Good time management and good design organization is critical to your success.

#### 1.1 Tentative Deadlines

The following is a brief description of each checkpoint and approximately how many weeks will be alloted to each one. This schedule may change as the semester progresses. The current schedule is summarised at the end of the document in Section 7.

- October 25 Checkpoint 1 (1 week) Draw a schematic of your processor's datapath and pipeline stages.
- November 15 Checkpoint 2 (3 weeks) Implement your RISC-V processor core in Verilog and write tests to verify your implementation.
- November 22 Checkpoint 3 (1 week) Attach I/O components from lab to your processor (FIFOs, buttons, switches), general PWM controller, basic subtractive synthesizer
- December 6 Checkpoint 4 (2 weeks) Implement a dedicated power management unit to monitor FPGA temperature and power consumption and dynamically switch the operating frequency of the main CPU.
- December 13 (by appointment) Final Checkoff Final processor optimization and checkoff
- December 15 Project Report Final report due

## 1.2 General Project Tips

Document your project as you go. You should comment your Verilog and keep your diagrams up to date. Aside from the final project report (you will need to turn in a report documenting your project), you can use your design documents to help the debugging process.

Finish the required features first. Attempt extra features after everything works well. If your submitted project does not work by the final deadline, you will not get any credit for any extra credit features you have implemented.

This project, as has been done in past semesters, will be divided into checkpoints. The following sections will specify the objectives for each checkpoint.

# 2 Checkpoints 1 & 2 - 3-stage Pipelined RISC-V CPU

The first checkpoint in this project is designed to guide the development of a three-stage pipelined RISC-V CPU that will be used as a base system in subsequent checkpoints.

#### TODO: REPLACE THIS DIAGRAM



Figure 1: High-level overview of the full system

The green (RISC-V core) and yellow (UART/counters) blocks on the diagram are the focus of the first and second checkpoints. The third checkpoint will add audio and IO components in blue. Finally, the fourth checkpoint will implement the power management unit in red.

# 2.1 Setting up your Code Repository

The project skeleton files are available on Github. The suggested way for initializing your repository with the skeleton files is as follows:

```
git clone git@github.com:EECS150/project_skeleton_fa19.git
cd project_skeleton_fa19
git remote add my-repo git@github.com:EECS150/fa19_teamXX.git
git push my-repo master
```

Then reclone your repo and add the skeleton repo as a remote:

```
cd ..
rm -rf project_skeleton_fa19
git clone git@github.com:EECS150/fa19_teamXX.git
cd fa19_teamXX
git remote add staff git@github.com:EECS150/project_skeleton_fa19.git
```

To pull project updates from the skeleton repo, run git pull staff master.

To get a team repo, fill one line in the Google spreadsheet posted on Piazza with your team information (names, Github logins, and enrolled lab session).

You should check frequently for updates to the skeleton files. Update announcements will be posted to Piazza.

# 2.2 Integrate Designs from Labs

You should copy some modules you designed from the labs. We suggest you keep these with the provided source files in hardware/src (overwriting any provided skeletons).

```
cd fa19_teamXX
cp fpga_labs_fa19/lab6/debouncer.v fa19_teamXX/hardware/src/io_circuits/.
```

#### Copy these files from the labs:

```
lab6/debouncer.v
lab6/synchronizer.v
lab6/edge_detector.v
lab6/fifo.v
lab6/uart_transmitter.v
```

# 2.3 Project Skeleton Overview

• hardware

```
- src
```

\* z1top.v: Top level module. The RISC-V CPU is instantiated here.

- \* PYNQ-Z1.xdc: Constraints file. You can modify this to change pin assignments for peripherals when connecting I/O.
- \* riscv\_core/Riscv151.v: All of your CPU datapath and control should be contained in this file.
- \* riscv\_core/reg\_file.v: Your register file implementation.
- \* memories/{imem, dmem, bios\_mem}.v: Synthesizable RAMs for the instruction, data, and BIOS memories.
- \* io\_circuits/uart.v, uart\_transmitter.v, uart\_receiver.v: Your working UART from Labs 5 and 6

#### - sim

- \* assembly\_testbench.v: Starting point for testing your CPU. Works with the software in assembly\_tests.
- \* echo\_testbench.v: Runs the software in echo on your CPU. The software implements the echo FSM from lab 5, and the testbench controls an off-chip UART to test it.

#### • software

- bios151v3: The BIOS program, which allows us to interact with our CPU via the UART.
   You need to compile it compile it before creating a bitstream or running a simulation.
- echo: The echo program, which emulates the FSM from Lab 5 in software.
- assembly\_tests: Use this as a template to write assembly tests for your processor designed to run in simulation.
- c\_example: Use this as an example to write C programs.
- mmult: This is a program to be run on the FPGA for Checkpoint 2. It generates 2 matrices and multiplies them. Then it returns a checksum to verify the correct result.

To compile software go into a program directory and run make. To build a bitstream run make impl in hardware.

#### 2.4 RISC-V 151 ISA

Table 1 contains all of the instructions your processor is responsible for supporting. It contains most of the instructions specified in the RV32I Base Instruction set, and allows us to maintain a relatively simple design while still being able to have a C compiler and write interesting programs to run on the processor. For the specific details of each instruction, refer to sections 2.2 through 2.6 in the RISC-V Instruction Set Manual.

You may find a RISC-V green card helpful.

#### 2.4.1 CSR Instructions

You will have to implement 2 CSR instructions to support running the standard RISC-V ISA test suite. A CSR (or control status register) is some state that is stored independent of the register file and the memory. While there are 2<sup>12</sup> possible CSR addresses, you will only use one of them (tohost = 0x51E). The tohost register is monitored by the RISC-V ISA testbench, and simulation ends when a value is written to this register. A value of 1 indicates success, a value greater than 1 gives clues as to the location of the failure.

There are 2 CSR related instructions that you will need to implement:

- 1. csrw tohost,t2 (short for csrrw x0,csr,rs1 where csr = 0x51E)
- 2. csrwi tohost,1 (short for csrrwi x0,csr,uimm where csr = 0x51E)

csrw will write the value from register in rs1. csrwi will write the immediate (stored in rs1) to the addressed csr. Note that you do not need to write to rd (writing to x0 does nothing).

## 2.5 Pipelining

Your CPU must implement this instruction set using a 3-stage pipeline. The division of the datapath into three stages is left unspecified as it is an important design decision with significant performance implications. We recommend that you begin the design process by considering which elements of the datapath are synchronous and in what order they need to be placed. After determining the design blocks that require a clock edge, consider where to place asynchronous blocks to minimise the critical path. The RAMs we are using for the data, instruction, and BIOS memories are both synchronous read **and** write.

#### 2.6 Hazards

As you have learned in lecture, pipelines create hazards. Your design will have to resolve both control and data hazards. You must resolve data hazards by implementing forwarding whenever possible. This means that you must forward data from your data memory instead of stalling your pipeline or injecting NOPs. All data hazards can be resolved by forwarding in a three-stage pipeline.

You'll have to deal with the following types of hazards:

- 1. **Read-after-write data hazards** Consider carefully how to handle instructions that depend on a preceding load instruction, as well as those that depend on a previous arithmetic instruction.
- 2. Control hazards What do you do when you encounter a branch instruction, a jal (jump and link), or jalr (jump from register and link)? You will have to choose whether to predict branches as taken or not taken by default and kill instructions that weren't supposed to execute if needed. You can begin by resolving branches by stalling the pipeline, and when your processor is functional, move to naive branch prediction.

Table 1: RISC-V ISA

| 31                    | 27       | 26   | 25   | 24         | 2       | 20 | 19  | 15 | 14  | 12   | 11     | 7       | 6   | 0    |        |
|-----------------------|----------|------|------|------------|---------|----|-----|----|-----|------|--------|---------|-----|------|--------|
|                       | funct7   |      |      |            | rs2     |    | rsi | 1  | fun | ct3  |        | rd      | opo | code | R-type |
|                       | ir       | nm[  | 11:( | )]         |         |    | rsi | 1  | fun | ct3  |        | rd      | opo | code | I-type |
| iı                    | nm[11:]  | 5]   |      |            | rs2     |    | rsi | 1  | fun | ct3  | imn    | n[4:0]  | opo | code | S-type |
| im                    | m[12 10] | ):5] |      |            | rs2     |    | rsi | 1  | fun | ct3  | imm    | 4:1 11] | opo | code | B-type |
|                       |          |      |      | $_{ m im}$ | m[31:1] | 2] |     |    |     |      |        | rd      | opo | code | U-type |
| imm[20 10:1 11 19:12] |          |      |      |            |         |    |     | rd | opo | code | J-type |         |     |      |        |

# RV32I Base Instruction Set

|              | imm[31:12]     |       |         | rd          | 0110111 | LUI   |
|--------------|----------------|-------|---------|-------------|---------|-------|
|              | imm[31:12]     | rd    | 0010111 | AUIPC       |         |       |
|              | n[20 10:1 11 1 | 9:12] |         | rd          | 1101111 | JAL   |
| imm[11:0     | ,              | rs1   | 000     | rd          | 1100111 | JALR  |
| imm[12 10:5] | rs2            | rs1   | 000     | imm[4:1 11] | 1100011 | BEQ   |
| imm[12 10:5] | rs2            | rs1   | 001     | imm[4:1 11] | 1100011 | BNE   |
| imm[12 10:5] | rs2            | rs1   | 100     | imm[4:1 11] | 1100011 | BLT   |
| imm[12 10:5] | rs2            | rs1   | 101     | imm[4:1 11] | 1100011 | BGE   |
| imm[12 10:5] | rs2            | rs1   | 110     | imm[4:1 11] | 1100011 | BLTU  |
| imm[12 10:5] | rs2            | rs1   | 111     | imm[4:1 11] | 1100011 | BGEU  |
| imm[11:0     | ,              | rs1   | 000     | rd          | 0000011 | LB    |
| imm[11:0     | ,              | rs1   | 001     | rd          | 0000011 | LH    |
| imm[11:0     | 1              | rs1   | 010     | rd          | 0000011 | LW    |
| imm[11:0     | 1              | rs1   | 100     | rd          | 0000011 | LBU   |
| imm[11:0     | 1              | rs1   | 101     | rd          | 0000011 | LHU   |
| imm[11:5]    | rs2            | rs1   | 000     | imm[4:0]    | 0100011 | SB    |
| imm[11:5]    | rs2            | rs1   | 001     | imm[4:0]    | 0100011 | SH    |
| imm[11:5]    | rs2            | rs1   | 010     | imm[4:0]    | 0100011 | SW    |
| imm[11:0     | 1              | rs1   | 000     | rd          | 0010011 | ADDI  |
| imm[11:0     | 1              | rs1   | 010     | rd          | 0010011 | SLTI  |
| imm[11:0     | 1              | rs1   | 011     | rd          | 0010011 | SLTIU |
| imm[11:0     | ,              | rs1   | 100     | rd          | 0010011 | XORI  |
| imm[11:0     | ,              | rs1   | 110     | rd          | 0010011 | ORI   |
| imm[11:0     |                | rs1   | 111     | rd          | 0010011 | ANDI  |
| 0000000      | shamt          | rs1   | 001     | rd          | 0010011 | SLLI  |
| 0000000      | shamt          | rs1   | 101     | rd          | 0010011 | SRLI  |
| 0100000      | shamt          | rs1   | 101     | rd          | 0010011 | SRAI  |
| 0000000      | rs2            | rs1   | 000     | rd          | 0110011 | ADD   |
| 0100000      | rs2            | rs1   | 000     | rd          | 0110011 | SUB   |
| 0000000      | rs2            | rs1   | 001     | rd          | 0110011 | SLL   |
| 0000000      | rs2            | rs1   | 010     | rd          | 0110011 | SLT   |
| 0000000      | rs2            | rs1   | 011     | rd          | 0110011 | SLTU  |
| 0000000      | rs2            | rs1   | 100     | rd          | 0110011 | XOR   |
| 0000000      | rs2            | rs1   | 101     | rd          | 0110011 | SRL   |
| 0100000      | rs2            | rs1   | 101     | rd          | 0110011 | SRA   |
| 0000000      | rs2            | rs1   | 110     | rd          | 0110011 | OR    |
| 0000000      | rs2            | rs1   | 111     | rd          | 0110011 | AND   |

# $\mathrm{RV}32/\mathrm{RV}64$ Zicsr Standard Extension

| csr | rs1  | 001 | rd | 1110011 | CSRRW  |
|-----|------|-----|----|---------|--------|
| csr | uimm | 101 | rd | 1110011 | CSRRWI |

# 2.7 Register File

Your register file should have two asynchronous-read ports and one synchronous-write port (positive edge).

To test your register file, you should write a testbench to verify the following:

- Register 0 is not writable, i.e. reading from register 0 always returns 0
- Registers are updated on the same cycle that a write occurs (i.e. the value read on the cycle following the rising edge of the write should be the value written).
- The write enable signal to the register file controls whether a write occurs (we is active high, meaning you only write when we is high)
- Reads should be asynchronous (the value at the output one simulation timestep (#1) after feeding in an input address should be the value stored in that register)

After you build your design, look for warnings in the messages and logs windows about the register file.

#### 2.8 RAMs

In this project, we will be using inferred block RAMs to implement memories for the processor.

#### 2.8.1 Initialization

The Verilog \$readmemb(filename, path to 2D reg, start addr, end addr) and \$readmemh() system tasks can be used to initialize a 2D reg with a text file containing the desired contents of the memory (in binary or hex respectively). These system tasks are placed inside an initial block and point to a particular 2D reg instance to initialize. If a 2D reg isn't initialized it is filled with Xs

For synthesis, all the memories are initialized with the contents of the BIOS program (see src/memories/{imem, dmem, bios\_mem}.v).

For simulation, the testbench initializes the memories with a program specified by the testbench (see sim/assembly\_testbench.v).

### 2.8.2 Endianness + Addressing

The instruction and data RAMs have 16384 32-bit rows, as such, they accept 14 bit addresses. The RAMs are **word-addressed**; this means that every unique 14 bit address refers to one 32-bit row (word) of memory.

However, the memory addressing scheme of RISC-V is **byte-addressed**. This means that every unique 32 bit address the processor computes (in the ALU) points to one 8-bit byte of memory.

For us, the bottom 16 bits of the addresses computed by the CPU are relevant for RAM access. The top 14 bits are the word address (for indexing into one row of the block RAM), and the bottom two are the byte offset (for indexing to a particular byte in a 32 bit row).

#### TODO: replace this diagram



Figure 2: Block RAM organization. The labels for row address should read 14'h0 and 14'h1.

Figure 2 illustrates the 14-bit word addresses and the two bit byte offsets. Observe that the RAM organization is **little-endian**, i.e. the most significant byte is at the most significant memory address (offset '11').

# 2.8.3 Reading from RAMs

Since the RAMs have 32-bit rows, you can only read data out of the RAM 32-bits at a time. This is an issue when executing an 1h or 1b instruction, as there is no way to indicate which 8 or 16 of the 32 bits you want to read out.

Therefore, you will have to shift and mask the output of the RAM to select the appropriate portion of the 32-bits you read out. For example, if you want to execute a 1bu on an address ending in 2'b10, you will only want bits [23:16] of the 32 bits that you read out of the RAM (thus storing {24'b0, output[23:16]} to a register).

# 2.8.4 Writing to RAMs

To take care of sb and sh, note that the we input to the instruction and data memories is 4 bits wide. These 4 bits are a byte mask telling the RAM which of the 4 bytes to actually write to. If we={4'b1111}, then all 32 bits passed into the RAM would be written to the address given.

Here's an example of storing a single byte:

- Write the byte 0xa4 to address 0x10000002 (byte offset = 2)
- Set we =  $\{4'b0100\}$
- Set dina = {32'hxx\_a4\_xx\_xx} (x means don't care)

# 2.9 Memory Architecture

The standard RISC pipeline is usually depicted with separate instruction and data memories. Although this is an intuitive representation, it does not let us modify instruction memory to run new programs. Your CPU, by the end of this checkpoint, will be able to receive compiled RISC-V binaries though the UART, store them into instruction memory, then jump to the downloaded program. To facilitate this, we will adopt a modified memory architecture shown in Figure 3:

## TODO: replace this diagram



Figure 3: Memory Architecture

#### 2.9.1 Summary of Memory Access Patterns

Your memory architecture will consist of three RAMs. The RAMs are memory resources contained within the FPGA chip, and no external (off-chip, DRAM) memory will be used for this project. There are RAMs for the instruction, data, and the BIOS memory.

Your processor will begin execution from the BIOS memory, which will be initialized with the BIOS program (in software/bios151v3). The BIOS program will be able to read from the BIOS memory (to fetch static data and instructions), and to read and write to and from instruction and data memory. This allows the BIOS program to receive user programs over the UART from your computer and load them into instruction memory. You can then instruct the BIOS program to jump to an instruction memory address, which begin execution of the program that you loaded.

At any time, you can press the reset button on the board to return your processor to the BIOS program.

#### 2.9.2 Unaligned Memory Accesses

In the official RISC-V specification, unaligned loads and stores are supported. However, in your project, you can ignore instructions that request an unaligned access. The compiler will never generate unaligned accesses.

#### 2.9.3 Address Space Partitioning

Your CPU will need to be able to access multiple sources for data as well as control the destination of store instructions. In order to do this, we will partition the 32-bit address space into four regions: data memory read and writes, instruction memory writes, BIOS memory reads, and memory-mapped I/O. This will be encoded in the top nibble (4 bits) of the memory address generated in load and store operations, as shown in Table 2. In other words, the target memory/device of a load or store instruction is dependent on the address. The reset signal should reset the PC to the value defined by the parameter RESET\_PC which is by default the start of BIOS memory (0x40000000).

Address[31:28]Address Type Device Notes Access 4'b00x1Data Data Memory Read/Write PC4'b0001 Instruction Memory Read-only 4'b001x Instruction Memory Write-Only Only if PC[30] == 1'b1Data 4'b0100 PC **BIOS Memory** Read-only 4'b0100 Data BIOS Memory Read-only 4'b1000 Read/Write Data I/O

Table 2: Memory Address Partitions

Each partition specified in Table 2 should be enabled only based on its associated bit in the address encoding. This allows operations to be applied to multiple devices simultaneously, which will be used to maintain memory consistency between the data and instruction memory.

For example, a store to an address beginning with 0x3 will write to both the instruction memory and data memory, while storing to addresses beginning with 0x2 or 0x1 will write to only the instruction or data memory, respectively. For details about the BIOS and how to run programs on your CPU, see Section 2.13.

Please note that a given address maybe refers to a different memory depending on which address type it is. For example the address 0x10000000 refers to the data memory when it is a data address while a program counter value of 0x10000000 refers to the instruction memory.

The note in the table above (referencing PC[30]), specifies that you can only write to instruction memory if you are currently executing in BIOS memory. This prevents programs from being self-modifying, which would drastically complicate your processor.

## 2.9.4 Memory Mapped I/O

At this stage in the project the only way to interact with your CPU is through the UART. The UART from Lab 5 accomplishes the low-level task of sending and receiving bits from the serial lines, but you will need a way for your CPU to send and receive bytes to and from the UART. To accomplish this, we will use memory-mapped I/O, a technique in which registers of I/O devices are assigned memory addresses. This enables load and store instructions to access the I/O devices as if they were memory.

To determine CPI (cycles per instruction) for a given program, the I/O memory map is also used to include instruction and cycle counters.

Table 3 shows the memory map for this stage of the project.

**Function Data Encoding** Address Access UART control {30'b0, data\_out\_valid, data\_in\_ready} 32'h80000000 Read 32'h80000004 {24'b0, data\_out} UART receiver data Read 32'h80000008 UART transmitter data Write {24'b0, data\_in} Clock cycles elapsed 32'h80000010 Cycle counter Read 32'h80000014 Instruction counter Read Number of instructions executed 32'h80000018 Reset counters to 0 Write N/A

Table 3: I/O Memory Map

You will need to determine how to translate the memory map into the proper ready-valid handshake signals for the UART. Your UART should respond to sw, sh, and sb for the transmitter data address, and should also respond to lw, lh, lb, lhu, and lbu for the receiver data and control addresses.

You should treat I/O such as the UART just as you would treat the data memory. This means that you should assert the equivalent write enable (i.e. valid) and data signals at the end of the execute stage, and read in data during the memory stage. The CPU itself should not check the data\_out\_valid and data\_in\_ready signals; this check is handled in software. The CPU needs to drive data\_out\_ready and data\_in\_valid correctly.

The cycle counter should be incremented every cycle, and the instruction counter should be incremented for every instruction that is committed (you should not count bubbles injected into the pipeline or instructions run during a branch mispredict). From these counts, the CPI of the processor can be determined for a given benchmark program.

#### 2.10 Testing

The design specified for this project is a complex system and debugging can be very difficult without tests that increase visibility of certain areas of the design. In assigning partial credit at the end

for incomplete projects, we will look at testing as an indicator of progress. A reasonable order in which to complete your testing is as follows:

- 1. Test that your modules work in isolation via Verilog testbenches
- 2. Test the entire CPU one instruction at a time with hand-written assembly see assembly\_testbench.v
- 3. Run the riscy-tests ISA test suite
- 4. Test the CPU's memory mapped I/O see echo\_testbench.v

## 2.10.1 Integration Testing

Once you are confident that the individual components of your processor are working in isolation, you will want to test the entire processor as a whole. The easiest way to do this is to write an assembly program that tests all of the instructions in your ISA. A skeleton is provided for you in software/assembly\_tests. See Section 2.12 for details.

Once you have verified that all the instructions in the ISA are working correctly, you may also want to verify that the memory mapped I/O and instruction/data memory reading/writing work with a similar assembly program.

## 2.11 Software Toolchain - Writing RISC-V Programs

A GCC RISC-V toolchain has been built and installed in the eecs151 home directory; these binaries will run on any of the c125m machines in the 125 Cory lab. The most relevant pieces of the toolchain are given below:

- riscv64-unknown-elf-gcc: gcc for RISC-V, compiles C code to RISC-V binaries.
- riscv64-unknown-elf-as: RISC-V assembler, compiles assembly code to RISC-V binaries.
- riscv64-unknown-elf-objdump: Dumps RISC-V binaries as readable assembly code.

Look at the software/c\_example folder for an example of a C program.

There are several files:

- start.s: This is an assembly file that contains the start of the program. It initialises the stack pointer then jumps to the main label. Edit this file to move the top of the stack. Typically your stack pointer is set to the top of the data memory address space, so that the stack has enough room to grow downwards.
- c\_example.ld: This linker script sets the base address of the program. For checkpoint 2, this address should be in the format 0x1000xxxx The .text segment offset is typically set to the base of the instruction memory address space.
- c\_example.elf: Binary produced after running make.
   Use riscv64-unknown-elf-objdump -Mnumeric -D c\_example.elf to view the assembly code.

• c\_example.dump: Assembly dump of the binary.

# 2.12 Assembly Tests

Hand written assembly tests are in software/assembly\_tests/start.s and the corresponding testbench is in hardware/sim/assembly\_testbench.v.

start.s contains assembly that's compiled and loaded into the BIOS RAM by the testbench.

#### \_start:

```
# Test ADD
li x10, 100  # Load argument 1 (rs1)
li x11, 200  # Load argument 2 (rs2)
add x1, x10, x11  # Execute the instruction being tested
li x20, 1  # Set the flag register to stop execution and inspect the

→ result register
  # Now we check that x1 contains 300 in the testbench
```

#### Done: j Done

The assembly\_testbench toggles the clock one cycle at time and waits for register x20 to be written with a particular value (in the above example: 1). Once x20 contains 1, the testbench inspects the value in x1 and checks it is 300, which indicates your processor correctly executed the add instruction.

If the testbench timed out it means x20 never became 1, so the processor got stuck somewhere or x20 was written with another value.

# 2.13 BIOS and Programming your CPU

We have provided a BIOS program in software/bios151v3 that allows you to interact with your CPU and download other programs over UART. The BIOS is basically just an infinite loop that reads from the UART, checks if the input string matches a known control sequence, and then performs the action. For detailed information on the BIOS, see Appendix C. TODO: attach BIOS document in the appendix of this spec

To run the BIOS:

- 1. Verify that the stack pointer and .text segment offset are set properly in start.s and bios151v3.ld
- 2. Compile the program with make in the software/bios151v3 directory
- 3. Verify the {imem, dmem, bios\_mem.v} modules are initialized with the BIOS hex file
- 4. Build a bitstream and program the FPGA
- 5. Use screen to access the serial port:

#### screen \$SERIALTTY 115200

6. Press the reset button to make the CPU PC go to the start of BIOS memory

Close screen using Ctrl-a Shift-k, or other students won't be able to use the serial port! If you can't access the serial port you can run killscreen to kill all screen sessions.

If all goes well, you should see a 151 > prompt after pressing return. The following commands are available:

- jal <address>: Jump to address (hex).
- sw, sb, sh <data> <address>: Store data (hex) to address (hex).
- lw, lbu, lhu <address>: Prints the data at the address (hex).

As an example, running sw cafef00d 10000000 should write to the data memory and running lw 10000000 should print the output 10000000: cafef00d. Please also pay attention that writes to the instruction memory (sw ffffffff 20000000) do not write to the data memory, i.e. lw 10000000 still should yield cafef00d.

In addition to the command interface, the BIOS allows you to load programs to the CPU. With screen closed, run:

```
coe_to_serial <coe_file> <address>
```

This stores the .coe file at the specified hex address. In order to write into both the data and instruction memories, remember to set the top nibble to 0x3 (i.e. coe\_to\_serial echo.coe 30000000, assuming the .ld file sets the base address to 0x10000000). You also need to ensure that the stack and base address are set properly (See Section 2.11).

For example, before making the mmult program you should have set the set the base address to 0x10006000 (see 2.15). Therefore, when loading the mmult program to the FPGA you should place it into the memory that it starts aligned with the base address: coe\_to\_serial mmult.coe 30006000. Then, you can start in in your screen session by using jal 10006000.

#### 2.14 Target Clock Frequency

By default, the minimum clock period is set at 50MHz. It should be easy to meet timing at 50 MHz. Look at the reports in hardware/build/synth/post\_synth\_timing\_summary.rpt and impl/post\_route\_timing\_summary.rpt to see if timing is met. If you failed, the timing reports specify the critical path so you can attempt to optimize.

For this checkpoint, we will allow you to demonstrate the CPU working at 50 MHz, but for the final checkoff at the end of the semester, you will need to optimize for a higher clock speed ( $\geq 100 \mathrm{MHz}$ ) for full credit. Details on how to build your FPGA design with a different clock frequency will come later.

# 2.15 Matrix Multiply

To check the correctness and performance of your processor we have provided a benchmark in software/mmult/ which performs matrix multiplication. You should be able to load it into your processor in the same way as loading the echo program. This program computes S = AB, where A and B are  $64 \times 64$  matrices. The program will print a checksum and the counters discussed in Section 2.9.4. The correct checksum is 0001f800. If you do not get this, there is likely a problem in your CPU with one of the instructions that is used by the BIOS but not mmult.

The matrix multiply program requires that the stack pointer and the offset of the .text segment be set properly, otherwise the program will not execute properly.

The stack pointer (set in start.s) needs to accommodate three 64×64 matrices as well as additional space for temporary results. It should be set to 0x10006000 and grows downwards.

The .text segment offset (set in mmult.ld) needs to accommodate the full set of instructions and static data in the mmult binary. It should be set to 0x10006000 and.

The program will also output the values of your instruction and cycle counters (in hex). These can be used to calculate the CPI for this program. Your target CPI should be under 1.2, and ideally should be under 1.15. If your CPI exceeds this value, you will need to modify your datapath and pipeline to reduce the number of bubbles inserted for resolving control hazards (since they are the only source of extra latency in our processor). This might involve performing naive branch prediction or moving the jalr address calculation to an earlier stage.

# 2.16 How to Survive This Checkpoint

Start early and work on your design incrementally. Draw up a very detailed and organised block diagram and keep it up to date as you begin writing Verilog. Unit test independent modules such as the control unit, ALU, and regfile. Write thorough and complex assembly tests by hand, and don't solely rely on the RISC-V ISA test suite. The final BIOS program is several 1000 lines of assembly and will be nearly impossible to debug by just looking at the waveform.

The most valuable asset for this checkpoint will not be your GSIs but will be your fellow peers who you can compare notes with and discuss design aspects with in detail. However, do NOT under any circumstances share source code.

Once you're tired, go home and *sleep*. When you come back you will know how to solve your problem.

#### 2.16.1 How To Get Started

It might seem overwhelming to implement all the functionality that your processor must support. The best way to implement your processor is in small increments, checking the correctness of your processor at each step along the way. Here is a guide that should help you plan out Checkpoint 1 and 2:

- 1. Design. You should start with a comprehensive and detailed design/schematic. We suggest that you think carefully about all the functionality and instructions your processor needs to support and enumerate all the control signals that you will need. Be especially careful when designing the memory fetch stage of your pipeline as all the memories we use (BIOS, inst, data, IO) are synchronous.
- 2. First steps. You should get started by implementing some modules that are straightforward to write and test. We suggest you get started by writing reg\_file.v, for which there has been a template provided in the project skeleton. Once you finish writing the regfile, test it comprehensively by writing a Verilog testbench. Look at the Register File section for details on what the test should verify.
- 3. Control Unit + other small modules. Next try implementing your control unit, the ALU, and any other small independent modules that you identified in your design. Make sure you unit test these aggressively, so that you verify their correctness and get used to writing Verilog testbenches.
- 4. Memory. Create your memory controller and other auxiliary structures. Only add the BIOS memory in the instruction fetch stage and only add the data memory block RAM in the memory stage of your pipeline. This will keep things simple in order to test the base functionality of your processor.
- 5. Connect stages and pipeline. Now you should have all of the modules ready to connect them together and pipeline them by inserting registers between the stages. At this point, you should be able to run integration tests using assembly tests for most R and I type instructions.
- 6. Implement handling of control hazards. Now insert bubbles into your pipeline to resolve control hazards associated with JAL, JALR, and branch instructions. Don't worry about data hazard handling for now. Test that your control instructions work properly with assembly tests. You can insert explicit NOP instructions in your tests to get around data dependencies.
- 7. Implement data forwarding for data hazards. Add forwarding muxes to the proper place in your datapath and forward the outputs of the ALU and memory stage. Implement a hazard unit that can detect data dependencies and set the control signals for the forwarding muxes accordingly. Remember that you might have to forward to ALU input A, ALU input B, and data to write to memory. Test forwarding aggressively; most of your bugs will come from incomplete or faulty forwarding logic. Make sure you test forwarding from memory and from the ALU, and with control instructions.
- 8. Add BIOS memory reads. Add the BIOS memory block RAM to the memory stage to be able to load data from the BIOS memory. Write assembly tests that contain some static data stored in the BIOS memory and verify that you can read that data.
- 9. Add Inst memory writes and reads. Add the instruction memory block RAM to the memory stage to be able to write data to it when executing inside the BIOS memory. Also add the instruction memory block RAM to the instruction fetch stage to be able to read instructions from the inst memory. It is crucial to write tests to stress this portion of the processor; we suggest writing tests that first write instructions to the instruction memory, and then jump (using jalr) to instruction memory to see the right instructions are executed.

- 10. Add cycle counters. Begin to add the memory mapped IO components, by first adding the cycle and instruction counters. These are just 2 32-bit registers that your CPU should update on every cycle and every instruction respectively. Write tests to verify that your counters can be reset with a SW instruction, and can be read from using a LW instruction.
- 11. Integrate UART. Add the UART to the memory stage, in parallel with the data, instruction, and BIOS memories. Detect when an instruction is accessing the UART and route the data to the UART accordingly. Make sure that you are setting the UART ready/valid control signals properly as you are feeding or retrieving data from it. This part can be tricky, ask a TA for a full explanation of how a program would communicate with the UART. We have provided you with the echo\_testbench which performs a test of the UART. You should extend this testbench with more comprehensive tests, as many bugs can be traced to a faulty UART integration.
- 12. Run the BIOS. If everything so far has gone well, you can try making the CPU with instantiating the BIOS memory with the BIOS program. Impact the CPU on the board and verify that the BIOS performs as expected. As a precursor to this step, you might try to make the CPU with instantiating the BIOS memory with the echo program, since it is a smaller and easier to analyze program.
- 13. Run matrix multiply. As a final step to check your implementation, you should be able to load the mmult program with the coe\_to\_serial utility, and run mmult on the FPGA. Verify that it returns the correct checksum.
- 14. Check CPI. Now that your processor is complete as far as functionality goes, compute the CPI when running the mmult program. If you achieve a CPI below 1.2, that is acceptable, but if your CPI is larger than that, you should think of ways to reduce it.

#### 2.17 Checkoff

The checkoff is divided into two stages: block diagram/design and implementation. The second part will require significantly more time and effort than the first one. As such, completing the block diagram in time for the design review is crucial to your success in this project.

#### 2.17.1 Checkpoint 1: Block Diagram

The first checkpoint requires a detailed block diagram of your datapath. The diagram should have a greater level of detail than a high level RISC datapath diagram. You may complete this electronically or by hand.

If working by hand, we recommend working in pencil and combining several sheets of paper for a larger workspace. If doing it electronically, you can use Inkscape, Google Drawings, draw.io or any program you want.

You should be able to describe in detail any smaller sub-blocks in your diagram. Though the text-book diagrams are a decent starting place, remember that they often use asynchronous-read RAMs

for the instruction and data memories, and we will be using synchronous-read block RAMs. Additionally, at this point we recommend that you have completely functional UART, ALU, instruction decoder, and register file modules (see 2.7), though we will not be checking this.

Checkpoint 1 is due in lab no later than October 25. You are required to go over your design with a GSI during lab. Be prepared to talk generally about how you came up with your design and defend your design decisions.

## 2.17.2 Non-Checkpoint Weeks

GSIs will be in lab during the regular times to help you. Come to lab every week even if there is no checkoff deadline.

#### 2.17.3 Checkpoint 2: Base RISCV151 System

This checkpoint requires a fully functioning three stage RISC-V CPU as described in this specification. Checkoff will consist of a demonstration of the BIOS functionality, storing a program (echo and mmult) over the UART, and successfully jumping to and executing the program.

Checkpoint 2 materials should be committed to your project repository by November 15.

# 2.17.4 Checkpoints 1 & 2 Deliverables Summary

| Deliverable              | Due Date                                  | Description                                                                                                                                                                                                                                                                                                                             |  |
|--------------------------|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Block Diagram October 25 |                                           | Sit down with a GSI and go over your design in detail                                                                                                                                                                                                                                                                                   |  |
| RISC-V CPU               | November 15<br>Check in code<br>to Github | Demonstrate that the BIOS works, you can use coe_to_serial to load the echo program, jal to it from the BIOS, and have that program successfully execute. Load the mmult program with coe_to_serial, jal to it, and have it execute successfully and return the benchmarking results and correct checksum. Your CPI should be under 1.2 |  |

# 3 Checkpoint 3 - I/O Integration, PWM Controller, Subtractive Synthesizer

In checkpoint 3 of this project you will implement a memory mapped I/O interface to user inputs and outputs (buttons, LEDs, and switches). To buffer user inputs to your processor you will integrate the FIFO built in lab.

You will design a generic PWM controller to drive the audio output that can be configured over MMIO. Finally, you will implement a simple polyphonic subtractive synthesizer with amplitude (ADSR) envelopes.

# 3.1 I/O Integration

In lab, you built a synchroniser, debouncer and an edge detector that were used to take in various user inputs. Now, we want our processor to have access to these inputs (and the switches) and also to be able to drive outputs such as the LEDs. We will extend our memory map to give programs access to these I/Os.

When a user pushes a button on the Pynq-Z1 board, the button's signal travels through the synchroniser  $\rightarrow$  debouncer  $\rightarrow$  edge detector chain. The result is a single clock cycle wide pulse coming out of the edge detector that represents a single button press. If we just extended our memory map to directly include the outputs from the edge detector, the processor would have to read from those locations on every clock cycle to be sure it didn't miss any user inputs.

To fix this, we will buffer user inputs with a FIFO and let the processor consume them when it has time to do so.

#### 3.1.1 Hookup User I/O

We want to give the processor access to these I/Os:

- Switches
- GPIO LEDs (the ones on the Pynq-Z1 board)
- Push-buttons

The I/O extension to the memory map is in Table 4.

On any given clock cycle, when any of the button signals pulse high, the FIFO should be written to with the status of all the button signals. The CPU should be able to read the empty signal of the FIFO, and it should be able to read out data from the FIFO with the FIFO's rd\_en signal controlled by your memory logic.

Modify z1top.v and Riscv151.v by instantiating your FIFO, hooking up its ports to the user I/O signals, and connecting your FIFO's read interface to the RISC-V core.

Table 4: Memory Map for I/O Integration

| Address      | Function                               | Access       | Data Encoding                           |
|--------------|----------------------------------------|--------------|-----------------------------------------|
|              | GPIO FIFO Empty<br>GPIO FIFO Read Data | Read<br>Read | {31'b0, empty}<br>{28'b0, buttons[3:0]} |
| 32'h80000028 | Switches                               | Read         | {30'b0, SWITCHES[1:0]}                  |
| 32'h80000030 | GPIO LEDs                              | Write        | {26'b0, LEDS[5:0]}                      |

# 3.1.2 User I/O Test Program

The software/user\_io\_test tests the FIFO and user I/O integration. After programming the FPGA, run make in the user\_io\_test folder, and run coe\_to\_serial user\_io\_test.coe 30006000. Then screen and jal 10006000 from the BIOS to jump into the user I/O test program.

This program has several commands to help you debug and verify functionality:

- read\_buttons CPU reads from the GPIO FIFO until it is empty, decodes the button press data, and prints it out.
- read\_switches CPU reads the slide switches' address and prints out the state of the switches.
- led <data> Writes the <data> (32-bits in hex) that you specify to the GPIO LEDs address. We only have 6 LEDs on the board so you can write values up to 0x3F.
- exit Jump back into BIOS.

#### 3.2 PWM Controller

# 3.2.1 Piano Program

# 3.3 Subtractive Synth

# 3.4 Checkpoint 3 Deliverables Summary

| Deliverable                         | Due Date    | Description                                                                                                                          |  |  |
|-------------------------------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------|--|--|
| User I/Os +<br>PWM Piano +<br>Synth | November 22 | Demonstrate the working user IO test,<br>and PWM piano programs. Describe the<br>design of your synth and produce various<br>sounds. |  |  |

# 4 Checkpoint 4: Power Management Unit with Dynamic Frequency Scaling

# 4.1 Checkpoint 4 Deliverables Summary

| Deliverable                                 | Due Date   | Description                                                                                                   |
|---------------------------------------------|------------|---------------------------------------------------------------------------------------------------------------|
| PMU Core +<br>power manage-<br>ment program | December 6 | Demonstrate keeping the FPGA temperature below X°C while executing an intensive mmult in less than X seconds. |

# 5 Final Checkpoint - Optimization

This optimization checkpoint is lumped with the final checkpoint and the checkoff will occur at the same time. This part of the project is designed to give students freedom to implement the optimizations of their choosing to improve the performance of their processor.

The general optimization goal for this project is to achieve maximal performance on the mmult program, as defined by the 'Iron Law' of Processor Performance.

$$\frac{\mathrm{Time}}{\mathrm{Program}} = \frac{\mathrm{Instructions}}{\mathrm{Program}} \times \frac{\mathrm{Cycles}}{\mathrm{Instruction}} \times \frac{\mathrm{Time}}{\mathrm{Cycle}}$$

Your goal is to minimize the execution time of mmult. The number of instructions is fixed, but you have freedom to change the CPI and the CPU clock frequency. Often you will find that you will have to sacrifice CPI to achieve a higher clock frequency, but there also will exist opportunities to improve one or both of the variables without compromises.

## 5.1 Clock Generation Info + Changing Clock Frequency

Open up z1top.v. You will notice a top level input called CLK\_125MHZ\_FPGA. It is a 125 MHz clock signal, which we will use to derive our CPU clock.

Scrolling down a little further, you will see an instantiation of PLLE2\_ADV, which is a PLL (phase locked loop) primitive on the FPGA. This is a circuit that lets us create a new clock from an existing clock with a user-specified multiply-divide ratio.

The CLKIN input clock of the PLL is driven by the 125 MHz user\_clk\_g (buffered USER\_CLK). The PLL divides this frequency by the DIVCLK\_DIVIDE parameter, which is set to 5. Thus, internally, the PLL creates a 25 MHz clock. Then, this multiplied clock is divided by CLKFBOUT\_MULT parameter and divided by the CLKOUTO\_DIVIDE parameter. In our case, this yields 125\_000\_000 / 5 \* 34 / 17 = 50\_000\_000, our desired CPU clock. Finally, the multiplied and divided clock shows up at the CLKOUTO output clock of the PLL, which is connected to cpu\_clk. The cpu\_clk is buffered and cpu\_clk\_g is used in our CPU and other modules.

Play around with the multipliers and divisors in the PLL to generate a faster (or slower) clock. You may have to consult Xilinx's documentation on the PLL2E\_ADV primitive. (You can also use the Clocking Wizard IP generator in Vivado to generate this instantiation.) The parameters can't be set arbitrarily and there are a few caveats. The multipliers and divisors must be integers and you must fall within the device's operating frequency range - Vivado will complain if you don't. A few frequencies to try are: 60 MHz, 75 MHz, and 100 MHz.

#### 5.2 Critical Path Identification

Begin by pulling the latest skeleton files from the staff repository: git pull staff master. After running synthesis and implementation, your FPGA design will be placed and routed, and timing analysis will be performed to determine the critical path(s) of your design. The timing tools will

automatically figure out the CPU clock timing constraint based on the multiply-divide ratio you used in your PLL.

You can find the critical path in the timing reports from your implementation step (in the Flow Navigator). Expand  $Implemented\ Design \rightarrow Select\ Report\ Timing\ Summary$ . You are interested in the timing paths for cpu\_clk\_g which is the clock used by your CPU and the rest of your design.

What is your critical path?

For each timing path look for the attribute called "slack". Slack describes how much extra time the combinational delay of the path has before the rising edge of the receiving clock. It is a setup time attribute. Positive slack means that this timing path resolves and settles before the rising edge of the clock, and negative slack indicates a setup time violation.

You will then see the source and destination of the path which you can usually map to a net in your design. You can also see (And even visualise) the actual logic path that starts at the source and follows some logic in your design until it gets to the destination.

There are 3 common delay types that you will encounter during optimization. Most of the Trc\* delays are RAM delays that represent either Clk-to-q delays or setup time constraints. Tilo delays are combinational delays through LUTs. net delays are routing delays. If you want details on a specific delay type, check the Virtex 5 Datasheet starting from page 40.

net delays include a fanout attribute. You will likely want to minimize fanout of a given net along a timing path in order to reduce routing delay. You will notice that as a percentage of total delay, routing dominates over combinational logic delay. As you continue optimization, you can reach the point where the routing delay percentage of total delay will be roughly one-half.

#### 5.2.1 Finding Actual Critical Paths

When you first check the timing report with a 50 MHz clock, you might not see your 'actual' critical path. 50 MHz is an easy timing constraint for the tools to meet for most CPU designs and thus, the tools will only attempt to optimize routing until timing is met, and will then stop. The critical paths you see in the report may not be the 'actual' critical paths since the tools haven't been pushed to the limit.

We recommend that you begin optimization by increasing the clock frequency slowly and re-running synthesis and implementation until the routing tool fails to meet timing. At this point, you know that the tools tried as hard as they could and just missed timing, so then the critical paths you see in the report are the 'actual' ones you need to work on.

As an aside, don't try to increase the clock speed up all the way to 100 MHz initially, since that will cause the routing tool to give up even before it tried anything. Thus, you will get 'false' critical paths, that aren't necessarily where you should spend your time when optimizing.

# 5.3 Optimization Tips

As you work on achieving a higher clock speed, you will likely notice that the routing tool (PAR) is quite temperamental. You may find that your design might meet timing for a given clock speed, but after making a small, insignificant design change, the tool fails to meet timing. This is because PAR uses a random seed as a starting point in its algorithm. Sometimes it is a 'good' seed and yields an optimal result, but a small design change may cause the same seed to become 'bad' for that design and it yields a sub-optimal result.

As you optimize your design, you will want to try running mmult on your newly optimized designs as you go along. You don't want to make a lot of changes to your processor, get a better clock speed, and then find out you broke something along the way.

You will find that sacrificing CPI for a better clock speed is a good bet to make in some cases, but will worsen performance in others. You should keep a record of all the different optimizations you tried and the effect they had on CPI and minimum clock period; this will be useful for the final report when you have to justify your optimization and architecture decisions.

There is no limit to what you can do in this section. The only restriction is that you have to run the original, unmodified mmult program so that the number of instructions remain fixed. You can add as many pipeline stages as you want, stall as much or as little as desired, add a branch predictor, or perform any other optimizations. If you decide to do a more advanced optimization (like a 5 stage pipeline), ask the staff to see if you can use it as extra credit in addition to the optimization.

You will be graded based on the best mmult performance you were able to achieve, as well as your documentation/reasoning for your architecture modifications in the process of optimization. You need to also take into consideration area usage when optimizing, so be sure to keep records as you optimize.

# 6 Optimizations, Extra Credit, and Grading

All groups must complete the final checkoff by December 13 (by appointment). Use the week prior to your final checkoff for code cleanup, optimizations, late checkpoints, and optional extra credit projects.

## 6.1 Grading on Optimization

To receive full credit, you must demonstrate a working CPU at an optimized clock frequency (above 50MHz) that has a working BIOS, can load and execute programs (both echo and mmult), can receive, process, and send to user I/O, and has a working audio synthesizer. Additionally, you will be graded on total FPGA resource utilization, with the best designs using as few resources as possible. If you are unable to make the deadline for any of the checkpoints, it is still in your best interest to complete the design late, as you can still receive most of the credit if you get a working design by the final checkoff.

Credit for your area optimizations will be calculated using a cost function. At a high level, the cost function will look like:

$$Cost = C_{LUT} \times \#ofLUTs + C_{RAMB} \times \#ofRAMBs + C_{REG} \times \#ofSliceRegisters$$

where C<sub>LUT</sub>, C<sub>RAMB</sub>, and C<sub>REG</sub> are constant value weights that will be decided upon based on how much each resource that you use should cost. As part of your final grade we will evaluate the cost of your design based on this metric. Keep in mind that cost is only one very small component of your project grade. Correct functionality is far more important.

## 6.2 Checkpoints

We have divided the project up into checkpoints so that you (and the staff) can pace your progress. The due dates are indicated at the end of each checkpoint section, as well as in the **Project Timeline** (Section 7) at the end of this document. During the week each checkpoint is due, you will be required to get your implementation checked off by the GSI in the lab section you are enrolled in.

#### 6.3 Style: Organization, Design

Your code should be modular, well documented, and consistently styled. Projects with incomprehensible code will upset the graders.

## 6.4 Final Project Report

Upon completing the project, you will be required to submit a report detailing the progress of your EECS151/251A project. The report should document your final circuit at a high level, and

describe the design process that led you to your implementation. We expect you to document and justify any tradeoffs you have made throughout the semester, as well as any pitfalls and lessons learned (not make excuses for why something didn't work). Additionally, you will document any optimizations made to your system, the system's performance in terms of area (resource use), clock period, and CPI, and other information that sets your project apart from other submissions.

The staff emphasizes the importance of the project report because it is the product you are able to take with you after completing the course. All of your hard work should reflect in the project report. Employers may (and have) ask to examine your EECS151/251A project report during interviews. Put effort into this document and be proud of the results. You may consider the report to be your medal for surviving EECS151/251A.

#### 6.4.1 Report Details

You will turn in your project report on Gradescope by the final checkoff date. The report should be around 8 pages total with around 5 pages of text and 3 pages of figures ( $\pm$  a few pages on each). Ideally you should mix the text and figures together.

Here is a suggested outline and page breakdown for your report. You do not need to strictly follow this outline, it is here just to give you an idea of what we will be looking for.

- Project Functional Description and Design Requirements. Describe the design objectives of your project. You don't need to go into details about the RISC-V ISA, but you need to describe the high-level design parameters (pipeline structure, memory hierarchy, etc.) for this version of the RISC-V. ( $\approx 0.5$  page)
- **High-level organization**. How is your project broken down into pieces. Block diagram level-description. We are most interested in how you broke the CPU datapath and control down into submodules, since the code for the later checkpoints will be pretty consistent across all groups. Please include an updated block diagram ( $\approx 1$  page).
- **Detailed Description of Sub-pieces**. Describe how your circuits work. Concentrate here on novel or non-standard circuits. Also, focus your attention on the parts of the design that were not supplied to you by the teaching staff. For instance, describe the details of your FIFOs, audio synthesizer, and any extra credit work. (≈ 2 pages).
- Status and Results. What is working and what is not? At what frequency (50MHz or greater) does your design run? Do certain checkpoints work at a higher clock speed while others only run at 50 MHz? Please also provide the number of LUTs and SLICE registers used by your design, which can be found by running make report. Also include the CPI and minimum clock period of running mmult for the various optimizations you made to your processor. This section is particularly important for non-working designs (to help us assign partial credit). (≈ 1-2 pages).
- Conclusions. What have you learned from this experience? How would you do it different next time? ( $\approx 0.5$  page).
- Division of Labor. This section is mandatory. Each team member will turn in a separate document from this part only. The submission for this document will also

be on Gradescope. How did you organize yourselves as a team. Exactly who did what? Did both partners contribute equally? Please note your team number next to your name at the top. ( $\approx 0.5$  page).

When we grade your report, we will grade for clarity, organization, and grammar. Make sure to proofread and correct mistakes before turning it in. Submit your report to the Gradescope assignment. Only one partner needs to submit the shared report, while each individual will need to submit the division of labor report to a separate Gradescope assignment.

#### 6.5 Extra Credit

Teams that have completed the base set of requirements are eligible to receive extra credit worth up to 10% of the project grade by adding extra functionality and demonstrating it at the time of the final checkoff.

The following are suggested projects that may or may not be feasible in one week.

- Integrating your Tone Generator and I2S Controller from the Lab Assignments (see section B for details)
- Branch Predictor: Implement a two bit (or more complicated) branch predictor with a branch history table (BHT) to replace the naive 'always taken' predictor used in the project
- 5-Stage Pipeline: Add more pipeline stages and push the clock frequency past 100MHz
- Audio Recording: Capturing mic input from the Pynq's microphone
- RISC-V M Extension: Extend the processor with a hardware multiplier and divider
- 3 (or more) bit color: Increase the size of the framebuffer to have control of the RGB content of each pixel
- Dynamic Resolution: Allow the processor to control the output resolution of the DVI controller at runtime

When the time is right, if you are interested in implementing any of these, see the staff for more details.

# 6.6 Project Grading

- 80% Functionality at project due date. Your design will be subjected to a comprehensive test suite and your score will reflect how many of the tests your implementation passes.
- 5% Optimization at project due date. This grade is a function of the resources used by your implementation. This score is contingent on implementing all the required functionality. An incomplete project will receive a zero in this category.
- 5% Checkpoint functionality. You are graded on functionality for each completed checkpoint. The total of these scores makes up 5% of your project grade. The weight of each checkpoint's score may vary.

10% Final report and style demonstrated throughout the project.

Not included in the above tabulations are point assignments for extra credit as discussed above. Extra credit is discussed below:

Up to 10% Additional functionality. Credit based on additional functionality will be qualified on a case by case basis. Students interested in expanding the functionality of their project must meet with a GSI well ahead of time to be qualified for extra credit. Point value will be decided by the course staff on a case by case basis, and will depend on the complexity of your proposal, the creativity of your idea, and relevance to the material taught.

# 7 Project Timeline

| Checkpoint                                      | Deliverable                               | Due Date                     |
|-------------------------------------------------|-------------------------------------------|------------------------------|
| 1 & 2: RISCV151 Processor                       | Design Review<br>In-Lab Checkoff          | October 25<br>November 15    |
| 3: IO, FIFOs, Video/Graphics                    | In-Lab Checkoff<br>Project Interview      | November 22                  |
| Final Checkoff, Extra Credit, and Optimizations | In-Lab Checkoff<br>Github code submission | December 13 (by appointment) |
| Final Report                                    | Gradescope submission                     | December 15                  |

Table 5: EECS151 Fall 2019 Project Timeline

# A Local Development

You can build the project on your laptop but there are a few dependencies to install. In addition to Vivado and Icarus Verilog, you need a RISC-V GCC cross compiler and an elf2hex utility.

#### A.1 Linux

A system package provides the RISC-V GCC toolchain (Ubuntu): sudo apt install gcc-riscv64-linux-gnu. There are packages for other distros too.

To install elf2hex:

```
git clone git@github.com:sifive/elf2hex.git
cd elf2hex
autoreconf -i
./configure --target=riscv64-linux-gnu
make
vim elf2hex # Edit line 7 to remove 'unknown'
sudo make install
```

## A.2 OSX, Windows

Download SiFive's GNU Embedded Toolchain from here. See the 'Prebuilt RISC-V GCC Toolchain and Emulator' section.

# B Tone Generator & I2S Extra Credit

This section details an extra credit opportunity for you to integrate your tone generator and I2S controller from the lab assignments with your RISCV processor.

# B.1 Summary

You will add the tone\_generator created in lab to the design as a peripheral and allow programs to access its tone\_switch\_period and output\_enable inputs over memory mapped I/O. You will be able use your processor to simulate the music\_streamer FSM built in lab using software only.

Once you have the tone generator working, you will be able to load a program onto your processor which allows you to use your keyboard as a piano. The program synthesizes sine waves of various frequencies based on what key you are pressing and will transmit the wave to the codec so you can hear the music you play through your headphones.

Get started by pulling the latest skeleton files from the staff repository: git pull staff master.

# **B.2** Tone Generator Hookup

Copy over your tone\_generator.v from Lab 5 to the hardware/src/audio/ directory. Recall that your tone\_generator takes a tone\_switch\_period which describes how many clock cycles the tone\_generator takes to invert its square\_wave\_out output. There is also an output\_enable input into the tone\_generator which gates the square\_wave\_out output low.

We want to give our RISC-V core the ability to set the tone\_switch\_period and the output\_enable of the tone generator. Here is the addition to the memory map:

Table 6: Tone Generator Memory Map Additions

| Address      | Function                          | Access | Data Encoding                    |
|--------------|-----------------------------------|--------|----------------------------------|
| 32'h80000034 | Tone Generator Output Enable      | Write  | {31'b0, output_enable}           |
| 32'h80000038 | Tone Generator Tone Switch Period | Write  | {8'b0, tone_switch_period[23:0]} |

Modify z1top.v and Riscv151.v. Instantiate the tone\_generator at the top level and connect square\_wave\_out to the AUDIO\_PWM output. The output\_enable signal should be connected to the AND of BUTTONS[0] and the register that's written by the CPU. The tone signal should be connected to another register that's written by the CPU via memory mapped I/O.

Modify your CPU to take in and output any signals it needs for this tone generator hookup.

#### B.2.1 Testing the Tone Generator

We have provided a program to test your tone generator and its memory map. It can be found in software/tone\_gen\_test/. Compile and run the program just as you did for the user\_io\_test. Make sure the first slide switch is on. jal to the program from the BIOS, and you can play with these commands:

- on Flips the output enable register high
- off Flips low the output enable register low
- tone <tone\_switch\_period> Writes the user specified tone\_switch\_period (32-bits in hex) to the tone switch period address
- exit Jumps back to the BIOS

Calculate the tone\_switch\_period for a 440Hz tone with a 50 MHz clock and try sending the command for that through the test program. Verify that the piezo speaker is buzzing at 440Hz by comparing it to a square wave at the same frequency via a tone generator.

#### **B.2.2** Music Streamer in Software

Now that we have access to user I/Os and access to the tone generator, we can fully implement the music streamer and sequencer FSM from lab 4 entirely in software!

Kind of... We still don't have enough buttons so are limited in the functionality we can implement. The workaround is to not include the sequencer from lab 4 (which was an optional part anyway). Since button 3 is now our reset, we can use buttons 0-2 to implement the rest of the functionality. The software version of the music streamer maps button 0 to play/pause, button 1 to reverse, and button 2 to tempo reset. As you'll have noticed, there is no way to change the tempo, so if you would like to add that functionality, feel free to edit the .c file. You can use the switches as we did in the lab, but keep in mind that they behave differently to the buttons. How you do this part is up to you, but it is enough to show a version with only play/pause, reverse, and tempo reset. Ask a GSI if you need help.

The music streamer program can be found in **software/music\_streamer**. To use this program, use the same scripts from Lab 3 to generate a music data file from a MusicXML file, and then convert that data file to a static array declaration that can be used in a C program.

```
python scripts/musicxml_parser.py musicxml/Row_Row_Row_Your_Boat.mxl music.txt
```

You will now have a music.txt file in the /software/music\_streamer directory with the music data. Now we use the c\_array\_generator.py script to create a static array declaration using this file.

```
python scripts/c_array_generator.py music.h music.txt
```

Now, we have a file called music.h that has a static array declaration filled with the music data. This serves exactly the same purpose as the ROM that was generated in Lab 3.

To build the music\_streamer program and place it on your processor, execute:

```
make
coe_to_serial music_streamer.coe 3000a000
screen $SERIALTTY 115200
151> jal 1000a000
```

The music\_streamer functions exactly like it does in Lab 3, but the state machine that was implemented directly in hardware, is now implemented in software. Here is the state machine diagram for reference:



The program will print out information as you transition the state machine, edit notes in the sequencer, and modify the tempo.

#### B.3 Tone Generator and I2S Extra Credit Deliverables

- Demonstrate the music streamer program.
- Demonstrate the piano program.

# C BIOS

This section was written by Vincent Lee, Ian Juch, and Albert Magyar.

# C.1 Background

For the first checkpoint we have provided you a BIOS written in C that your processor is instantiated with. BIOS stands for Basic Input/Output System and forms the bare bones of the CPU system on initial boot up. The primary function of the BIOS is to locate, and initialize the system and peripheral devices essential to the PC operation such as memories, hard drives, and the CPU cores.

Once these systems are online, the BIOS locates a boot loader that initializes the operating system loading process and passes control to it. For our project, we do not have to worry about loading the BIOS since the FPGA eliminates that problem for us. Furthermore, we will not deal too much with boot loaders, peripheral initialization, and device drivers as that is beyond the scope of this class. The BIOS for our project will simply allow you to get a taste of how the software and hardware layers come together.

The reason why we instantiate the memory with the BIOS is to avoid the problem of bootstrapping the memory which is required on most computer systems today. Throughout the next few checkpoints we will be adding new memory mapped hardware that our BIOS will interface with. This document is intended to explain the BIOS for checkpoint 1 and how it interfaces with the hardware. In addition, this document will provide you pointers if you wish to modify the BIOS at any point in the project.