# Multicore CPU using a NoC based on packet switching

Talibov Serhan May 2025

# User manual

# Contents

| 1 | Tha                | anks to                    | 3             |
|---|--------------------|----------------------------|---------------|
| 2 | Intr<br>2.1<br>2.2 | Network on a chip          | <b>3</b><br>3 |
| 3 | Rep                | pository directories       | 3             |
| 4 | Ger                | neral description          | 5             |
| 5 | Det                | ailed description          | 6             |
|   | 5.1                | CPU structure              | 6             |
|   | 5.2                | Convolution code           | 7             |
|   | 5.3                | Memory controller          | 9             |
|   | 5.4                | · ·                        | 10            |
|   |                    |                            | 10            |
|   |                    |                            | 11            |
|   |                    |                            | 11            |
|   | 5.5                |                            | 11            |
|   | 5.6                |                            | 12            |
| 6 | Usa                | age and necessary software | 13            |
|   | 6.1                | · ·                        | 13            |
|   | 6.2                |                            | 13            |
|   | 6.3                |                            | 15            |
|   |                    |                            | 15            |
|   |                    |                            | 17            |

### 1 Thanks to

- Lezhnev E.V., Romanov A.Yu. for the developing the idea and making this project possible;
- Grushevskii N.I. for being a codeveloper of the NoC connection subsystem taking an equal part of this task;
- Nigmatullin N.R. for the development of NoC converters;
- Zhelnio S. for the development of schoolRISCV soft core.

#### 2 Introduction

## 2.1 Network on a chip

Network on a chip (NoC) is a network-based communication between IP cores on an integrated circuit (IC), that uses packet switching to transmit data packets. This type of interconnect gives a great tradeoff between the chip area used and the interconnect bandwidth compared to other technologies such as point-to-point, bus, or ring-type connections when the number of connection points increases.

## 2.2 Goals of this project

The goal of this project is to create an HDL description of an example of a multicore CPU based on NoC technology. This CPU consists of several parts:

- CPU core a schoolRISCV single-cycle soft core;
- NoC interconnect a mesh 4x4 NoC for a 16-core CPU;
- packet converters to convert CPU output into transmittable data packets (flits).

# 3 Repository directories

| Directiry                        | Description                                                                        |  |  |
|----------------------------------|------------------------------------------------------------------------------------|--|--|
| doc                              |                                                                                    |  |  |
| <sup>L</sup> UserManual.pdf      | user manual for this project                                                       |  |  |
| boards                           | HDL files and scripts for programming FPGA boards                                  |  |  |
| - *board_name*                   | a directory containing files for generating a Quartus project for a specific board |  |  |
| - toplevel.sv                    | a common toplevel module to be used in board-specific modules for hardware-on-loop |  |  |
| <sup>L</sup> toplevel_onboard.sv | a common toplevel module to be used in board-specific modules for self-contained   |  |  |
| cores                            | HDL files for schoolRISCV soft core and supporting modules                         |  |  |
| - converters                     | HDL files for converters between memory controller (MC) packets and NoC packets    |  |  |
| - packet_collector.sv            | converter from NoC to MC                                                           |  |  |
| <sup>L</sup> splitter.sv         | converter from MC to NoC                                                           |  |  |
| L src                            | HDL files for the schoolRISCV soft core                                            |  |  |

| - cpu_with_ram.sv               | module that connects CPU and RAM to the MC                                       |  |  |  |
|---------------------------------|----------------------------------------------------------------------------------|--|--|--|
| ram.sv                          | a two-port RAM                                                                   |  |  |  |
| - sm_register.v                 | a DFF for an instruction counter                                                 |  |  |  |
| - sm_rom.v                      | preloaded instructions                                                           |  |  |  |
| - sr_cpu.v                      | a CPU module with a counter, decoder, register file, ALU, AGU and a control unit |  |  |  |
| - sr_cpu.vh                     | 'define macros for RISCV opcodes and ALU/AGU oper codes                          |  |  |  |
| - sr_mem_ctrl.sv                | a memory controller (MC) connecting CPU to RAM through the NoC                   |  |  |  |
| L sr_mem_ctrl.svh               | 'define macros for MC instructions                                               |  |  |  |
| cpu                             | HDL files for the 16-core CPU on a NoC                                           |  |  |  |
| - noc_with_cores.sv             | connects 16 CPU cores to the mesh 4x4 NoC                                        |  |  |  |
| L uart.sv                       | hooks up the 16-core CPU to the UART to monitor RAM data at a given address      |  |  |  |
| $\mathrm{mesh}_{-}4\mathrm{x}4$ | HDL files for the 4x4 mesh NoC                                                   |  |  |  |
| - inc                           | 'define macros for NoC configuration                                             |  |  |  |
| - noc.svh                       | macros for general NoC parameters                                                |  |  |  |
| - noc_XY.svh                    | macros for topology-specific (mesh) parameters                                   |  |  |  |
| - queue.svh                     | macros for queue parameters                                                      |  |  |  |
| L router.svh                    | macros for router parameters                                                     |  |  |  |
| - noc                           |                                                                                  |  |  |  |
| L noc.sv                        | module that connects 16 routers into a NoC                                       |  |  |  |
| L src                           | HDL files for router components                                                  |  |  |  |
| - algorithm.sv                  | an XY algorithm for packet switching                                             |  |  |  |
| - arbiter.sv                    | a module that chooses a packet to be switched                                    |  |  |  |
| - queue.sv                      | FIFOs for collecting incoming packets                                            |  |  |  |
| L router.sv                     | a module that creates a router from its components                               |  |  |  |
| modelsim                        |                                                                                  |  |  |  |
| ram_image_05.hex                | RAM images that contain a picture                                                |  |  |  |
| - instr_node_015.hex            | RAM images that contain RISCV codes for each core                                |  |  |  |
| - modelsim_run.bat              | a batch file that launches ModelSim using modelsim_script.tcl script             |  |  |  |
| L modelsim_script.tcl           | a script, according to which the simulation is ran                               |  |  |  |
| tb                              | HDL files for testbenches                                                        |  |  |  |
| L tb.sv                         | a testbench files that tests the CPU, dumping RAM contents at the end            |  |  |  |
| -                               |                                                                                  |  |  |  |

# 4 General description

This project contains a complete HDL description of a 4x4 mesh NoC, that uses an XY routing algorithm for packet switching. Each NoC router has a number of bidirectional connections for its neighbors and a single one for a CPU core assigned to the router as shown in Figure 1:



Figure 1: Simplified illustration of the NoC

If simulated or programmed into an FPGA without any modifications, this CPU will compute a simple convolution of a 51x90 image with a 3x3 kernel without any padding, so with any given 51x90 image the result will be a 49x88 image, where each of the pixels are computed with a formula (1), where in is an initial image, k is the convolution kernel and out is the result:

$$out(x-1,y-1) = \sum_{i=1}^{3} \sum_{j=1}^{3} k(i,j) * in(x-i,y-j)$$
 (1)

Given that  $k = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 1 \\ 1 & 1 & 1 \end{pmatrix}$ , the resulting image will look like a blurred version of the original when divided by a factor of  $\sum_{i=1}^{3} \sum_{j=1}^{3} k(i,j) = 10$  as shown in Figure 2:



Figure 2: Initial image (left) and the result of convolution after being divided by 10 (right)

# 5 Detailed description

#### 5.1 CPU structure

In this project a schoolRISCV single-cycle soft core [1] was used as a CPU. It is a tiny core that supports only a tiny subset of RV32I integer instructions: add, or, srl, sltu, sub, addi, lui, beq, bne. The structure of this core is presented in Figure 3:



Figure 3: Original single-cycle schoolRISCV CPU core

This instruction set is not enough to compute integer convolution because it uses multiplication and addition, hence the CPU was modified: new opcodes and alu operations were added inside the sr\_cpu.vh file, and ALU itself was modified to support integer multiplication.

Another modification was to add sw and lw instructions, because the idea is to read the image from one location in the RAM and write the result to another. To support these instructions, an AGU (address generation unit) was developed, which decodes lw rd, imm(rs1) / sw rs2, imm(rs1) instruction format and outputs data, address and instruction, which are read by the memory controller (MC). AGU is able to stop the instruction counter inside the CPU to pause it when it encounters an lw/sw instruction until further notice by the MC: after lw AGU waits until the MC signals that the data were written into the register file (RF), and after sw AGU waits until the MC signals that it took the instruction. A modified CPU structure is presented in Figure 4:



Figure 4: Modified single-cycle schoolRISCV CPU core

#### 5.2 Convolution code

A 16-core CPU has a 1024-word RAM chunk assigned to each core totalling 16384 words for the entire CPU. The processor sees it as a continuous memory with logical addresses spanning from 0x0 to 0x3FFF (this memory is word-addressed rather than being byte-addressed. Byte-addressing is the standard for a RISC-V CPU, so the addressing issue in the assembly code and in the memory controller should be addressed in the future). The 51x90 image is stored continuously throughout the 0th to 4th memory blocks, then the 3x3 kernel is located in the block 5. The resulting image is written to the next 5 blocks. The following listing contains a full RISCV assembly code that is preloaded into one of the cores, that computes 6 lines of the resulting image:

```
li a2, 1
li t0, 50
sub a3, t0, a2
li a4, 38
li t2, 2048
addi t2, t2, 1024
addi t2, t2, 1024
addi t2, t2, 1024
addi t3, t2, 1024
addi t3, t3, 1488
li a0, 1
li a1, 32
lw s2, 0x0(t2)
lw s3, 0x1(t2)
lw s4, 0x2(t2)
lw s5, 0x3(t2)
lw s6, 0x4(t2)
```

```
lw s7, 0x5(t2)
lw s8, 0x6(t2)
lw s9, 0x7(t2)
lw s10, 0x8(t2)
convolveY:
    convolveX:
    li t6, 0
        mul gp, t0, a1
        add gp, gp, a0
        add t4, gp, zero
         sub t4, t4, t0
        sub t4, t4, a2
        lw t5, 0x0(t4)
        mul t5, t5, s2
        add t6, t6, t5
         addi t4, t4, 1
        lw t5, 0x0(t4)
        mul t5, t5, s3
        add t6, t6, t5
        addi t4, t4, 1
        lw t5, 0x0(t4)
        mul t5, t5, s4
         add t6, t6, t5
        add t4, gp, zero
        sub t4, t4, a2
        lw t5, 0x0(t4)
        mul t5, t5, s5 add t6, t6, t5
         addi t4, t4, 1
        lw t5, 0x0(t4)
        mul t5, t5, s6
        add t6, t6, t5
        addi t4, t4, 1
        lw t5, 0x0(t4)
        mul t5, t5, s7
```

add t6, t6, t5

```
add t4, gp, zero
        add t4, t4, t0
        sub t4, t4, a2
        lw t5, 0x0(t4)
        mul t5, t5, s8
        add t6, t6, t5
        addi t4, t4, 1
        lw t5, 0x0(t4)
        mul t5, t5, s9
        add t6, t6, t5
        addi t4, t4, 1
        lw t5, 0x0(t4)
        mul t5, t5, s10
        add t6, t6, t5
        sw t6, 0x0(t3)
        addi t3, t3, 1
    addi a0, a0, 1
    bne a0, a3, convolveX
    addi a0, zero, 1
addi a1, a1, 1
bne a1, a4, convolveY
addi a1, zero, 1
idle:
    beq zero, zero, idle
```

Each core computes lines that are contained in the [a1 - 1, a4 - 2] range. The t3 register value is the address of the (a1 - 1)th line. Each core is set up in a way that helps avoid any overlaps or skips in the resulting image.

#### 5.3 Memory controller

Each CPU core sees the RAM as a continuous unit in a logical address space, meanwhile physically it is subdivided into 16 pieces of 1024 words each addressed from 0x0. This means that an MC, that is able to translate from logical address to physical, send and receive messages from the NoC and interact with the CPU core, is needed. The memory controller was designed to perform two types of actions:

- Take the data and instruction from the CPU and send them out to the NoC, stop the instruction flow;
- Take the full packet from the NoC, analyse it and perform necessary actions.

For a full packet coming from the NoC there are 4 types of actions:

• Load requested: return the packet with data to the requester;

- Load satisfied: receive data from the packet and write it into the register file (RF), resume the instruction flow;
- Store requested: write received data into the received RAM address;
- Store satisfied: resume the instruction flow;

A simple schematic for the developed MC is presented in Figure 5:



Figure 5: A simple schematic for the memory controller

#### 5.4 Converters

An MC sends a 64-bit packet with the ID of a destination node and receives a 64-bit packet with the ID of a sender node. A NoC may transfer the data in a completely different format, so there was a need to design MC-to-NoC converters, more specifically a splitter, that divides an MC packet into transferrable data packets (flits), and a packet collector, that assembles NoC flits into a full packet, that is readable by the MC.

#### 5.4.1 Splitter

A schematic of a splitter is presented in Figure 6:



Figure 6: A diagram of a splitter

It consists of four FIFOs, that collect data packets, an ID of a destination node, an instruction code and a packet ID, that is generated by the MC. Each clock it creates a 35-bit flit, that contains a validity bit, a 16-bit payload and other information, that is used to route the flit and then assemble them into the full packet at the destination. A full data packet is split into 4 flits, and after the splitter finishes sending out pieces of a full packet it shifts all of the FIFOs.

#### 5.4.2 Packet collector

A schematic of a packet collector is presented in Figure 7:



Figure 7: A diagram of a packet collector

It consists of a buffer, where the packets are assembled, a decoder, which writes into the buffer, depending on contents of the incoming packet, and an analyzer, which reads from a buffer each clock in search of a complete packet.

#### 5.4.3 TTL

Both memory controller and packet collector have a TTL system in place that ensures full completion of all load/store instructions while avoiding any deadlocks caused by the overflow of a packet collector. Each clock the collector purges all of the buffers that were left without new flits for a long enough period of time. The memory controller has its own TTL system, which is used to resend the instruction if there was no response in a certain timeframe.

#### 5.5 Network-on-chip

A Network-on-chip (NoC) allows to transfer data packets between completely different pairs of nodes simultaneously using a network of routers, that use packet switching to route traffic. This particular NoC is a 4x4 mesh network that uses an XY-algorithm to switch data packets. The developed router consists FIFOs for each input port, that collect incoming packets, a round-robin arbiter, that chooses a FIFO, from which the packet is taken out to be analyzed, and an algorithm, that analyzes the chosen packet and outputs it to the corresponding output. A packet contains XY-coordinates of a destination, which in conjunction with knowing its own coordinates is enough for the algorithm to determine the correct output. A diagram of a router is presented in Figure 8:



Figure 8: A diagram of a NoC router

Ports, which are named from north to west are used to connect routers together, the core input is connected to a splitter of the core assigned to the router and the core output is connected to a packet collector, that belongs to the same core. With all of the components assembled together, the whole network looks like the diagram, presented in Figure 9:



Figure 9: A diagram of a complete CPU

#### **5.6** UART

UART is used to monitor RAM data at all times. Each of the sixteen RAM chunks is a 2-port RAM, where the first port is used in the CPU and the second one is read-only and connected to the UART logic. Separate uartTx and uartRx, that handle the RX and TX data streams, and the logic, that manages these modules, were developed. A diagram is presented in Figure 10:



Figure 10: A diagram of UART logic

This particular UART controller is set to 115200 bauds per second, though it is easily configurable from the toplevel.sv file. The first step is to receive from the RX port 4 bytes, that encode a word address starting with the LSB first, then receive a single byte that encodes the ID of a node. Then the CPU returns RAM contents from that particular location, and since the word is 32 bit long, it is transferred through TX as 4 bytes starting with the LSB first too.

# 6 Usage and necessary software

#### 6.1 Simulation structure

The design was simulated in ModelSim - Intel FPGA Starter Edition 10.5b software. The testbench file **tb.sv** is contained in a **tb** directory of repository, the simulation is contained in the **modelsim** directory. This directory consists of hex files containing instructions for each of the CPU cores and initial RAM contents of each of the memory blocks, a .tcl script and a .bat file for running ModelSim. To start the simulation you just need to run the .bat file.

#### 6.2 Simulation result

After launching the **modelsim\_run.bat** file the ModelSim GUI is launched and runs the simulation. The main window of the ModelSim is the Wave viewer (Figure 11), which shows **clk**, **rst\_n**, **rx**, **tx**, **clkRx**, **clkTx** signals. **rx** and **tx** signals represent received and transmitted UART signals, where **rx** is driven in the **tb.sv** as a UART device requesting data from a specific place in the RAM, and **tx** is driven by the design sending the contents of RAM.



Figure 11: A ModelSim GUI

After the simulation is finished, the simulation folder gets filled up with new files named **out-put\_image\_chunk\_0..4.hex** and **latency\_log\_0..15.csv**. Image chunk files store contents of the RAM blocks that contain the resulting image written in a hexadecimal format. Latency log files contain a headerless csv table that contain a number of clocks required to complete every **lw** and **sw** instruction. Column 0 contains a timestamp of completion, column 1 contains latency as a number of clock cycles. Both files can be parsed and demonstrated in a visual manner (Figure 12, Figure 13) in a Python script, for example (scripts may end up in a repository).



Figure 12: Original image (left), desired result (center), image extracted from the CPU (right)



Figure 13: lw and sw instruction latency charts for each core

#### 6.3 Unit on hardware

There are 2 variants of testing said unit on hardware: self-contained, where you can check the design using just the development board, and hardware-in-loop, where the data is transmitted to the PC using UART and TTL adapter to see the RAM contents or to see the image being drawn in real time (coming soon).

#### 6.3.1 Self-contained

The board used to see the working CPU in hardware was DE10-Standard with a Cyclone V 5CSXFC6D6F31C6N SoC. To create a Quartus project for the design compilation and further programming you need to use a **quartus\_onboard.tcl** script in the **boards/DE10-Standard** directory using a following batch command in Windows:

#### $quartus\_sh$ -t $quartus\_onboard.tcl$

After some time the design is compiled and you can open a Quartus project, from where you can program the design on a board. Then you can explore the RAM contents according to the Figure 14



Figure 14: Usage of a self-contained design on a board



Figure 15: Example photo of the unit programmed to the FPGA

Total resource usage: Total ALMs: 31761 Total registers: 37891

Total block memory bits: 524288

Resource usage by entity:

| Entity           | Average ALUTs | Average registers | Average DSP blocks |
|------------------|---------------|-------------------|--------------------|
| schoolRISCV core | 1372.9        | 1032.0            | 2                  |

| memory controller | 116.0 | 211.0 | 0 |
|-------------------|-------|-------|---|
| splitter          | 27.3  | 112.0 | 0 |
| collector         | 659.5 | 416.0 | 0 |
| central router    | 516.3 | 710.0 | 0 |
| edge router       | 431.1 | 560.0 | 0 |
| corner router     | 307.0 | 408.5 | 0 |

# 6.3.2 Hardware-in-loop

// coming soon