# ZAP : An ARM v4T Compatible Soft Processor

Revanth Kamaraj (revanth91kamaraj@gmail.com)

October 21, 2016

#### **MIT License**

Copyright (c) 2016 Revanth Kamaraj (Email: revanth91kamaraj@gmail.com)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

CONTENTS 2

# **Contents**

| 1 | Introduction                              | 5  |
|---|-------------------------------------------|----|
|   | 1.1 Features                              | 5  |
|   | 1.2 About This Manual                     | 6  |
| 2 | Configuring the core and testbench        | 7  |
| 3 | Compiling Code and Running Simulation     | 8  |
|   | 3.0.1 Generating a binary using GNU tools | 8  |
|   | 3.0.2 Generating a Verilog memory map     | 8  |
|   | 3.0.3 Invoking the simulator              | 9  |
|   | 3.1 Run Sample Code quickly               | 9  |
| 4 | IO Ports of the Core                      | 11 |
| 5 | Basic Description                         | 14 |

| LIST OF FIGURES | 3 |
|-----------------|---|
|-----------------|---|

| igures |         |
|--------|---------|
|        |         |
|        | 'igures |

| 1 | High level view of the ZAP pipeline | <br>14 |
|---|-------------------------------------|--------|
|   |                                     |        |

LIST OF TABLES 4

# **List of Tables**

| 1 | Configuring the core and testbench | 7  |
|---|------------------------------------|----|
| 2 | Variables in Perl script           | 10 |
| 3 | IO Ports                           | 11 |
| 4 | Pipeline Stage Description         | 15 |

1 INTRODUCTION 5

### 1 Introduction

ZAP is an ARM® v4T compatible soft processor core. The code is fully open source and is released under the MIT license.

#### 1.1 Features

#### • Fully ARM v4T compatible.

- Executes the 32-bit wide ARM v4 instruction set.
- Executes the 16-bit wide compressed instruction set.

#### • Deeper pipeline for better clock speed.

- The processor is built around a 9 stage pipeline to achieve a high operating frequency. The pipeline has a data forwarding capability to allow back to back instructions to execute without stalls. Non trivial shifts require their operands a cycle early. Loads have a 3 cycle latency and the pipeline will stall if an attempt is made to access the register within the latency period.

#### • Supports interrupt and abort signaling

- Features dedicated high level sensitive IRQ, FIQ and memory abort pins.

#### • Can be interfaced with caches/MMU.

- The CPSR of the processor is exposed as a port allowing for implementation of a virtual memory system.
- Memory stall may be indicates to the core via dedicated ports to allow caches to be connected.

#### • Coprocessor interface provided.

The coprocessor interface simply exposes internal signals of the core. It is up to the coprocessor
to interpret and process instructions correctly.

### • Supports M-variant multiplication instructions

- These instructions are supported:

```
MUL, MLA, SMULL, UMULL, SMLAL, UMLAL
```

#### • The core is configurable

 The processor may be synthesized without compressed instruction support and/or coprocessor interface support to save area and improve speed.

### • Designed for FPGA synthesis

- Most memory structures of the processor map efficiently onto FPGA block RAMs. The register file is overclocked by a 2x clock to allow for 2 write ports.
- The branch predictor memory also efficiently maps to FPGA block RAM.
- No device specific instantiations are made to allow for portability across FPGA vendors.

#### Faster performance of memory instructions

1 INTRODUCTION 6

- Memory instructions with writeback can be issued as a single instruction since the register file is built to have 2 write ports. This may improve performance.

- The processor core is written entirely in synthesizable Verilog-2001.
- A branch predictor is installed to compensate for the longer pipeline.
  - Branches within a 2KB block of memory can be mapped into the predictor without conflict.
     The predictor basically uses a bimodal prediction algorithm (2-bit saturating counter per branch entry).

#### · Uses a base restored abort model

Uses a base restored abort model making it easier to write exception handlers. Basically, on a
fault in between a multiple memory transfer, the processor rolls back the base pointer register as
it were before the operation took place.

#### 1.2 About This Manual

The purpose of this manual is to document the processor core's design. This document is very incomplete. I will try my best to update it.

# 2 Configuring the core and testbench

Throughout, it is assumed that \$ZAP\_HOME points to the working directory of the project. Core/testbench configuration may be done using defines. The defines file is located in \$ZAP\_HOME/includes/config.vh.

See table 1

| Define           | Purpose                                                                | Required for | Comments                                                                                                                                 |
|------------------|------------------------------------------------------------------------|--------------|------------------------------------------------------------------------------------------------------------------------------------------|
|                  | CORE CONF                                                              | FIGURATION   |                                                                                                                                          |
| THUMB_EN         | Enabling compressed instruction support                                | Core Setup   | Enabling this increases core area and reduces performance.                                                                               |
| COPROC_IF_EN     | Enabling coprocessor suport. Extra ports get added.                    | Core Setup   | Enabling this increases core area and reduces performance.                                                                               |
|                  |                                                                        | ONFIGURATION |                                                                                                                                          |
| IRQ_EN           | Generates periodic IRQ pulses.                                         | Testbench    | _                                                                                                                                        |
| SIM              | Generates extra messages.                                              | Testbench    | Must be UNDEFINED for correct synthesis of the core in Xilinx since some debugging structures in RTL are removed if this is not defined. |
| VCD_FILE_PATH    | Set the path to the VCD data dump.                                     | Testbench    | _                                                                                                                                        |
| MEMORY_IMAGE     | Path to the memory image Verilog file.                                 | Testbench    | -                                                                                                                                        |
| MAX_CLOCK_CYCLES | Set the number of cycles the simulation should run before terminating. | Testbench    | _                                                                                                                                        |
| SEED             | Set the testbench seed.<br>The seed influences randomness.             | Testbench    | _                                                                                                                                        |

Table 1: Configuring the core and testbench

## 3 Compiling Code and Running Simulation

NOTE: If you want to quickly test the processor with sw/asm/prog.s and sw/asm/prog.c sample programs, see Section 3.1

#### 3.0.1 Generating a binary using GNU tools

You can use the existing GNU toolchain to generate code for the processor. This section will briefly explain the procedure. For the purposes of this discussion, let us assume these are the source files...

```
main.c
fact.c
startup.s
misc.s
linker.ld This is the linker script.
```

Generate a bunch of object files.

```
arm-none-eabi-as -mcpu=arm7tdmi -g startup.s -o startup.o arm-none-eabi-as -mcpu=arm7tdmi -g misc.s -o misc.o arm-none-eabi-gcc -c -mcpu=arm7tdmi -g main.c -o main.o arm-none-eabi-gcc -c -mcpu=arm7tdmi -g fact.c -o fact.o
```

Link them up using a linker script...

```
arm-none-eabi-ld -T linker.ld startup.o misc.o main.o fact.o -o prog.elf
```

```
Finally generate a flat binary... arm-none-eabi-objcopy -0 binary prog.elf prog.bin
```

The .bin file generated is the flat binary.

### 3.0.2 Generating a Verilog memory map

```
perl $ZAP_HOME/scripts/bin2mem.pl prog.bin prog.v The prog.v file looks like this...
```

```
mem[0] = 8'b00;
mem[1] = 8'b01;
```

### 3.0.3 Invoking the simulator

Ensure config. vh is set up correctly.

Your command must look like this (It is a single command)...

```
iverilog $ZAP_HOME/rtl/*.v $ZAP_HOME/rtl/*/*.v $ZAP_HOME/testbench/*.v
$ZAP_HOME/models/ram/ram.v -I$ZAP_HOME/includes -DSEED=22
```

The rtl/\*.v and rtl/\*/\*.v collect all of the synthesizable Verilog-2001 files, the testbench/\*.v collects all of the testbench (In this situations, the ram.v file is a part of the testbench).

Provide some seed value (22 is used in the example). Ensure you edit the config.vh file before running the simulation to correctly point to the memory map, vcd target output path etc for the simulator to pick up.

## 3.1 Run Sample Code quickly...

A sample .s and .c file is present in \$ZAP\_HOME/sw/s and \$ZAP\_HOME/sw/c respectively. To translate them to binary and to a Verilog memory map, you can run the Perl script (See NOTE below)

```
perl $ZAP_HOME/debug/run_sim.pl
```

**NOTE:** Ensure you set all the variables in the Perl script as per table 2...

| Variable        | Purpose                                                 |
|-----------------|---------------------------------------------------------|
| ZAP_HOME        | Set this to the project working directory.              |
| LOG_FILE_PATH   | Set this to the place where you want the log file to    |
|                 | be created.                                             |
| ASM_PATH        | Set this to the location of your startup assembly file. |
| C_PATH          | Set this to the location of your C file.                |
| LINKER_PATH     | Set this to the location of the linker script.          |
| TARGET_BIN_PATH | Set this to the target bin file location.               |
| VCD_PATH        | Set this to location where the VCD is to be created.    |
|                 | This must match what is in config.vh                    |
| MEMORY_IMAGE    | Set this to the location where the memory image         |
|                 | must be created. This must match what is in con-        |
|                 | fig.vh                                                  |

Table 2: Variables in Perl script

# 4 IO Ports of the Core

Table 3 lists the ports of the processor core.

Table 3: IO Ports

| Port                | Direction | Description                                                                                                                                                          | Comments                                                                                                                                                                                |
|---------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| i_clk               | I         | Core clock                                                                                                                                                           | _                                                                                                                                                                                       |
| i_clk_2x            | Ι         | Register file clock                                                                                                                                                  | Must run syn-<br>chronously with the<br>core clock at twice the<br>frequency.                                                                                                           |
| i_reset             | Ι         | Active high reset                                                                                                                                                    | This runs through a reset filter (using 2 series D flip-flops) that then drives a global reset throughout the design. TODO: Need to remove reset from flip-flops that do not need them. |
| i_instruction[31:0] | I         | 32-bit instruction from the instruction cache.                                                                                                                       | _                                                                                                                                                                                       |
| i_valid             | I         | Indicates that the instruction on the 32-bit bus is valid and ready to be clocked into the instruction register (Fetch stage). If this is 0, the PC does not change. |                                                                                                                                                                                         |
| i_instr_abort       | I         | Indicates that the instruction cache experienced an instruction abort.                                                                                               | When abort is asserted, the i_valid port is treated as a DON'T CARE.                                                                                                                    |
| o_read_en           | О         | Current memory access is a read access.                                                                                                                              | _                                                                                                                                                                                       |
| o_write_en          | 0         | Current memory access is a write access.                                                                                                                             | -                                                                                                                                                                                       |
| o_address[31:0]     | О         | memory access.                                                                                                                                                       | The address is always 32-bit aligned. Specific bytes are addressed using byte enables.                                                                                                  |
| o_mem_translate     | 0         | Current memory access<br>must adopt a user view<br>of memory                                                                                                         | <b>-</b> .                                                                                                                                                                              |

| i_data_stall         | Ι | Indicates that the data               |                     |
|----------------------|---|---------------------------------------|---------------------|
| 1_Uala_Stall         | 1 |                                       | _                   |
|                      |   |                                       |                     |
|                      |   | The entire pipeline freezes when this |                     |
|                      |   |                                       |                     |
|                      |   | happens (PC does not                  |                     |
|                      |   | change).                              |                     |
| i_data_abort         | I | Indicates that the data               |                     |
|                      |   | memory access aborted.                |                     |
|                      |   | When this is asserted,                |                     |
|                      |   | the i_data_stall port                 |                     |
|                      |   | must be deasserted.                   |                     |
| i_rd_data[31:0]      | I | Data received from data               | _                   |
|                      |   | memory is clocked in                  |                     |
|                      |   | from this port if there is            |                     |
|                      |   | no data memory stall.                 |                     |
| o_wr_data[31:0]      | 0 | Data to write out to                  | _                   |
|                      |   | memory                                |                     |
| o_ben[3:0]           | 0 | Byte enables. [0] deals               | _                   |
|                      |   | with [7:0] of the data                |                     |
|                      |   | and [3] with [31:24].                 |                     |
| i_fiq                | I | FIQ level sensitive sig-              | _                   |
| 1                    |   | nal (Active high)                     |                     |
| i_irq                | I | IRQ level sensitive sig-              | _                   |
| 1_114                |   | nal (Active high)                     |                     |
| o_pc[31:0]           | 0 | Program Counter                       | Lower bit is ALWAYS |
| 0_pc[31.0]           | O | 1 Togram Counter                      | 0.                  |
| o_cpsr[31:0]         | 0 | Current PSR                           |                     |
| i_copro_done         | I | Coprocessor done indi-                | _                   |
| 1_copro_done         | 1 | cation.                               |                     |
| o_copro_dav          | 0 | Data on the coprocessor               | _                   |
|                      |   | output ports of the core              |                     |
|                      |   | are valid                             |                     |
| o_copro_word         | 0 | The entire 32-bit co-                 | _                   |
| [31:0]               |   | processor instruction is              |                     |
|                      |   | presented on the port.                |                     |
| i_copro_reg_en       | I | Coprocessor wishes to                 | _                   |
| _                    |   | take charge of the regis-             |                     |
|                      |   | ter file.                             |                     |
| i_copro_reg_wr_index | I | Coprocessor write reg-                | _                   |
| [5:0]                |   | ister index.                          |                     |
| i_copro_reg_rd_index | I | Coprocessor read regis-               | _                   |
| [5:0]                |   | ter index                             |                     |
| i_copro_reg_wr_data  | I | Coprocessor data                      | _                   |
| [31:0]               | = | to write to the reg-                  |                     |
| [01.0]               |   | ister mentioned in                    |                     |
|                      |   | i isiei illeliiloneo                  |                     |
|                      |   | i_copro_reg_wr_index.                 |                     |

13

| o_copro_reg_rd_data | 0 | Data read from reg-    | _ |
|---------------------|---|------------------------|---|
| [31:0]              |   | ister file to be read  |   |
|                     |   | into coprocessor.      |   |
|                     |   | This corresponds to    |   |
|                     |   | register specified on  |   |
|                     |   | i_copro_reg_rd_index   |   |
|                     |   | although it is delayed |   |
|                     |   | by 1 clock cycle.      |   |
|                     |   |                        |   |

NOTE: If COPROC\_IF\_EN is not defined, the coprocessor ports are not available!

NOTE: The coprocessor must generate translated register numbers. For example, if it wishes to write to R13\_FIQ, it must generate a write to R25.

# 5 Basic Description

Figure 1 should give a basic overview of the pipeline structure (8 stage pipeline). Table 4 gives an overall description of the pipeline stages. Note that the pipeline is an 8 stage pipeline.

Figure 1: High level view of the ZAP pipeline



5 BASIC DESCRIPTION 15

| Stage                  | Description                                        |
|------------------------|----------------------------------------------------|
| FETCH                  | Clocks in data from the instruction cache. Also    |
|                        | computes PC+8 in case the instruction refers to    |
|                        | R15 (TODO: PC+8 adder can be eliminated by get-    |
|                        | ting PC from the third stage)                      |
| BRANCH PREDICTOR RAM   | This consists of block RAM dedicated to hold       |
| BRANCH I REDICTOR RAIN | branch state for up to 2KB of instructions.        |
| PREDECODE              | Decompresses 16-bit instructions, handles LD-      |
| TREBLEODE              | M/STM instructions and SWAP/SWAPB. Also            |
|                        | handles the coprocessor interface. Basically con-  |
|                        | sists of a series of state machines.               |
| DECODE                 | Decodes 32-bit instructions into a format that can |
|                        | be understood by the downstream logic.             |
| ISSUE                  | Performs preliminary register read by sniffing out |
|                        | register value from the bypass network.            |
| VALUE RESTORE          |                                                    |
| SHIFTER                |                                                    |
| MAC-UNIT               | The value restore unit restores correct values by  |
|                        | getting them from the ALU. This is useful for      |
|                        | executing back-to-back instructions that typically |
|                        | get read incorrectly in issue. The shifter handles |
|                        | all shift operations and the MAC unit performs     |
|                        | multiply- accumulate.                              |
| EXECUTE                | The execute unit consists of a 32-bit ALU. This    |
|                        | unit also generates memory addresses and other     |
|                        | memory control signals.                            |
| MEMORY ACCESS          | This stage clocks in data from the data cache.     |
| REGISTER FILE          | This stage is the register file and is basically 4 |
|                        | block RAMs to allow for 4 independent read ports   |
|                        | (TODO: Can be reduced to 3 since one port is re-   |
|                        | dundant).                                          |

Table 4: Pipeline Stage Description