# A High-radix Online Arithmetic Verification System Final Year Project 1800478: Interim Report

# Zifan Wang, 01077639 Imperial College London

# CONTENTS

| I    | Introdu                        | ıction                       | 2 |
|------|--------------------------------|------------------------------|---|
| II   | Project Specification          |                              |   |
|      | II-A                           | Project Organisation         | 2 |
|      | II-B                           | Deliverables                 | 2 |
|      | II-C                           | Hardware Choice              | 2 |
|      | II-D                           | Software Choice              | 2 |
| III  | Background Research            |                              |   |
|      | III-A                          | Online Arithmetic            | 3 |
|      | III-B                          | High-radix Arithmetic        | 3 |
|      | III-C                          | High-radix Online Arithmetic | 4 |
| IV   | Engineering Background         |                              |   |
|      | IV-A                           | Testbench Architecture       | 4 |
|      | IV-B                           | Target                       | 4 |
|      | IV-C                           | Data Transfer Rate           | 4 |
|      | IV-D                           | Runtime Test Data Generation | 5 |
|      | IV-E                           | Clock Domains                | 5 |
|      | IV-F                           | Result Analysis              | 5 |
| V    | Implementation Plan            |                              |   |
|      | V-A                            | Milestones                   | 5 |
|      | V-B                            | Timeline                     | 5 |
|      | V-C                            | Work to Date                 | 7 |
| VI   | Evaluation Plan                |                              | 7 |
|      | VI-A                           | Metrics                      | 7 |
| VII  | Ethical, Legal and Safety Plan |                              | 8 |
|      | VII-A                          | Ethical Considerations       | 8 |
|      | VII-B                          | Legal Considerations         | 8 |
|      | VII-C                          | Safety Considerations        | 8 |
| Refe | rences                         |                              | 8 |

#### I. Introduction

With the right number representation system, it is possible to perform arithmetic operations MSD first. Consequently, these online arithmetic operators are attractive for hardware implementation in both serial and parallel forms. When computing digits serially, they can be chained such that subsequent operations begin before the preceding ones complete. Parallel implementations tend to be most sensitive to failure in their LSDs, making them more friendly to overclocking than their LSD first counterparts, for which the opposite is true.

In the past, online operators have typically been implemented in binary. Although Radix-2 modules are the simplest to design and has the shortest cycle time per digit, it has the highest online delay and requires the largest number of cycles to complete calculations [14]. As such, the choice of binary is not absolute. In this project, I will explore high-radix online operators, investigating their suitability for FPGA implementation and examining the resultant tradeoffs between performance, area and power.

#### II. PROJECT SPECIFICATION

# A. Project Organisation

This project is a part of a larger project investigating the effect of using high-radix number representation with online arithmetic operators. The overarching aim involves implementing such a system on an FPGA and quantifying its performance improvements. This is achieved through two individual projects, vertically split from the enveloping project. One shall design and verify the arithmetic operator modules, while the other shall design a system from the top-level to test and evaluate these operators. This project deals with the system-level issues.

As this project progresses in parallel with the designing of the operator modules, it is necessary to decouple the two projects so that, being individual projects, they can be evaluated individually. The success of one project should not be restricted by the status of the other. To this end, the goal of the system-level design is more focussed on its functionalities and robustness. This relationship and its effect on the evaluation will be examined further in the evaluation chapter of this report.

To ensure the two products will work together once they are both complete, a common interface needs to be agreed upon. The interface will be done using Qsys. Its reasoning falls off directly from the use of the hardware; as such it will be explained in the corresponding section.

# B. Deliverables

At the end of the project, the system should be able to perform the following:

- 1) Connect to the arithmetic modules as its input;
- 2) Generate and run tests on these modules;
- 3) Vary the frequency and voltage of the FPGA;
- 4) Evaluate its performance.

# C. Hardware Choice

The system itself will be built on a Cyclone V SX SoC Development Board from Intel [26].

The 5CSXFC6D6F31C6N SoC has an Arm Cortex-A9 MPCore accompanied by Intel's 28nm FPGA fabric [19]. The FPGA is necessary for implementing the hardware design and obtaining empirical results for the project.



Fig. 1. Structure of the System-on-Chip

While an FPGA without an embedded CPU will be enough for this project to work, having an Hard Processor System (HPS) on the same chip is useful as the test software can run on it. The HPS is a separate piece of hardware that distinguishes itself from a soft processor, such as the Nios II, which is a processor programmed onto the FPGA itself. With this additional capacity, a better user interface can thus be constructed with more detailed, on-the-fly control of the FPGA. This means setting up the testbench will only require programming the design into the FPGA, followed by running the test script on the HPS. The product will thus be self-contained. It will be more accessible as no additional setup is required for the user.

It should be noted that Xilinx offers similar boards as well. Its Zynq SoC family has a very comparable structure as they too integrate the software programmability of an Arm processor with the hardware possibility of an FPGA. For example, similar to the Cyclone V SX, Zynq-7000S features an Arm Cortex-A9 coupled with a Xilinx 28nm FPGA [30]. As such, a board like the ZedBoard [31] could be just as viable for this project.

As there are very few significant functional differences between the two brands, I shall initially explore with the Intel board, simply for its availability and my familiarity with their development tools. Due to the architectural differences between the logic elements between Xilinx and Altera FPGAs [13], the performances on the two boards are not necessarily identical. Once the project has progressed to a point where the system design is mature and tested, the Xilinx alternative can be explored as an extension.

# D. Software Choice

The software choice follows closely with the hardware choice in this project. To develop for Intel FPGAs, Quartus has to be used. The version picked is arbitrary as there are not many functional differences between the versions that will be critical to the project. As Quartus Prime 16.0 is the version installed in the computers in the department, I will use the same version simply for convenience. This naturally means the hardware system will be built with the system integration tool that comes with Quartus – Qsys.

The Qsys software is designed to be used for integrating different hardware modules into a system. As such, it will be used as the interface for the two parallel projects.

While an HLS language could be used for the design, in this design it suffers from a few problems and does not offer enough benefits to justify its use. Usually HLS is preferred for developing algorithms, because it is often easier to write them down in C first before converting them to RTL. However, this project has a lot of control path work and direct manipulation on the data bits. The interfaces also require detailed control of the actual hardware implementation only offered by HDLs. It is therefore not worth it to go through the conversion and as such, this design will be written in Verilog.

Other than the hardware design tools, there is some freedom of choice on the HPS side of the project. The test will be built with Python, which will be running on an Ubuntu system that is installed on the HPS. This choice is made as there are previous unrelated projects on the same development board, which means a lot of time can be saved on tedious setup works such as getting an operating system booting.

# III. BACKGROUND RESEARCH

#### A. Online Arithmetic

Traditional arithmetic operators have two common characteristics. One, their order of operation may be different depending on the operation itself. A traditional adder, parallel or serial, generates its answers from the LSD to the MSD. A traditional divider design on the other hand, generates its answer from the MSD to the LSD [1] [10].

Due to this inconsistency, arithmetic operators may be forced to compute word-by-word, waiting for all digits to finish in the previous operator the next can start [17]. Therefore, if a divider follows an adder of the same width, the divider has to wait until the adder has completed its computation before it can begin its own.

The other commonality of traditional designs is that their precisions are specified design-time. Once built, a 32-bit adder always adds 32 bits together, as the hardware is fixed at runtime. A possible way of making it less inefficient would be using SIMD instructions [3], combining smaller operations into a larger one that fits the hardware. This however, has the trade-off of more control circuits in the hardware, or a more complex compiler.

Online arithmetic does not suffer from the first issue as it performs all arithmetic operations MSD-first [5] [6]. Furthermore, pipelining can be used with online serial arithmetic operators. Thus the output digit of an earlier operation can be fed into the next operator before the earlier operation been fully complete.



Fig. 2. Computing  $y = \sqrt{(a+b)cd/(e-f)}$  with serial online operators [5]

As illustrated in figure 2, while each individual operation may take longer than its conventional counterpart, online arithmetic could provide a speed up if the operators are chained in serial. In addition to the temporal trade-off, individual online arithmetic operators also sacrifices in the use of memory space. To perform all computation are made from the MSD to the LSD, the use of a redundant number system is compulsory. However, this redundancy also has its advantage in making the operators scalable. The time required per digit can be made independent of the length of the operands. [15]

A recently proposed architecture allows the precision of online arithmetic to be controlled at run-time [17]. Traditionally, this run-time control was restricted due to the parallel adders present in the multipliers and dividers. This architecture reuses a fixed-precision adder and stores residues in on-chip RAM. As such, a single piece of hardware can be used to calculate to any precision, limited only by the size of the on-chip RAM.

The way online arithmetic alleviates the second problem of fixed precision falls out directly from its MSD-first nature. Suppose the output of a conventional ripple adder is sampled before it has completed its operation. In this case, the lower digits would have been completed, but the carry would not have reached the higher ones. This means the error on the result would be significant, as the top bits were still undetermined [11].

However, if the output of a parallel online adder is sampled before its completion, the lower bits would be the undetermined ones. This means the error of the operation would be small. With overclocking, online arithmetic would fail gracefully, losing its precision gradually from the lowest bits first. Thus, it allows for a run-time trade-off between precision and frequency [12].

# B. High-radix Arithmetic

Conventional designs of arithmetic operators use binary representations. This was chosen four decades ago to maximise numerical accuracy per bit of data. However, using a high-radix representation system could still yield better numerical accuracy while reducing area cost for FPGAs. For example, a hexadecimal floating-point adder was shown to have a

30% smaller area-time product than its binary counterpart, while still delivering equal worst-case and better average-case numerical accuracy [2].

However, the savings are not without trade-offs. This trade-off can become unfavourable if the specification requires much I/O and little computation [16]. This is because the overhead of radix conversion would be significant. It is also unwise to use high-radix representations when the numbers are unusually small, thus making the savings offered by the high-radix negligible [2].

# C. High-radix Online Arithmetic

Using high-radix number representations for online arithmetic is a relatively novel concept. While there have been some research with similar premises [8] [9], this project takes a more direct approach by implementing custom operators made for high-radix online arithmetic on an FPGA. This would allow for empirical results to be obtained, and hopefully revealing practical insights to the method while doing so.

Furthermore, benchmarking this exotic arithmetic system for popular FPGA accelerations such as neural networks would be interesting, as there is not much precedence for it.

# IV. ENGINEERING BACKGROUND

## A. Testbench Architecture

While there has been many similar performance analysis done on hybrid SoCs before, each of them uses their own, usually ad hoc, testbench design [11]. This project will use a structure that is inspired by that of an agent in UVM [18].



Fig. 3. Proposed testbench block diagram

Configuration is first done from a software running on the HPS, which sends the test specifications to the randomiser. The randomiser will provide a stream of random data, that would be converted to meaningful test inputs by the driver. The test output will be watched by the monitor, reporting any

interesting event to the scoreboard, which keeps track of them. The scoreboard feeds the information back to the software ondemand. The interface is used to decouple the control logic with the DUT, allowing the frequency of the DUT to be finely controlled.

# B. Target

The design of the verification system is the major engineering challenge of this project. In order to stress the DUT, the verification system must perform at a much higher frequency than the expected frequency of the DUT. Assuming the DUT is to run at 300MHz, to fully explore the effect of overclocking, the testbench must be able to run at double the frequency or higher. This gives a target frequency of 800MHz. Assuming data width of 32-bits, the target data transfer rate is then estimated to be 25.6Gbps.

# C. Data Transfer Rate

As the test is to be controlled on the HPS, the HPS-FPGA bridge will be the immediate bottleneck if the test data is to flow from HPS to FPGA. While the HPS can easily generate test data with a piece of software, there is a large amount of overhead as data crosses from one architecture to another. This overhead exists in the form of both a decreased bandwidth and a increased delay. Thus, it would not be sensible for the HPS to send out data during runtime.

1) Off-chip DDR SDRAM: Another thought may be to first populate the off-chip DDR SDRAM on FPGA side, then feed that data to the DUT during test. This is already much faster than passing the data directly from HPS. The 1GB, 32-bits wide DDR3 on FPGA side is rated at 400MHz. With double rate transfer, this gives a maximum transfer rate of 25.6Gbps.

While using the off-chip RAM may theoretically achieve the targets, it still has its disadvantages. First, the process of filling up the memory and then using them for the tests takes time. This means the test would be broken up into bursts with time in between for checking results and filling in new data. The complexity of the SDRAM interface also requires a SDRAM controller to be used to manage SDRAM refresh cycles, address multiplexing and interface timing. These all adds up to a significant access latency. While it could be overcame with burst accesses and piplined accesses, it would further complicate the SDRAM controller. A controller is provided by Altera [21], but it consumes a non-negligible amount of the limited FPGA resources, while adding unnecessary complexities to the design. Customising or building a new SDRAM controller to fit this project is possible, but needlessly time-consuming.

2) On-chip Memory: The on-chip memory is much faster and simpler to handle. In comparison, this memory is implemented on the FPGA itself, and thus needs no external connections for accesses. It has the highest possible throughput, with the lowest possible latency in an FPGA-based system. The memory transactions can also be piplined, giving one transaction per clock cycle. With an on-chip FIFO accessed in dual-port mode, the write operations at one end and the read operations at the other end can happen simultaneously.

This effective doubling of the bandwidth is useful as tests are prepared and fed into the DUT, or when test results are collected and fed to the monitor.

On-chip memory is not without its drawbacks. It is volatile and very limited in capacity. While the off-chip can have its storage reaching 1GB, that of the on-chip memory could only reach a few MB [20]. Volatility is not exactly of concern in this project, but its small capacity means not much test data can be held before it needs more fed in.

## D. Runtime Test Data Generation

To exploit the benefits of on-chip memory, a way of generating test data at runtime needs to be designed. As arithmetic operators have a vast set of valid inputs, it is necessary to have cost effective test generation.

A good choice here is to use random testing. With relatively minimum effort, random testing can provide significant coverage and discover relatively subtle errors [4]. The main drawback of random testing is the possible lack of coverage on extreme cases, and the usual solution is to provide handwritten tests to complement random testing. However, as the main goal of this testbench is gauging the performance of the module, and not necessarily verifying the correctness of it, this could be ignored during stress testing. If logic correctness testing is later required, these special tests could be written and run separately with a relaxed timing restriction.

LFSRs are a reliable way of generating pseudo random numbers fast with minimum cost [7]. They will thus form the starting point of data generation. While it is possible for data generated to be invalid as inputs to the DUT, this should not be the case for most benchmarks in this project. Even if so, they can be dealt with at the monitor.

With this, the software would only need to configure the randomiser, and test data no longer needs to pass through the HPS-FPGA bridge.

# E. Clock Domains

Another concern in the system design is of the different clock domains that must exist on the FPGA. At a minimum, there needs to be two clock domains, one surrounds the DUT and another supports the rest of the control logic around the DUT. These clock frequencies can be generated and distributed with PLLs, which are provided as IP Cores in the Quartus software [22]. Data crossing clock domains will be fed through FIFOs to prevent loss.

The proposed structure will have the main bulk of the control logic running on a separate clock domain as the DUT. Only an interface with FIFOs and minimum logic will be running in synchronous with the DUT. Therefore, the test controls can be running at a slower frequency without bottlenecking the system, allowing the DUT to be stressed further.

# F. Result Analysis

If the monitor detects an interesting event such as an error, it will sent out a message to the scoreboard. The scoreboard

has counters tracking these events, and update them back to the software periodically.

The software can run statistics to provide further insights to

#### V. IMPLEMENTATION PLAN

# A. Milestones

The project is divided into three main phases. The initial one is of research and learning. This is to lay the foundation to the project, and is represented by task 2 and 3 in the timeline. The second phase is building the core product of the project. This is includes task 4, 5 and 6 in the timeline. The final phase is enhancing the product, represented by task 7, 8, and 9.

## B. Timeline

In order to track the progress and success of the project, the difficulties of the deliverables need to be analysed first. Figure 4 provides a visualisation of the project timeline. Some slack is included into the plan to reduce risk and give potential for extension.

1) **Term Time and Examinations**: Time available for the project varies greatly throughout the year. The greatest factors would be the term time and the examinations. The time needed for revision has been marked in the chart. These times will have minimum progress to the project.

During term time, there would be coursework deadlines, which would also negatively affect the time that could be allocated to this project. One coursework module was selected for Autumn and three was picked for Summer. As such, it is expected that the progress would be somewhat slower during Autumn but significantly slower during Spring. To make up for the time lost, a part of Christmas were used and a few weeks from Easter will also be committed towards the project.

The list of tasks were then laid out onto the timeline. This is done according to its expected difficulty and the expected availability to work on these tasks.

- 2) Background Research: To fill in the background knowledge required to work on this project, the first months were spent on reading textbooks and papers. The research is to provide context and motivation to the project. It also provided an overview of the field of research and some understanding towards the current state-of-art. Most importantly, it offered the necessary knowledge needed for this project. The result of this research is summarised back in chapter III.
- 3) Learning the Tools: Due to the lack of experience in programming hybrid SoCs and the lack of knowledge in the current state-of-art digital arithmetic designs, a significant portion of the effort was spend on researching and learning the skills necessary to carry out the project. This involved building a small testing system on the board. Details regarding this testing system can be found in the Work to Date section.
- 4) Testbench Structure: Once comfortable with the tools, The main design of the testbench can start. This task would be the foundation of this project, as it would provide a basis for all following features. A skeletal testbench should be complete and functional at the end of this task. This means the an operator module matching the correct specifications can be loaded into the testbench, and compute some non-trivial tests.



Fig. 4. Project Timeline

5) Variable Frequency: As the project seeks to quantify the performance across a range of frequencies, the most important feature of the system would be the ability to vary this parameter. This would be done by controlling the PLLs with the HPS [22]. Once implemented, tests will be run to explore and confirm the maximum frequency the testbench would remain reliable. If it does not meet the planned target, the testbench may need to be redesigned, and this project would be under some risk.

While most of the later sections of the project can be selectively added or removed from the scope relatively easily, this initial setup of the testbench structure will always remain critical to any further improvements. It is thus vital that the bare minimum system gets done early. To ensure this happens, this task and the testbench structure will be placed in the highest priority before its completion, and any blocking issue should be discussed with the supervisor if it could not be resolved after reasonable effort.

6) Benchmarks: With a promising base, more intricate tests and benchmarks can be designed. These would aim to reflect the system's performance running meaningful compute tasks. The systems enveloping the arithmetic modules could be stressed with popular algorithms to evaluate real-life obtainable speed up. In addition to better tests, this task also aims to obtain better data from these tests. The minimum result required here would be numerical information on power consumption, FPGA resources required, and the data throughput.

The last three tasks forms the core of this project. In other words, all three tasks must be completed for a minimum

functional product. The following tasks would be mostly considered as useful extensions. While not as vital as the core tasks, the following tasks greatly improves the quality of life and usability, and thus are equally critical to the success of the final product.

- 7) Configurable Modules: So far, only modules of a specific I/O width and numerical representation could be tested. It might be interesting to explore arithmetic modules with other configurations. To allow the testbench be used for further experiments or future projects, it is helpful to have a configurable testbench. Qsys components can be configured with the Hardware Component Definition File [23]. The plan is to build the testbench as a Qsys component, then use Qsys Component Editor as an interface for configuration.
- 8) Handling Failures: Another improvement to the testbench is related to how it handles failures in the module. It would be much more insightful for the user, if a more insightful failure message is provided in addition to just a simple failure rate. This could include examples of failed output against expected output, or statistical data describing the pattern of failures. This additional logic during run-time may degrade performance of the testbench, so it would be useful for the verbosity of this information to be configurable by the user.
- 9) Interactive UI: If time allows for even greater user experience enhancements, an real-time interactive graphical user interface could be constructed for the final demonstration. This would visualise the reduction of the module's precision as the user increases the clock rate. However, this would take significant time and effort, and this task will be re-evaluated

when the project progresses to the stage. An time-saving yet functional alternative would be a command-line interface with a well-documented user guide.

10) Reports and Presentation: The reports and the presentation are the most visible in all deliverables of this entire process. As such, while not directly contributing to the progress of the project, they are still vital to its success. The reports will be written alongside with the engineering process. At the end, around a week of time will be spent solely on completing and polishing up the final report. This should allow ample time for a well-organised submission.

The week after the final report will be used for the presentation. This would involve preparing a slide deck, a demo, and a script.

## C. Work to Date

Before any engineering work is done towards the final product, a small module was built to learn the environment. This module should be simple yet covers enough grounds to provide as much learning during the process as possible without taking up too much actual development time towards the product. As the greatest unfamiliarity is with the interaction across the HPS-FPGA bridge, a simple hardware accelerated adder was made for this training.

1) FPGA Side: Programming the FPGA to communicate with the HPS is no trivial task. Luckily, there exist a golden system reference design [29] for the board in use for this project. Unfortunately, support for certain versions of Quartus are missing from the GSRD download database, including the one used for this project, 16.0. While the design could be opened with a different version of the software, it would cause a series of conflicts usually related to using IP Cores that have changed over the iterations. To circumvent this issue cleanly, GSRD version 14.1 was downloaded and compiled on a separate install of Quartus II 14.1. This allowed the reference design to be studied in detail, and the sections required for this project to be rebuilt with Quartus Prime 16.0.

From the perspective of the FPGA, The HPS exposes three bridges for connections [24]. As this is a relatively simple task, the lightweight bridge is used. Module altera\_hps exposes the master of this bridge as h2f\_lw\_axi\_master. Next, the actual hardware adder needs to be built and integrated as a hardware module in Qsys with a matching interface. A simple adder can produce its result after one clock cycle. This greatly simplifies the logic required for the Avalon slave interface. The logic for the control and data signals are then written according to the interface specifications [28]. Following the naming conventions for the signals allows Qsys Component Editor to automatically detect the Avalon slave from this module at analysis. This saves the troubles of editing the \_hw.tcl file. To experiment with module configuration, the adder is designed with variable width.

The adder is then instantiated and connected to the rest of the system with two clicks in Qsys. From there, Qsys can generate the HDL for the entire system, which is then compiled to a bitstream file. With the bitstream ready, the work now shifts to the HPS.

2) HPS Side: The HPS runs Ubuntu and a bash script has been written to load the bitstream onto the FPGA. Next, A program is written in Python to test the hardware design from the HPS. The interfaces are mapped onto the physical memory, thus they can be accessed by opening /dev/mem. Checking against the specifications [24], the lightweight master is at  $0 \times \text{FF20}\_0000$ . Qsys allocates the memory spaces of modules relatively, so when it reports that the adder has been placed at  $0 \times 0010\_0000$ , it is physically at  $0 \times \text{FF30}\_0000$ . The adder is designed to have its two inputs at  $0 \times 0000$  and  $0 \times 1000$  and its output at  $0 \times 20000$ , which has been assigned by Qsys relatively to  $0 \times \text{FF30}\_0000$ .

With the memory mapping understood, the script can be designed to closely mirror this relative relationship between the modules using classes. For example, this allows the adder to still define its output at 0×20 in the adder class, and then initialised with an AXI module bringing it to the correct physical address. This parallel between software and hardware should be helpful as the product gets more complex.

For testing, 1000 add operations are executed separately with and without the hardware acceleration of the FPGA. While called hardware acceleration, it is not expected for the FPGA to have a higher performance than the HPS in this testing case. The CPU is reasonably efficient in calculating additions, while to calculate on the FPGA has a large overhead cost as the data transfers across the bridge back and forth.

## VI. EVALUATION PLAN

# A. Metrics

One natural way of measuring the success of the project is to look at the actual progress and comparing it to the plan given in the implementation plan. It should be noted that no plan is perfect, so some deviation is allowed. However, if there is significant delay from the implementation plan, there must be justifications given.

The next few measures looks at the performance of the final product. First, the maximum stress of that the testbench and provide without failing can be used as a success measure. A testbench with a higher maximum frequency can reveal a wider picture in the performance of the DUT. This would hopefully allow more insights to be gained regarding the DUT, or it could mean that the testbench can be used for future designs that may be faster than the current one. As the main quantitative metric, this would be a vital indicator of the project's success.

The Robustness of the testbench is also vital to the product's performance. The testbench should be free from errors within a reasonable operating range. If the testbench becomes unreliable with some minor changes to the system, the data that can be obtained would be very limited. The failure of the DUT can no longer be confirmed, as the error may be in the testbench instead of the DUT.

The ease of use of the testbench could be another evaluation point. On the hardware side, the verification system can be packaged into a Qsys module. Given the DUT is also a module with an agreed interface, they could be easily connected in Qsys for testing. For example, the DUT may be written in VHDL for its deterministic nature, but the testbench maybe

written in Verilog for its simplicity, but as both can be synthesised into a module, they would still be compatible in Qsys. On the software side, a user-friendly interface could be built. A usable command line interface maybe good enough, but a simple graphic interface could make the tests much more visual and interesting.

The interface could also provide information on the failure in the DUT. A better testbench would provide more insightful details when the DUT fails. This would make debugging or evaluating the design much simpler. Along with the GUI, this project has many optional extensions that would be discussed further in the corresponding section. After the main goal of the project being met, the number of optional functions implemented would become a good measure of the progress of the project.

Since the project is of the verification system, the results from the benchmarks should not be used for evaluation of this project.

A noteworthy point in evaluation is regarding the progress of the sister project. The purpose of the testbench is to verify and stress the arithmetic designs. If these designs would not be available near the end of this project, it would be difficult to empirically prove the capabilities of the testbench and its surrounding system. It is not impossible, as there are still substitutions for them. For functional purposes, standard offthe-shelf arithmetic modules could be used in-lieu. For other purposes, it is possible to have a model done before the actual design starts in the paired project. While this would allow this project to progress easier, it would be extra work for the other project, which is ultimately up to the decision of the other student. In all, it would be nice to have a solid arithmetic module completely to run in this testbench, but without one, the system can still be built and completed, albeit generating less useful data towards the overall aim of the project.

# VII. ETHICAL, LEGAL AND SAFETY PLAN

#### A. Ethical Considerations

Checking against the ethical issue list provided by Imperial College Research Ethics Committee [25], this project does not

- · damage participants' mental or physical health;
- jeopardise the safety and liberty of the researchers;
- use any private information;
- involve sensitive subject matter or methods;
- risk any conflict of interest between the researchers and the College.

This project is thus free from significant ethical concerns.

# B. Legal Considerations

Intel Quartus Prime software offers a variety of IP cores. These are encrypted module designs that would be integrated into the verification system of the project [27]. This project would integrate some of these IP cores so that some of the basic circuit building blocks would not need to be redesigned.

While it would be possible to complete this project with a free license, Imperial has academic licenses for the software allowing for faster compilation time.

# C. Safety Considerations

As the project is done mainly on a computer with minimum physical aspects, there is no major safety concern. For the minor concerns associated with the project, the physical development board will be handled with care, and the desk works will be interleaved with breaks.

#### REFERENCES

- R.P. Brent, "A Regular Layout for Parallel Adders", IEEE Trans. Comput., vol. C-31, pp. 260-264, 1982.
- [2] B. Catanzaro, and B. Nelson, "Higher Radix Floating-Point Representations for FPGA-Based Arithmetic", Proceedings of the 51st Annual Design Automation Conference, 2005.
- [3] R. Duncan, "A Survey of Parallel Computer Architectures", Computer, vol. 23, pp. 5-16, 1990.
- [4] J.W. Duran, "An Evaluation of Random Testing", IEEE Trans. on Software Engineering, vol. SE-10, no. 4, pp. 438-444, 1984.
- [5] M.D. Ercegovac, "On-line Arithmetic: An Overview", 28th Annual Technical Symposium, pp. 86-93, Internaltional Society for Optics and Photonics, 1984
- [6] M.D. Ercegovac, and T. Lang, "Digital Arithmetic", Morgan Kaufmann, 2003
- [7] S. Hazwani, et al, "Randomness Analysis of Pseudo Random Noise Generator Using 24-bits LFSR", Fifth International Conference on Intelligent Systems, Modelling and Simulation, 2014.
- [8] T. Lynch, and M.J. Schulte, "A High Radix On-line Arithmetic for Credible and Accurate Computing", Journal of Universal Computer Science, vol. 1, no. 7, pp. 439-453, 1995.
- [9] T. Lynch, and M.J. Schulte, "Software for High Radix On-line Arithmetic", Reliable Computing, vol. 2, no. 2, pp. 133-138, 1996.
- [10] H.R. Srinivas, and K.K. Parhi, "High-Speed VLSI Arithmetic Processor Architectures Using Hybrid Number Representation", J. of VLSI Sign. Process., vol. 4. pp. 177-198, 1992.
- [11] K. Shi, D. Boland, and G.A. Constantinides, "Accuracy-Performance Tradeoffs on an FPGA through Overclocking", Proc. Int. Symp. Field-Programmable Custom Computing Machines, pp. 29-36, 2013.
- [12] K. Shi, D. Boland, E. Stott, S. Bayliss, and G.A. Constantinides, "Datapath Synthesis for Overclocking: Online Arithmetic for Latency-Accuracy Trade-offs", Proceedings of the 13th Symposium on Field-Programmable Custom Computing Machines, pp. 1-6, ACM, 2014.
- [13] O. Šćekić "FPGA Comparative Analysis", University of Belgrade, 2005.
- [14] A.F. Tenca, and M.D. Ercegovac, "Design of high-radix digit-slices for on-line computations", 2007.
- [15] K.S. Trivedi, and M.D. Ercegovac, "On-line Algorithms for Division and Multiplication", IEEE Trans. Comput., vol. C-26, no. 7, pp. 667-680, 1977.
- [16] P. Whyte, "Design and Implementation of High-radix Arithmetic Systems Based on the SDNR/RNS Data Representation" Edith Cowan University, 1907.
- [17] Y. Zhao, J. Wickerson, and G.A. Constantinides, "An Efficient Implementation of Online Arithmetic", Int. Conf. on Field-Programmable Technology, 2016.
- [18] Accellera Systems Initiative, "Universal Verification Methodology 1.2 User's Guide", 2015.
- [19] Altera Corporation, "Cyclone V SoC Development Board Reference Manual", 2015.
- [20] Altera Corporation, "Memory System Design", Embedded Design Handbook 2010
- [21] Altera Corporation, "Introduction to Altmemphy IP", External Memory Interface Handbook: Reference Material, vol. 3, 2012.
- [22] Altera Corporation, "Phase-Locked Loop Basics, PLL,".
- [23] Altera Corporation, "Creating Qsys Components", 2018.
- [24] Altera Corporation, "Cyclone V Hard Processor System Technical Reference Manual", 2018.
- [25] Imperial College "An Ethics Code", Imperial College Research Ethics Committee, 2013.
- [26] Intel Corporation, "Cyclone V SoC Development Kit and Intel SoC FPGA Embedded Development Suite".
- [27] Intel Corporation, "Introduction to Intel FPGA IP Cores", 2018.
- [28] Intel Corporation, "Avalon Interface Specifications", 2018.
- [29] RocketBoards.org, "GSRD 14.1 User manual", 2015.
- [30] Xilinx, Inc, "Zynq-7000 All Programmable SoC", 2018.
   [31] Xilinx, Inc, "ZedBoard (Zynq Evaluation and Development) Hardware User's Guide", 2012.