## FPGA Implementation of CNN Handwritten Character Recognition

TEAM POOR HANDWRITING

Qikun Liu Shichen (Justin) Qiao Haining Qiu Lingkai (Harry) Zhao

**ADVISOR** 

Eric Hoffman

ECE 554 - MAY, 2023



## Summary

Machine Learning (ML) has been a skyrocketing field in Computer Science in recent years. As computer hardware engineers, we are enthusiastic in hardware implementations of popular software ML architectures to optimize their performance, reliability, and resource usage.

Our project involved designing a real-time device for recognizing handwritten letters and digits using an Altera DE1 FPGA Kit. We implemented and validated three different ML architectures: linear classification, a 784-64-10 fully connected neural network (NN), and a LeNet-5 CNN with ReLU activation layers and 36 classes. The training processes were done in Python scripts, and the resulting kernels and weights were stored in hex files and loaded into the FPGA's SRAM units. We wrote assembly code for our custom 32-bit floating-point instruction set architecture (ISA) to perform classification, and used a 5-stage MIPS processor that we designed in SystemVerilog to manage image processing, matrix multiplications, and user interfaces. We followed various engineering standards, including IEEE-754 32-bit Floating Point Standard, Video Graphics Array (VGA) display protocol, Universal Asynchronous Receiver-Transmitter (UART) protocol, and Inter-Integrated Circuit (I2C) protocols to achieve our project goals.

This report documents the high-level design block diagrams, interfaces between each System Verilog module, implementation details of our software and firmware components, and the potential impacts of our project on society. Additionally, we will provide a final demonstration and discuss each team member's individual contributions to this senior capstone project.

# Project Final Report

## **Table of Contents**

| 1. | Repeat of ISA Table (updated from your project proposal doc) | 4  |
|----|--------------------------------------------------------------|----|
| 2. | . Hardware Block Diagrams                                    | 7  |
|    | 2.1. Top Level                                               | 7  |
|    | 2.1.1. Top Level Memory Mapped Registers                     | 8  |
|    | 2.1.2. Top Level Memory Blocks                               | 8  |
|    | 2.1.3. Top Level Compress Signal Control                     | 9  |
|    | 2.2. Camera Interface                                        | 10 |
|    | 2.2.1. D5M Camera ports                                      | 10 |
|    | The camera interface requires the following ports:           | 10 |
|    | 2.2.2. SDRAM_Control                                         | 10 |
|    | 2.2.3. I2C_CCD_Config                                        | 10 |
|    | 2.2.4. Raw2Gray                                              | 10 |
|    | 2.2.5. CCD_Capture                                           | 10 |
|    | 2.3. CPU                                                     | 10 |
|    | 2.4. Extended ALU                                            | 11 |
|    | 2.4.1. Floating Point Adder Interface                        | 12 |
|    | 2.4.1.1. Left Shifter Interface                              | 13 |
|    | 2.4.1.2. Right Shifter Interface                             | 13 |
|    | 2.4.2. Floating-point Multiplier Interface                   | 13 |
|    | 2.4.3. Float-to-integer Unit Interface                       | 13 |
|    | 2.4.4. Integer-to-float Unit Interface                       | 13 |
|    | 2.4.5. 16-by-16 Integer Multiplier Interface                 | 13 |
|    | 2.5. Stack                                                   | 14 |
|    | 2.6. Image Processing and Storage                            | 14 |
|    | 2.6.1. Image Compressor Interface                            | 14 |
|    | 2.6.2. Image Compressor X Interface - only for CNN Model     | 15 |
|    | 2.6.3. Image Compressor Registers                            | 15 |
|    | 2.6.4. Image Memory Interface                                | 15 |
|    | 2.6.5. Image Memory Registers                                | 15 |
| 3. | 3. Software                                                  | 16 |
|    | 3.1. Machine Learning Model - Liner Classification           | 16 |
|    | 3.2. Machine Learning Model - Neural Network                 | 16 |
|    | 3.3. Machine Learning Model - Convolutional Neural Network   | 17 |
|    | 3.4. Assembly Firmware                                       | 18 |

| 3.4.1. Main Function (CNN-Supercharged)                  | 18 |
|----------------------------------------------------------|----|
| 3.4.2. Pre Process Function                              | 19 |
| 3.4.3. Convolution Layer                                 | 19 |
| 3.4.4. Average Pooling Layer                             | 21 |
| 3.4.5. Neural Network (Matrix Multiplication) Layer      | 21 |
| 3.4.6. Output Layer                                      | 22 |
| 3.5. Self-checking Assembly Tests and Python Auto-Tester | 23 |
| 3.5.1. Python Auto-Tester                                | 23 |
| 3.5.2. ASM Test Coverage                                 | 23 |
| 4. Engineering Standards Employed in your Design         | 24 |
| 4.1. IEEE 1800-2009 SystemVerilog                        | 24 |
| 4.2. IEEE-754 32-bit Floating Point Standard             | 24 |
| 4.3. VGA                                                 | 24 |
| 4.4. UART (SPART)                                        | 24 |
| 4.5. I2C                                                 | 24 |
| 5. Potential Societal Impacts of Our Design              | 25 |
| 5.1. Computing Efficiency in both Time and Cost          | 25 |
| 5.2. Smart Monitoring and Internet of Things (IoT)       | 25 |
| 5.3. Trend of Edge Computing Using ASIC Devices          | 25 |
| 6. Validation                                            | 25 |
| 7. Final Application Demonstration                       | 28 |
| 8 Contributions of Individuals                           | 29 |

## 1. Repeat of ISA Table (updated from your project proposal doc)

#### **Full ISA:**

https://docs.google.com/spreadsheets/d/1PT7VjIhUPUwOg7ZNtqeGGRNTjavUGF0D/edit#gid=12035849

Changes from proposal: Added ADDI and SUBI instructions, removed MOVC(LWI) instruction

#### **Responses to Eric's comments:**

We extended and validated the assembler according to our 32-bit ISA. Please checkout Project/asm\_tests/asmbl\_32.pl for the implementation and Project/asm\_tests/translate\_test.asm for a compilation sanity check.

We kept the HLT instruction so that our Python auto tester can take advantage of it and run assembly validation tests of our processor automatically.

We kept R0 hardwired to 0 to efficiently test floating point values. For instance, when we have a FP value in R1, and we want to branch according to the sign of this value. We can't use ADD or ADDI for this because the ALU will not set flags properly for FP values. Thus, with this feature, we can do ADDF R1, R1, R0, and then branch safely and efficiently right below this ADDF instruction.

We still kept LLB and LHB as they were, since we feel something like LLW and LHW is confusing to the LW instruction.

## Copy of ISA table (please go to the link at the top of this section for a better formatted table):

## ECE 554 32-bit ISA

General Format

3 register instruction: aaaa\_axxx\_xxxd\_dddd\_xxxs\_ssss\_xxxt\_tttt

2 register instruction: aaaa\_axxx\_xxxd\_dddd\_xxxs\_ssss\_iiii\_iiii

1 register instruction: aaaa\_axxx\_xxxd\_dddd\_oooo\_oooo\_oooo

a=opcode, c=sub\_opcode, x=don't\_care, d=destination, s=source, t=second\_source, i=immediate,

o=offset

Floating Point Format: IEEE 754: https://en.wikipedia.org/wiki/IEEE 754

- 1. Flag registers are Z-zero, V-overflow, N-negative/sign
- 2. The overflow flag denotes positive overflow as well as negative underflow
- 3. Register R0 is hard-wired to 32'h00000000, can't be written to
- 4. Jal instruction always stores the return address in register R31. Do not write R31 inside function calls if you wish to return.

| Instructi<br>on |                   | Sample<br>Instruct<br>ion | IOPCO  | Sample Explanation | Other Comments                        |
|-----------------|-------------------|---------------------------|--------|--------------------|---------------------------------------|
| ADD             | aaaa_axxx_xxxd_d  | ADD                       | 5'b000 | IR1 <= R2 + R3     | Saturating arithmetic.                |
| 700             | ddd_xxxs_ssss_xxx | R1, R2,                   | 00     |                    | Updates the Z, V and N flag registers |

|      | t_tttt                                    | R3                   |                  |                              |                                                                    |
|------|-------------------------------------------|----------------------|------------------|------------------------------|--------------------------------------------------------------------|
| ADDZ |                                           | IR1. R2.             |                  | R1 <= R2 + R3 only<br>if Z=1 |                                                                    |
| SUB  |                                           | SUB R1,<br>R2, R3    | 5'b000<br>10     | R1 <= R2 - R3                |                                                                    |
| AND  |                                           | AND<br>R1, R2,<br>R3 | 5'b000<br>11     | R1 <= R2 & R3                | Undates the 7 flag register                                        |
| NOR  |                                           | IR1. R2.             | 5'b001<br>00     | R1 <= ~(R2   R3)             | Updates the Z flag register                                        |
|      |                                           |                      |                  |                              |                                                                    |
| SLL  | aaaa_axxx_xxxd_d                          | R2, C                | 01               | R1 <= R2 << C                |                                                                    |
| SRL  | ddd yyys ccc yyy                          | SRL R1,<br>R2, C     | 5'b001<br>10     | IR1 <= R2 >> C               | C is 5-bit unsigned immediate value<br>Updates the Z flag register |
| SRA  | ı_IIII                                    | SRA R1,<br>R2, C     | 5'b001<br>11     | R1 <= R2 >>> C               |                                                                    |
|      |                                           |                      |                  |                              |                                                                    |
| LW   | aaaa_axxx_xxxd_d ddd_xxxs_ssss_oo oo_oooo | l '                  |                  | R1 <= DataMem[R2             |                                                                    |
|      |                                           | <u> </u>             | 000              | + O]                         | O is 8-bit signed immediate value                                  |
| SW   |                                           | l                    | 5'b010           | DataMem[R2 + O]              | o is o site signed immediate value                                 |
|      |                                           | R2, O                | 01               | <= R1                        |                                                                    |
|      |                                           |                      |                  |                              |                                                                    |
| LHB  | aaaa_axxx_xxxd_d                          | LHB R1,<br>C         | 5'b010<br>10     | R1 <= {C, R1[15:0]}          | C is 16-bit signed immediate value                                 |
| LLB  | ddd_iiii_iiii_iiii_iiii                   | LLB R1,              | 5'b010           | R1 <=                        | C is 10-bit signed illilliediate value                             |
|      |                                           | С                    | 11               | sign-extend{C}               |                                                                    |
| В    |                                           |                      |                  |                              |                                                                    |
| NEQ  |                                           | l "                  | 5'b011<br>00 000 | lRranch if 7=0               | O is signed 12-bit offset in two's complement                      |
| EQ   | xx_xxxx_oooo_oo<br>oo_oooo                | "                    | 5'b011<br>00 001 | lBranch if Z=1               | Branch target address =<br>(Address of branch instruction + 1) +   |
| l GT |                                           | l '                  |                  | 1                            | offset<br>PC holds word addresses, each                            |
| LT   |                                           | B LT,                | 5'b011<br>00 011 | Pranch if N=1                | instruction is 1 word,<br>offset is specified as the number of     |
| GTE  |                                           |                      |                  | Branch if N=0                | instructions with respect to the instruction following the         |

|       |                                                  | label                  | 00 100           |                                                       | branch                                                                                                                 |
|-------|--------------------------------------------------|------------------------|------------------|-------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| LTE   |                                                  | l '                    | 5'b011<br>00 101 |                                                       | instruction.                                                                                                           |
| OVFL  |                                                  | B OVFL,<br>label       | 5'b011<br>00 110 | IBranch if V=1                                        |                                                                                                                        |
| UNCON |                                                  | B<br>UNCON<br>D, label | 00 111           | Branch<br>unconditionally                             |                                                                                                                        |
| JAL   | aaaa_axxx_xxxxx_xx<br>xx_xxxx_oooo_oo<br>oo_oooo | JAL                    | 5'b011<br>01     | R31 <= address of lial instruction +1.                | O is signed 12-bit offset in two's<br>complement<br>Jump target address =<br>(Address of jal instruction + 1) + offset |
| JR    | aaaa_axxx_xxxxx_xx<br>xx_xxxt_tttt_xxxx_<br>xxxx | JR R31                 | 5'b011<br>10     | Jump to the address in R31                            | Can be used to return from function calls (jal)                                                                        |
| PUSH  | aaaa_axxx_xxxx_xx<br>xx_xxxs_ssss_xxxx<br>_xxxx  | PUSH<br>R1             | 5'b100<br>10     | DataMem[SP] <=<br> R1: Decrement SP                   | Stores value in R1 into data memory pointed by the stack pointer; decrements stack pointer                             |
|       | aaaa_axxx_xxxd_d<br>ddd_xxxx_xxxxx_xx<br>xx_xxxx | POP R1                 | 5'b100<br>11     |                                                       | Loads value in data memory pointed by the stack pointer into R1; increments stack pointer                              |
| ADDI  | aaaa_axxx_xxxd_d                                 | IR1. R2.               | 5'b101<br>00     | R1 <= R2 + C                                          | I is 8-bit signed immediate value.                                                                                     |
| SUBI  | ddd_xxxs_ssss_iiii_<br>iiii                      | SUBI<br>R1, R2,<br>I   | 5'b101<br>01     | R1 <= R2 - C                                          | Updates the Z, V and N flag registers                                                                                  |
| MUL   |                                                  | MUL<br>R1, R2,<br>R3   | 5'b110<br>00     | R1 <= (signed)<br>R2[15:0] * (signed)<br>R3[15:0]     | Only support 16 by 16 multiplications                                                                                  |
| UMUL  | Iddd yyys ssss yyyl                              | dd_xxxs_ssss_xxx R3    |                  | R1 <= (unsigned)<br>R2[15:0] *<br>(unsigned) R3[15:0] |                                                                                                                        |
| ADDF  |                                                  | IR1. R2.               | 5'b110<br>10     |                                                       | Floating point calculation<br>1-bit sign, 8-bit exponent, 23-bit<br>mantissa                                           |
| SUBF  |                                                  | SUBF                   | 5'b110           | R1 <= R2 - R3                                         | mantissa                                                                                                               |

|      |                                | R1, R2,<br>R3 | 11           | (floating-point)                           |
|------|--------------------------------|---------------|--------------|--------------------------------------------|
| MULF |                                | R1, R2,       |              | R1 <= R2*<br>R3(floating-point)            |
| ITF  | aaaa_axxx_xxxd_d               | •             |              | R1 <= R2 (integer to floating-point)       |
|      | X XXXX                         | FIIR1,        | 5'b111<br>10 | R1 <= R2<br>(floating-point to<br>integer) |
| HLT  | 1111_1xxx_xxxx_x xxx_xxxx_xxxx | HLT           | 5'b111<br>11 | Processor Halt                             |

## 2. Hardware Block Diagrams

## 2.1. Top Level

The top level includes instantiations of the following modules.

A set of modules come from the tutorial of exploring Camera. They are: Sdram\_Control, RAW2GRAY, CCD\_Capture, I2C\_CCD\_Config, Reset\_Delay, and VGA\_Controller.

A set of modules are fully self-implemented and added for the required function. They are: image\_mem, weight\_rom, SPART, PLL, and rst\_sync. The details are explained in the following sections.

ImageRecog.sv

VGA\_Controller

Sdram\_Control

image\_mem

weight\_rom

compress\_control

Reset\_Delay

rst\_synch

rst

I2C\_CCD\_Config

PLL

The modification on the CPU is explained in the following section.

## 2.1.1. Top Level Memory Mapped Registers

ref clk

| Register Address: | Description:                                                          |
|-------------------|-----------------------------------------------------------------------|
| 0x0000C000        | Write to this address will write to LEDR[9:0] of board                |
| 0x0000C001        | Read from this address will return state of SW[9:0] of board          |
| 0x0000C004        | Transmit Buffer (IOR/W = 0); Receive Buffer (IOR/W = 1)               |
| 0x0000C005        | Status Register (IOR/W = 1)                                           |
| 0x0000C006        | DB(Low) Division Buffer                                               |
| 0x0000C007        | DB(High) Division Buffer                                              |
| 0x0000C008        | Set to 1 to request compressing the image, it will pull down once the |
|                   | compression is finished.                                              |

#### 2.1.2. Top Level Memory Blocks

Two external memory modules are instantiated at the top level to support the implementation of CNN in hardware.

The image\_mem is a dual ported 32x32 memory module to store one compressed image. It allows writing from the image\_compressor\_x and reading from CPU external memory access. When a image is captured and stored in image\_mem, the data automatically adds 2 paddings to each side (from 28x28 to 32x32) by image\_compressor\_x to implement the LeNet-5 CNN architecture.

The weight\_rom is a single ported memory module with all the weights necessary for CNN. There are a total 63,654 weights needed to implement the CNN. The CPU can read any weight in the weight\_rom with a starting address of 0x20000. These weights are trained, tested and validated using software. They are then loaded into the ROM using a weight.hex file generated by software. We decided to use external memory for weight\_rom for two reasons: 1. We do not want to expand the data\_mem of the CPU

because we want to keep the CPU as a general purpose processor. 2. We want to ensure that the weights are stored as rom and cannot be overwritten by software.

- The first 6x25 weights are for the first convolution layers
- The second 6x16x25 weights are for the second convolution layer
- The third 400x120 weights are for the first fully-connected neural network layer
- The fourth 120x84 weights are for the second fully-connected neural network layer
- The last 84x36 weights are for the third fully-connected neural network layer

For more information on the neural network architectures, please see the software section of the report.

| Name:      | Start Addr | End Addr   | Description:                                                                               |
|------------|------------|------------|--------------------------------------------------------------------------------------------|
| image_mem  | 0x00010000 | 0x0001030F | Image memory RAM, take inputs from image compressor and output values to VGA and processor |
| weight_rom | 0x00020000 | 0x00021E9F | Weight memory ROM, values load from ML software, output to processor                       |

## 2.1.3. Top Level Compress Signal Control

To implement the real-time processing, we must be able to capture an image only after the last image is processed and outputted through UART. Therefore, we use this compress control logic to allow the CPU to request a new image compression. Before processing each image and having the compressed snapshot in image\_mem, the assembler code must request a snapshot by storing 1 to 0xC008 in data mem (SW 1, 0xC008). The compress control will wait until the SDRAM access is synchronized with the first pixel of the snapshot and enable the compressor to process the image. The assembler code needs to check the data in 0xC008 periodically. When the data stored in 0xC008 becomes 0, the compressed image is ready for use.

| Signal:              | Dir: | Description:                                                              |
|----------------------|------|---------------------------------------------------------------------------|
| uncompress_addr_x In |      | X axis of the uncompressed image address. It is used to synchronize with  |
| [7:0]                |      | the first pixel of the uncompressed image                                 |
| uncompress_addr_     | in   | Y axis of the uncompressed image address. It is used to synchronize with  |
| y[7:0]               |      | the first pixel of the uncompressed image                                 |
| we                   | In   | Write-enable signal for requesting a snapshot. The CPU will access this   |
|                      |      | when writing to 0xC008.                                                   |
| compress_wdata       | In   | The request for a snapshot. The CPU can set this to 1 to indicate a       |
|                      |      | request for snapshot, 0 to indicate no need to take a snapshot            |
| pause                | In   | Input from the button. This supports the function of freezing video input |
|                      |      | by pressing a button (KEY[2]). The compress signal control will not start |
|                      |      | until the key is released.                                                |
| compress_req         | out  | The status register of the compress control. If it is 0, there is no      |
|                      |      | compression going on. If it is 1, a compression is in process. CPU has    |
|                      |      | responsibility to access 0xC008 to check this status.                     |
| compress_start       | out  | The control signal to start a new compression.                            |

## 2.2. Camera Interface

#### 2.2.1. D5M Camera ports

The camera interface requires the following ports:

input: D5M\_D[11:0], D5M\_FVAL, D5M\_LVAL, D5M\_PIXCLK, D5M\_STROBE

output: D5M RESET N, D5M SCLK, D5M TRIGGER, D5M XCLKIN

inout: D5M SDATA

These ports are implemented using GPIO\_0 ports on the FPGA board. We map the ports on GPIO to corresponding pins on the D5M camera. The D5M camera captures the images and the data is stored into the external SDRAM on the FPGA developer board. Because we do not have sufficient FPGA embedded SRAM memory, we must use the external SDRAM.

#### 2.2.2. SDRAM\_Control

To manage the data transferred into the external SDRAM, we use the Sdram\_Control module to manipulate the timing and data sent into the SDRAM. Each received pixel is divided into two parts because the width of the SDRAM memory is 8 bits and the received pixel is 12 bits. The Sdram\_Control module takes the captured image in and stores the upper and lower bytes sequentially. The Sdram Control module has a FIFO buffer to avoid loss of data.

#### 2.2.3. I2C\_CCD\_Config

To manage the configuration of the Camera, we use the I2C\_CCD\_Config module to adjust the exposure, zoom, and brightness configuration of the D5M camera through I2C protocol. When we want to adjust one setting of the camera, we use the combination of switch and button to set the desired setting. For example, if SW[0] is on, and KEY[1] is pressed, then the exposure will increase. If SW[0] is off, and KEY[1] is pressed, then the exposure will decrease. I2C\_CDD\_Config will monitor the switch and keys to update settings through the I2C protocol.

#### 2.2.4. **Raw2Gray**

To Convert the captured color image into grayscale image, we use the Raw2Gray module to process the captured pixel and convert it to a grayscale image for CNN prediction. The Raw2Gray module takes the input from the CCD\_Capture module, which is captured from the camera and converts the output to grayscale values.

#### 2.2.5. CCD Capture

To capture the image and manage the communication protocol with the camera, we use the CCD\_Capture module to control the clock and the data received. This module also keeps track of the frame count, x\_location, and y\_location of the received pixel.

#### 2.3. CPU

This is a high level block diagram of our modified processor. The original 16-bit pipeline logic was mostly preserved. We only expanded the data paths to 32 bits and added new modules. For detailed interface specifications, please refer to the following sections.



## 2.4. Extended ALU



The extended ALU is in the same pipeline stage of the original ALU in the processor while its hardware supports floating-point operations and integer multiplication. It contains five submodules for floating-point addition and multiplication, conversions between float and integer, and integer multiplication. Its output value and flags are selected by the func signal.

| Signal:         | Dir: | Description:                                                               |
|-----------------|------|----------------------------------------------------------------------------|
| clk             | in   | 50MHz system clock                                                         |
| src1[31:0]      | in   | 32-bit source 1 into ALU                                                   |
| src0[31:0]      | in   | 32-bit source 0 into ALU                                                   |
| func[2:0]       | in   | 3-bit OP Code:                                                             |
|                 |      | 000 ==> MUL                                                                |
|                 |      | 001 ==> UMUL                                                               |
|                 |      | 010 ==> ADDF                                                               |
|                 |      | 011 ==> SUBF                                                               |
|                 |      | 100 ==> MULF                                                               |
|                 |      | 101 ==> ITF                                                                |
|                 |      | 110 ==> FTI                                                                |
|                 |      | 111 ==> undefined                                                          |
| dst_EX_DM[31:0] | in   | 32-bit ALU output                                                          |
| ov              | out  | Overflow flag - but this is always 0!!! Kept here for following branch ops |
| zr              | out  | Zero flag - high when output is 0 (int zero or FP zeroes)                  |
| neg             | out  | Negative flag - high when output is negative                               |

#### 2.4.1. Floating Point Adder Interface

The addition of two IEEE-754 floating-point numbers is a complex process due to the potential difference in their exponents and unsigned mantissas. Here are the steps involved in the FP Adder process:

- 1. Compare the exponents and determine the smaller one. Calculate the absolute difference between the exponents, and the larger exponent will be the common exponent.
- 2. Prepend the common exponent to both mantissas, making them both 24-bit in length.
- 3. Shift the mantissa with the smaller exponent to the right, using the lower 5 bits of the exponent difference as the shift amount. The maximum shift amount should be 22-bit.
- 4. Convert both appended and shifted mantissas to 2's complement format. This makes both numbers 25-bit in length.
- 5. Add the two 25-bit numbers, and if the result overflows positively or negatively, increment the common exponent. Note that this overflow is an internal overflow, not an external value overflow.
- 6. Convert the 25-bit 2's complement result back to a 25-bit signed number, where the MSB is the final sign and the rest 24-bit are the unsigned value. If this 24-bit prepended mantissa starts with 0, but no internal occurs, denormalize the common exponent to 0.
- 7. If an overflow occurs, shift the lower 24-bit to the right and append 1 in the MSB, or shift the lower 24-bit to the left if it has leading zeros. The common exponent, if not 0, is adjusted accordingly.
- 8. The resulting mantissa is the lower 23-bit of the final 24-bit result, and the exponent is the final common exponent. The sign is the MSB of the final result.

| Signal:    | Dir: | Description:                                                  |
|------------|------|---------------------------------------------------------------|
| A[31:0]    | in   | 32-bit input interpreted as an IEEE-754 floating-point number |
| B[31:0] in |      | 32-bit input interpreted as an IEEE-754 floating-point number |
| Out[31:0]  | out  | 32-bit output as an IEEE-754 floating-point number            |

#### 2.4.1.1. Left Shifter Interface

| Signal:         | Dir: | Description:                                           |
|-----------------|------|--------------------------------------------------------|
| In[23:0]        | in   | 24-bit mantissa input to be logically left-shifted     |
| ShAmt[4:0] in 5 |      | 5-bit left shift amount                                |
| Out[23:0]       | out  | 24-bit mantissa output normalized into IEEE-754 format |

### 2.4.1.2. Right Shifter Interface

| Signal:    | Dir: | Description:                                           |
|------------|------|--------------------------------------------------------|
| In[23:0]   | in   | 24-bit mantissa input to be logically right-shifted    |
| ShAmt[4:0] | in   | 5-bit right shift amount                               |
| Out[23:0]  | out  | 24-bit mantissa output normalized to a common exponent |

## 2.4.2. Floating-point Multiplier Interface

The process for multiplying two 32-bit values in IEEE-754 format involves breaking the inputs into {S1, E1, M1} and {S2, E2, M2}. The signs are XORed to determine the sign of the product, and the exponents are added together with the 127 offset accounted for. The mantissas are then appended with an implicit 1 (or 0) and multiplied together. Special values like -INF, -0, +0, and +INF require special combinational logic to adjust the outputs. The resulting values are then concatenated and normalized to conform to the IEEE FP standard and output as a single 32-bit value.

| Signal:   | Dir: | Description:                                                  |
|-----------|------|---------------------------------------------------------------|
| A[31:0]   | in   | 32-bit input interpreted as an IEEE-754 floating-point number |
| B[31:0]   | in   | 32-bit input interpreted as an IEEE-754 floating-point number |
| OUT[31:0] | out  | 32-bit output as an IEEE-754 floating-point number            |

## 2.4.3. Float-to-integer Unit Interface

| Signal:              | Dir: | Description:                                                  |
|----------------------|------|---------------------------------------------------------------|
| FP_val[31:0]         | in   | 32-bit input interpreted as an IEEE-754 floating-point number |
| signed int val[31:0] | out  | 32-bit output converted into a signed integer                 |

#### 2.4.4. Integer-to-float Unit Interface

| Signal:              | Dir: | Description:                                                   |
|----------------------|------|----------------------------------------------------------------|
| signed_int_val[31:0] | in   | 32-bit input interpreted as a signed integer                   |
| FP_val[31:0]         | out  | 32-bit output converted into an IEEE-754 floating-point number |

## 2.4.5. 16-by-16 Integer Multiplier Interface

| Signal:   | Dir: | Description:                                               |
|-----------|------|------------------------------------------------------------|
| A[31:0]   | in   | 32-bit input interpreted as an integer                     |
| B[31:0]   | in   | 32-bit input interpreted as an integer                     |
| sign      | in   | 1 for signed multiplication; 0 for unsigned multiplication |
| OUT[31:0] | out  | 32-bit output as an integer                                |

#### **2.5. Stack**

The stack is implemented as an individual memory module. It is placed in the Execute stage for maximum efficiency. We do not implement it as a subset of the data memory because it takes one additional cycle to reach the data memory stage.

This stack is implemented in the same fashion as the FIFO buffer, except that it will push to the top of the stack and pop from top of the stack. This stack is a First-In-Last-Out(FILO) buffer. A list of registers is instantiated and the address of access gets updated each cycle based on the commands(push, pop) to the stack. Only one command is allowed at a time (cannot push and pop in the same cycle) because each instruction from the CPU can only perform one action at a time.

| Signal:           | Dir: | Description:                    |
|-------------------|------|---------------------------------|
| clk               | in   | 50M system clock                |
| rst_n             | in   | active low reset                |
| push              | in   | push wdata onto stack           |
| рор               | in   | pop top of stack to stack_EX_DM |
| wdata[31:0]       | in   | 32-bit data to be pushed        |
| stack_EX_DM[31:0] | out  | 32-bit data being popped        |

## 2.6. Image Processing and Storage

The image taken from the camera is stored in SDRAM, which is then sequentially fed into the image compressor. This image compressor sequentially takes a 224\*224 8-bit image from SDRAM and compress it into a 28\*28 8-bit image by taking the average color among 8\*8 blocks. Within one cycle, only one pixel is taken and only one pixel is out.

The image compressor x is specifically designed for CNN model, which requires a 32\*32 image with zero-padding of width 2 that surrounds the original 28\*28 image. The image compressor x therefore has a different output signal called compress\_addrx, which ranges from 0 to 1023 (instead of 0 to 783). The zero-padding is achieved by skipping and not writing to the padding address in the image memory so that these addresses always contain 0. The compress\_addrx signal is a combinational logic of the original output compress\_addr. Note that the SRAM block of the image memory is also extended to 1024 locations.

2.6.1. Image Compressor Interface

| Signal:            | Dir: | Description:                                                      |
|--------------------|------|-------------------------------------------------------------------|
| clk                | in   | 25MHz clock signal from VGA display                               |
| rst_n              | in   | System reset signal                                               |
| start              | in   | Signals a valid pixel color input starting from 0                 |
| pix_color_in[7:0]  | in   | 8-bit pixel color value from VGA DRAM (0 to 255 grayscale)        |
| pix_haddr[7:0]     | in   | 8-bit pixel horizontal address (0 to 223)                         |
| pix_vaddr[7:0]     | in   | 8-bit pixel vertical address (0 to 223)                           |
| sram_wr            | out  | Write enable signal to the SRAM memory storing compressed image   |
| pix_color_out[7:0] | out  | 8-bit compressed pixel color by taking average value among an 8*8 |
|                    |      | block                                                             |

| compress_addr[9:0]   out   10-bit compressed image pixel address (0 to 783 for a 28*28 image) |
|-----------------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------------|

## 2.6.2. Image Compressor X Interface - only for CNN Model

| Signal:             | Dir: | Description:                                                        |
|---------------------|------|---------------------------------------------------------------------|
| clk                 | in   | 25MHz clock signal from VGA display                                 |
| rst_n               | in   | System reset signal                                                 |
| start               | in   | Signals a valid pixel color input starting from 0                   |
| pix_color_in[7:0]   | in   | 8-bit pixel color value from VGA DRAM (0 to 255 grayscale)          |
| pix_haddr[7:0]      | in   | 8-bit pixel horizontal address (0 to 223)                           |
| pix_vaddr[7:0]      | in   | 8-bit pixel vertical address (0 to 223)                             |
| sram_wr             | out  | Write enable signal to the SRAM memory storing compressed image     |
| pix_color_out[7:0]  | out  | 8-bit compressed pixel color by taking average value among an 8*8   |
|                     |      | block                                                               |
| compress_addrx[9:0] | out  | 10-bit compressed image pixel address (0 to 1023 for a 32*32 image) |

## 2.6.3. Image Compressor Registers

| Register:          | Description:                                                                  |
|--------------------|-------------------------------------------------------------------------------|
| compress_addr[9:0] | Described in table above.                                                     |
|                    | Reset to 10'd784; zero-set when start asserted; incremented when              |
|                    | sram_wr asserted.                                                             |
|                    | Namely, it increments to the next available SRAM address after an image       |
|                    | memory write and stalls itself when the entire image memory gets              |
|                    | written until the next asserted start signal.                                 |
| block[13:0][0:27]  | 28 14-bit wide SRAM blocks to store the accumulated sum of every pixel        |
|                    | value inside 28 8*8 blocks.                                                   |
|                    | We need 28 of them since at least one row of 8*8 blocks should be saved       |
|                    | for averaging, and there are 28 blocks per row (224/8 = 28). Its address is   |
|                    | determined by the upper 5 bits of pix_haddr, named b_haddr.                   |
|                    | 14-bit is needed since it stores the sum of 64 8-bit wide pixel color values. |
|                    | The average value is taken from its upper 8 bits to produce a compressed      |
|                    | pixel color value.                                                            |

## 2.6.4. Image Memory Interface

| Signal:    | Dir: | Description:                                                           |
|------------|------|------------------------------------------------------------------------|
| clk        | in   | 50MHz system clock. Note that this is 2 times faster than the clock of |
|            |      | the image compressor module, but this is safe since every correct data |
|            |      | will get written twice at the same address.                            |
| we         | in   | Write enable signal of image SRAM memory from image compressor         |
| waddr[9:0] | in   | 10-bit write address of SRAM from image compressor                     |
| wdata[7:0] | in   | 8-bit compressed pixel color value from image compressor               |
| raddr[9:0] | in   | 10-bit read address of SRAM (0 to 1023)                                |
| rdata[7:0] | out  | 8-bit read port of SRAM for a compressed pixel color value             |

## 2.6.5. Image Memory Registers

| Register: | Description: |
|-----------|--------------|
|-----------|--------------|

| rdata[9:0][0:1023] | Described in table above. There are 1024 SRAM blocks just enough for a |
|--------------------|------------------------------------------------------------------------|
|                    | compressed 32*32 image with zero-padding of width 2. This SRAM is      |
|                    | written by the image compressor.                                       |

## 3. Software

## 3.1. Machine Learning Model - Liner Classification

| Input:                      | Output:                                            |
|-----------------------------|----------------------------------------------------|
| keras.datasets.mnist        | weight.hex                                         |
| - 60,000 images with labels | - contains 7840 8-digits hex numbers, one per line |

One high-level software is used for this project. The purpose of the software is to provide the weight matrix that is used to make predictions by performing matrix multiplication on our input image data. This software is written in Python on Google Colaboratory. It includes a Python notebook and a Python script file. The script file contains helper functions and the notebook will read input, call functions, and generate output.

The software does its work in three steps: it reads train data in, trains on the data, and produces the weight matrix. It is currently using the Keras dataset that contains 60,000 different images of 28\*28 pixels of handwritten number, and corresponding label to indicate their values. The first 40,000 images are used for training, and the next 10,000 images are used for validation during training, and the last 10,000 images are used for testing the accuracy of the trained model.

The model used for this software is softmax loss linear classifier, which utilizes the difference of ideal logistic probabilities and the current logistic probabilities as the loss function and employs gradient descent to minimize the loss function. This training procedure is repeated at least 200 times. The training process uses the PyTorch library to help boost the performance and reduce the training time. With appropriate learning rate and regularization factors, the training accuracy can be over 90%, and the testing accuracy is around 88.6% for recognizing individual digits.

After finding the best performance model, we extract the transpose of the weight matrix. The transposed weight matrix is flatten into a list of 784 = (28\*28) floating-point numbers. For each row, each of the 28 numbers is parsed into hex numbers and pasted into a .hex file, where each line is in the format of {LINE\_NUMBER} {HEX\_NUMBER}. This file can be directly loaded into the FPGA board as a ROM, which is later read for the matrix multiplication.

## 3.2. Machine Learning Model - Neural Network

| Input:                       | Output:                                             |  |
|------------------------------|-----------------------------------------------------|--|
| keras.datasets.emnist        | weight_nn.hex                                       |  |
| - 425,600 images with labels | - contains 50816 8-digits hex numbers, one per line |  |

Since Linear Classification didn't perform too well with only 89% accuracy for digits only and 67% accuracy for digits and letter, we chose to change our model to use Neural Network. A Neural Network(NN) model typically has better performance when compared to Linear Classification(LC) model.

For this NN model, we used the PyTorch library that is publicly available, which has a built-in NN model. Specifically, we used LeNet NN model that has two layers of neurons: 784\*64 and 64\*10, and there is a ReLu hidden layer between these two neuron layers. We set these configurations for the torch.nn library and let it train/test for us. For predicting each image, we will first perform matrix multiplication with the image data and the first layer, which is 1\*784 dot 784\*64, which result in a matrix of 1\*64. Then, we change all the negative numbers in this 1\*64 matrix to 0. This simulates the ReLu layer in our NN model. Next, we will perform another matrix multiplication between 1\*64 and 64\*10. This results into 1\*10 numbers, which represents the score for each of the classes. If this is for both digits and letters, then the demensino here would be 36(10 for digits and 26 for letters - both upper case letters and lower case letters classify into the same category). The final result we get is around 99% accuracy for digits and 90% accuracy for digits and letters. However, the performance once we load it onto an FPGA doesn't appear to be this high.

To get the weights, we only need to export the two neuron layers. model.lin1 and model.lin2 were printed to the weightnn.hex file, one on each line. model.lin1 has dimension of 784\*64 = 50176, and model.lin2 has dimension of 64\*10 = 640. This totals up to 50176+640 = 50816. If this is for both digits and letters, then the dimension would be 784\*64 = 50176 for layer one, and 64\*36 = 2304 for layer two; and it would total to 52480 instead. This hex file will later be read into weight\_nn\_rom.sv. Each number is first converted to a float, and then cast into a hex, and then finally re-structured into a 8-digit hex number, which is 32 bits large. The format of each line is the same with LC, so refer to the end of section 3.1.

## 3.3. Machine Learning Model - Convolutional Neural Network

| Input:                       | Output:                                             |  |  |
|------------------------------|-----------------------------------------------------|--|--|
| keras.datasets.emnist        | weight_cnn.hex                                      |  |  |
| - 425,600 images with labels | - contains 61470 8-digits hex numbers, one per line |  |  |

The NN model was a lot better than the CL model, but we weren't satisfied with the performance on fpga. Having only two layers limits the model's accuracy. But if we were to have more layers for the NN model, our FPGA would run out of memory, as we need to store the weight information somewhere on the board. Another model was then introduced, the Convolutional Neural Network(CNN), which would extract and shrink the input size, and then do NN. Thich helps us keep the total number of weights, even if we use more layers.

For the CNN model, we kept the structure of the NN model, but added more convolutional layers before the linear neuron layers. There are two convolutional layers added, each having the dimension of 6\*5\*5 and 16\*5\*5, which works like two filters for the input data. The input data is still 784(28\*28\*1). The first convolutional layer will add paddings of 2 pixels on each of the 4 sides of this picture, which will be 32\*32\*1 as a result. The padding added will all be 0. This 32\*32\*1 matrix will be sent to model.cov1 layer, which uses 6\*1\*5\*5 kernels to extract information. This operation will then give use a 28\*28\*6 matrix as the first middle data. We then change all negative numbers to be a 0, serving as the ReLu layer. This processed 28\*28\*6 data will then be average pooled with 2\*2 kernels and stride = 2, which means each time we take a 2\*2 matrix and grab the average, and move by 2 each time. This will then give us a

14\*14\*6 matrix as the input to the second convolution layer. The moel.cov2 has 16\*6\*5\*5 kernels, and by operating on the 14\*14\*6 matrix, which will turn into a 10\*10\*16 matrix as the second middle data. Again, we take all negative values out and substitute them with 0. Then we do average polling again with 2\*2 kernels and stride of 2. This will finally give us a 5\*5\*16 matrix as our data. The data has a dimension of 400 after being flattened, which is way less than the original 784. The total number of weights is the sum of the two convolutional layers, in which model.conv1 is 6\*1\*5\*5 = 150, and model.conv2 is 16\*6\*5\*5 = 2400. In total, the dimension of the two convolutional layers are 150+2400 = 2550 numbers.

This 1\*400 matrix will then be send to the NN model, which has three layers: 400\*120, 120\*84, and 84\*10(or 84\*36 if include letters). This is similar to the two layer NN model that we've used before, so refer to section 3.2 The total numbers of weight for the NN portion is the sum of these three layers. model.lin1 has 400\*120 = 48000, model.lin2 has 120\*84 = 10080, and model.lin3 has 84\*10 = 840. In total, the dimension of the three linear layers is 58920. Total dimension that is used by weight\_cnn\_rom.sv will then be 58920 + 2550 = 61470. Including letters this will be 48000+10080+3024+2550 = 63654.

## 3.4. Assembly Firmware

From the firmware's perspective, the input image (integers between 0 and 255) is stored in image\_mem starting at 0x00010000, and the pre-trained kernels or weights (in 32-bit Floating Point format) are stored in weight\_rom (or weight\_nn\_rom or weight\_cnn\_rom) starting at 0x00020000. Using the same machine learning model as the training processes described above, the firmware would accomplish the classification processes through programs written in our customized assembly language.

### 3.4.1. Main Function (CNN-Supercharged)

We chose to only explain the main function in CNN\_supercharged.asm in this report, as it's the most complicated one compared to the main functions among all the firmware we developed.

The main function is essentially an infinite loop, tailoring the different layers together through data memory (DM) and triggering CNN classifications indefinitely. Before entering a CNN, MAIN sends a snapshot request to the image processing unit and waits in SNAPSHOT\_WAIT until an image is captured from the camera, compressed, padded, and stored in image\_mem. Then MAIN would call PRE\_PROCESS to convert the integers in image\_mem to 32-bit floating point format and store them into DM.

After pre-processing, MAIN setup parameters such as pointer to weight ROM, input matrix size, channel lengths, DM pointer to output matrix, and so on for the different layers of the CNN and call CONV, AVG\_POOL, MATRIX\_MUL, and OUTPUT\_LAYER in the order as defined in section 3.3 above to do the work.

For DM and RF usage, please refer to the following tables:

| Starting Addr | Ending Addr | Usage in this Program                                      |
|---------------|-------------|------------------------------------------------------------|
| 0             | 1023        | Preprocessed input image after HW padding (2 on each side) |
| 1024          | 5727        | Output of first convolution layer                          |
| 5728          | 6903        | Output of first pooling layer                              |

| (reusage start) |               |                                     |  |
|-----------------|---------------|-------------------------------------|--|
| 0               | 1599          | Output of second convolution layer  |  |
| 1600            | 1999          | Output of second pooling layer      |  |
| 2000            | 2119          | Output of first full NN layer       |  |
| 2120            | 2203          | Output of second full NN layer      |  |
| (reusage end)   | (reusage end) |                                     |  |
| 7000            | 7399          | Workzone for matrix multiplications |  |
| 8000            | 8009          | Final Scores of the 10 classes      |  |

| Register Name | Usage in this Function                       |
|---------------|----------------------------------------------|
| RO            | hard-wired 0                                 |
| R1            | snapshot trigger                             |
| R2            | pointer to weight ROM                        |
| R3            | pointer of image MEM                         |
| R4            | input matrix size                            |
| R5            | output matrix size                           |
| R6            | convolution input channel length             |
| R7            | convolution output channel length            |
| R27           | Snapshot status                              |
| R28           | reserved for result of matrix multiplication |
| R29           | DM pointer to output matrix                  |
| R30           | 0x0000C000 base address of peripherals       |
| R31           | reserved for JAL/JR                          |

#### 3.4.2. Pre Process Function

The PRE\_PROCESS is fairly simple. It takes in a starting address and a matrix size as parameter, reads out each entry, converts each integer to FP format, and stores the results back to DM starting at address 0.

| Params:          | Params:                                |  |
|------------------|----------------------------------------|--|
| R3               | pointer of image matrix                |  |
| R4               | matrix size                            |  |
| Local Variables: | Local Variables:                       |  |
| R0               | hard-wired 0                           |  |
| R5               | DM pointer                             |  |
| R6               | temp reg holding value being converted |  |
| R31              | reserved for JAL/JR                    |  |

## 3.4.3. Convolution Layer

The convolution layer is an essential piece in the Convolution Neural Network. Researches have shown that it proves the accuracy of the classification significantly. It also reduces the space required for weights. For fully-connected neural networks, we observed lower accuracy and less efficient use of memory than those of a Convolution Neural Network. The Convolution Layer extracts the useful

features, such as edge detection, Gaussian Blur, and filter, from the original image and reduces the size of the input for the next layers. Below image is an example of how the convolution is performed with a 3x3 kernel: A 3x3 subset of the original image is multiplied by the elements in the kernel correspondingly. And the sum of the 3x3 multiplication is outputted as a pixel in the output image.



#### image from towardsai.net

In our architecture, we use a 5x5 convolution kernel in each convolution layer. To perform the convolution with a 5x5 kernel, we need to perform 5x5x#input\_channels to get a pixel for the output. Depending on the size of the output and #output\_channel, we need to generate number-of-output-image pixels. For example, if the input is 6 of 32x32 images (6 channels) and we want to output 16 images, then the output is 16 of 28x28 images and we need to perform 16x28x28x5x5x6 multiplications, and we also need to perform 16x28x28x5x5x6 additions to finish this convolution layers.

To use our assembly firmware, the function needs to know the address of the input kernel, address of the image, size of the input image, input channel length, output channel length, and address of the output. We used 25 registers to implement the convolution layer. The registers and use of the registers are listed below.

| Params:     |                                                                                                             |
|-------------|-------------------------------------------------------------------------------------------------------------|
| R2          | addr_kernel (start address of kernel)                                                                       |
| R3          | addr_image (start address of image)                                                                         |
| R4          | side_length_input (input side_length)                                                                       |
| R6          | in_channel_length (repeat the convolution calculation for multiple images with the same kernel)             |
| R7          | out_channel_length (repeat the convolution calculation for same image with different kernels)               |
| R29         | addr_output (start address of output address, increase by one when a pixel is calculated)                   |
| Local Varia | ables:                                                                                                      |
| R4          | side_length_output, same reg as input (will set to side_length_input - 4 to reflect the output side_length) |

| R5      | Set to R4 - 1 for branch purposes                                                         |
|---------|-------------------------------------------------------------------------------------------|
| R8      | x_result, x location of output images ( start as 0, increase by one one a pixel is        |
|         | calculated. set to 0 when reach side_length_output)                                       |
| R9      | y_result, y location of output images ( start as 0, increase by one when x_result reaches |
|         | side_length_output )                                                                      |
| R10     | pix_sum (sum of result at a result_pix)                                                   |
| R11~R15 | 5 weight registers (shared between all 25 weights)                                        |
| R16~R20 | 5 pix registers (shared between 25 pixels, it also stores the mult result)                |
| R21     | image_length (use this jump distance to switch between input channels)                    |
| R22     | base (a temp base location for pixel load)                                                |
| R23     | temp (intermediate for base address calculation                                           |
| R24     | channel_id (a down counter for keep track of the channel id in process)                   |
| R25     | side_length_output, same reg as input (will set to side_length_input - 4 to reflect the   |
|         | output side_length)                                                                       |
| R26     | static copy of param out_channel_length                                                   |

## 3.4.4. Average Pooling Layer

The average pooling layer comes after every convolution layer in our CNN. This layer has a similar function to the image compressor, but it is performed by firmware. It takes a 2\*2 pixel block from an image specified by the layer starting address, adds up 4 pixel values, calculates the average value, and stores it back to data memory as a compressed pixel for the new image. It performs the steps above for images in every channel and generates new images of the number of channels. The aim of this layer is to reduce the amount of computation required while preserving the features of the image. By compressing the image, it also helps to prevent overfitting and improves the network's generalization ability.

| Params:          | Params:                  |  |
|------------------|--------------------------|--|
| R3               | layer starting address   |  |
| R4               | image width              |  |
| R6               | number of image channels |  |
| R29              | output starting address  |  |
| Local Variables: |                          |  |
| R0               | hard-wired 0             |  |
| R2               | 0.25F                    |  |
| R7~R10           | pooled pixel             |  |

#### 3.4.5. Neural Network (Matrix Multiplication) Layer

Fully connected NN can be abstracted to loops of 1D matrix multiplications. Thus, we designed this MATRIX\_MUL function which takes two starting addresses, one pointing to the weight and the other pointing to the image (or input data of a middle layer), multiplies the two one-dimensional (flattened) matrices together, and returns the accumulated value. The caller would record and manage the real input and output dimensions and reconstruct the output matrix.

| Params:    | Params:                                |  |
|------------|----------------------------------------|--|
| R2         | pointer to weight matrix               |  |
| R3         | pointer of image matrix                |  |
| R4         | matrix size                            |  |
| Return Va  | lue:                                   |  |
| R28        | result of matrix multiplication        |  |
| Local Vari | Local Variables:                       |  |
| R0         | hard-wired 0                           |  |
| R5         | intermediate mult result store address |  |
| R6         | image pixel value                      |  |
| R7         | weight value                           |  |
| R8         | multiplication result                  |  |
| R31        | reserved for JAL/JR                    |  |

## 3.4.6. Output Layer

The output layer is responsible for making a prediction based on the input data. In our case, there are 36 possible outputs, and the output layer determines the maximum result among all the possibilities. To achieve this, the layer loops through all 36 results and compares them with the current maximum value, replacing the current maximum when necessary. Once the maximum result is found, the output layer transmits the final prediction to the SPART module, which is connected to the processor. The prediction is in the form of an ASCII value that represents the predicted character.

| Params:          |                                                    |  |  |  |
|------------------|----------------------------------------------------|--|--|--|
| R29              | base pointer to the output layer                   |  |  |  |
| Local Variables: |                                                    |  |  |  |
| R0               | hard-wired 0                                       |  |  |  |
| R2               | loop index i                                       |  |  |  |
| R3               | loop terminate condition = 35                      |  |  |  |
| R4               | current number                                     |  |  |  |
| R5               | current number - current max                       |  |  |  |
| R6               | 35 - i                                             |  |  |  |
| R7               | current max                                        |  |  |  |
| R8               | current max index                                  |  |  |  |
| R9               | 0x00000030 ASCII number offset                     |  |  |  |
| R10              | temp reg to distinguish between digits and letters |  |  |  |
| R30              | 0x0000C000 base address of peripherals             |  |  |  |
| R31              | reserved for JAL/JR                                |  |  |  |

## 3.5. Self-checking Assembly Tests and Python Auto-Tester

#### 3.5.1. Python Auto-Tester

The Python auto-tester works together with the cpu\_tb.sv file and the asm tests. The protocol between these parts is that the asm tests are stuck at 0x00AD if they pass, and at 0x00DD if they fail. The cpu\_tb.sv will wait for 300 cycles (which is more than most tests), then it will check if the PC is around 0x00AD or 0x00DD, and output a message of "Test pass" or "Test fail".

The Python auto-tester has 4 functionalities: help, run one file, run all files, and clean. "python test.py help" will display a help message on how to use the tester. "python test.py <filename>.asm" will look for that .asm file and assemble, compile, and simulate it. "python test.py" will look for all .asm files and do the same. "python test.py clean" will clean all .hex files that are generated by test.py.

For the testing functionality, the python script will have to modify the instr\_mem.sv file. Before it does any to this file, test.py will first remember what is currently in instr\_mem.sv and save it. Then, it will run the assembler on any .asm files that need to be tested to generate a .hex file. The name of the .hex file will then be replaced into instr\_mem.sv. The script will then re-compile all .sv and .v files so that the file name change will go through. Next, the script simulates the project and stores the output of the cpu\_tb.sv into an output file. If the file contains "pass", then we can remove the output file and proceed to the next asm file. Else if the file contains "fail", then the script adds information about which test failed in the output file, and stops testing any further. After all the testing procedures are finished, the test script will restore instr\_mem.sv with its original content, and display the testing results.

#### 3.5.2. ASM Test Coverage

Our Assembly tests are designed to comprehensively validate the processor architecture, ensuring that all firmwares are executed correctly. To achieve this, we have created unit tests for every original instruction as well as each newly introduced instruction. All test assembly codes consist of two infinite loops, each containing a branch to itself. The passing loop is located at 0x00AD, while the failing loop is at 0x00DD. Each instruction test contains several test cases, all of which branch to the failing loop when an error occurs. Only the last test case branches to the passing loop upon success. As a result, a passing unit test will eventually have its Program Counter (PC) at 0x00AD, while a failing test will have its PC at 0x00DD.

For instructions related to our extended ALU, our tests focus on validating the correct flags and data bypassing, rather than the actual calculations. This is because the extended ALU primarily handles floating-point arithmetic, which has already undergone extensive testing as a Verilog module itself. The floating-point module's design team has conducted over 10 million randomized tests and 256 tests with special values, demonstrating the module's correctness and accuracy.

## 4. Engineering Standards Employed in your Design

## 4.1. IEEE 1800-2009 SystemVerilog

SystemVerilog is a hardware description and verification language used to design digital systems, with the aim of improving productivity in the verification of hardware designs. It is an extension of Verilog, which includes additional features like assertions, constrained random testing, and coverage measurement. For example, we used the casex feature of SystemVerilog to implement the floating point operations. We also used SystemVerilog to validate our design.

## 4.2. IEEE-754 32-bit Floating Point Standard

IEEE-754 is a standard for floating-point arithmetic that was first published by the Institute of Electrical and Electronics Engineers (IEEE) in 1985. The standard defines formats for representing and manipulating floating-point numbers, which are used to approximate real numbers in computers. This project employs the single-precision 32-bit format to perform addition, subtraction, multiplication, and integer conversion on floating point numbers. Hardware support for IEEE-754 32-bit floating-point arithmetic provides a broader range of representable values and greater accuracy compared to integer calculations.

### 4.3. VGA

The Video Graphics Array (VGA) protocol is an industry-standard protocol for displaying images on supported monitors. To meet the protocol's requirements for a stable output frequency of 25.2MHz on a 640x480 display, we utilized the embedded PLL to generate a 25MHz clock, achieving a frame rate of approximately 59 frames per second (FPS). We used a VGA Controller to control the data and timing requested by the VGA protocol. We feed our pixels directly from the SDRAM into the VGA Controller and reset the timing when the board is initialized. The VGA protocol operates at half the processor speed, with the data to be displayed being flopped in the VGA Controller.

## **4.4. UART (SPART)**

The Universal Asynchronous Receiver-Transmitter (UART) is a widely-used industry-standard protocol for transmitting data between two devices. In our project, we utilized the UART protocol to transmit predicted results to the monitor via a USB cable. To enable buffering of the transmitted and received data, we added two FIFO buffers on either side of the UART signal. Each FIFO buffer can store up to eight one-byte data. We have named the upgraded module that facilitates this functionality as SPART, and it is used to send predicted characters to the display.

#### 4.5. I2C

The Inter-Integrated Circuit (I2C) is a widely-used synchronous, multi-master/multi-slave (controller/target), packet-switched, single-ended, serial communication bus. In our project, we utilized the I2C protocol to adjust the exposure, pause, zoom, and brightness settings of the camera. Specifically, we used I2C\_CCD\_Config to change the camera settings. Once the camera captures an image, it is processed (using image2Gray) and stored directly into the SDRAM.

## 5. Potential Societal Impacts of Our Design

## 5.1. Computing Efficiency in both Time and Cost

This hardware implementation of a Convolutional Neural Network (CNN) has the capability to recognize a single character within 90 ms and costs less than 500 USD in total. Unlike software that relies on extravagant GPUs, the FPGA is specialized and dedicated solely to this computing task. While its NRE cost is higher than that of software development, its deployment at scale is much less expensive, and it consumes less computing power than GPU-based systems. Additionally, the pipelined processor architecture and configurable assembly functions, such as matrix multiplication, convolution, and average pooling, allow for easy expansion to accommodate various image recognition demands by introducing new algorithms.

## 5.2. Smart Monitoring and Internet of Things (IoT)

Our system offers a versatile solution to monitoring readings from industrial equipment and sensors that lack external data ports, allowing for non-stop monitoring. Take, for example, a multimeter in a grid substation that is unable to transmit voltage/current readings to the data center due to its lack of capability. Rather than redesigning the multimeter or hiring additional laborers, our design offers an instant and effective solution that connects the multimeter to the IoT, enabling transmission of data to others. Our system can also pre-process readings to generate more useful data for users, improving reliability while reducing the need for human intervention. While our solution may cause manpower displacement, it ultimately provides greater reliability and non-stop monitoring capabilities.

## 5.3. Trend of Edge Computing Using ASIC Devices

The core paradigm behind this design is Edge Computing, which aims to bring computation and data storage closer to the location where the data is generated. By processing data locally and transmitting only the necessary data, edge devices reduce latency and bandwidth requirements, sometimes eliminating the need for data transmission. This approach improves reliability and scalability by distributing computing resources efficiently without relying on a centralized cloud infrastructure. It also enhances security by keeping critical data within a hardware system that is closer to its source. The applications of edge computing are rapidly emerging in various industries, including smart grid systems, autonomous vehicles, and industrial manufacturing.

## 6. Validation

In addition to module level System Verilog testbenches and self-check assembly validations, we also validated our top level design with ModemSim simulations. The intermediate matrices and final scores reported by simulations were compared to their counterparts generated by our ML software using the same fixed image hex files. Note that with fixed hex images, our design takes about 3.7 million simulation cycles to classify one character. Since our software utilizes 64-bit floating point operations but our design uses 32-bit format, small errors at the least significant bits (when represented in decimal format) is allowed. Some example Python code we used to dump weight matrices, fixed images, and intermediate matrices are shared as follows:

```
1. write_file = open('test_fixed_image_cnn.hex', 'w')
2. for i in range(len(comb_test[sample_idx][0].reshape(-1))):
3.     s_print = "@"+hex(i)[2:].zfill(4)+"
     "+hex(int(comb_test[sample_idx][0].reshape(-1)[i]*255))[2:].zfill(2)
4.     write_file.write(s_print.strip()+'\n')
5. write_file.close()
```

We compared selected sets of the intermediate layers and the final 36 scores of the classes, and validated that our hardware design precisely executed all of our different ML architectures. Selected screenshots of our validation processes are shared below:







## 7. Final Application Demonstration

A demonstration video is posted on YouTube at https://youtu.be/7T7qlo2lxYQ



The images above display the results of our FPGA-CNN model's prediction of several handwritten characters. The bottom-left screen shows the prediction result obtained through UART, while the bottom-right corner features our FPGA board equipped with a camera. In the image's right-hand side, a hand can be seen swiping between sample handwritten images on an iPad. Meanwhile, the monitor in the background displays the video feed captured by the camera. The red frame indicates the range of letter prediction. In the top-left corner of the monitor, a compressed 32x32 image of the target letter is echoed back.

## 8. Contributions of Individuals

| Justin Qiao            | Haining Qiu         | Harry Zhao             | Qikun Liu            |
|------------------------|---------------------|------------------------|----------------------|
| Team Management        | FP Adder Module     | Top Level Integration  | ML Software          |
| Assembly Firmware      | Image Processing HW | CPU Pipeline Expansion | Weight ROM Interface |
| Extended ALU Modules   | Pooling Layer ASM   | Camera Interface       | Python Auto-Tester   |
| FP Verilog Testbenches | New Inst Validation | Convolution ASM        | Old Inst Validation  |