# Hardware acceleration of video processing

**Pieter BERTELOOT** 

Supervisor: Sammy Verslype

Academic year: 2017-2018

# **Abstract**

This project is a study on accelerating software processes using hardware resources. Hardware is used to perform functions faster and more efficiently than is possible in software running on a processor. The focus in this project is placed on video processing.

A PYNQ board is used to perform the hardware acceleration. This board is equipped with a Xilinx Zynq® All Programmable Systems on Chips (APSoCs). This is a combination of a field-programmable gate array (FPGA) and a cortex A9 processor along with other peripherals. It can be used for parallel hardware execution, real-time signal processing, high frame-rate video processing, ....

Overlays are used to make this possible. These are programmable and configurable FPGA designs. PYNQ provides a python interface that allows overlays to be controlled in the processing system. A custom video overlay is created that reads an incoming HDMI signal, processes it and sends it back to the HDMI output.

The processing of the signal is done in a custom Intellectual Property (IP) block, which is made using Vivado high level synthesis (HLS). This program is an automated design process that implements c code in hardware logic. HLS provides extensive libraries for data types, video processing, DSP and much more, which can be used to accelerate and optimize projects. Directives can be given for user-specified optimization such as pipelining, loops, ....

In this project, a wide variety of IP's are made. Starting with basic functions, to get to know HLS, to more complex functions such as edge detection. Edge detection is identifying points in a frame where discontinuities occur. To perform this edge detection, the Sobel filter is used. This filter is based on convolving a frame with a specific kernel. This kernel defines in which direction the detection should be (X or Y). Edge detection is used for image segmentation (dividing an image into multiple parts) and data extraction for computer/machine vision.

To speed up the IP creation and optimization, a test bench is created that makes C simulation possible. This simulation loads a local image, creates an axi stream with the image data and feeds this to the IP. This outputs an image which can be used to check a correct working functionality.

Finally, the comparison is made between a software and hardware implemented video filter. Here the conclusion can be made that it is possible to accelerate software processes using the PYNQ board. The video signal can be processed, keeping the frame rate of 60fps with only a low latency between the input and the output signal.

All created IP's, accompanied by input and output images, are described in this report and can be used as a beginner's guide for creating PYNQ overlays.

1

# **Table of Contents**

| Abs | tract .        |                                                    | 1    |  |
|-----|----------------|----------------------------------------------------|------|--|
| 1   | Introduction 4 |                                                    |      |  |
|     | 1.1            | Objective                                          | 4    |  |
| 2   | Ove            | rlays                                              | 5    |  |
|     | 2.1            | Base overlay                                       | 5    |  |
|     | 2.2            | Rebuilding the base overlay                        | 6    |  |
|     | 2.3            | Creating our first IP                              | 8    |  |
| 3   | Vide           | eo processing                                      | 11   |  |
|     | 3.1            | Video signal                                       | . 11 |  |
|     |                | 3.1.1 Frontend                                     | 11   |  |
|     |                | 3.1.2 VDMA                                         | 12   |  |
|     |                | 3.1.3 HDMI out                                     | 12   |  |
|     |                | 3.1.4 Video pipeline                               | 12   |  |
|     | 3.2            | Processing the signal                              | 12   |  |
|     |                | 3.2.1 Our first video processing: Screen splitter  | 12   |  |
|     |                | 3.2.2 C simulation and test bench                  | 15   |  |
|     |                | 3.2.3 Screen splitter 2                            | 18   |  |
| 4   | Vide           | eo filters                                         | 19   |  |
|     | 4.1            | Edge detection                                     | 19   |  |
|     | 4.2            | Sobel X or Y                                       | 20   |  |
|     | 4.3            | Sobel X + Y                                        | 23   |  |
|     | 4.4            | A system using these filters                       | 25   |  |
|     | 4.5            | Sobel negative values                              | 30   |  |
|     |                | 4.5.1 Visualizing the positive and negative values | 33   |  |
| 5   | Con            | nparation Software vs hardware                     | 34   |  |
| 6   | Con            | clusion                                            | 35   |  |
| 7   | Futu           | ıre work                                           | 36   |  |

# List of Figures

| Figure 2-1: Vivado Address Editor              | 6  |
|------------------------------------------------|----|
| Figure 2-2: Video overlay                      | 7  |
| Figure 2-3: Position add IP                    | 9  |
| Figure 2-4: Assign IP address                  | 9  |
| Figure 3-1: Xilinx video protocol signals      | 11 |
| Figure 3-2: Connection video filter            | 12 |
| Figure 3-3: placement IP                       | 14 |
| Figure 3-4: Result Screensplitter              | 15 |
| Figure 3-5: Result C simulation                | 17 |
| Figure 3-6: Result Screensplitter 2            | 18 |
| Figure 4-1: Sobel kernel X and Y               | 19 |
| Figure 4-2: Result Sobel X or Y                | 21 |
| Figure 4-3: Input information loss             | 21 |
| Figure 4-4: Result information loss            | 22 |
| Figure 4-5: block diagram sobel X+Y            | 23 |
| Figure 4-6: HLS dataflow                       | 25 |
| Figure 4-7: Result Sobel X + Y                 | 25 |
| Figure 4-8: Block diagram system               | 25 |
| Figure 4-9: Result system using filters        | 28 |
| Figure 4-10: image types                       | 30 |
| Figure 4-11: Result Sobel with negative values | 31 |
| Figure 4-12: Resources 10 bit                  | 32 |
| Figure 4-13: Resources 16 bit                  | 32 |
| Figure 4-14: Resources 1 channel, 10 bit       | 32 |

# 1 Introduction

# 1.1 Objective

The objective of this project is to study the use of FPGA hardware resources to accelerate software processes with focus on video processing. All elements on how the project is realized are discussed in this report and can be used as beginner guide for new PYNQ users. All HSL, bit, tcl and python files are available on github:

https://github.com/Pieter-Berteloot/PYNQ\_Projects

This project makes use of the PYNQ-Z1 board. This board is the hardware platform for the PYNQ open-source framework. This includes ARM A9 CPUs where the following software runs:

- Linux
- Python
- Jupyther notebook
- Hardware libraries and API for the FPGA

These are used to create a user-friendly and customizable video processing system.

Hardware libraries are the programmable logic circuits and are called overlays. These are like software libraries. The programmer can select which one matches their application the best. The advantage of using these overlays is that once an overlay is build, it can be reused in other applications.

# 2 OVERLAYS

Overlays are programmable and configurable FPGA designs. These are used to accelerate software applications. PYNQ provides a python interface that allows overlays to be controlled in the processing system.

An overlay includes:

Bitstream file

File that contains the programming information for the FPGA.

TCL file

Determines the available IPs

Python API

Handles the configuration and communication with the IPs

The default base overlay is loaded at boot time on the PYNQ board. This overlay can be replaced with other overlays while the system is running.

# 2.1 Base overlay

The base overlay allows PYNQ to use the peripherals (video, audio, GPIOs, ...) that are on the board. It connects the IP blocks to the Zynq processing system. These peripherals can then be used from the Python environment. Let's now look what's inside the base overlay. To do this, rebuild the overlay following these steps:

- First clone/download the board files and overlays from the PYNQ github page: https://github.com/Xilinx/PYNQ.
- Open Vivado Design Suite (for this project Vivado 2016.2 is used) and run the following code in the TCL console:

```
cd <PYNQ repository>/boards/Pynq-Z1/base
vivado -mode batch -source build_base_ip.tcl
vivado -mode batch -source base.tcl
```

• Wait until both scripts have finished (this will take some time). When this is done the base overlay can be found in:

```
<PYNQ repository>/boards/Pynq-Z1/base/base
```

This base overlay will be used as starting point for our project because it already defines all the configurations needed for the processing system interface and the peripherals. An important part in the block design of the overlay is the processing system AXI peripherals. This is a General-Purpose AXI-Lite interface (GP0) that controls and configures IP blocks in the design and runs on a 100MHz clock.

# 2.2 Rebuilding the base overlay

Every peripheral can be found in the base overlay. The routing of these blocks takes time and hardware so to reduce this, only the components that are needed for video streaming/processing are kept. Our edited base overlay (Figure 2-2) can be found on the next page.

The following blocks are needed for video streaming:

- AXI interface
- ZYNQ processing system
- System interrupts
- Reset processing system fclk0
- Reset processing system fclk1
- Video

The rest of the IP block have been removed from the block design. To prevent errors, the deleted input and output signals are removed from the top.v file that can be found in the project Manager.

The **Address Editor** (Figure 2-1) is also a very important subject in the IP Integrator. The offset Address and range of each IP is displayed here. This address will later be used for **Memory-mapped I/O** (MMIO). When a new IP that has AXI-Lite communication is added, the user has to map the IP to give it an Address.



Figure 2-1: Vivado Address Editor



Figure 2-2: Video overlay

# 2.3 Creating our first IP

When it's known how to create, edit and communicate with the overlay, it's possible to create our own IP. Let's start with a simple adder. The objective is to make an IP that has 2 integer values as input and 1 integer value as output. The user provides these 2 integers and the IP calculates the sum.

For constructing this IP, Vivado High Level Synthesis (HLS) is used. This transforms complex algorithms into VHDL code. It accelerates the IP creation transforming C, C++ and System C code to VHDL code.

When creating a new project using the PYNQ-Z1 board, select the xc7z020clg400-1 board part.

Let's analyze the following code:

```
#include <ap_fixed.h>
#include <ap_int.h>

void add_function( int a, int b, int *c) {

#pragma HLS INTERFACE s_axilite port=return bundle=control

#pragma HLS INTERFACE s_axilite port=a bundle=control

#pragma HLS INTERFACE s_axilite port=b bundle=control

#pragma HLS INTERFACE s_axilite port=c bundle=control

#pragma HLS INTERFACE s_axilite port=c bundle=control

*c = a + b;
}
```

First, start with importing C++ libraries so the Fixed-Point Data Types and Integer Data Types can be used.

After this comes the TOP function. This function is very important because the arguments of the top functions are the interfaces. These will become ports on the RTL design and directives can be specified to select the IO protocol ports. The axilite protocol is used for communication with the IP. Pragmas are used to tell the program that the axilite protocol needs to be used.

At last, the functionality of the IP is programmed. When this is done, C synthesis and RLT export can be performed.

Now import the IP in our overlay. To do this, import the IP in the IP catalog (project manager - > IP Catalog) and add the IPs repository. Once this is done, open the block design and add the IP as shown in Figure 2-3. Connect the AXI control input to an open AXI connection on the PS AXI periph.



Figure 2-3: Position add IP

Assign an address to our IP in the Address Editor. Figure 2-4 shows how this is done.



Figure 2-4: Assign IP address

In my case the **offset address** is 0x43C8\_0000. Now let's check what's inside this register. In the HLS-project, open **add\_function\_control\_s\_axi.vhd** (solution1 -> syn -> vhdl). Scroll down till you see the Address Information. Tabel 2-1 shows the signals that our important for us.

**Tabel 2-1 Address information** 

| Address | Name             | Function                                                                      |
|---------|------------------|-------------------------------------------------------------------------------|
| 0x00    | Control signals  | Controls the ip, bit 0 makes the ip start, bit 1 will be high when it's done, |
| 0x10    | Data signal of a | Stores integer a                                                              |
| 0x18    | Data signal of b | Stores integer b                                                              |
| 0x20    | Data signal of c | Stores integer c                                                              |

Now it's possible to generate our BIT- and TCL file. Generate the bitstream and run the following code in the Tcl console:

write\_bd\_tcl top.tcl

Note: generating the BIT-file can take some time.

Once this is done, copy the BIT-and TCL file to the following location on a PYNQ board:

### \\192.168.2.99\xilinx\pynq\overlays\base

And run the following code in a notebook:

```
In [2]: from pynq import Overlay
        from pynq import MMIO
        from pynq.lib.video import *
                                                                            Import
                                                                           overlays
        base = Overlay("/home/xilinx/pynq/overlays/base/top.bit")
        base.download()
                                                                            Adder
In [3]: add_example = MMIO(0x43C80000,0x10000)
                                                                            MMIO
In [4]: add_example.write(0x10,3)
        print("Integer a:",add_example.read(0x10))
                                                                           Fill int a
        add_example.write(0x18,5)
                                                                            and b
        print("Integer b:",add_example.read(0x18))
        Integer a: 3
        Integer b: 5
In [5]: add_example.write(0x00,1)
                                                                             Start AP
                                                                           Read output
        print("Integer c=",add_example.read(0x20))
In [6]:
        Integer c= 8
```

After creating and using the overlay successful, it's possible to use this to create more complex systems. The next chapter will focus on creating the Sobel edge detection filter in a video stream.

# 3 VIDEO PROCESSING

# 3.1 Video signal

Before the video can be processed, a closer look is taken on how the video signal is transmitted in the base overlay. Here the video processing can be split into different parts:

- HDMI in
  - Frontend
  - Color\_convert
  - Pixel\_pack
    - Dvi2RGB decoder
    - Video in to axi4-stream
- VDMA
- HDMI out
  - o Pixel\_unpack
  - Color\_convert
  - Frontend

### 3.1.1 Frontend

The HDMI signal is transmitted in a transition minimized differential signal (TDMS). The DVI to RGB video decoder decodes this signal and transforms it to a RGB signal. This IP outputs a 24-bit RGB signal with V -and H synq signals. The video in to axi4-stream converts this signal to the **Xilinx video protocol**.

| Function       | nction Width Direction AXI4-Stream Signal Name |     | Video Specific Name |       |
|----------------|------------------------------------------------|-----|---------------------|-------|
| Video Data     | Any number of bytes                            | Out | m_axis_video_tdata  | DATA  |
| Valid          | 1                                              | Out | m_axis_video_tvalid | VALID |
| Ready          | 1                                              | In  | m_axis_video_tready | READY |
| Start Of Frame | 1                                              | Out | m_axis_video_tuser  | SOF   |
| End Of Line    | 1                                              | Out | m_axis_video_tlast  | EOL   |

Figure 3-1: Xilinx video protocol signals

The following signals are important for us:

Video data

Contains the video data which is 24 bit (8 bit for each color).

Start of Frame

Start of frame indicates that the first pixel of a new frame is transmitted.

· End of Line

End of Line indicates that the last pixel of a line is transmitted.

More information about this protocol can be found on: <a href="https://www.xilinx.com/support/documentation/ip\_documentation/axi\_videoip/v1\_0/ug934\_axi\_videoIP.pdf">https://www.xilinx.com/support/documentation/ip\_documentation/axi\_videoip/v1\_0/ug934\_axi\_videoIP.pdf</a>

### 3.1.2 VDMA

The video direct memory access is designed to allow efficient high-bandwidth access between the AXI4-+stream video interface and the AXI4 interface. This IP reads and writes frames to the memory.

### 3.1.3 **HDMI** out

HDMI out is the same as HDMI in but now it transforms the Xilinx video protocol back to HDMI signal.

# 3.1.4 Video pipeline

For more information about the video pipeline and how to use it, study hdmi\_video\_pipeline example notebook.

# 3.2 Processing the signal

The video processing will take place in the HDMI-in package show in Figure 3-2



Figure 3-2: Connection video filter

### 3.2.1 Our first video processing: Screen splitter

Let's start with a simple video processing system. In this project an IP is created that:

- Splits the screen in 2 parts
  - o First part: Full original image passes
  - Second part: Only the red component passes
- · The split can be defined in real time on what column it will be

All the code can be found on github:

https://github.com/Pieter-Berteloot/PYNQ\_Projects/tree/master/Video%20Processing/Split

### Implementation:

Let's start making our IP in High Level Synthesis. First, define some types.

Now define what input and output signals are used. Note that the signals are the same as defined in 3.1.1.

```
struct video_stream {
    struct {
        pixel_type p1;
        pixel_type p2;
        pixel_type p3;
    } data;
    ap_uint<1> user;
    ap_uint<1> last;
};
```

After this, create the TOP function. This project uses a video signal input and an integer. The output is also a video stream.

```
void split_ip(video_stream* in_data, video_stream* out_data, int a) {
```

The program doesn't know what interfaces these input and output signals must use. Pragmas are used to define this. An axis interface is used for the video stream and an axilite interface for the integer.

```
#pragma HLS INTERFACE axis port=in_data
#pragma HLS INTERFACE axis port=out_data
#pragma HLS INTERFACE s_axilite port=a
#pragma HLS INTERFACE ap ctrl none port=return
```

The pixels are sequentially streamed. Every time a new pixel is received, the EOL and SOF signal are put directly to the output. The data line is stored in temp variables.

```
comp_type in1, in2, in3, out1, out2, out3;
out_data->user = in_data->user;
out_data->last = in_data->last;

in1.range() = in_data->data.p1;
in2.range() = in_data->data.p2;
in3.range() = in_data->data.p3;
```

When the data is read, it is checked in which column the pixel is located. The column counter is reset when the end of line signal is high.

```
if(col <= a) {
    out1 = in1;
    out2 = in2;
    out3 = in3;
} else {
    out1 = in1;
    out2 = 0;
    out3 = 0;
}
if(in_data->last)
    col = 0;
```

After this, the out variable can be assigned to the output stream.

```
out_data->data.p1 = out1.range();
out_data->data.p2 = out2.range();
out_data->data.p3 = out3.range();
```

Synthesize and export and add the IP in the Vivado block design like this:



Figure 3-3: placement IP

**Note**: Don't forget to connect the axilite interface to the axi\_interconnect IP and map the new IP in the address editor.

Generate the tcl and bit files and import the overlay in PYNQ, using the same method as in the adder IP. Write a value to 0x10 to define where the split should be. The gives the following result:



Figure 3-4: Result Screensplitter

### 3.2.2 C simulation and test bench

Building the overlays takes allot of time. Let's change the code and make a test bench in HLS so it is possible to simulate instead of creating the overlay. The hls\_video and hls\_opencv library is used to make this simulation possible. The following changes are made:

### Input and output type

For input and output the AXI\_STREAM is used. This makes use of HLS::stream. An hls::stream object can be used to store data samples in the same manner as an array. The data in an hls::stream can only be accessed sequentially. In the C code, the hls::stream behaves like a FIFO of infinite depth.

Multiple reads of the same data from an hls::stream are impossible. Once the data has been read from an hls::stream it no longer exists in the stream.

### Use of hls::mat

hls::mat represent an image in HLS Video Library. It can be seen as a frame for the programmers. In hardware it is implemented the same way as a stream (with FIFO).

### • Make a sperate split function that can be called in the TOP function

Implement the functionality of 3.2.1 in a function.

### • The use of AXIvideo2Mat and Mat2AXIvideo

These functions will convert the video input to a mat object and convert the mat object to the output stream. The system handles all the EOL and SOF signals.

### A test bench and H file is created

This test bench loads an image, converts it to an axi stream, sends it through our IP and converts the stream back to an image.

As always: all code can be found on github:

https://github.com/Pieter-Berteloot/PYNQ\_Projects/tree/master/Video%20Processing/C%20simulation

Let's first look at the split function:

```
void split(
            RGB IMAGE& img in,
            RGB IMAGE& img out,
            int index) {
      RGB PIX pin;
      RGB PIX pout;
L row: for(int row = 0; row < 1080; row++) {
#pragma HLS LOOP TRIPCOUNT min=1 max=1080
      L col: for(int col = 0; col < 1920; col ++) {
#pragma HLS LOOP TRIPCOUNT min=1 max=1920
#pragma HLS loop flatten off
#pragma HLS PIPELINE II = 1
           img in >> pin;
               if(col <= index) {</pre>
                     pout.val[0] = pin.val[0];
                      pout.val[1] = pin.val[1];
                      pout.val[2] = pin.val[2];
               else{
                      pout.val[0] = pin.val[0];
                      pout.val[1] = 0;
                      pout.val[2] = 0;
                }
           img out << pout;</pre>
        }
    }
```

The functionality of this function is the same as in 3.2.1. The only difference is that it uses RGB\_IMAGE's as input and output. These are hls::mat objects and can be seen as frames. In the function, there are 2 loops that iterates over all the pixels in the frame.

There are also 2 important pragmas: Loop\_flatten and Pipeline. Loop flatting allows nested loops to be collapsed into a single loop with improved latency. Pipeline reduces the initiation interval for a function or loop by allowing the concurrent execution of operations.

More information about pragmas can be found here:

Pipeline: https://www.xilinx.com/html\_docs/xilinx2018\_1/sdsoc\_doc/oyc1517254361139.html

**Loop flatten**: https://www.xilinx.com/html\_docs/xilinx2018\_1/sdsoc\_doc/hid1517254361170.html

Once the split function is made, it can easily be called it in our top function:

```
// Convert AXI4 Stream data to hls::mat format
hls::AXIvideo2Mat(in_data, img_0);

//call the split function
split(img_0, img_1, a);

//Convert the mat to Axi video stream
hls::Mat2AXIvideo(img_1, out_data);
```

The H file is made so the project can be included into a test bench. The H file is self-explanatory and can be found on github. It is important to define the input and output image here.

```
#define INPUT_IMAGE "test_1080p.bmp"
#define OUTPUT IMAGE "test output 1080p.bmp"
```

Also place the input image in the HLS project directory. The test bench code is:

The opency functions are used to load, convert and save images. If this is all done, run the C simulation and an image should appear in the following directory:

<hls project>\solution1\csim\build

### The result:



Figure 3-5: Result C simulation

Notice that the Output is now blue instead of red. This is because the color mode is by default BGR but RGB is used in the hardware examples.

**Note**: for C simulations, streams have an **infinite** depth while in hardware they have a depth of **1** by default. This can result in a difference in results when latency is important.

### 3.2.3 Screen splitter 2

The next step is performing operations on the pixel data. The objective is to transform the original screen splitter to a RGB and GRAY screen. To convert RGB to GRAY the following formula is used:

```
gray = 0.2989 * R + 0.587 * G + 0.114 B
```

All 3 colors are set to this gray value so it can be displayed on a monitor. First a type to represent the coefficients Is needed. This is done with ap fixed:

```
typedef ap fixed<10,2, AP RND, AP SAT> coeff type;
```

This defines a 10-bit variable with 2 bits representing the numbers above the decimal point and 8 bits representing the value below the decimal point. The variable is signed, rounding to plus infinity and uses saturation for overflows.

This is used for:

```
coeff type const1 = 0.114;
coeff_type const2 = 0.587;
coeff type const3 = 0.2989;
```

The result of the calculation also needs to be stored.

```
char gray;
```

Now the gray value can be calculated. Change the code from the original split to the following:

```
gray = const1 * pin.val[0] + const2 * pin.val[1] + const3 * pin.val[2];
pout.val[0] = gray;
pout.val[1] = gray;
pout.val[2] = gray;
```

Because the interfaces have change, the inputs and outputs of the IP have change too. Make sure that every input and output is properly connected.

### The result:



Figure 3-6: Result Screensplitter 2

# 4 VIDEO FILTERS

It's possible now to read an incoming signal, process it, write it back to the HDMI output. Now it's time to implement filters on the video signal. Let's start with a basic Sobel filter in the x or y direction. hls::sobel makes use of the hls::filter2d function.

# 4.1 Edge detection

Edge detection is used to identify and locate discontinuities in an image. These can be detected by checking the change in pixel intensity (high pass filter). The most common way to detect these is by convoluting the image with a kernel. When there is no drastic change this will return a low or zero value. The type of edge detected is in function of the used kernel. In this project the Sobel kernels are used:



Figure 4-1: Sobel kernel X and Y

These kernels are used to detect edges vertically and horizontally. They can also be combined so it can detect vertical and horizontal edges. The magnitude is calculated with this formula:

$$XY = \sqrt{X^2 + Y^2}$$

Because this is difficult to implement in hardware, a simplified formula is used:

$$XY = |X| + |Y|$$

### 4.2 Sobel X or Y

First, the split function in 3.2.3 is modified to a RGB2Gray function.

```
void RGB2Gray(
            RGB IMAGE& img in,
             RGB IMAGE& img out
        ) {
      RGB PIX pin;
      RGB PIX pout;
      char gray;
L row: for(int row = 0; row < 1080; row++) {
#pragma HLS LOOP TRIPCOUNT min=1 max=1080
      L col: for(int col = 0; col < 1920; col++) {
#pragma HLS LOOP_TRIPCOUNT min=1 max=1920
#pragma HLS loop_flatten off
#pragma HLS PIPELINE II = 1
            img in >> pin;
                gray =
                        const1 * pin.val[0] +
                         const2 * pin.val[1] +
                         const3 * pin.val[2];
                pout.val[0] = gray;
                pout.val[1] = gray;
                pout.val[2] = gray;
            img out << pout;
        }
    }
```

This function has an RBG image as input and converts it to a gray image. Another function is needed to calculate the Sobel filter in X or Y direction:

The hls::sobel function has the following template:

```
template<int XORDER, int YORDER, int SIZE, int ROWS, int COLS, int SRC_T,
int DROWS, int DCOLS, int DST_T>
```

When XORDER=1 and YORDER=0 it computes the horizontal derivative and the other way around for the vertical derivative. SIZE is the kernel size.

After implementing this, the result should be:



Input



Figure 4-2: Result Sobel X or Y

**Note**: Our RGB image type is HLS\_8UC3:

typedef hls::Mat<1080,1920, HLS 8UC3> RGB IMAGE;

This is an 8 bit unsigned char with 3 channels (R, G, B). This means that **negative values** cannot be represented in our image which will lead to information is loss. To demonstrate this, take the following picture as input:



Figure 4-3: Input information loss

this input from left to right, there is a transition from perfect black <0,0,0> to perfect white <255,255,255> and a transition from perfect white to perfect black.

When this transition happens, the following calculation is made:

$$pixel\ left\ transition = \begin{bmatrix} 0 & 255 & 255 \\ 0 & 255 & 255 \\ 0 & 255 & 255 \end{bmatrix} * \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} = 1 * 255 + 2 * 255 + 1 * 255 = 1020$$

$$pixel\ right\ transition = \begin{bmatrix} 255 & 255 & 0 \\ 255 & 255 & 0 \\ 255 & 255 & 0 \end{bmatrix} * \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} = -1 * 255 - 2 * 255 - 1 * 255 = -1020$$

A value of 1020 will be transformed to a 255 value because an unsigned char ranged from 0 to 255. The -1020 value cannot be represented with an unsigned char thus no line will be drawn.



Figure 4-4: Result information loss

### 4.3 Sobel X + Y

The following block diagram is used to calculate the sobel function in x and y direction:



Figure 4-5: block diagram sobel X+Y

To display the horizontal and vertical edges, the sum must be taken from Sobel X and Sobel Y. The following formula should be used:

$$XY = |X| + |Y|$$

In our example, the following formula is used because there are no negative values in the unsigned char data type.

$$XY = X + Y$$

First an add function is made so this formula can be used on an image.

```
void add(RGB IMAGE& img in0, RGB IMAGE& img in1, RGB IMAGE& img out) {
      RGB PIX pin0, pin1;
      RGB PIX pout;
L row: for(int row = 0; row < 1080; row++) {
#pragma HLS LOOP TRIPCOUNT min=720 max=1080
L col: for(int col = 0; col < 1920; col++) {
#pragma HLS LOOP TRIPCOUNT min=1280 max=1920
#pragma HLS loop flatten off
#pragma HLS PIPELINE II = 1
           img in0 >> pin0;
           img_in1 >> pin1;
               pout = (pin0 + pin1);
           img_out << pout;</pre>
        }
    }
}
```

This function adds 2 images together. It can be used to add the Sobel X and Y results. These are obtained using the Sobel function in 4.2. As seen in the block diagram Figure 4-5, the input image must be duplicated so the Sobel functions can be used on the same image.

This function is similar to the add function. The only difference is that it uses 1 input to 2 outputs. Now these functions can be used in our TOP function.

```
// Convert AXI4 Stream data to hls::mat format
hls::AXIvideo2Mat(in_data, img_0);

//Convert to gray image
RGB2Gray(img_0, img_1);

//copy the input image
copy2(img_1, img_2, img_3);

//sobel functions
sobel(img_2, img_4, 0);
sobel(img_3, img_5, 1);

//add sobel x and y
add2(img_4, img_5, img_6);

//Convert the mat to Axi video stream
hls::Mat2AXIvideo(img_6, out data);
```

Note: Don't forget to add some more RGB images.

It's always good to have a look in the Analysis tab. The schedule and dataflow can be seen here. This is useful to check your system before building it.



Figure 4-6: HLS dataflow



Figure 4-7: Result Sobel X + Y

# 4.4 A system using these filters

Now it's time to make a system with these filters. Let's try to make a system that is the sum of

- The original frame
- Sobel X
- Sobel Y

The brightness of all the separate frames can be lowered and also the color mode can be chosen.



Figure 4-8: Block diagram system

Let's have a look at the colormode function. This function has 4 modes:

- RGB passthrough
- RGB2Gray
- Screen split in half: left RGB, right Gray
- Full black

```
void colorMode(RGB IMAGE& img in, RGB IMAGE& img out, char mode) {
      RGB PIX pin;
      RGB_PIX pout;
      char gray;
      L row: for(int row = 0; row < 1080; row++) {
#pragma HLS LOOP TRIPCOUNT min=1 max=1080
            L col: for(int col = 0; col < 1920; col++) {
#pragma HLS LOOP TRIPCOUNT min=1 max=1920
#pragma HLS pipeline rewind
                  img in >> pin;
                        if (mode == 0) {
                     pout.val[0] = pin.val[0];
                                                                 RGB
                     pout.val[1] = pin.val[1];
                     pout.val[2] = pin.val[2];
                 } else if (mode == 1) {
                           gray = const1 * pin.val[0] +
                                    const2 * pin.val[1] +
                                     const3 * pin.val[2];
                                                                       GRAY
                           pout.val[0] = gray;
                           pout.val[1] = gray;
                           pout.val[2] = gray;
                 } else if(mode == 2) {
                       if(col <= 960){
                            pout.val[0] = pin.val[0];
                            pout.val[1] = pin.val[1];
                           pout.val[2] = pin.val[2];
                                                                       SPLIT
                       } else {
                                    const1 * pin.val[0] +
                            gray =
                                    const2 * pin.val[1] +
                                    const3 * pin.val[2];
                           pout.val[0] = gray;
                            pout.val[1] = gray;
                            pout.val[2] = gray;
                    else if (mode == 3) {
                     pout.val[0] = 0;
                     pout.val[1] = 0;
                     pout.val[2] = 0;
                                                                      Black
                 } else {
                     pout.val[0] = pin.val[0];
                     pout.val[1] = pin.val[1];
                     pout.val[2] = pin.val[2];
                 img out << pout;</pre>
            }
     }
}
```

The next step is to modify the Copy2 to a Copy 3 function. When this is done, the Sobel and Colormode functions can be reused. Notice that the block diagram is not symmetric. The output of the original image will arrive before the output of the filters do. This will cause the pixels to be out of syng and won't be able to be added in de add3 function.

To prevent this from happening, a buffer is placed between Colormode and the add3 function. First the minimum depth of this buffer needs to be known. The Sobel function, with a kernel size of 3, needs at least 3 lines of data to begin calculating. So the latency in a Sobel function is 3 \* 1920 = 5760.

There will be 5760 pixels stored in this buffer so the 3 lines will be in synq again. hls::mat is basically the same as a stream (mat is implemented as a hls::stream). So, the depth of the image can be set where the buffer is needed.

```
#pragma HLS stream depth=19200 variable=img 1.data stream
```

Now the only thing that needs to be done is making the add3 function. This function adds 3 images. The intensity of each image can also be changed in this function.

```
void add3(RGB IMAGE& img in0,int a, RGB IMAGE& img in1, int b, RGB IMAGE&
img in2, int c, RGB IMAGE& img out) {
      RGB PIX pin0, pin1, pin2;
      RGB PIX pout;
L row: for(int row = 0; row < 1080; row++) {
#pragma HLS LOOP TRIPCOUNT min=720 max=1080
L col: for(int col = 0; col < 1920; col++) {
#pragma HLS LOOP_TRIPCOUNT min=1280 max=1920
#pragma HLS loop flatten off
#pragma HLS PIPELINE II = 1
           img in0 >> pin0;
           img in1 >> pin1;
           img in2 >> pin2;
           \frac{pout}{pout} = (pin0/a + pin1/b + pin2/c);
           img out << pout;
        }
    }
}
```

# The result:



Figure 4-9: Result system using filters

# 4.5 Sobel negative values

Let's take a closer look on how to solve the problem discussed in 4.2. The negative values of the Sobel filter are needed for video processing. These negative values contain information that is not used in our previous projects.

To show these negative values, it's possible to multiply the X and Y kernel by -1. This will only output the "negative" values. Taking the sum of these two will double the needed resources.

A better solution is to change the datatype so negative values can be represented.

The objective in this project is to construct the following system:



Figure 4-10: image types

These functions must be made:

- convertToSigned
- convertToUnsigned

First, a new type of image is needed to store the 16-bit data.

```
typedef hls::Mat<1080,1920, HLS_16SC3> RGB16_IMAGE;
```

Now there must be a data type to fill this image up:

```
typedef hls::Scalar<3, short > RGB16 PIX;
```

The convert to signed function takes a RGB\_IMAGE and converts it to a RGB16\_IMAGE.

```
void convertToSigned(RGB IMAGE& img in0, RGB16 IMAGE& img out) {
      RGB PIX
                pin; //the input
      RGB16 PIX pout; // for the output
L row: for(int row = 0; row < 1080; row++) {
#pragma HLS LOOP TRIPCOUNT min=720 max=1080
      L col: for(int col = 0; col < 1920; col++) {
#pragma HLS LOOP TRIPCOUNT min=1280 max=1920
#pragma HLS loop flatten off
#pragma HLS PIPELINE II = 1
            img in0 >> pin;
            pout.val[0]=pin.val[0];
            pout.val[1]=pin.val[1];
            pout.val[2]=pin.val[2];
            img out << pout;</pre>
        }
    }
}
```

The convert to unsigned function takes a RGB16\_IMAGE and converts it back to a RGB\_IMAGE. It's similar to the convert to unsigned function but now the absolute value must be taken to convert it to unsigned.

**Note**: Don't forget to change the image type in the Sobel function to RGM16\_IMAGE.

### The result:



Figure 4-11: Result Sobel with negative values

The data is kept as a short which is 16 bits. This has a range from -32,768 to 32,767. Most values will never be used and will take unnecessary resources. Let's change it to 10 signed bits because the HLS\_10SC3 data type already exists in hls\_video\_types.h. This is a 3 channel 10-bit signed data type. This ranges from -512 to 511.

### Image type:

```
typedef hls::Mat<1080,1920, HLS_10SC3> RGB10_IMAGE;
Pixel type:
    typedef hls::Scalar<3, ap_int<10> > RGB10_PIX;
```

| ─ Summary       |          |        |        |       |
|-----------------|----------|--------|--------|-------|
| Name            | BRAM_18K | DSP48E | FF     | LUT   |
| DSP             | -        | -      | -      | -     |
| Expression      | -        | -      | -      | -     |
| FIFO            | 0        | -      | 75     | 348   |
| Instance        | 18       | 3      | 1390   | 1815  |
| Memory          | -        | -      | -      | -     |
| Multiplexer     | -        | -      | -      | -     |
| Register        | -        | -      | 5      | -     |
| Total           | 18       | 3      | 1470   | 2163  |
| Available       | 280      | 220    | 106400 | 53200 |
| Utilization (%) | 6        | 1      | 1      | 4     |

| <ul> <li>Summary</li> </ul> |          |        |        |       |
|-----------------------------|----------|--------|--------|-------|
| Name                        | BRAM_18K | DSP48E | FF     | LUT   |
| DSP                         | -        | -      | -      | -     |
| Expression                  | -        | -      | -      | -     |
| FIFO                        | 0        | -      | 75     | 312   |
| Instance                    | 18       | 3      | 1102   | 1473  |
| Memory                      | -        | -      | -      | -     |
| Multiplexer                 | -        | -      | -      | -     |
| Register                    | -        | -      | 5      | -     |
| Total                       | 18       | 3      | 1182   | 1785  |
| Available                   | 280      | 220    | 106400 | 53200 |
| Utilization (%)             | 6        | 1      | 1      | 3     |

Figure 4-13: Resources 16 bit

Figure 4-12: Resources 10 bit

Some FF's and LUT's are saved in the 10-bit version. The BRAM's and DSP's stay the same because the 10-bit integer is stored in a 16-bit value.

The resources can be lowered even more by using only 1 channel for the gray images. Using 3 channels for gray images is useless because the values in the channels are all the same. This can easily be done by adding 2 more image and pixel types:

```
typedef hls::Mat<1080,1920, HLS_8UC1> GRAY_IMAGE;
typedef hls::Mat<1080,1920, HLS_10SC1> GRAY10_IMAGE;
typedef hls::Scalar<1, unsigned char> GRAY_PIX;
typedef hls::Scalar<1, ap_int<10> > GRAY10_PIX;
```

To easiest way to change from rgb to gray is using the hls::CvtColor converter. This function converts a RGB image to a 1 channel Gray image.

```
hls::CvtColor<HLS RGB2GRAY>(img 0, img gray1);
```

After some simple changes the following synthesis is created:

| ─ Summary       |          |        |        |       |
|-----------------|----------|--------|--------|-------|
| Name            | BRAM_18K | DSP48E | FF     | LUT   |
| DSP             | -        | -      | -      | -     |
| Expression      | -        | -      | -      | -     |
| FIFO            | 0        | -      | 70     | 300   |
| Instance        | 6        | 3      | 895    | 1065  |
| Memory          | -        | -      | -      | -     |
| Multiplexer     | -        | -      | -      | -     |
| Register        | -        | -      | 6      | -     |
| Total           | 6        | 3      | 971    | 1365  |
| Available       | 280      | 220    | 106400 | 53200 |
| Utilization (%) | 2        | 1      | ~0     | 2     |

Figure 4-14: Resources 1 channel, 10 bit

The amount of BRAM is divided by 3 because it uses 3 times less data. The FF's and LUT's have also decreased.

## 4.5.1 Visualizing the positive and negative values

A common way to visualize the positive and negative values is by using an offset. This is done with the following code:

```
pout.val[0]=(pin.val[0] / 2) + 127;
```

The input is divided by 2 and a value of 127 is added. Black (value 0) will become a value of 127. Negative values will be darker and positive values will be brighter (negative values less than 127 and positive greater than 127).

The following result is a visualization of a Sobel Y filter:





Output



# 5 COMPARATION SOFTWARE VS HARDWARE

The video processing IP block is placed in the video stream. This block uses a 142 MHz clock, which is the same as the pixel clock outputted in the HDMI-in frontend. This means that the 60-fps can be kept using the hardware overlay.

To test what fps the software has, the following code is run in python:

```
import cv2
import numpy as np
numframes = 10
grayscale = np.ndarray(shape=(hdmi in.mode.height, hdmi in.mode.width),
                       dtype=np.uint8)
result = np.ndarray(shape=(hdmi in.mode.height, hdmi in.mode.width),
                    dtype=np.uint8)
start = time.time()
for in range(numframes):
    inframe = hdmi in.readframe()
    cv2.cvtColor(inframe,cv2.COLOR BGR2GRAY,dst=grayscale)
    inframe.freebuffer()
    cv2.Laplacian(grayscale, cv2.CV 8U, dst=result)
    outframe = hdmi out.newframe()
    cv2.cvtColor(result, cv2.COLOR GRAY2BGR, dst=outframe)
    hdmi out.writeframe(outframe)
end = time.time()
print("Frames per second: " + str(numframes / (end - start)))
```

The OpenCV library is used to process the images. In this example, a new image frame is read. The frame is converted to gray and sent to a Laplacian filter (which uses a different 3x3 kernel). This frame is then send to the output. This process results in a frame rate of 1.3769 frames per second which is remarkably lower than the 60 fps that is achieved in hardware.

# 6 CONCLUSION

A wide variety of IP's and overlays are made. This project started with building a custom video overlay. Here an overlay is created that can only be used for video processing. After this a simple adder IP block was created to test if it's possible to create our own IP's and use them in our python environment. The axilite communication between the processing system and the programable logic is also tested here. Next chapter was about the video processing, how the signal is transmitted and some simple IP's reading and processing the signal.

When it is known how to read the data, perform calculations and write the data back, it is possible to use this knowledge and create video filters. In this project a high pass filter is made using the Sobel operators for X and Y direction. The functionality of these IP's is simulated in a C test bench and when this looked fine, it was made into an overlay. A few changes were made to the filters so the negative values are kept, which is important for future work.

Finally, the system is compared to a software video filter. In this system, the frame rate between 1-2 frames per second. Compared to our 60 frames per second with hardware logic, we can conclude that it is possible to use the PYNQ board to accelerate software processes using hardware logic.

# **7** FUTURE WORK

The goal of this project was to learn the PYNQ environment and use it to accelerate software processes in hardware logic. Basic filters are made here. These filters can be used in more complex systems, described in the book Digital Image Processing, such as:

- Canny edge detector [page 729]
- Image smoothing using low pass filters [page 272]
- Image Sharpening using high pass filters [page 284]
- Image thresholding [page 742]
- Image Eroding and Dilating [page 638]
- And more if possible