Skip to content

Example: Ethernet

Julian Kemmerer edited this page Jun 12, 2022 · 24 revisions

This is a break down of an Arty board based example that receives Ethernet frames, does some work on the payload, and sends a response frame back.

This example is from a series of examples designed for the Arty Board. See that page for basic instruction for using the Arty board.

somecode

Source

The source code for this project can be primarily found in work_app.c. This project uses the Xilinx TEMAC core at this time for easy access to the off chip Ethernet PHY. work_app.c describes three main functions / clock domains: rx_main() which is connected the receive part of the TEMAC core, tx_main() which is connected to the transmit part of the TEMAC core, and work_pipeline() which has the work() function to be done.

In both the RX and TX functions two things occur: 1) The 8b AXIS to/from the TEMAC is converted to 32b AXIS. This is done using raw VHDL code as to easily make use of existing AXIS data width converter IP freely available from Xilinx. 2) The 32b AXIS from the TEMAC is parsed to pull out Ethernet header info, ex. mac addresses, see eth_32.c.

Bytes from Ethernet are (de)serialized, converted to user struct types, buffered in clock crossing FIFOs inputs_fifo and outputs_fifo, and sent to/from the work_pipeline() function.

// This pipeline does the following:
//    Reads work inputs from rx fifo
//    Does work on the work inputs to form the work outputs
//    Writes outputs into tx fifo
#pragma MAIN_MHZ work_pipeline 150.0   // Actually running at 100MHz but need margin since near max utilization
void work_pipeline()
{
  // Read incoming work inputs from rx_main
  inputs_fifo_read_t input_read = inputs_fifo_READ_1(1); 
  work_inputs_t inputs = input_read.data[0]; 
  
  // Do work on inputs, get outputs
  work_outputs_t outputs = work(inputs);

  // Write outgoing work outputs into tx_main
  work_outputs_t output_wr_data[1];
  output_wr_data[0] = outputs;
  outputs_fifo_write_t output_write = outputs_fifo_WRITE_1(output_wr_data, input_read.valid);
  // TODO overflow wire+separate state
}

PipelineC Tool

Throughput Sweep

In this example the tool need to spend time automatically pipelining the work() function to meet timing. Below you can see the tool try 1,2,3 clocks of latency before finally meeting timing.

================== Beginning Throughput Sweep ================================
Function: xil_temac_rx_module Target MHz: 25.0
Function: xil_temac_tx_module Target MHz: 25.0
Function: rx_main Target MHz: 25.0
Function: tx_main Target MHz: 25.0
Function: work_pipeline Target MHz: 150.0
WARNING: headers_fifo async fifo depth increased to minimum allowed = 16
Starting with blank sweep state...
...determining slicing information for each main function...
xil_temac_rx_module : 0 clocks latency, sliced coarsely...
xil_temac_tx_module : 0 clocks latency, sliced coarsely...
rx_main : 0 clocks latency, sliced coarsely...
tx_main : 0 clocks latency, sliced coarsely...
work_pipeline : 1 clocks latency, sliced coarsely...
Running: /media/1TB/Programs/Linux/Xilinx/Vivado/2019.2/bin/vivado -journal /home/julian/pipelinec_syn_output/top/vivado.jou -log /home/julian/pipelinec_syn_output/top/vivado_e1ff.log -mode batch -source "/home/julian/pipelinec_syn_output/top/top_e1ff.tcl"
 Clock Goal (MHz): 150.0 , Current MHz: 95.4380606986066 ( 10.478 ns)
xil_temac_rx Clock Goal (MHz): 25.0 , Current MHz: 134.15615776764153 ( 7.454000000000001 ns)
xil_temac_tx Clock Goal (MHz): 25.0 , Current MHz: 171.05713308244964 ( 5.8459999999999965 ns)
Making coarse adjustment and trying again...
xil_temac_rx_module : 0 clocks latency, sliced coarsely...
xil_temac_tx_module : 0 clocks latency, sliced coarsely...
rx_main : 0 clocks latency, sliced coarsely...
tx_main : 0 clocks latency, sliced coarsely...
work_pipeline : 2 clocks latency, sliced coarsely...
Running: /media/1TB/Programs/Linux/Xilinx/Vivado/2019.2/bin/vivado -journal /home/julian/pipelinec_syn_output/top/vivado.jou -log /home/julian/pipelinec_syn_output/top/vivado_d177.log -mode batch -source "/home/julian/pipelinec_syn_output/top/top_d177.tcl"
 Clock Goal (MHz): 150.0 , Current MHz: 124.14649286157666 ( 8.055 ns)
xil_temac_rx Clock Goal (MHz): 25.0 , Current MHz: 134.15615776764153 ( 7.454000000000001 ns)
xil_temac_tx Clock Goal (MHz): 25.0 , Current MHz: 171.05713308244964 ( 5.8459999999999965 ns)
Making coarse adjustment and trying again...
xil_temac_rx_module : 0 clocks latency, sliced coarsely...
xil_temac_tx_module : 0 clocks latency, sliced coarsely...
rx_main : 0 clocks latency, sliced coarsely...
tx_main : 0 clocks latency, sliced coarsely...
work_pipeline : 3 clocks latency, sliced coarsely...
Running: /media/1TB/Programs/Linux/Xilinx/Vivado/2019.2/bin/vivado -journal /home/julian/pipelinec_syn_output/top/vivado.jou -log /home/julian/pipelinec_syn_output/top/vivado_5510.log -mode batch -source "/home/julian/pipelinec_syn_output/top/top_5510.tcl"
 Clock Goal (MHz): 150.0 , Current MHz: 161.70763260025873 ( 6.184 ns)
xil_temac_rx Clock Goal (MHz): 25.0 , Current MHz: 134.15615776764153 ( 7.454000000000001 ns)
xil_temac_tx Clock Goal (MHz): 25.0 , Current MHz: 171.05713308244964 ( 5.8459999999999965 ns)
Found maximum pipeline latencies...
================== Writing Results of Throughput Sweep ================================
Done.

Vivado Results

Resource usage: Inferring DSPs is still experimental, so the multiplication in this example is implemented entirely in fabric.

ethres

C Driver/Test Code

Code for getting Ethernet frames to/from the FPGA is included in eth_sw.c. Make sure to set the DEFAULT_IF value to match the network interface name you want to use.

eth_work.h uses autogenerated to-from bytes functions wrapping eth_sw.c functions for specifically getting work inputs and outputs transferred over Ethernet. The paths to these code gen'd files need to be modified once you have run the compiler and produced those files.

This design does some work() on the FPGA. The details of what this work is, how to verify its correctness, and a driver C program can be found here.

This design is tested by sending Ethernet frames containing work_inputs_t structs over Ethernet. And then waiting for work_outputs_t structs in response.

Compile and run the code like so from the work example directory.

gcc -I ../../../../ -Wall -pthread work_test.c -o work_test
sudo ./work_test 8192  # 8192 `works` done (per CPU thread)

The test first does the work() on the CPU. Next the test starts two threads, one for writing to the FPGA and one for reading from it. These threads write and read work_inputs_t,work_outputs_t structs over Ethernet as fast as possible to accomplish the same work the CPU did using the FPGA.

$ sudo ./work_test 8192
CPU threads: 2
n 'work()'s: 16384 
Total tx bytes: 1048576 
Total rx bytes: 524288 
CPU took 0.010933 seconds to execute 
CPU iteration time: 0.000001 seconds
CPU bytes per sec: 143861782.642912 B/s
FPGA took 0.148723 seconds to execute 
FPGA iteration time: 0.000009 seconds
FPGA bytes per sec: 10575786.349021 B/s
Speedup: 0.073514

No errors were printed, the work done by the FPGA matched the work done by the CPU. A successful, but slow, test. Ethernet frames sent and received can be monitored using something like Wireshark.

Ethernet Loopback Example Breakdown

ethloop