Skip to content

Shared Resource Bus

Julian Kemmerer edited this page Oct 13, 2023 · 102 revisions

This page describes 'shared resource buses' which are very similar to system buses like AXI: where managers make requests to and receive responses from subordinates using five separate valid+ready handshake channels for simultaneous reads and writes.

These buses are used in a graphics demo to have multiple 'host threads' share frame buffers to draw the Mandelbrot set.

Requests and Responses

PipelineC's shared_resource_bus.h generic shared bus header is used to create hosts that make read+write requests to and receive read+write responses from devices.

diagram

For example, if the device is a simple byte-address memory mapped RAM, request and responses can be configured like so:

  • write_req_data_t: Write request data
    • ex. RAM needs address to request write
      typedef struct write_req_data_t
      {
        uint32_t addr;
        /*// AXI:
        // Address
        //  "Write this stream to a location in memory"
        id_t awid;
        addr_t awaddr; 
        uint8_t awlen; // Number of transfer cycles minus one
        uint3_t awsize; // 2^size = Transfer width in bytes
        uint2_t awburst;*/
      }write_req_data_t;
  • write_data_word_t: Write data word
    • ex. RAM writes some data element, some number of bytes, to addressed location
      typedef struct write_data_word_t
      {
        uint8_t data;
        /*// AXI:
        // Data stream to be written to memory
        uint8_t wdata[4]; // 4 bytes, 32b
        uint1_t wstrb[4];*/
      }write_data_word_t;
  • write_resp_data_t: Write response data
    • ex. RAM write returns dummy single valid/done/complete bit
      typedef struct write_resp_data_t
      {
        uint1_t dummy;
        /*// AXI:
        // Write response
        id_t bid;
        uint2_t bresp; // Error code*/
      } write_resp_data_t;
  • read_req_data_t: Read request data
    • ex. RAM needs address to request read
      typedef struct read_req_data_t
      {
        uint32_t addr;
        /*// AXI:
        // Address
        //   "Give me a stream from a place in memory"
        id_t arid;
        addr_t araddr;
        uint8_t arlen; // Number of transfer cycles minus one
        uint3_t arsize; // 2^size = Transfer width in bytes
        uint2_t arburst;*/
      } read_req_data_t;
  • read_data_resp_word_t: Read data and response word
    • ex. RAM read returns some data element
      typedef struct read_data_resp_word_t
      {
        uint8_t data;
        /*// AXI:
        // Read response
        id_t rid;
        uint2_t rresp;
        // Data stream from memory
        uint8_t rdata[4]; // 4 bytes, 32b;*/
      } read_data_resp_word_t;

Valid + Ready Handshakes

Shared resource buses use valid+ready handshaking just like AXI. Each of the five channels (write request, write data, write response, read request, and read data) has its own handshaking signals.

Bursts and Pipelining

Again, like AXI, these buses have burst (packet last boundary) and pipelining (multiple IDs for transactions in flight) signals.

The Shared Bus Declaration

A kind/type of 'shared bus' is declared by using the SHARED_BUS_TYPE_DEF macro. Instances of the shared bus are declared using the "shared_resource_bus_decl.h" header-as-macro. The macros declare types, helper functions, and a pair of global wires. One of the global variable wires is used for device to host data, while the other wire is for the opposite direction host to device. PipelineC's #pragma INST_ARRAY shared global variables are used to resolve multiple simultaneous drivers of the wire pairs into shared resource bus arbitration.

SHARED_BUS_TYPE_DEF(
  ram_bus, // Bus 'type' name
  uint32_t, // Write request type (ex. RAM address)
  uint8_t, // Write data type (ex. RAM data)
  uint1_t, // Write response type (ex. dummy value for RAM)
  uint32_t, // Read request type (ex. RAM address)
  uint8_t // Read data type (ex. RAM data)
)

#define SHARED_RESOURCE_BUS_NAME          the_bus_name // Instance name
#define SHARED_RESOURCE_BUS_TYPE_NAME     ram_bus // Bus 'type' name    
#define SHARED_RESOURCE_BUS_WR_REQ_TYPE   uint32_t // Write request type (ex. RAM address)
#define SHARED_RESOURCE_BUS_WR_DATA_TYPE  uint8_t // Write data type (ex. RAM data)
#define SHARED_RESOURCE_BUS_WR_RESP_TYPE  uint1_t // Write response type (ex. dummy value for RAM)
#define SHARED_RESOURCE_BUS_RD_REQ_TYPE   uint32_t // Read request type (ex. RAM address)
#define SHARED_RESOURCE_BUS_RD_DATA_TYPE  uint8_t // Read data type (ex. RAM data)  
#define SHARED_RESOURCE_BUS_HOST_PORTS    NUM_HOST_PORTS
#define SHARED_RESOURCE_BUS_HOST_CLK_MHZ  HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_DEV_PORTS     NUM_DEV_PORTS
#define SHARED_RESOURCE_BUS_DEV_CLK_MHZ   DEV_CLK_MHZ
#include "shared_resource_bus_decl.h"

Connecting the Device to the Shared Bus

The SHARED_BUS_TYPE_DEF declares types like <bus_type>_dev_to_host_t and <bus_type>_host_to_dev_t.

Arbitrary devices are connected to the bus via controller modules that ~convert host to-from device signals into device specifics.

Again for example, a simple RAM device might have a controller module like:

// Controller Outputs:
typedef struct ram_ctrl_t{
  // ex. RAM inputs
  uint32_t addr;
  uint32_t wr_data;
  uint32_t wr_enable;
  // Bus signals driven to host
  ram_bus_dev_to_host_t to_host;
}ram_ctrl_t;
ram_ctrl_t ram_ctrl(
  // Controller Inputs:
  // Ex. RAM outputs
  uint32_t rd_data,
  // Bus signals from the host
  ram_bus_host_to_dev_t from_host
);

Inside that ram_ctrl module RAM specific signals are connected to the five valid+ready handshakes going to_host (ex. out from RAM) and from_host (ex. into RAM).

A full example of a controller can be found in the shared frame buffer example code discussed in later sections.

Multiple Hosts and Instances of Devices

In the above sections a controller function, ex. ram_ctrl, describes how a device connects to a shared bus. Using the SHARED_BUS_ARB macro the below code instantiates the arbitration that connects the multiple hosts and devices together through <instance_name>_from_host, <instance_name>_to_host wires for each device.

MAIN_MHZ(ram_arb_connect, DEV_CLK_MHZ)
void ram_arb_connect()
{
  // Arbitrate M hosts to N devs
  // Macro declares the_bus_name_from_host and the_bus_name_to_host
  SHARED_BUS_ARB(ram_bus, the_bus_name, NUM_DEV_PORTS)

  // Connect devs to arbiter ports
  uint32_t i;
  for (i = 0; i < NUM_DEV_PORTS; i+=1)
  {
    ram_dev_ctrl_t port_ctrl
      = ram_ctrl(<RAM outputs>, the_bus_name_from_host[i]);
    <RAM inputs> = port_ctrl....;
    the_bus_name_to_host[i] = port_ctrl.to_host;
  }
}

Using the Device from Host Threads

The "shared_resource_bus_decl.h" header-as-macro declares derived finite state machine helper functions for reading and writing the shared resource bus. These functions are to be used from NUM_HOST_PORTS simultaneous host FSM 'threads'.

Below, for example, shows generated signatures for reading and writeing the example shared bus RAM:

uint8_t the_bus_name_read(uint32_t addr);
uint1_t the_bus_name_write(uint32_t addr, uint8_t data); // Dummy return value

Graphics Demo

Old graphics demo diagram: graphicsdemodiagram

The below graphics demo differs from the old one shown in the diagram above (but the old one is also worth reading for more details).

Instead of requiring on-chip block RAM, this new demo can use off-chip DDR memory for large color frame buffers. Additionally, this demo focuses on a more complex rendering computation that can benefit from PipelineC's auto-pipelining.

Dual Frame Buffer

The graphics_demo.c file is an example exercising a dual frame buffer as a shared bus resource from dual_frame_buffer.c. The demo slowly cycles through R,G,B color ranges, requiring for each pixel: a read from frame buffer RAM, minimal computation to update pixel color, and a write back to frame buffer RAM for display.

The frame buffer is configured to use a Xilinx AXI DDR controller starting inside ddr_dual_frame_buffer.c. The basic shared resource bus setup for connecting to the Xilinx DDR memory controller AXI bus can be found in axi_xil_mem.c. In that file an instance of an axi_shared_bus_t shared resource bus (defined in axi_shared_bus.h) called axi_xil_mem is declared using the shared_resource_bus_decl.h file include-as-macro helper.

Displaying Frame Buffer Pixels

In addition to 'user' rendering threads, the frame buffer memory shared resource needs to be reading pixels at a rate that can meet the streaming requirement of the VGA resolution pixel clock timing for connecting a display.

Unlike the the old demo, in this demo ddr_dual_frame_buffer.c uses a separate 'read-only priority port' axi_xil_rd_pri_port_mem_host_to_dev_wire wire to simply connect a VGA position counter to a dedicated read request side of the shared resource bus. Responses from the bus are the pixels that are written directly into the vga_pmod_async_pixels_fifo.c display stream.

MAIN_MHZ(host_vga_reader, XIL_MEM_MHZ)
void host_vga_reader()
{
  static uint1_t frame_buffer_read_port_sel_reg;

  // READ REQUEST SIDE
  // Increment VGA counters and do read for each position
  static vga_pos_t vga_pos;
  // Read and increment pos if room in fifos (cant be greedy since will 100% hog priority port)
  uint1_t fifo_ready;
  #pragma FEEDBACK fifo_ready
  // Read from the current read frame buffer addr
  uint32_t addr = pos_to_addr(vga_pos.x, vga_pos.y);
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.araddr = dual_ram_to_addr(frame_buffer_read_port_sel_reg, addr);
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arlen = 1-1; // size=1 minus 1: 1 transfer cycle (non-burst)
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arsize = 2; // 2^2=4 bytes per transfer
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arburst = BURST_FIXED; // Not a burst, single fixed address per transfer
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.valid = fifo_ready;
  uint1_t do_increment = fifo_ready & axi_xil_rd_pri_port_mem_dev_to_host_wire.read.req_ready;
  vga_pos = vga_frame_pos_increment(vga_pos, do_increment);

  // READ RESPONSE SIDE
  // Get read data from the AXI RAM bus
  uint8_t data[4];
  uint1_t data_valid = 0;
  data = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.burst.data_resp.user.rdata;
  data_valid = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.valid;
  // Write pixel data into fifo
  pixel_t pixel;
  pixel.a = data[0];
  pixel.r = data[1];
  pixel.g = data[2];
  pixel.b = data[3];
  pixel_t pixels[1];
  pixels[0] = pixel;
  fifo_ready = pmod_async_fifo_write_logic(pixels, data_valid);
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.data_ready = fifo_ready;

  frame_buffer_read_port_sel_reg = frame_buffer_read_port_sel;
}

Threads + Kernel

Computation Kernel

In graphics_demo.c the pixel_kernel function implements incrementing RGB channel values as a simple test pattern.

The pixels_kernel_seq_range function iterates over a range of frame area executing pixel_kernel for reach pixel. The frame area is defined by start and end x and y positions.

// Single 'thread' state machine running pixel_kernel "sequentially" across an x,y range
void pixels_kernel_seq_range(
  kernel_args_t args,
  uint16_t x_start, uint16_t x_end, 
  uint16_t y_start, uint16_t y_end)
{
  uint16_t x;
  uint16_t y;
  for(y=y_start; y<=y_end; y+=TILE_FACTOR)
  {
    for(x=x_start; x<=x_end; x+=TILE_FACTOR)
    {
      if(args.do_clear){
        pixel_t pixel = {0};
        frame_buf_write(x, y, pixel);
      }else{
        // Read the pixel from the 'read' frame buffer
        pixel_t pixel = frame_buf_read(x, y);
        pixel = pixel_kernel(args, pixel, x, y);
        // Write pixel back to the 'write' frame buffer
        frame_buf_write(x, y, pixel);
      }
    }
  }
}

Multiple Threads

Multiple host threads can be reading and writing the frame buffers trying to execute their own sequential run of pixels_kernel_seq_range. This is accomplished by manually instantiating multiple derived FSM thread pixels_kernel_seq_range_FSM modules inside of a function called render_demo_kernel. The NUM_TOTAL_THREADS = (NUM_X_THREADS*NUM_Y_THREADS) copies of pixels_kernel_seq_range all run in parallel, splitting the FRAME_WIDTH by NUM_X_THREADS threads and FRAME_HEIGHT by NUM_Y_THREADS.

// Module that runs pixel_kernel for every pixel
// by instantiating multiple simultaneous 'threads' of pixel_kernel_seq_range
void render_demo_kernel(
  kernel_args_t args,
  uint16_t x, uint16_t width,
  uint16_t y, uint16_t height
){
  // Wire up N parallel pixel_kernel_seq_range_FSM instances
  uint1_t thread_done[NUM_X_THREADS][NUM_Y_THREADS];
  uint32_t i,j;
  uint1_t all_threads_done;
  while(!all_threads_done)
  {
    pixels_kernel_seq_range_INPUT_t fsm_in[NUM_X_THREADS][NUM_Y_THREADS];
    pixels_kernel_seq_range_OUTPUT_t fsm_out[NUM_X_THREADS][NUM_Y_THREADS];
    all_threads_done = 1;
    
    uint16_t thread_x_size = width >> NUM_X_THREADS_LOG2;
    uint16_t thread_y_size = height >> NUM_Y_THREADS_LOG2;
    for (i = 0; i < NUM_X_THREADS; i+=1)
    {
      for (j = 0; j < NUM_Y_THREADS; j+=1)
      {
        if(!thread_done[i][j])
        {
          fsm_in[i][j].input_valid = 1;
          fsm_in[i][j].output_ready = 1;
          fsm_in[i][j].args = args;
          fsm_in[i][j].x_start = (thread_x_size*i) + x;
          fsm_in[i][j].x_end = fsm_in[i][j].x_start + thread_x_size - 1;
          fsm_in[i][j].y_start = (thread_y_size*j) + y;
          fsm_in[i][j].y_end = fsm_in[i][j].y_start + thread_y_size - 1;
          fsm_out[i][j] = pixels_kernel_seq_range_FSM(fsm_in[i][j]);
          thread_done[i][j] = fsm_out[i][j].output_valid;
        }
        all_threads_done &= thread_done[i][j];
      }
    }
    __clk();
  }
}

render_demo_kernel can then simply run in a loop, trying for the fastest frames per second possible.

void main()
{
  kernel_args_t args;
  ...
  while(1)
  {
    // Render entire frame
    render_demo_kernel(args, 0, FRAME_WIDTH, 0, FRAME_HEIGHT);
  }
}

The actual graphics_demo.c file main() does some extra DDR initialization, is slowed down to render the test pattern slowly, and manages the toggling of the dual frame buffer 'which is the read buffer' select signal after each render_demo_kernel iteration: frame_buffer_read_port_sel = !frame_buffer_read_port_sel;.

Pipelines as Shared Resource

The above graphics demo uses an AXI RAM frame buffer as the resource shared on a bus.

Another common use case is having an automatically pipelined function as the shared resource. shared_resource_bus_pipeline.h is a header-as-macro helper for declaring a pipeline instance connected to multiple host state machines via a shared resource bus.

// Example declaration using helper header-as-macro
#define SHARED_RESOURCE_BUS_PIPELINE_NAME         name
#define SHARED_RESOURCE_BUS_PIPELINE_OUT_TYPE     output_t
#define SHARED_RESOURCE_BUS_PIPELINE_FUNC         the_func_to_pipeline
#define SHARED_RESOURCE_BUS_PIPELINE_IN_TYPE      input_t
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_THREADS NUM_THREADS
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_CLK_MHZ HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_PIPELINE_DEV_CLK_MHZ  DEV_CLK_MHZ
#include "shared_resource_bus_pipeline.h"

In the above example a function output_t the_func_to_pipeline(input_t) is made into a pipeline instance used like output_t name(input_t) from derived FSM NUM_THREADS threads host threads (running at HOST_CLK_MHZ). The pipeline is automatically pipelined to meet the target DEV_CLK_MHZ operating frequency.

Mandelbrot Demo

mandelbrot demo diagram

As a demonstration of shared_resource_bus_pipeline.h the mandelbrot_demo.c file instantiates several pipeline devices for computation inside shared_mandelbrot_dev.c.

For example the device for computing Mandelbrot iterations is declared as such:

// Do N Mandelbrot iterations per call to mandelbrot_iter_func
#define ITER_CHUNK_SIZE 6
#define MAX_ITER 32
typedef struct mandelbrot_iter_t{
  complex_t c;
  complex_t z;
  complex_t z_squared;
  uint1_t escaped;
  uint32_t n;
}mandelbrot_iter_t;
#define ESCAPE 2.0
mandelbrot_iter_t mandelbrot_iter_func(mandelbrot_iter_t inputs)
{
  mandelbrot_iter_t rv = inputs;
  uint32_t i;
  for(i=0;i<ITER_CHUNK_SIZE;i+=1)
  {
    // Mimic while loop
    if(!rv.escaped & (rv.n < MAX_ITER))
    {
      // float_lshift is division by subtraction on exponent only
      rv.z.im = float_lshift((rv.z.re*rv.z.im), 1) + rv.c.im;
      rv.z.re = rv.z_squared.re - rv.z_squared.im + rv.c.re;
      rv.z_squared.re = rv.z.re * rv.z.re;
      rv.z_squared.im = rv.z.im * rv.z.im;
      rv.n = rv.n + 1;
      rv.escaped = (rv.z_squared.re+rv.z_squared.im) > (ESCAPE*ESCAPE);
    }
  }
  return rv;
}
#define SHARED_RESOURCE_BUS_PIPELINE_NAME         mandelbrot_iter
#define SHARED_RESOURCE_BUS_PIPELINE_OUT_TYPE     mandelbrot_iter_t
#define SHARED_RESOURCE_BUS_PIPELINE_FUNC         mandelbrot_iter_func
#define SHARED_RESOURCE_BUS_PIPELINE_IN_TYPE      mandelbrot_iter_t
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_THREADS NUM_USER_THREADS
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_CLK_MHZ HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_PIPELINE_DEV_CLK_MHZ  MANDELBROT_DEV_CLK_MHZ
#include "shared_resource_bus_pipeline.h"

Note the ITER_CHUNK_SIZE scaling constant: This allows the single pipeline to compute multiple iterations (as opposed to using a single iteration pipeline more times sequentially). In this design with a medium sized FPGA the value scales (typically to fill the space not used by the derived FSM threads) to ~1-8 iterations in the pipeline.

Other devices include screen_to_complex for computing screen position to complex plane position value, as well as iter_to_color which takes and integer number of iterations and returns an RGB color.

Simulation

Using the setup from inside the PipelineC-Graphics repo the following commands will compile and run the demo at 480p display with 8x tiling down to 80x60 pixels in frame buffer.

rm -Rf ./build
../PipelineC/src/pipelinec mandelbrot_pipelinec_app.c --out_dir ./build --comb --sim --verilator --run -1
verilator -Mdir ./obj_dir -Wno-UNOPTFLAT -Wno-WIDTH -Wno-CASEOVERLAP --top-module top -cc ./build/top/top.v -O3 --exe main.cpp -I./build/verilator -CFLAGS -DUSE_VERILATOR -CFLAGS -DFRAME_WIDTH=640 -CFLAGS -DFRAME_HEIGHT=480 -LDFLAGS $(shell sdl2-config --libs)
cp ./main.cpp ./obj_dir
make CXXFLAGS="-DUSE_VERILATOR -I../../PipelineC/ -I../../PipelineC/pipelinec/include -I../build/verilator -I.." -C ./obj_dir -f Vtop.mk
./obj_dir/Vtop

Alternatively, cloning the mandelbrot branch will allow you run to make mandelbrot_verilator instead.

Performance Tuning

Faster Clocks

In keeping line with the law , its typical to scale frequency as the easiest first option. This design has two relevant clocks: DEV_CLK the device clock and HOST_CLK the host FSM 'threads' clock.

Currently derived FSMs have many optimizations to be done yet and thus typically you won't be able to scale thread clocks reliably, or very far. In this design the HOST_CLK is set to ~40MHz.

The device clock scaling is where PipelineC's automatic pipelining is critical. Essentially the device clock can be set ~arbitrarily high. However, the latency penalty of asynchronous clock domain crossings and the latency of ever more deeply pipelined devices needs to be weighed, the best solution is not always just to run at the maximum clock rate. In this original single threaded 'CPU style' design it was typical for the optimal device clock to be equal to the slow host clock. That is because the latency of extra pipelining + having the clock domain for the devices running at a higher clock rate was not an overall performance benefit.

More Threads

Having a few simply sequentially iterating single-threads is very... CPU like. In that way, it is possible to scale 'by adding more cores/threads', but this comes with heavy resource use, needing to duplicate the entire state machine. In this example a medium size FPGA comfortable fits 4-8 derived state machine 'threads'.

Design Changes

In addition to tuning 'CPU-like' scaling parameters as above, it's possible to instead make ~'algorithmic'/architectural changes to be even more specific to the task, where the source code looks further and further from the original simple Mandelbrot implementation.

Data Parallelism / Floating Point Units

As opposed to having pipeline devices that separately compute dedicate things, for ex. screen_to_complex screen position and mandelbrot_iter the Mandelbrot iterations, it is possible to use fewer resources and, again CPU-like, use smaller single operation floating points units to compute with. In shared_mandelbrot_dev_fp_ops.c there is a version of the Mandebrot demo that has two Mandelbrot devices: a floating point adder, and a floating point multiplier. These pipelines are used iteratively for all floating point computations.

Synchronous/Blocking Function Calls

The original + and * floating point operators must manually be replaced with function calls (pending work on automatic operator overloading). Functions wrapping the shared resource bus pipelines of the form float fp_add(float x, float y) and float fp_mult(float x, float y) are made available.

Asynchronous/Non-blocking Function Calls

Replacing all floating point operators with generic fp_add/mult pipeline device use instead of using specialized Mandebrot specific pipeline devices will obviously result in a performance decrease. However, this structure opens up new ways of using the pipelines.

To start, automatically pipelining smaller single floating point units is easier than more complex pipelines consisting of multiple operators. So it becomes very easy to always run the floating point units near the maximum possible rate. However, best performance requires saturating those highly pipelined operators with multiple operations in flight at once (ex. multiple add/mults at once). An easy way to do that from a single thread is to find parallelism in the data:

The Mandelbrot iteration consists of four add/sub operations and three multiply operations. However, not all operators depend on each other which allows some to be completed at the same time as others.

The below C code completes two floating point operations at the same time. This is accomplished by starting the two ops, then waiting for both to finish.

// Code written in columns - a single op looks like:
//       _start(args);
//    out = _finish();

// Mult and sub at same time
/*float re_mult_im = */ fp_mult_start(rv.z.re, rv.z.im); /*float re_minus_im = */ fp_sub_start(rv.z_squared.re, rv.z_squared.im);
float re_mult_im = fp_mult_finish(/*rv.z.re, rv.z.im*/); float re_minus_im = fp_sub_finish(/*rv.z_squared.re, rv.z_squared.im*/);
// Two adds at same time
/*rv.z.im = */ fp_add_start(float_lshift(re_mult_im, 1), rv.c.im); /*rv.z.re = */ fp_add_start(re_minus_im, rv.c.re);
rv.z.im = fp_add_finish(/*float_lshift(re_mult_im, 1), rv.c.im*/); rv.z.re = fp_add_finish(/*re_minus_im, rv.c.re*/);
// Two mult at same time
/*rv.z_squared.re = */ fp_mult_start(rv.z.re, rv.z.re); /*rv.z_squared.im = */ fp_mult_start(rv.z.im, rv.z.im); 
rv.z_squared.re = fp_mult_finish(/*rv.z.re, rv.z.re*/); rv.z_squared.im = fp_mult_finish(/*rv.z.im, rv.z.im*/); 
// Final single adder alone
/*float re_plus_im = */ fp_add_start(rv.z_squared.re, rv.z_squared.im); 
float re_plus_im = fp_add_finish(/*rv.z_squared.re, rv.z_squared.im*/); 
'Software threads' / Time Multiplexing / 'Coroutines'

As has been noted derived FSMs have many optimizations to be done yet and so simply adding more 'threads/cores' to the design doesn't scale very far. Additionally, the data parallelism found in the Mandelbrot iteration operations is not as great a level of paralleism as deeply pipelines devices can handle. Ex. You might have 2,4,8 threads/cores of FSM execution and the 1-2 operations in flight from each thread - meaning a maximum pipeline depth of ~8*~2=16 stages is the most the pipelining/frequency scaling the design can benefit from. What is needed is a way to access pixel level parallelism (ex. how the many many pixels can all be computed independently of each other by threads splitting parts of the screen to render) but without the cost of having many hardware threads.

Instead its possible to emulate having multiple physical 'cores/threads' of FSM execution by using single physical thread to time multiplex across multiple ~'instruction streams'/state machines. Very similar to how single core CPUs run multiple "simultaneous" threads by having processes time multiplexed onto the core by the operating system. In this case, there is no operating system - and the scheduling is a simple fixed cycling through discrete states of ~coroutine style functions.

Coroutines require execution to be suspended and then at a later time resumed. In this case, this is accomplished by creating a state machine (In C/PipelineC that itself is 'running' on the derived state machine 'core/thread' - yes it's confusing). This allows a single copy of the hardware (the physical FSM) to 'in software' implement N "simultaneous"(time multiplex quickly) copies of a small 'coroutine' state machine. In the mandelbrot_demo_w_fake_time_multiplex_threads.c version of the demo, physical derived FSM 'cores/threads' get called 'hardware threads' and the multiple time multiplexed instances of coroutines are called 'software threads'.

The below code from shared_mandelbrot_dev_w_fake_time_multiplex_threads.c describes the high level mandelbrot_kernel_fsm 'coroutine' state machine. It cycles over computing the screen position, looping for the Mandelbrot iterations, and then computing the final pixel color. It is written using async _start and _finish function calls to the original Mandelbrot specific shared pipeline devices. Each device has a state for starting an operation, and a state for finishing it.

typedef enum mandelbrot_kernel_fsm_state_t{
  screen_to_complex_START,
  screen_to_complex_FINISH,
  mandelbrot_iter_START,
  mandelbrot_iter_FINISH,
  iter_to_color_START,
  iter_to_color_FINISH
}mandelbrot_kernel_fsm_state_t;
typedef struct mandelbrot_kernel_state_t{
  // FSM state
  mandelbrot_kernel_fsm_state_t fsm_state;
  // Func inputs
  pixel_t pixel;
  screen_state_t screen_state;
  uint16_t x;
  uint16_t y;
  // Func local vars
  mandelbrot_iter_t iter;
  screen_to_complex_in_t stc;
  pixel_t p;
  // Done signal
  uint1_t done;
}mandelbrot_kernel_state_t;
mandelbrot_kernel_state_t mandelbrot_kernel_fsm(mandelbrot_kernel_state_t state)
{
  // The state machine starting and finishing async operations
  state.done = 0;
  if(state.fsm_state==screen_to_complex_START){
    // Convert pixel coordinate to complex number 
    state.stc.state = state.screen_state;
    state.stc.x = state.x;
    state.stc.y = state.y;
    /*state.iter.c = */screen_to_complex_start(state.stc);
    state.fsm_state = screen_to_complex_FINISH;
  }else if(state.fsm_state==screen_to_complex_FINISH){
    state.iter.c = screen_to_complex_finish(/*state.stc*/);
    // Do the mandelbrot iters
    state.iter.z.re = 0.0;
    state.iter.z.im = 0.0;
    state.iter.z_squared.re = 0.0;
    state.iter.z_squared.im = 0.0;
    state.iter.n = 0;
    state.iter.escaped = 0;
    state.fsm_state = mandelbrot_iter_START;
  }else if(state.fsm_state==mandelbrot_iter_START){
    if(!state.iter.escaped & (state.iter.n < MAX_ITER)){
      /*state.iter = */mandelbrot_iter_start(state.iter);
      state.fsm_state = mandelbrot_iter_FINISH;
    }else{
      state.fsm_state = iter_to_color_START;
    }
  }else if(state.fsm_state==mandelbrot_iter_FINISH){
    state.iter = mandelbrot_iter_finish(/*state.iter*/);
    state.fsm_state = mandelbrot_iter_START;
  }else if(state.fsm_state==iter_to_color_START){
    // The color depends on the number of iterations
    /*state.p = */iter_to_color_start(state.iter.n);
    state.fsm_state = iter_to_color_FINISH;
  }else if(state.fsm_state==iter_to_color_FINISH){
    state.p = iter_to_color_finish(/*state.iter.n*/);
    state.done = 1;
    state.fsm_state = screen_to_complex_START; // Probably not needed
  }
  return state;
}

Then a single hardware 'thread' of the below C code is used to execute NUM_SW_THREADS time multiplexed "simultaneous" instances of the above mandelbrot_kernel_fsm coroutine 'software thread' state machine:

n_pixels_t n_time_multiplexed_mandelbrot_kernel_fsm(
  screen_state_t screen_state,
  uint16_t x[NUM_SW_THREADS], 
  uint16_t y[NUM_SW_THREADS]
){
  n_pixels_t rv;
  mandelbrot_kernel_state_t kernel_state[NUM_SW_THREADS];
  // INIT
  uint32_t i;
  for(i = 0; i < NUM_SW_THREADS; i+=1)
  {
    kernel_state[i].fsm_state = screen_to_complex_START; // Probably not needed
    kernel_state[i].screen_state = screen_state;
    kernel_state[i].x = x[i];
    kernel_state[i].y = y[i];
  }
  // LOOP doing N 'coroutine'/fsms until done
  uint1_t thread_done[NUM_SW_THREADS];
  uint1_t all_threads_done = 0;
  do
  {
    all_threads_done = 1;
    for(i = 0; i < NUM_SW_THREADS; i+=1)
    {
      // operate on front of shift reg [0] (as opposed to random[i])
      if(!thread_done[0]){
        kernel_state[0] = mandelbrot_kernel_fsm(kernel_state[0]);
        rv.data[0] = kernel_state[0].p;
        thread_done[0] = kernel_state[0].done;
        all_threads_done &= thread_done[0];
      }
      // And then shift the reg to prepare next at [0]
      ARRAY_1ROT_DOWN(mandelbrot_kernel_state_t, kernel_state, NUM_SW_THREADS)
      ARRAY_1ROT_DOWN(pixel_t, rv.data, NUM_SW_THREADS)
      ARRAY_1ROT_DOWN(uint1_t, thread_done, NUM_SW_THREADS)
    }
  }while(!all_threads_done);
  return rv;
}

As seen above, the code loops across all NUM_SW_THREADS instances of the coroutine state machine, where each iteration calls mandelbrot_kernel_fsm and checks to see if the state machine has completed yet (and supplied a return value). In this way, cycling one state transition at a time across all of the 'soft' instances makes for an easy way to have NUM_SW_THREADS operations in flight at once to the pipelined devices from just a single physical hardware derived FSM instance thread.

Conclusion

Using these shared resource buses its possible to picture even more complex architectures of host threads and computation devices.

Generally the functionality in shared_resource_bus.h will continue to be improved and made easier to adapt in more design situations.

Please reach out if interested in giving anything a try or making improvements, happy to help! -Julian

Clone this wiki locally