FSM Style

This page describes the 'finite state machine style' of writing PipelineC code. Inspired by Silice's clock step operator, PipelineC introduced a __clk() function that allows the generation of complex derived finite state machines.

Basic Example

Consider this example code:

// No clock cycles, combinatorial logic
uint32_t test0(uint32_t x)
{
  uint32_t rv = x + 1;
  return rv;
}

// One clock cycle before add, a FSM
uint32_t test1(uint32_t x)
{
  __clk();
  uint32_t rv = x + 1;
  return rv;
}

// One clock cycle after add, a FSM
uint32_t test2(uint32_t x)
{
  uint32_t rv = x + 1;
  __clk();
  return rv;
}

schemfsm

and how they behave over time: sim0

These test cases drive the input x with 2, thus expecting 3 as the returned output. test0 is combinatorial logic. The output is immediately, in the same cycle, available at the output.

test1 introduces a clock cycle __clk() of delay. Notice how the return value register registers_r.state_regs.rv for test1 compares with test2: there is a one clock cycle delay in test1 before the rv state register takes its value.

However note the return value output of the function return_output.return_output and its associated handshake valid signal return_output.output_valid: In both test1 and test2 the returned value is 3 during the second cycle. This is because both functions are describing two clock cycles of computation, only slightly different combinatorial logic data paths.

In test1 the x input value is registered and saved for after the __clk(). Meaning the data path for test1 is from an x register through an adder to the output return value port. (the rv register is actually unused and optimizes away in synthesis)

In test2 the x input value is used before the __clk(). This means the data path for test2 is from an input port x (no register) through an adder as combinatorial logic to an rv register. In the second cycle the rv register drives the output return value port.

__clk() is essentially used as dividing marker for how combinatorial logic should be divided / where registers will be used.

Function Calls

For example, using the two tests as subroutines:

// Two subroutine FSMs
uint32_t main(uint32_t a)
{
 uint32_t b = test1(a);
 uint32_t c = test2(b);
 return c;
}

Syntax

It is possible to write a state machine in 'regular' PipelineC without __clk() operators. Instead each FSM state and transition logic is explicitly written as PipelineC code (similar to regular HDL) - this is verbose but works. Instead, the FSM style discussed here is essentially a layer of code generation creating that verbose code for the user, states are derived from __clk() use.

Calling other FSM style functions from within an FSM style function looks exactly like expected C function call syntax.

However the syntax for declaring an FSM instance not inside another FSM function (in 'regular' PipelineC code) looks like so:

uint32_t test1(uint32_t x)
{
  __clk();
  uint32_t rv = x + 1;
  return rv;
}
#include "test1_FSM.h"
#pragma MAIN test1_wrapper
uint32_t test1_wrapper()
{
  // The instance
  test1_INPUT_t i;
  i.x = 2;
  i.input_valid = 1;
  i.output_ready = 1;
  test1_OUTPUT_t o = test1_FSM(i);
  return o.return_output;
}

The #include "test1_FSM.h" signals that the test1 function is a finite state machine, and is also the name of the generated file where PipelineC code describing the derived FSM will go.

These FSM style functions include input and output handshaking signals _valid/_ready - for function entry and return. As such, inside the generated _FSM.h are structs that define input and output types like test1_INPUT_t and test1_OUTPUT_t. These types are used for the generated test1_FSM function, which is used when instantiating the module in the top level MAIN of the design.

Using the generated _FSM() function and handshake signal structs from generated code is only needed when the function is at the top level of the design or connecting with other non-FSM style PipelineC. I.e. Calling other FSM functions from in an FSM looks as expected, regular C function call syntax.

Loops

Consider the below function that first waits for 5 clocks, then 4, ... down to 1. It does this via a wait function the presumably does __clk() a number of times specified by the input argument. This code compiles fine.

uint32_t wait_test()
{
  uint32_t clocks_to_wait = 5;
  while(clocks_to_wait > 0)
  {
    wait(clocks_to_wait);
    clocks_to_wait -= 1;
  }
  return clocks_to_wait;
}

This implementation of the wait() function will not compile:

uint32_t wait(uint32_t clocks)
{
  while(clocks > 0)
  {
    clocks -= 1;
  }
  return clocks;
}

By default all loops are unrolled, i.e. could decrement the clocks value any number of times in one clock cycle by concatenating subtractors in combinatorial logic. The tool does not know how to spread the loop across multiple clock cycles unless instructed how to do so. A __clk() can be added to the loop body and the code now compiles fine.

uint32_t wait(uint32_t clocks)
{
  while(clocks > 0)
  {
    __clk();
    clocks -= 1;
  }
  return clocks;
}

The wait_test() function does not need a __clk() in it's loop body since use of the FSM style wait function now includes the clock delay markers.

sim1

Multiple Threads

General Arbitration

When having multiple simultaneous instances of a derived state machine 'thread' running there are a few ways to deal with sharing resources.

First are the lower level global variable instance arrays for resolving multiple drivers, etc. But on top of that is the more general ~AXI-like shared resource buses header which can generated arbitration for any arbitrary module.

If specifically looking to arbitrate access to another derived state machine, the single instances FSMs / shared regions below are another option.

Single Instance FSMs / Shared Regions

By default in PipelineC there one instance of function hardware per MAIN top level FSM ~thread. Multiple calls to the same function from different threads does not use/refer to the same underlying hardware.

This can be overridden by signalling that a certain function calls should not be duplicated, and only a single instance will be shared among many calling locations in code:

uint8_t atomic_add(uint8_t val, uint8_t tid)
{
  static uint8_t the_reg;
  uint8_t new_reg_val = the_reg + val;
  printf("Thread %d incremented value from %d -> %d.\n",
         tid, the_reg, new_reg_val);
  the_reg = new_reg_val;
  return the_reg;
}
// Signal to tool to derive a single instance
// FSM region, ex. only one 8b the_reg in the design
// shared among N calling instances
#include "atomic_add_SINGLE_INST.h"

The above function FSM only exists once in the design, arbitration logic/muxing/etc is added to share the valid/ready FSM entry/return handshaking signals.

A setup of N 'thread' instances of a incrementer_thread FSM (function that repeatedly calls atomic_add) is seen below:

// Top level N parallel instances of incrementer_thread
#define N_THREADS 10
// Connect FSM outputs to top level output
// So doesnt synthesize away
typedef struct n_thread_outputs_t
{
  incrementer_thread_OUTPUT_t data[N_THREADS];
}n_thread_outputs_t;
#pragma MAIN_MHZ main 100.0
n_thread_outputs_t main()
{
  // Registers keeping time count, knowing when done
  static uint32_t clk_cycle_count;
  
  // N parallel instances of incrementer_thread
  uint8_t tid;
  incrementer_thread_INPUT_t inputs[N_THREADS];
  n_thread_outputs_t outputs;
  for(tid=0; tid<N_THREADS; tid+=1)
  {
    inputs[tid].tid = tid;
    inputs[tid].input_valid = 1;
    inputs[tid].output_ready = 1;
    outputs.data[tid] = incrementer_thread_FSM(inputs[tid]);
    if(outputs.data[tid].input_ready)
    {
      printf("Clock# %d: Thread %d starting.\n",
        clk_cycle_count, tid);  
    }
    if(outputs.data[tid].output_valid)
    {
      printf("Clock# %d: Thread %d output value = %d.\n",
        clk_cycle_count, tid, outputs.data[tid].return_output);
    }
  }
  
  // Func occuring each clk cycle
  clk_cycle_count += 1;
  
  // Wire up outputs
  return outputs;
}

The above code has the following behavior in simulation: sim4 sim5text

As expected, only a single calling FSM 'thread' at a time is able to have access and ~'atomically increment' the register value inside the atomic_add function. The rest of the threads block, waiting for their turn for access to the FSM in a later cycle.

Sharing a single FSM among N calling threads can at worst be 1/N times slower than having a not-shared duplicate copied of the FSM nearby to invoke whenever the calling threads needs. Additionally, there is a cost to arbitrating between the N calling threads: 1-to-N and N-to-1 muxes for both the input arguments and output return value of the function (ex. if the function takes alot of inputs or outputs then sharing becomes costly quickly). Finally if trying to round-robin/pipeline/time-share single instance FSMs among N calling threads, it is best to have N or more other single instance resources/'steps of the computation' in sequence before returning to the same shared resource again. Ex. having a thread that just repeatedly in a tight loop invoked a shared resource is not efficient. Instead invoke single instance functions and move on to do some work elsewhere, giving other threads the chance to access the resource with less competition.

Multiple IO Functions / Yield

In addition to the fundamental __clk() function/operator, FSM style comes with three more functions:

void __in();
void __out(<return type> return_value);
void __io(<return type> return_value);

When __in() is invoked the function can receive new inputs: the input valid+ready handshake is repeated. This function takes one or more cycles to return (is waiting to redo handshake), once complete the input argument variables have been updated with new input values (overrides user values) and execution continues.

Similarly __out() takes an argument that is the same as the return type of the function. When invoked the output valid+ready handshake is repeated, a value is 'returned' except function execution continues instead of ending. Again, this takes one or more cycles to complete depending on if ready for output in the handshake.

Finally __io() does both __in() and __out() functionality in one or more clock cycles. (Calling __in() then __out() would take two or more clock cycles).

These functions explicitly invoke the built in handshaking signals for function calls in ways that are not captured by regular C function call syntax (ex. no concept of yielding values from functions). Instead it only makes sense to use these IO functions when instantiating the generated _FSM() wrapper function, i.e. when manually wiring/controlling handshaking signals for multiple inputs and outputs over time.

For example, blinking an LED can be done with a while loop outputting the LED value with __out() each clock cycle:

// Output 4b wired to LEDs
// Derived FSM style
uint4_t main()
{
  // a 28 bits unsigned integer register
  uint28_t counter = 0;
  while (1) {
    // LEDs updated every clock
    // with the 4 most significant bits
    uint4_t led = uint28_27_24(counter);
    counter = counter + 1;
    __out(led);
  }
}
// Including this header auto generates a derived
// finite state machine (in PipelineC) from main
#include "main_FSM.h"
// The top level 'main' wrapper instantiation
// IO port types from generated code
#pragma MAIN_MHZ main_wrapper 100.0
uint4_t main_wrapper()
{
  main_INPUT_t i;
  i.input_valid = 1;
  i.output_ready = 1;
  main_OUTPUT_t o = main_FSM(i);
  return o.return_output;
}

More Examples

UART makes for an easy state machine, check out this half duplex UART loopback example.

See the recent Basic Neural Network FSM demo that creates a state machine with just a few states - but those are used iteratively many times via nested loops.

For a dive into more complex state machines checkout the advanced POSIX-like function calls to open,read,write files on a host OS to play a guessing game with the user using stdin and stdout. (See a tweet about old version of code)

Feel free to reach out for help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly