-
Notifications
You must be signed in to change notification settings - Fork 49
FSM Style
This page describes the 'finite state machine style' of writing PipelineC code. Inspired by Silice's clock step operator, PipelineC introduced a __clk()
function that allows the generation of complex derived finite state machines.
Consider this example code:
// No clock cycles, combinatorial logic
uint32_t test0(uint32_t x)
{
uint32_t rv = x + 1;
return rv;
}
// One clock cycle before add, a FSM
uint32_t test1(uint32_t x)
{
__clk();
uint32_t rv = x + 1;
return rv;
}
// One clock cycle after add, a FSM
uint32_t test2(uint32_t x)
{
uint32_t rv = x + 1;
__clk();
return rv;
}
and how they behave over time:
These test cases drive the input x
with 2
, thus expecting 3
as the returned output.
test0
is combinatorial logic. The output is immediately, in the same cycle, available at the output.
test1
introduces a clock cycle __clk()
of delay. Notice how the return value register registers_r.state_regs.rv
for test1
compares with test2
: there is a one clock cycle delay in test1
before the rv
state register takes its value.
However note the return value output of the function return_output.return_output
and its associated handshake valid signal return_output.output_valid
: In both test1
and test2
the returned value is 3
during the second cycle. This is because both functions are describing two clock cycles of computation, only slightly different combinatorial logic data paths.
In test1
the x
input value is registered and saved for after the __clk()
. Meaning the data path for test1
is from an x
register through an adder to the output return value port. (the rv
register is actually unused and optimizes away in synthesis)
In test2
the x
input value is used before the __clk()
. This means the data path for test2
is from an input port x
(no register) through an adder as combinatorial logic to an rv
register. In the second cycle the rv
register drives the output return value port.
__clk()
is essentially used as dividing marker for how combinatorial logic should be divided / where registers will be used.
For example, using the two tests as subroutines:
// Two subroutine FSMs
uint32_t main(uint32_t a)
{
uint32_t b = test1(a);
uint32_t c = test2(b);
return c;
}
It is possible to write a state machine in 'regular' PipelineC without __clk()
operators. Instead each FSM state and transition logic is explicitly written as PipelineC code (similar to regular HDL) - this is verbose but works. Instead, the FSM style discussed here is essentially a layer of code generation creating that verbose code for the user
, states are derived from __clk()
use.
Calling other FSM style functions from within an FSM style function looks exactly like expected C function call syntax.
However the syntax for declaring an FSM instance not inside another FSM function (in 'regular' PipelineC code) looks like so:
uint32_t test1(uint32_t x)
{
__clk();
uint32_t rv = x + 1;
return rv;
}
#include "test1_FSM.h"
#pragma MAIN test1_wrapper
uint32_t test1_wrapper()
{
// The instance
test1_INPUT_t i;
i.x = 2;
i.input_valid = 1;
i.output_ready = 1;
test1_OUTPUT_t o = test1_FSM(i);
return o.return_output;
}
The #include "test1_FSM.h"
signals that the test1
function is a finite state machine, and is also the name of the generated file where PipelineC code describing the derived FSM will go.
These FSM style functions include input and output handshaking signals _valid/_ready
- for function entry and return. As such, inside the generated _FSM.h
are structs that define input and output types like test1_INPUT_t
and test1_OUTPUT_t
. These types are used for the generated test1_FSM
function, which is used when instantiating the module in the top level MAIN
of the design.
Using the generated _FSM()
function and handshake signal structs from generated code is only needed when the function is at the top level of the design or connecting with other non-FSM style PipelineC. I.e. Calling other FSM functions from in an FSM looks as expected, regular C function call syntax.
Consider the below function that first waits for 5 clocks, then 4, ... down to 1. It does this via a wait
function the presumably does __clk()
a number of times specified by the input argument. This code compiles fine.
uint32_t wait_test()
{
uint32_t clocks_to_wait = 5;
while(clocks_to_wait > 0)
{
wait(clocks_to_wait);
clocks_to_wait -= 1;
}
return clocks_to_wait;
}
This implementation of the wait()
function will not compile
:
uint32_t wait(uint32_t clocks)
{
while(clocks > 0)
{
clocks -= 1;
}
return clocks;
}
By default all loops are unrolled
, i.e. could decrement the clocks
value any number of times in one clock cycle by concatenating subtractors in combinatorial logic. The tool does not know how to spread the loop across multiple clock cycles unless instructed how to do so. A __clk()
can be added to the loop body and the code now compiles fine
.
uint32_t wait(uint32_t clocks)
{
while(clocks > 0)
{
__clk();
clocks -= 1;
}
return clocks;
}
The wait_test()
function does not need a __clk()
in it's loop body since use of the FSM style wait
function now includes the clock delay markers.
When having multiple simultaneous instances of a derived state machine 'thread' running there are a few ways to deal with sharing resources.
First are the lower level global variable instance arrays for resolving multiple drivers, etc. But on top of that is the more general ~AXI-like shared resource buses header which can generated arbitration for any arbitrary module.
If specifically looking to arbitrate access to another derived state machine, the single instances FSMs / shared regions below are another option.
By default in PipelineC there one instance of function hardware per MAIN
top level FSM ~thread. Multiple calls to the same function from different threads does not use/refer to the same underlying hardware.
This can be overridden by signalling that a certain function calls should not be duplicated, and only a single instance will be shared among many calling locations in code:
uint8_t atomic_add(uint8_t val, uint8_t tid)
{
static uint8_t the_reg;
uint8_t new_reg_val = the_reg + val;
printf("Thread %d incremented value from %d -> %d.\n",
tid, the_reg, new_reg_val);
the_reg = new_reg_val;
return the_reg;
}
// Signal to tool to derive a single instance
// FSM region, ex. only one 8b the_reg in the design
// shared among N calling instances
#include "atomic_add_SINGLE_INST.h"
The above function FSM only exists once in the design, arbitration logic/muxing/etc is added to share the valid/ready FSM entry/return handshaking signals.
A setup of N 'thread' instances of a incrementer_thread
FSM (function that repeatedly calls atomic_add
) is seen below:
// Top level N parallel instances of incrementer_thread
#define N_THREADS 10
// Connect FSM outputs to top level output
// So doesnt synthesize away
typedef struct n_thread_outputs_t
{
incrementer_thread_OUTPUT_t data[N_THREADS];
}n_thread_outputs_t;
#pragma MAIN_MHZ main 100.0
n_thread_outputs_t main()
{
// Registers keeping time count, knowing when done
static uint32_t clk_cycle_count;
// N parallel instances of incrementer_thread
uint8_t tid;
incrementer_thread_INPUT_t inputs[N_THREADS];
n_thread_outputs_t outputs;
for(tid=0; tid<N_THREADS; tid+=1)
{
inputs[tid].tid = tid;
inputs[tid].input_valid = 1;
inputs[tid].output_ready = 1;
outputs.data[tid] = incrementer_thread_FSM(inputs[tid]);
if(outputs.data[tid].input_ready)
{
printf("Clock# %d: Thread %d starting.\n",
clk_cycle_count, tid);
}
if(outputs.data[tid].output_valid)
{
printf("Clock# %d: Thread %d output value = %d.\n",
clk_cycle_count, tid, outputs.data[tid].return_output);
}
}
// Func occuring each clk cycle
clk_cycle_count += 1;
// Wire up outputs
return outputs;
}
The above code has the following behavior in simulation:
As expected, only a single calling FSM 'thread' at a time is able to have access and ~'atomically increment' the register value inside the atomic_add
function. The rest of the threads block, waiting for their turn for access to the FSM in a later cycle.
Sharing a single FSM among N calling threads can at worst be 1/N times slower
than having a not-shared duplicate copied of the FSM nearby to invoke whenever the calling threads needs. Additionally, there is a cost to arbitrating between the N calling threads: 1-to-N and N-to-1 muxes
for both the input arguments and output return value of the function (ex. if the function takes alot of inputs or outputs then sharing becomes costly quickly). Finally if trying to round-robin/pipeline/time-share single instance FSMs among N calling threads, it is best to have N or more other single instance resources/'steps of the computation' in sequence before returning to the same shared resource again. Ex. having a thread that just repeatedly in a tight loop invoked a shared resource is not efficient. Instead invoke single instance functions and move on to do some work elsewhere, giving other threads the chance to access the resource with less competition.
In addition to the fundamental __clk()
function/operator, FSM style comes with three more functions:
void __in();
void __out(<return type> return_value);
void __io(<return type> return_value);
When __in()
is invoked the function can receive new inputs: the input valid+ready handshake is repeated. This function takes one or more cycles to return (is waiting to redo handshake), once complete the input argument variables have been updated with new input values (overrides user values) and execution continues.
Similarly __out()
takes an argument that is the same as the return
type of the function. When invoked the output valid+ready handshake is repeated, a value is 'returned' except function execution continues instead of ending. Again, this takes one or more cycles to complete depending on if ready for output in the handshake.
Finally __io()
does both __in()
and __out()
functionality in one or more clock cycles. (Calling __in()
then __out()
would take two or more clock cycles).
These functions explicitly invoke the built in handshaking signals for function calls in ways that are not captured by regular C function call syntax (ex. no concept of yielding values from functions). Instead it only makes sense to use these IO functions when instantiating the generated _FSM()
wrapper function, i.e. when manually wiring/controlling handshaking signals for multiple inputs and outputs over time.
For example, blinking an LED can be done with a while loop outputting the LED value with __out()
each clock cycle:
// Output 4b wired to LEDs
// Derived FSM style
uint4_t main()
{
// a 28 bits unsigned integer register
uint28_t counter = 0;
while (1) {
// LEDs updated every clock
// with the 4 most significant bits
uint4_t led = uint28_27_24(counter);
counter = counter + 1;
__out(led);
}
}
// Including this header auto generates a derived
// finite state machine (in PipelineC) from main
#include "main_FSM.h"
// The top level 'main' wrapper instantiation
// IO port types from generated code
#pragma MAIN_MHZ main_wrapper 100.0
uint4_t main_wrapper()
{
main_INPUT_t i;
i.input_valid = 1;
i.output_ready = 1;
main_OUTPUT_t o = main_FSM(i);
return o.return_output;
}
UART makes for an easy state machine, check out this half duplex UART loopback example.
See the recent Basic Neural Network FSM demo that creates a state machine with just a few states - but those are used iteratively many times via nested loops.
For a dive into more complex state machines checkout the advanced POSIX-like function calls to open,read,write
files on a host OS to play a guessing game with the user using stdin and stdout
. (See a tweet about old version of code)
Feel free to reach out for help!