Skip to content

Meanwhale/ByteAutomata

Repository files navigation

ByteAutomata

Language recognition for parsing. Works in Java, C++, and C#.

ByteAutomata provides a method to implement a hard-coded finite-state machine for text code tokenizing. This project includes an example program that turns input code to a token tree. It can be used as a template for your own parser. Here's the C++ code that defines state machine for the example syntax:

            alt text

It's developed as a part of Meanscript, a work-in-progress multi-platform scripting and bytecode language:

https://github.com/Meanwhale/MeanscriptCLI

If you have any questions, comments, ideas, etc. contact me:

https://twitter.com/TheMeanwhale
meanwhale@gmail.com

Compile and run

Compile the example program in Java, C++, and/or C#. You can give a code (see the syntax below) to the example program and see a resulting token tree printed, if there isn't any errors.

Execute the example program to see guide for command line arguments.

  • Java

    You need JDK 8 to compile and run. Go to folder java. Compile classes and run the main class ByteAutomataJava:

        javac ByteAutomataJava.java
        java ByteAutomataJava
    
  • C++

    You need GCC (https://gcc.gnu.org/) to compile. Go to folder cpp. Call make to complile or use command

        g++ -std=c++14 main.cpp src/code.cpp -o byteautomata
    

    and then run byteautomata (.exe):

        ./byteautomata
    
  • C#

    Compile

        csc MeanCS.cs ByteAutomata.cs src\code.cs
    

    and run

        ByteAutomata.exe
    

Input Code Syntax

The example program recognizes simple C-like code containing

  • text characters a-z and A-Z (no numbers or underscore)
  • number : characters 0-9
  • code blocks starting with ([{ and ending with )]} respectively
  • expression breaks: comma and semicolon
  • white space: space, linebreak, and tab

The syntax is defined in class MicroLexer in source code. For example input

abc 123; foo (456, bar[i])

results this token tree:

[<ROOT>]             // expression, token tree root
  [abc]              // text token
  [123]              // number token
[<EXPR>]             // expression
  [foo]
  [<BLOCK>]          // code block root
    [<EXPR>]
      [456]
    [<EXPR>]
      [bar]
      [<BLOCK>]
        [<EXPR>]
          [i]

Project content

Project code is generated from base code, written in C-like language and macros (not included to this repository for now). Base code is run thru GCC preprocessor for target languages, which are currently C++, C#, and Java. The most essential classes are ByteAutomata, a general use state machine, and MicroLexer, an example implementation.

Folder structure

folder content
/ root folder: README, LICENCE, and an example script file.
/cpp/ main() source file, header, and utils
/cpp/src/ generated source (code.cpp) code and header files
/csharp/ Main() source file and utils
/csharp/src/ generated source code (code.cs)
/java/ main() class source
/java/net/meanscript/ generated code: classes for public interface
/java/net/meanscript/core/ generated code: internal classes
/java/net/meanscript/java/ Java-specific code

How It Works

ByteAutomata has a (logically) two-dimensional byte array, which has a column for each input byte (256) and a row for each state. For example here's a simple state machine (https://en.wikipedia.org/wiki/Finite-state_machine), that recognizes text and numbers separated with white space, eg. abc 123 ef:

alt text

Here's how it's done in ByteAutomata's byte array, where array items are indexes of transition callbacks:

                             0 ... {space} ... ' 0' '1' '2' ... '9' .... 'a' ... 'z'
			     
    state 1: white space              0          a   a   a  ...  a        b ...  _b_
    state 2: number                   c          0   0   0       0       [X] ... [X]
    state 3: text                     d         [X] [X] [X]     [X]       0  ...  0
    
    callback a: nextState(number)
    callback b: nextState(text)
    callback c: addNumberToken(), nextState(white space)
    callback d: addTextToken(), nextState(white space)

For example, if we're in 'white space' state, and input byte is 'z', then callback 'b' is called (the array item for this is surrounded with underscores). [X] is an error code 0xff (255), and 0 is for staying on the same state without a callback call. All bytes are [X] by default.

State transitions are created for ByteAutomata by calling

    transition(state, inputBytes, callback);

For example, to define a transition from 'white space' to 'number' state when input is a number character, call (in pseudo-Java)

    transition(whiteSpaceState, "0123456789", () -> { nextState(numberState); } );

To create a token tree, add nodes to the tree on transitions. For example, in 'number' state, when we get a space byte as an input, we can add a number token:

    transition(numberState, " ", () -> {addNumberToken(startIndex, currentIndex); nextState(whiteSpaceState);});

So, if input is "abc 123 de", status at the end of "123" is like

             startIndex   currentIndex
                      v   v
    index     0 1 2 3 4 5 6 7 8 9
    input     a b c   1 2 3   d e

Then addToken(4, 6) is called. It copies the number characters [4], [5], and [6] from input buffer, save them to a new node, and add the node to the token tree.

Token tree is made of MNodes whose member are references/pointers to the next sibling and a child node, and token data. You can iterate thru the node tree nodes using class NodeIterator or directly class MNode. For example printing all nodes with a recursive function, in pseudo-Java:

void printTree (MNode node)
{
    print(node.data);
    if (node.child != null) printTree(node.child);
    if (node.next != null) printTree(node.next);
}

// ...

    printTree(rootNode);

You'll find the full version of the function in class MNode.


Copyright (c) 2020, Meanwhale