ByteAutomata

Language recognition for parsing. Works in Java, C++, and C#.

ByteAutomata provides a method to implement a hard-coded finite-state machine for text code tokenizing. This project includes an example program that turns input code to a token tree. It can be used as a template for your own parser. Here's the C++ code that defines state machine for the example syntax:

It's developed as a part of Meanscript, a work-in-progress multi-platform scripting and bytecode language:

https://github.com/Meanwhale/MeanscriptCLI

If you have any questions, comments, ideas, etc. contact me:

https://twitter.com/TheMeanwhale
meanwhale@gmail.com

Compile and run

Compile the example program in Java, C++, and/or C#. You can give a code (see the syntax below) to the example program and see a resulting token tree printed, if there isn't any errors.

Execute the example program to see guide for command line arguments.

Java
You need JDK 8 to compile and run. Go to folder java. Compile classes and run the main class ByteAutomataJava:
```
    javac ByteAutomataJava.java
    java ByteAutomataJava
```
C++
You need GCC (https://gcc.gnu.org/) to compile. Go to folder cpp. Call make to complile or use command
```
    g++ -std=c++14 main.cpp src/code.cpp -o byteautomata
```
and then run byteautomata (.exe):
```
    ./byteautomata
```

C#

Compile

    csc MeanCS.cs ByteAutomata.cs src\code.cs

and run

    ByteAutomata.exe

Input Code Syntax

The example program recognizes simple C-like code containing

text characters a-z and A-Z (no numbers or underscore)
number : characters 0-9
code blocks starting with ([{ and ending with )]} respectively
expression breaks: comma and semicolon
white space: space, linebreak, and tab

The syntax is defined in class MicroLexer in source code. For example input

abc 123; foo (456, bar[i])

results this token tree:

[<ROOT>]             // expression, token tree root
  [abc]              // text token
  [123]              // number token
[<EXPR>]             // expression
  [foo]
  [<BLOCK>]          // code block root
    [<EXPR>]
      [456]
    [<EXPR>]
      [bar]
      [<BLOCK>]
        [<EXPR>]
          [i]

Project content

Project code is generated from base code, written in C-like language and macros (not included to this repository for now). Base code is run thru GCC preprocessor for target languages, which are currently C++, C#, and Java. The most essential classes are ByteAutomata, a general use state machine, and MicroLexer, an example implementation.

Folder structure

folder	content
/	root folder: README, LICENCE, and an example script file.
/cpp/	main() source file, header, and utils
/cpp/src/	generated source (`code.cpp`) code and header files
/csharp/	Main() source file and utils
/csharp/src/	generated source code (`code.cs`)
/java/	main() class source
/java/net/meanscript/	generated code: classes for public interface
/java/net/meanscript/core/	generated code: internal classes
/java/net/meanscript/java/	Java-specific code

How It Works

ByteAutomata has a (logically) two-dimensional byte array, which has a column for each input byte (256) and a row for each state. For example here's a simple state machine (https://en.wikipedia.org/wiki/Finite-state_machine), that recognizes text and numbers separated with white space, eg. abc 123 ef:

Here's how it's done in ByteAutomata's byte array, where array items are indexes of transition callbacks:

                             0 ... {space} ... ' 0' '1' '2' ... '9' .... 'a' ... 'z'
			     
    state 1: white space              0          a   a   a  ...  a        b ...  _b_
    state 2: number                   c          0   0   0       0       [X] ... [X]
    state 3: text                     d         [X] [X] [X]     [X]       0  ...  0
    
    callback a: nextState(number)
    callback b: nextState(text)
    callback c: addNumberToken(), nextState(white space)
    callback d: addTextToken(), nextState(white space)

For example, if we're in 'white space' state, and input byte is 'z', then callback 'b' is called (the array item for this is surrounded with underscores). [X] is an error code 0xff (255), and 0 is for staying on the same state without a callback call. All bytes are [X] by default.

State transitions are created for ByteAutomata by calling

    transition(state, inputBytes, callback);

For example, to define a transition from 'white space' to 'number' state when input is a number character, call (in pseudo-Java)

    transition(whiteSpaceState, "0123456789", () -> { nextState(numberState); } );

To create a token tree, add nodes to the tree on transitions. For example, in 'number' state, when we get a space byte as an input, we can add a number token:

    transition(numberState, " ", () -> {addNumberToken(startIndex, currentIndex); nextState(whiteSpaceState);});

So, if input is "abc 123 de", status at the end of "123" is like

             startIndex   currentIndex
                      v   v
    index     0 1 2 3 4 5 6 7 8 9
    input     a b c   1 2 3   d e

Then addToken(4, 6) is called. It copies the number characters [4], [5], and [6] from input buffer, save them to a new node, and add the node to the token tree.

Token tree is made of MNodes whose member are references/pointers to the next sibling and a child node, and token data. You can iterate thru the node tree nodes using class NodeIterator or directly class MNode. For example printing all nodes with a recursive function, in pseudo-Java:

void printTree (MNode node)
{
    print(node.data);
    if (node.child != null) printTree(node.child);
    if (node.next != null) printTree(node.next);
}

// ...

    printTree(rootNode);

You'll find the full version of the function in class MNode.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
cpp		cpp
csharp		csharp
java		java
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
simple_automata.png		simple_automata.png
test_script.txt		test_script.txt
transition_definitions.png		transition_definitions.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ByteAutomata

Compile and run

Input Code Syntax

Project content

Folder structure

How It Works

About

Releases

Packages

Contributors 2

Languages

License

Meanwhale/ByteAutomata

Folders and files

Latest commit

History

Repository files navigation

ByteAutomata

Compile and run

Input Code Syntax

Project content

Folder structure

How It Works

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages