Language recognition for parsing. Works in Java, C++, and C#.
ByteAutomata provides a method to implement a hard-coded finite-state machine for text code tokenizing. This project includes an example program that turns input code to a token tree. It can be used as a template for your own parser. Here's the C++ code that defines state machine for the example syntax:
It's developed as a part of Meanscript, a work-in-progress multi-platform scripting and bytecode language:
https://github.com/Meanwhale/MeanscriptCLI
If you have any questions, comments, ideas, etc. contact me:
https://twitter.com/TheMeanwhale
meanwhale@gmail.com
Compile the example program in Java, C++, and/or C#. You can give a code (see the syntax below) to the example program and see a resulting token tree printed, if there isn't any errors.
Execute the example program to see guide for command line arguments.
- Java
You need JDK 8 to compile and run. Go to folder java. Compile classes and run the main class ByteAutomataJava:
javac ByteAutomataJava.java java ByteAutomataJava
- C++
You need GCC (https://gcc.gnu.org/) to compile. Go to folder cpp. Call
make
to complile or use commandg++ -std=c++14 main.cpp src/code.cpp -o byteautomata
and then run byteautomata (.exe):
./byteautomata
- C#
Compile
csc MeanCS.cs ByteAutomata.cs src\code.cs
and run
ByteAutomata.exe
The example program recognizes simple C-like code containing
- text characters a-z and A-Z (no numbers or underscore)
- number : characters 0-9
- code blocks starting with ([{ and ending with )]} respectively
- expression breaks: comma and semicolon
- white space: space, linebreak, and tab
The syntax is defined in class MicroLexer in source code. For example input
abc 123; foo (456, bar[i])
results this token tree:
[<ROOT>] // expression, token tree root
[abc] // text token
[123] // number token
[<EXPR>] // expression
[foo]
[<BLOCK>] // code block root
[<EXPR>]
[456]
[<EXPR>]
[bar]
[<BLOCK>]
[<EXPR>]
[i]
Project code is generated from base code, written in C-like language and macros (not included to this repository for now). Base code is run thru GCC preprocessor for target languages, which are currently C++, C#, and Java. The most essential classes are ByteAutomata, a general use state machine, and MicroLexer, an example implementation.
folder | content |
---|---|
/ | root folder: README, LICENCE, and an example script file. |
/cpp/ | main() source file, header, and utils |
/cpp/src/ | generated source (code.cpp ) code and header files |
/csharp/ | Main() source file and utils |
/csharp/src/ | generated source code (code.cs ) |
/java/ | main() class source |
/java/net/meanscript/ | generated code: classes for public interface |
/java/net/meanscript/core/ | generated code: internal classes |
/java/net/meanscript/java/ | Java-specific code |
ByteAutomata has a (logically) two-dimensional byte array, which has a column for each input byte (256) and a row for each state.
For example here's a simple state machine (https://en.wikipedia.org/wiki/Finite-state_machine), that recognizes text and numbers separated with white space, eg. abc 123 ef
:
Here's how it's done in ByteAutomata's byte array, where array items are indexes of transition callbacks:
0 ... {space} ... ' 0' '1' '2' ... '9' .... 'a' ... 'z'
state 1: white space 0 a a a ... a b ... _b_
state 2: number c 0 0 0 0 [X] ... [X]
state 3: text d [X] [X] [X] [X] 0 ... 0
callback a: nextState(number)
callback b: nextState(text)
callback c: addNumberToken(), nextState(white space)
callback d: addTextToken(), nextState(white space)
For example, if we're in 'white space' state, and input byte is 'z', then callback 'b' is called (the array item for this is surrounded with underscores).
[X]
is an error code 0xff
(255), and 0
is for staying on the same state without a callback call. All bytes are [X]
by default.
State transitions are created for ByteAutomata by calling
transition(state, inputBytes, callback);
For example, to define a transition from 'white space' to 'number' state when input is a number character, call (in pseudo-Java)
transition(whiteSpaceState, "0123456789", () -> { nextState(numberState); } );
To create a token tree, add nodes to the tree on transitions. For example, in 'number' state, when we get a space byte as an input, we can add a number token:
transition(numberState, " ", () -> {addNumberToken(startIndex, currentIndex); nextState(whiteSpaceState);});
So, if input is "abc 123 de", status at the end of "123" is like
startIndex currentIndex
v v
index 0 1 2 3 4 5 6 7 8 9
input a b c 1 2 3 d e
Then addToken(4, 6)
is called. It copies the number characters [4], [5], and [6] from input buffer, save them to a new node, and add the node to the token tree.
Token tree is made of MNode
s whose member are references/pointers to the next sibling and a child node, and token data.
You can iterate thru the node tree nodes using class NodeIterator
or directly class MNode
.
For example printing all nodes with a recursive function, in pseudo-Java:
void printTree (MNode node)
{
print(node.data);
if (node.child != null) printTree(node.child);
if (node.next != null) printTree(node.next);
}
// ...
printTree(rootNode);
You'll find the full version of the function in class MNode
.
Copyright (c) 2020, Meanwhale