GitHub - R32/lex: Build lexer and simple parser(SimpleLR) in macro, It also provides lexer and simpleLR tool for c language

Lex

Build lexer and simple parser(SimpleLR) in macro.

LIMITS:

In haxe, You can't use it in macro-in-macro
Suitable for ASCII only.

Samples

SLR tool for c langauge

haxelib run lex --slr TPL.slr TPL.lex

basic demo : test.slr ==> test_slr.c

json parser demo : rjson_parser.slr ==> rjson_parser_slr.c
Lexer tool for c langauge

run haxelib run lex TPL.lex basic demo : test.lex ==> test_lex.c
hello world
hscript parser

Status

Lexer

static function main()  {
    var str = lms.ByteData.ofString('exec(123 + 456)');
    var lex = new Lexer(str);
    var t = lex.token();
    while (t != Eof) {
        trace(s_token(t));
        t = lex.token();
    }
}

enum Op {
    Add;
    Sub;
    Mul;
    Div;
}

enum Token {
    Eof;
    LParen;
    RParen;
    Op( op : Op );
    CIdent( id : String);
    CInt( i : Int);
    CString( s : String);
}

/*
 * The following meta characters are supported in string:
 * `*`: zero or more
 * `+`: one or more
 * `?`: zero or one
 * `[`: begin char range
 * `]`: end char range
 * `\`: escape next char
 * `|`: or  (Not recommended, e.g: "abc|xyz" should be replaced by: "abc" | "xyz" )
 *
 * The newly added syntax:
 *
 * "a" | "b"    => /a|b/
 * "a" + "b"    => /ab/  "+" has a higher priority than "|"
 * Opt("abc")   => /(abc)?/
 * Star("abc")  => /(abc)*
 * Plus("abc")  => /(abc)+/
 *
 * For example:
 * var integer    = "0" | "[1-9][0-9]*";
 * var floatpoint = ".[0-9]+" | "[0-9]+.[0-9]*";
 * var exp        = "[eE][+-]?[0-9]+";
 * var float      = integer + Opt(exp) | floatpoint + Opt(exp);
 *
 * // "float" is equal to  [(0 | [1-9][0-9]*) + exp] | [( [0-9]+ | [0-9]+.[0-9]* ) + exp]
 */
@:rule(Eof) class Lexer implements lm.Lexer<Token> {

    // You could add "@:skip" to avoid being parsed into a pattern

    var r_int = "0" | "-?[1-9][0-9]*";

    var r_ident = "[a-zA-Z_][a-zA-Z_0-9]+";

    // Matching 1, And the first one will be automatically renamed as "token"
    var tok =  [
        "[ \t]+" => this.token(),   // Recursive Matching 1
        "+" => Op(Add),
        "-" => Op(Sub),
        "*" => Op(Mul),
        "/" => Op(Div),
        "(" => LParen,
        ")" => RParen,
        r_int =>
            CInt(Std.parseInt(this.current)),
        r_ident =>
            CIdent(this.current),
        '"' => {
            var i = this.pmax;      // save start position
            var t = this.string();  // goto Matching 2
            if (t == Eof)
                throw "UnClosed \"";
            this.pmin = i;          // restore pmin
            CString(this.current);
        },
        "//[^\n]*" =>
            this.token(),           // Recursive Matching 1
        _ => {
            // error handing, when error occurs, this.pmin will be >= this.pmax
            throw "UnExpected : " + this.getString(pmax, pmin - pmax);
            // NOTE: Unlike SLR parser, For lexer, only "Matching 1"
            // has the ability to handle errors, the others can only do empty(epsilon) matching
        }
    ];

    // Matching 2
    var string = [
        '"'     => CString(""),
        '[^"]+' => this.string()    // Recursive Matching 2
    ];

    // Matching N...
}

Parser: Only SimpleLR is available.

Unlike normal LR parser, there is no action-table, all are jump-table.

Some conflicts may be resolved in normal LALR/LR1, But here the conflicts error will be thrown directly.

Position: Inside the actions, you could use T1~TN to access the position, which is the instance of lm.Stream.Tok
```
T1.pmax - T1.pmin;
```
And inside the actions, you could use the vairalbe stream
```
var tok = stream.peek(0);
if (tok.term == SomeToken)
    stream.junk(1);
```

Combine multiple Tokens:

// 1. uses "[]", NOTE: if you put tokens with **different precedence**, a conflict error will be thrown.
switch(s) {
case [e1=expr, op = [OpAdd, OpSub], e2=expr]: op == OpAdd ? e1 + e2 : e1 - e2;
}

// 2. uses production(switch), NOTE: This will ignore all token **precedence** from predefine.
// But you can use "@:prec(XXX)" to enforce the **precedence** for it
case [e1=expr, op = op, e2=expr]: trace(op == OpAdd) ... //

var op : Token = switch(s) {
case [OpAdd] : OpAdd;
case [OpSub] : OpSub;
case [OpDiv] : OpDiv;
case [OpMul] : OpMul;
}

You can use string literals instead of simple terminators in stream match.

switch(s) {
case [e1=expr, op = ["+", "-"], e2=expr]: op == OpPlus ? e1 + e2 : e1 - e2;
case ["(", e = expr, ")"]: e;
}

Operator Precedence:

// the operator precedence definitions:
@:rule({
    left: ["+", "-"],         // The parser could auto reflect(str) => Token
    left: [OpTimes, OpDiv],   // The lower have higher Priority.
    nonassoc: [UMINUS],       // All characters of the placeholder must be UPPERCASE
}) class MyParser implements lm.SLR<MyLexer> {


    details...

    ```
    // UPPERCASE == "non-terml", LOWERCASE == "terml"
    [..., op, E]: if defined(op)     then case.left.lval = E.value
    [..., op, E]: if not defined(op) then case.left = null
    [..., T, E] | [E] | [..., t]:    then case.left = null
    [E, op, ...]: if defined(op)     then case.right.own = E.value
    [E, op, ...]: if not defined(op) then case.right.prio = -1
    [E, T, ...]                      then case.right.prio = -1
    [t, ...]                         then case.right = null
    ```

CHANGES

1.0.0
0.13.0:
- tool : Added simple LR tool for c langauge
- slr : Refactored LR0 code to SLR
- lexer : Added new syntax Opt(), Star(), Plus() for group.

Defines

-D lex_slrtable: for debug. it will generate a SimpleLR table save as slr-table.txt. for example:

You may have to modify the mmap field in debug.Print

Production:
  (R0)  MAIN --> EXPR $
  (R1)  EXPR --> EXPR [+ -] EXPR
  (R2)       --> EXPR * EXPR
  (R3)       --> EXPR / EXPR
  (R4)       --> ( EXPR )
  (R5)       --> - EXPR
  (R6)       --> int
-----------------------------------------------------------------------------------------
|  (S)  | (EP)  |   $   |  int  |   +   |   -   |   *   |   /   |   (   |   )   | EXPR  |
----------------------------------------------------------------------------------------- MAIN
|   0   | NULL  |       |  14   |       |   1   |       |       |   2   |       |   8   |
-----------------------------------------------------------------------------------------
|   1   | NULL  |       |  14   |       |   1   |       |       |   2   |       |  10   |
-----------------------------------------------------------------------------------------
|   2   | NULL  |       |  14   |       |   1   |       |       |   2   |       |   3   |
-----------------------------------------------------------------------------------------
|   3   | NULL  |       |       |   4   |   4   |   6   |   7   |       |  11   |       |
-----------------------------------------------------------------------------------------
|   4   | NULL  |       |  14   |       |   1   |       |       |   2   |       |   5   |
-----------------------------------------------------------------------------------------
|   5   |  R1   |       |       |       |       |   6   |   7   |       |       |       |
-----------------------------------------------------------------------------------------
|   6   | NULL  |       |  14   |       |   1   |       |       |   2   |       |  13   |
-----------------------------------------------------------------------------------------
|   7   | NULL  |       |  14   |       |   1   |       |       |   2   |       |  12   |
-----------------------------------------------------------------------------------------
|   8   | NULL  |   9   |       |   4   |   4   |   6   |   7   |       |       |       |
-----------------------------------------------------------------------------------------
-----------------
|   9   |  R0   |
-----------------
|  10   |  R5   |
-----------------
|  11   |  R4   |
-----------------
|  12   |  R3   |
-----------------
|  13   |  R2   |
-----------------
|  14   |  R6   |
-----------------

Usage

copy from test/subs/Demo.hx

@:analyzer(no_optimize)
class Demo {
    static function main() {
        var str = '1 - 2 * (3 + 4) + 5 * Unexpected 6';
        var lex = new Lexer(lms.ByteData.ofString(str));
        var par = new Parser(lex);
        trace(par.main() == (1 - 2 * (3 + 4) + 5 * 6));
    }
}

// The lm.LR0 Parser only works with "enum abstract (Int) to Int"
enum abstract Token(Int) to Int {
    var Eof = 0;
    var CInt;
    var OpPlus;
    var OpMinus;
    var OpTimes;
    var OpDiv;
    var LParen;
    var RParen;
    var CIdent;
}

/**
* @:rule(EOF, cmax = 255) See the example below:
*   Eof is a custom terminator which is defined in "<Token>" (required)
*   127 is the custom maximum char value. (optional, default is 255)
*/
@:rule(Eof, 127) class Lexer implements lm.Lexer<Token> {
    var inter = "0|[1-9][0-9]*";         // 0 or ...
    var r_zero = "0";                    // string variable will be treated as pattern if there is no `@:skip`
    var r_int = "[1-9][0-9]*";
    var tok =  [                         // a rule set definition, the first one will become .token()
        "[ \t]+" => this.token(),
        r_zero | r_int => CInt,          // zero or int
     // inter + Opt("[eE][+-]?[0-9]+")   // exponent
        "+" => OpPlus,
        "-" => OpMinus,
        "*" => OpTimes,
        "/" => OpDiv,
        "(" => LParen,
        ")" => RParen,
        "[a-zA-Z_]+" => CIdent,
    ];
}

@:rule({
    start: [main],            // Specify start, like the "%start" in ocamlyacc, If not specified, the first "switch" will be selected
    left: ["+", "-"],         // The parser could auto reflect(str) => Token
    left: [OpTimes, OpDiv],   // The lower have higher priority.
    nonassoc: [UMINUS],       // The placeholder must be uppercase
}) class Parser implements lm.SLR<Lexer> {

    var main = switch(s) {
        case [e = expr, Eof]:
            e;
        default:              // place handling error code here
            var t = stream.peek(0);
            switch(t.term) {
            case Eof:
                return 0;
            case CIdent:                // Show recovery from errors, Note: this ability is very weak
                stream.junk(1);         // Discard current token
                slrloop( -1, MAIN_EXP); // main => MAIN_EXP, NOTE: Only the entry switch-case (Specified in "start") has an EXP value
            default:
                throw "Unexpected: " + stream.str(t);
            }
    }

    var expr : Int = switch(s) {        // Specify Type explicitly
        case [e1 = expr, op = [OpPlus,OpMinus], e2 = expr]: op == OpPlus ? e1 + e2 : e1 - e2;
        case [e1 = expr, OpTimes, e2 = expr]: e1 * e2;
        case [e1 = expr, OpDiv, e2 = expr]: Std.int(e1 / e2);
        case [LParen, e = expr, RParen]: e;
        case [@:prec(UMINUS) OpMinus, e = expr]: -e;   // %prec UMINUS
        case [CInt(n)]: n;
    }

    // Define custom extract function for CInt(n)
    @:rule(CInt) inline function int_of_string( s : String ) : Int return Std.parseInt(s);
    // OR @:rule(CInt) inline function int_of_string( input : lms.ByteData, t : lm.Stream.Tok ) : Int {
    //    return Std.parseInt( input.readString(t.pmin, t.pmax - t.pmin) );
    //}
}

compile:

haxe -dce full -D analyzer-optimize -lib lex -main Demo -js demo.js

Generated JS:

// Generated by Haxe 4.3.0-rc.1
(function ($global) { "use strict";
var Demo = function() { };
Demo.main = function() {
    var str = "1 - 2 * (3 + 4) + 5 * 6";
    var lex = new Lexer(str);
    var par = new Parser(lex);
    console.log("Demo.hx:9:",par._entry(0,8) == 17);
};
var Lexer = function(s) {
    this.input = s;
    this.pmin = 0;
    this.pmax = 0;
};
Lexer.prototype = {
    getString: function(p,len) {
        return this.input.substr(p,len);
    }
    ,_token: function(init,right) {
        if(this.pmax >= right) {
            return 0;
        }
        var raw = Lexer.raw;
        var i = this.pmax;
        var state = init;
        var prev = init;
        var c;
        while(i < right) {
            c = this.input.charCodeAt(i++);
            state = raw.charCodeAt(128 * state + c);
            if(state >= 3) {
                break;
            }
            prev = state;
        }
        this.pmin = i;
        if(state == 255) {
            state = prev;
            --i;
        }
        var q = raw.charCodeAt(399 - state);
        if(i > this.pmax && q < 8) {
            this.pmin = this.pmax;
            this.pmax = i;
        } else {
            q = raw.charCodeAt(399 - init);
        }
        return this.cases(q);
    }
    ,token: function() {
        return this._token(0,this.input.length);
    }
    ,cases: function(s) {
        switch(s) {
        case 0:
            return this._token(0,this.input.length);
        case 1:
            return 1;
        case 2:
            return 2;
        case 3:
            return 3;
        case 4:
            return 4;
        case 5:
            return 5;
        case 6:
            return 6;
        case 7:
            return 7;
        default:
            throw new Error("UnMatached: '" + this.input.substr(this.pmax,this.pmin - this.pmax) + "'");
        }
    }
};
var Parser = function(lex) {
    this.stream = new lm_Stream(lex);
};
Parser.prototype = {
    _entry: function(state,exp) {
        var t = this.stream.newTok(0,0,0);
        t.state = state;
        var _this = this.stream;
        var i = _this.right;
        while(--i >= _this.pos) _this.cached[i + 1] = _this.cached[i];
        _this.cached[_this.pos] = t;
        ++_this.pos;
        ++_this.right;
        var raw = Parser.raw;
        while(true) {
            while(true) {
                t = this.stream.next();
                state = raw.charCodeAt(16 * state + t.term);
                if(state >= 9) {
                    break;
                }
                t.state = state;
            }
            if(state == 255) {
                this.stream.pos -= 1;
                var _this = this.stream;
                state = _this.cached[_this.pos + (-1)].state;
            }
            while(true) {
                var q = raw.charCodeAt(159 - state);
                var value = this.cases(q);
                if(q >= 7) {
                    return value;
                }
                t = this.stream.reduce(Parser.lvs[q]);
                if(t.term == exp) {
                    this.stream.pos -= 2;
                    this.stream.junk(2);
                    return value;
                }
                t.val = value;
                var _this1 = this.stream;
                state = raw.charCodeAt(16 * _this1.cached[_this1.pos + (-2)].state + t.term);
                t.state = state;
                if(state < 9) {
                    break;
                }
            }
        }
    }
    ,cases: function(q) {
        var __s = this.stream;
        switch(q) {
        case 0:
            return __s.cached[__s.pos + (-2)].val;
        case 1:
            var e1 = __s.cached[__s.pos + (-3)].val;
            var e2 = __s.cached[__s.pos + (-1)].val;
            if(__s.cached[__s.pos + (-2)].term == 2) {
                return e1 + e2;
            } else {
                return e1 - e2;
            }
            break;
        case 2:
            return __s.cached[__s.pos + (-3)].val * __s.cached[__s.pos + (-1)].val;
        case 3:
            return __s.cached[__s.pos + (-3)].val / __s.cached[__s.pos + (-1)].val | 0;
        case 4:
            return __s.cached[__s.pos + (-2)].val;
        case 5:
            return -__s.cached[__s.pos + (-1)].val;
        case 6:
            return Std.parseInt(__s.stri(-1));
        default:
            var t = this.stream.peek(0);
            throw new Error("Unexpected \"" + (t.term != 0 ? this.stream.lex.getString(t.pmin,t.pmax - t.pmin) : "Eof") + "\"");
        }
    }
};
var Std = function() { };
Std.parseInt = function(x) {
    var v = parseInt(x);
    if(isNaN(v)) {
        return null;
    }
    return v;
};
var lm_Tok = function(t,min,max) {
    this.term = t;
    this.pmin = min;
    this.pmax = max;
};
var lm_Stream = function(l) {
    this.lex = l;
    var this1 = new Array(128);
    this.cached = this1;
    this.right = 0;
    this.pos = 0;
};
lm_Stream.prototype = {
    reclaim: function(tok) {
        tok.nxt = this.h;
        this.h = tok;
    }
    ,newTok: function(term,min,max) {
        if(this.h == null) {
            return new lm_Tok(term,min,max);
        } else {
            var t = this.h;
            this.h = this.h.nxt;
            t.term = term;
            t.pmin = min;
            t.pmax = max;
            return t;
        }
    }
    ,peek: function(i) {
        while(this.right - this.pos <= i) {
            var t = this.lex.token();
            this.cached[this.right++] = this.newTok(t,this.lex.pmin,this.lex.pmax);
        }
        return this.cached[this.pos + i];
    }
    ,junk: function(n) {
        if(n <= 0) {
            return;
        }
        if(this.right - this.pos >= n) {
            var i = n;
            while(i-- > 0) this.reclaim(this.cached[this.pos + i]);
            i = this.pos;
            this.right -= n;
            while(i < this.right) {
                this.cached[i] = this.cached[i + n];
                ++i;
            }
        } else {
            n -= this.right - this.pos;
            while(n-- > 0) this.lex.token();
            while(this.right > this.pos) this.reclaim(this.cached[--this.right]);
        }
    }
    ,stri: function(dx) {
        var t = this.cached[this.pos + dx];
        return this.lex.getString(t.pmin,t.pmax - t.pmin);
    }
    ,next: function() {
        if(this.right == this.pos) {
            var t = this.lex.token();
            this.cached[this.right++] = this.newTok(t,this.lex.pmin,this.lex.pmax);
        }
        return this.cached[this.pos++];
    }
    ,reduce: function(lvw) {
        var w = lvw & 255;
        if(w == 0) {
            return this.reduceEP(lvw >>> 8);
        }
        var pmax = this.cached[this.pos + (-1)].pmax;
        --w;
        this.pos -= w;
        this.right -= w;
        var t = this.cached[this.pos + (-1)];
        t.term = lvw >>> 8;
        t.pmax = pmax;
        if(w == 0) {
            return t;
        }
        var i = w;
        while(i-- > 0) this.reclaim(this.cached[this.pos + i]);
        i = this.pos;
        while(i < this.right) {
            this.cached[i] = this.cached[i + w];
            ++i;
        }
        return t;
    }
    ,reduceEP: function(lv) {
        var prev = this.cached[this.pos - 1];
        var t = this.newTok(lv,prev.pmax,prev.pmax);
        var i = this.right;
        while(--i >= this.pos) this.cached[i + 1] = this.cached[i];
        this.cached[this.pos] = t;
        ++this.pos;
        ++this.right;
        return t;
    }
};
Lexer.raw = "ÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿ\t\x08\x07\x06ÿ\x05ÿ\x04\x03\x02\x02\x02\x02\x02\x02\x02\x02\x02ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x06\x07\x04\x02\x03\x05\x01\x01\x00ÿ";
Parser.raw = "ÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x08ÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\nÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x03ÿÿÿÿÿÿÿÿ\x04\x04\x06\x07ÿ\x0Bÿÿÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x05ÿÿÿÿÿÿÿÿÿÿ\x06\x07ÿÿÿÿÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\rÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x0Cÿÿÿÿÿÿ\tÿ\x04\x04\x06\x07ÿÿÿÿÿÿÿÿÿÿÿ\x06\x02\x03\x04\x05\x00ÿÿÿ\x01ÿÿÿÿÿ";
Parser.lvs = [2050,2307,2307,2307,2307,2306,2305];
Demo.main();
})({});

Name		Name	Last commit message	Last commit date
Latest commit History 316 Commits
demo		demo
include		include
src		src
test		test
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
haxelib.json		haxelib.json
release.sh		release.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lex

Samples

Status

CHANGES

Defines

Usage

About

Releases

Packages

Languages

R32/lex

Folders and files

Latest commit

History

Repository files navigation

Lex

Samples

Status

CHANGES

Defines

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages