Skip to content
/ lex Public

Build lexer and simple parser(SimpleLR) in macro, It also provides lexer and simpleLR tool for c language

Notifications You must be signed in to change notification settings

R32/lex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lex

Build lexer and simple parser(SimpleLR) in macro.

LIMITS:

  1. In haxe, You can't use it in macro-in-macro

  2. Suitable for ASCII only.

Samples

Status

  • Lexer

    static function main()  {
        var str = lms.ByteData.ofString('exec(123 + 456)');
        var lex = new Lexer(str);
        var t = lex.token();
        while (t != Eof) {
            trace(s_token(t));
            t = lex.token();
        }
    }
    
    enum Op {
        Add;
        Sub;
        Mul;
        Div;
    }
    
    enum Token {
        Eof;
        LParen;
        RParen;
        Op( op : Op );
        CIdent( id : String);
        CInt( i : Int);
        CString( s : String);
    }
    
    /*
     * The following meta characters are supported in string:
     * `*`: zero or more
     * `+`: one or more
     * `?`: zero or one
     * `[`: begin char range
     * `]`: end char range
     * `\`: escape next char
     * `|`: or  (Not recommended, e.g: "abc|xyz" should be replaced by: "abc" | "xyz" )
     *
     * The newly added syntax:
     *
     * "a" | "b"    => /a|b/
     * "a" + "b"    => /ab/  "+" has a higher priority than "|"
     * Opt("abc")   => /(abc)?/
     * Star("abc")  => /(abc)*
     * Plus("abc")  => /(abc)+/
     *
     * For example:
     * var integer    = "0" | "[1-9][0-9]*";
     * var floatpoint = ".[0-9]+" | "[0-9]+.[0-9]*";
     * var exp        = "[eE][+-]?[0-9]+";
     * var float      = integer + Opt(exp) | floatpoint + Opt(exp);
     *
     * // "float" is equal to  [(0 | [1-9][0-9]*) + exp] | [( [0-9]+ | [0-9]+.[0-9]* ) + exp]
     */
    @:rule(Eof) class Lexer implements lm.Lexer<Token> {
    
        // You could add "@:skip" to avoid being parsed into a pattern
    
        var r_int = "0" | "-?[1-9][0-9]*";
    
        var r_ident = "[a-zA-Z_][a-zA-Z_0-9]+";
    
        // Matching 1, And the first one will be automatically renamed as "token"
        var tok =  [
            "[ \t]+" => this.token(),   // Recursive Matching 1
            "+" => Op(Add),
            "-" => Op(Sub),
            "*" => Op(Mul),
            "/" => Op(Div),
            "(" => LParen,
            ")" => RParen,
            r_int =>
                CInt(Std.parseInt(this.current)),
            r_ident =>
                CIdent(this.current),
            '"' => {
                var i = this.pmax;      // save start position
                var t = this.string();  // goto Matching 2
                if (t == Eof)
                    throw "UnClosed \"";
                this.pmin = i;          // restore pmin
                CString(this.current);
            },
            "//[^\n]*" =>
                this.token(),           // Recursive Matching 1
            _ => {
                // error handing, when error occurs, this.pmin will be >= this.pmax
                throw "UnExpected : " + this.getString(pmax, pmin - pmax);
                // NOTE: Unlike SLR parser, For lexer, only "Matching 1"
                // has the ability to handle errors, the others can only do empty(epsilon) matching
            }
        ];
    
        // Matching 2
        var string = [
            '"'     => CString(""),
            '[^"]+' => this.string()    // Recursive Matching 2
        ];
    
        // Matching N...
    }
  • Parser: Only SimpleLR is available.

    Unlike normal LR parser, there is no action-table, all are jump-table.

    Some conflicts may be resolved in normal LALR/LR1, But here the conflicts error will be thrown directly.

    • Position: Inside the actions, you could use T1~TN to access the position, which is the instance of lm.Stream.Tok

      T1.pmax - T1.pmin;

      And inside the actions, you could use the vairalbe stream

      var tok = stream.peek(0);
      if (tok.term == SomeToken)
          stream.junk(1);
    • Combine multiple Tokens:

      // 1. uses "[]", NOTE: if you put tokens with **different precedence**, a conflict error will be thrown.
      switch(s) {
      case [e1=expr, op = [OpAdd, OpSub], e2=expr]: op == OpAdd ? e1 + e2 : e1 - e2;
      }
      
      // 2. uses production(switch), NOTE: This will ignore all token **precedence** from predefine.
      // But you can use "@:prec(XXX)" to enforce the **precedence** for it
      case [e1=expr, op = op, e2=expr]: trace(op == OpAdd) ... //
      
      var op : Token = switch(s) {
      case [OpAdd] : OpAdd;
      case [OpSub] : OpSub;
      case [OpDiv] : OpDiv;
      case [OpMul] : OpMul;
      }
    • You can use string literals instead of simple terminators in stream match.

      switch(s) {
      case [e1=expr, op = ["+", "-"], e2=expr]: op == OpPlus ? e1 + e2 : e1 - e2;
      case ["(", e = expr, ")"]: e;
      }
    • Operator Precedence:

      // the operator precedence definitions:
      @:rule({
          left: ["+", "-"],         // The parser could auto reflect(str) => Token
          left: [OpTimes, OpDiv],   // The lower have higher Priority.
          nonassoc: [UMINUS],       // All characters of the placeholder must be UPPERCASE
      }) class MyParser implements lm.SLR<MyLexer> {
details...

``` // UPPERCASE == "non-terml", LOWERCASE == "terml" [..., op, E]: if defined(op) then case.left.lval = E.value [..., op, E]: if not defined(op) then case.left = null [..., T, E] | [E] | [..., t]: then case.left = null [E, op, ...]: if defined(op) then case.right.own = E.value [E, op, ...]: if not defined(op) then case.right.prio = -1 [E, T, ...] then case.right.prio = -1 [t, ...] then case.right = null ```

CHANGES

  • 1.0.0

  • 0.13.0:

    • tool : Added simple LR tool for c langauge

    • slr : Refactored LR0 code to SLR

    • lexer : Added new syntax Opt(), Star(), Plus() for group.

Defines

  • -D lex_slrtable: for debug. it will generate a SimpleLR table save as slr-table.txt. for example:

    You may have to modify the mmap field in debug.Print

    Production:
      (R0)  MAIN --> EXPR $
      (R1)  EXPR --> EXPR [+ -] EXPR
      (R2)       --> EXPR * EXPR
      (R3)       --> EXPR / EXPR
      (R4)       --> ( EXPR )
      (R5)       --> - EXPR
      (R6)       --> int
    -----------------------------------------------------------------------------------------
    |  (S)  | (EP)  |   $   |  int  |   +   |   -   |   *   |   /   |   (   |   )   | EXPR  |
    ----------------------------------------------------------------------------------------- MAIN
    |   0   | NULL  |       |  14   |       |   1   |       |       |   2   |       |   8   |
    -----------------------------------------------------------------------------------------
    |   1   | NULL  |       |  14   |       |   1   |       |       |   2   |       |  10   |
    -----------------------------------------------------------------------------------------
    |   2   | NULL  |       |  14   |       |   1   |       |       |   2   |       |   3   |
    -----------------------------------------------------------------------------------------
    |   3   | NULL  |       |       |   4   |   4   |   6   |   7   |       |  11   |       |
    -----------------------------------------------------------------------------------------
    |   4   | NULL  |       |  14   |       |   1   |       |       |   2   |       |   5   |
    -----------------------------------------------------------------------------------------
    |   5   |  R1   |       |       |       |       |   6   |   7   |       |       |       |
    -----------------------------------------------------------------------------------------
    |   6   | NULL  |       |  14   |       |   1   |       |       |   2   |       |  13   |
    -----------------------------------------------------------------------------------------
    |   7   | NULL  |       |  14   |       |   1   |       |       |   2   |       |  12   |
    -----------------------------------------------------------------------------------------
    |   8   | NULL  |   9   |       |   4   |   4   |   6   |   7   |       |       |       |
    -----------------------------------------------------------------------------------------
    -----------------
    |   9   |  R0   |
    -----------------
    |  10   |  R5   |
    -----------------
    |  11   |  R4   |
    -----------------
    |  12   |  R3   |
    -----------------
    |  13   |  R2   |
    -----------------
    |  14   |  R6   |
    -----------------
    

Usage

copy from test/subs/Demo.hx

@:analyzer(no_optimize)
class Demo {
    static function main() {
        var str = '1 - 2 * (3 + 4) + 5 * Unexpected 6';
        var lex = new Lexer(lms.ByteData.ofString(str));
        var par = new Parser(lex);
        trace(par.main() == (1 - 2 * (3 + 4) + 5 * 6));
    }
}

// The lm.LR0 Parser only works with "enum abstract (Int) to Int"
enum abstract Token(Int) to Int {
    var Eof = 0;
    var CInt;
    var OpPlus;
    var OpMinus;
    var OpTimes;
    var OpDiv;
    var LParen;
    var RParen;
    var CIdent;
}

/**
* @:rule(EOF, cmax = 255) See the example below:
*   Eof is a custom terminator which is defined in "<Token>" (required)
*   127 is the custom maximum char value. (optional, default is 255)
*/
@:rule(Eof, 127) class Lexer implements lm.Lexer<Token> {
    var inter = "0|[1-9][0-9]*";         // 0 or ...
    var r_zero = "0";                    // string variable will be treated as pattern if there is no `@:skip`
    var r_int = "[1-9][0-9]*";
    var tok =  [                         // a rule set definition, the first one will become .token()
        "[ \t]+" => this.token(),
        r_zero | r_int => CInt,          // zero or int
     // inter + Opt("[eE][+-]?[0-9]+")   // exponent
        "+" => OpPlus,
        "-" => OpMinus,
        "*" => OpTimes,
        "/" => OpDiv,
        "(" => LParen,
        ")" => RParen,
        "[a-zA-Z_]+" => CIdent,
    ];
}

@:rule({
    start: [main],            // Specify start, like the "%start" in ocamlyacc, If not specified, the first "switch" will be selected
    left: ["+", "-"],         // The parser could auto reflect(str) => Token
    left: [OpTimes, OpDiv],   // The lower have higher priority.
    nonassoc: [UMINUS],       // The placeholder must be uppercase
}) class Parser implements lm.SLR<Lexer> {

    var main = switch(s) {
        case [e = expr, Eof]:
            e;
        default:              // place handling error code here
            var t = stream.peek(0);
            switch(t.term) {
            case Eof:
                return 0;
            case CIdent:                // Show recovery from errors, Note: this ability is very weak
                stream.junk(1);         // Discard current token
                slrloop( -1, MAIN_EXP); // main => MAIN_EXP, NOTE: Only the entry switch-case (Specified in "start") has an EXP value
            default:
                throw "Unexpected: " + stream.str(t);
            }
    }

    var expr : Int = switch(s) {        // Specify Type explicitly
        case [e1 = expr, op = [OpPlus,OpMinus], e2 = expr]: op == OpPlus ? e1 + e2 : e1 - e2;
        case [e1 = expr, OpTimes, e2 = expr]: e1 * e2;
        case [e1 = expr, OpDiv, e2 = expr]: Std.int(e1 / e2);
        case [LParen, e = expr, RParen]: e;
        case [@:prec(UMINUS) OpMinus, e = expr]: -e;   // %prec UMINUS
        case [CInt(n)]: n;
    }

    // Define custom extract function for CInt(n)
    @:rule(CInt) inline function int_of_string( s : String ) : Int return Std.parseInt(s);
    // OR @:rule(CInt) inline function int_of_string( input : lms.ByteData, t : lm.Stream.Tok ) : Int {
    //    return Std.parseInt( input.readString(t.pmin, t.pmax - t.pmin) );
    //}
}

compile:

haxe -dce full -D analyzer-optimize -lib lex -main Demo -js demo.js

Generated JS:

// Generated by Haxe 4.3.0-rc.1
(function ($global) { "use strict";
var Demo = function() { };
Demo.main = function() {
    var str = "1 - 2 * (3 + 4) + 5 * 6";
    var lex = new Lexer(str);
    var par = new Parser(lex);
    console.log("Demo.hx:9:",par._entry(0,8) == 17);
};
var Lexer = function(s) {
    this.input = s;
    this.pmin = 0;
    this.pmax = 0;
};
Lexer.prototype = {
    getString: function(p,len) {
        return this.input.substr(p,len);
    }
    ,_token: function(init,right) {
        if(this.pmax >= right) {
            return 0;
        }
        var raw = Lexer.raw;
        var i = this.pmax;
        var state = init;
        var prev = init;
        var c;
        while(i < right) {
            c = this.input.charCodeAt(i++);
            state = raw.charCodeAt(128 * state + c);
            if(state >= 3) {
                break;
            }
            prev = state;
        }
        this.pmin = i;
        if(state == 255) {
            state = prev;
            --i;
        }
        var q = raw.charCodeAt(399 - state);
        if(i > this.pmax && q < 8) {
            this.pmin = this.pmax;
            this.pmax = i;
        } else {
            q = raw.charCodeAt(399 - init);
        }
        return this.cases(q);
    }
    ,token: function() {
        return this._token(0,this.input.length);
    }
    ,cases: function(s) {
        switch(s) {
        case 0:
            return this._token(0,this.input.length);
        case 1:
            return 1;
        case 2:
            return 2;
        case 3:
            return 3;
        case 4:
            return 4;
        case 5:
            return 5;
        case 6:
            return 6;
        case 7:
            return 7;
        default:
            throw new Error("UnMatached: '" + this.input.substr(this.pmax,this.pmin - this.pmax) + "'");
        }
    }
};
var Parser = function(lex) {
    this.stream = new lm_Stream(lex);
};
Parser.prototype = {
    _entry: function(state,exp) {
        var t = this.stream.newTok(0,0,0);
        t.state = state;
        var _this = this.stream;
        var i = _this.right;
        while(--i >= _this.pos) _this.cached[i + 1] = _this.cached[i];
        _this.cached[_this.pos] = t;
        ++_this.pos;
        ++_this.right;
        var raw = Parser.raw;
        while(true) {
            while(true) {
                t = this.stream.next();
                state = raw.charCodeAt(16 * state + t.term);
                if(state >= 9) {
                    break;
                }
                t.state = state;
            }
            if(state == 255) {
                this.stream.pos -= 1;
                var _this = this.stream;
                state = _this.cached[_this.pos + (-1)].state;
            }
            while(true) {
                var q = raw.charCodeAt(159 - state);
                var value = this.cases(q);
                if(q >= 7) {
                    return value;
                }
                t = this.stream.reduce(Parser.lvs[q]);
                if(t.term == exp) {
                    this.stream.pos -= 2;
                    this.stream.junk(2);
                    return value;
                }
                t.val = value;
                var _this1 = this.stream;
                state = raw.charCodeAt(16 * _this1.cached[_this1.pos + (-2)].state + t.term);
                t.state = state;
                if(state < 9) {
                    break;
                }
            }
        }
    }
    ,cases: function(q) {
        var __s = this.stream;
        switch(q) {
        case 0:
            return __s.cached[__s.pos + (-2)].val;
        case 1:
            var e1 = __s.cached[__s.pos + (-3)].val;
            var e2 = __s.cached[__s.pos + (-1)].val;
            if(__s.cached[__s.pos + (-2)].term == 2) {
                return e1 + e2;
            } else {
                return e1 - e2;
            }
            break;
        case 2:
            return __s.cached[__s.pos + (-3)].val * __s.cached[__s.pos + (-1)].val;
        case 3:
            return __s.cached[__s.pos + (-3)].val / __s.cached[__s.pos + (-1)].val | 0;
        case 4:
            return __s.cached[__s.pos + (-2)].val;
        case 5:
            return -__s.cached[__s.pos + (-1)].val;
        case 6:
            return Std.parseInt(__s.stri(-1));
        default:
            var t = this.stream.peek(0);
            throw new Error("Unexpected \"" + (t.term != 0 ? this.stream.lex.getString(t.pmin,t.pmax - t.pmin) : "Eof") + "\"");
        }
    }
};
var Std = function() { };
Std.parseInt = function(x) {
    var v = parseInt(x);
    if(isNaN(v)) {
        return null;
    }
    return v;
};
var lm_Tok = function(t,min,max) {
    this.term = t;
    this.pmin = min;
    this.pmax = max;
};
var lm_Stream = function(l) {
    this.lex = l;
    var this1 = new Array(128);
    this.cached = this1;
    this.right = 0;
    this.pos = 0;
};
lm_Stream.prototype = {
    reclaim: function(tok) {
        tok.nxt = this.h;
        this.h = tok;
    }
    ,newTok: function(term,min,max) {
        if(this.h == null) {
            return new lm_Tok(term,min,max);
        } else {
            var t = this.h;
            this.h = this.h.nxt;
            t.term = term;
            t.pmin = min;
            t.pmax = max;
            return t;
        }
    }
    ,peek: function(i) {
        while(this.right - this.pos <= i) {
            var t = this.lex.token();
            this.cached[this.right++] = this.newTok(t,this.lex.pmin,this.lex.pmax);
        }
        return this.cached[this.pos + i];
    }
    ,junk: function(n) {
        if(n <= 0) {
            return;
        }
        if(this.right - this.pos >= n) {
            var i = n;
            while(i-- > 0) this.reclaim(this.cached[this.pos + i]);
            i = this.pos;
            this.right -= n;
            while(i < this.right) {
                this.cached[i] = this.cached[i + n];
                ++i;
            }
        } else {
            n -= this.right - this.pos;
            while(n-- > 0) this.lex.token();
            while(this.right > this.pos) this.reclaim(this.cached[--this.right]);
        }
    }
    ,stri: function(dx) {
        var t = this.cached[this.pos + dx];
        return this.lex.getString(t.pmin,t.pmax - t.pmin);
    }
    ,next: function() {
        if(this.right == this.pos) {
            var t = this.lex.token();
            this.cached[this.right++] = this.newTok(t,this.lex.pmin,this.lex.pmax);
        }
        return this.cached[this.pos++];
    }
    ,reduce: function(lvw) {
        var w = lvw & 255;
        if(w == 0) {
            return this.reduceEP(lvw >>> 8);
        }
        var pmax = this.cached[this.pos + (-1)].pmax;
        --w;
        this.pos -= w;
        this.right -= w;
        var t = this.cached[this.pos + (-1)];
        t.term = lvw >>> 8;
        t.pmax = pmax;
        if(w == 0) {
            return t;
        }
        var i = w;
        while(i-- > 0) this.reclaim(this.cached[this.pos + i]);
        i = this.pos;
        while(i < this.right) {
            this.cached[i] = this.cached[i + w];
            ++i;
        }
        return t;
    }
    ,reduceEP: function(lv) {
        var prev = this.cached[this.pos - 1];
        var t = this.newTok(lv,prev.pmax,prev.pmax);
        var i = this.right;
        while(--i >= this.pos) this.cached[i + 1] = this.cached[i];
        this.cached[this.pos] = t;
        ++this.pos;
        ++this.right;
        return t;
    }
};
Lexer.raw = "ÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿ\t\x08\x07\x06ÿ\x05ÿ\x04\x03\x02\x02\x02\x02\x02\x02\x02\x02\x02ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x06\x07\x04\x02\x03\x05\x01\x01\x00ÿ";
Parser.raw = "ÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x08ÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\nÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x03ÿÿÿÿÿÿÿÿ\x04\x04\x06\x07ÿ\x0Bÿÿÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x05ÿÿÿÿÿÿÿÿÿÿ\x06\x07ÿÿÿÿÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\rÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x0Cÿÿÿÿÿÿ\tÿ\x04\x04\x06\x07ÿÿÿÿÿÿÿÿÿÿÿ\x06\x02\x03\x04\x05\x00ÿÿÿ\x01ÿÿÿÿÿ";
Parser.lvs = [2050,2307,2307,2307,2307,2306,2305];
Demo.main();
})({});

About

Build lexer and simple parser(SimpleLR) in macro, It also provides lexer and simpleLR tool for c language

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages