Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitive literals #4

Open
mingodad opened this issue May 9, 2022 · 94 comments
Open

Case insensitive literals #4

mingodad opened this issue May 9, 2022 · 94 comments

Comments

@mingodad
Copy link
Contributor

mingodad commented May 9, 2022

Related to #1 (comment) I did implemented it here https://github.com/mingodad/peg doing a stack top replacement, maybe something like this can be done in chpeg too.

In https://github.com/mingodad/peg/src/peg.peg:

Primary		<- 
		 ...
		 / AT? (
                        LiteralSQ		{ push(makeString(yytext, '\'')); }
                        / LiteralDQ		{ push(makeString(yytext, '"')); }
                        / LiteralBQ		{ push(makeString(yytext, '`')); }
                        )
                    ('i' !IdentStart  { setTopStrCharCaseInsensitive(); })? Spacing
		...
@ChrisHixon
Copy link
Owner

Some hints about adding a VM op code like this:

// Literal
            case LIT: // arg = str_id; match literal string; skip next instruction on match
	    case LIT_NC:
                {
                    int len = self->str_len[arg];
                    if ((offset < (size - (len - 1))) && !(
			    (op == LIT) ? memcmp(&input[offset], self->strings[arg], len)
					: strncasecmp(&input[offset], self->strings[arg], len))) {
                        offset += len;
                        ++pc;
                        break;
                    }
                }

There are many places you'll have to add this instruction/op code for this to work. I think the easiest way to deal with adding a new instruction is in the bootstrap code. (You could hack the generated files and go that way, but peg_byte_code has to be regenerated too.) The only way to do that now is with the ruby code in bootstrap/peg_cbyte.rb. Make your own class and instruction, perhaps something like this, modified from Literal:

            class LiteralNC
                def post_reduce
                    @str_id = @root.add_string(value)
                end
                def parse_setup
                    @parse_state = @root.alloc_state(self)
                    @goto_state = @root.alloc_state(self)
                end
                def append_instructions(inst)
                    add_inst(inst, @parse_state, :LIT_NC, @str_id)
                    add_inst(inst, @goto_state, :GOTO, @parent.fail_state(self) - 1)
                end
            end

Do a search in the bootstrap dir for LIT and Literal and add your variation, mirroring what is done for those.

I created the bootstrap process in Ruby before there was a C version of the byte code compiler. All it really does is create the hardcoded C version of the parser byte code and tack stuff on the C 'templates'. Do a diff on the templates vs. generated C files and you'll see what I mean.

The C version of the compiler (in the src dir) came later. That will also need to be modified to handle your new instruction.

Since there is a C version of the compiler, it wouldn't be that hard to create something to output hardcoded bytecode (as C code) and then perhaps ditch the Ruby bootstrap. This is another potential feature that could be used to create a parser with your bytecode already compiled in.

@mingodad
Copy link
Contributor Author

mingodad commented May 9, 2022

I just finished this C function that probably allow remove the dependence on ruby:

void ByteCode_write_cbytecode(FILE *fp, const ByteCode *bc)
{
    fprintf(fp, "int peg_num_defs = %d;\n", bc->num_defs);
    fprintf(fp, "char *peg_def_names[] = {");
    for(int i=0; i < bc->num_defs; ++i) {
        fprintf(fp, "%s\"%s\"", i ? "," : "", bc->def_names[i]);
    }
    fprintf(fp, "};\n");
    fprintf(fp, "int peg_def_flags[] = {");
    for(int i=0; i < bc->num_defs; ++i) {
        const char *str;
        switch(bc->def_flags[i]) {
            case STOP: str = "STOP"; break;
            case IGNORE: str = "IGNORE"; break;
            case LEAF: str = "LEAF"; break;
            default:
                str = "0";
        }
        fprintf(fp, "%s%s", i ? "," : "", str);
    }
    fprintf(fp, "};\n");
    fprintf(fp, "int peg_def_addrs[] = {");
    for(int i=0; i < bc->num_defs; ++i) {
        fprintf(fp, "%s%d", i ? "," : "", bc->def_addrs[i]);
    }
    fprintf(fp, "}; // presubtracted by 1\n");
    fprintf(fp, "int peg_num_strings = %d;\n", bc->num_strings);
    fprintf(fp, "char *peg_strings[] = {");
    for(int i=0; i < bc->num_strings; ++i) {
        char *esc_str = esc_string(bc->strings[i], bc->str_len[i], 0);
        fprintf(fp, "%s\"%s\"", i ? "," : "", esc_str);
        free(esc_str);
    }
    fprintf(fp, "};\n");
    fprintf(fp, "int peg_str_len[] = {");
    for(int i=0; i < bc->num_strings; ++i) {
        fprintf(fp, "%s%d", i ? "," : "", bc->str_len[i]);
    }
    fprintf(fp, "};\n");
    fprintf(fp, "int peg_num_instructions = %d;\n", bc->num_instructions);
    fprintf(fp, "int peg_instructions[] = {\n");
    for(int i=0; i < bc->num_instructions; ++i) {
        int op = bc->instructions[i] & 0xff;
        int arg = bc->instructions[i] >> 8;
        fprintf(fp, "\t/*%d*/ INST(%s, %d),\n", i, OpNames[op], arg);
    }
    fprintf(fp, "};\n");    
}

@ChrisHixon
Copy link
Owner

Nice! Yeah, with a few more steps the Ruby bootstrap could be eliminated.

@mingodad
Copy link
Contributor Author

Here is a css grammar that was converted from pegjs (https://github.com/pegjs/pegjs/blob/master/examples/css.pegjs) using [aA] to get case insensitive in some places:

 start <-
	 stylesheet comment*

 stylesheet <-
	 ( CHARSET_SYM STRING ";" )? ( S / CDO / CDC )* ( import ( CDO S* / CDC S* )* )* ( ( ruleset / media / page ) ( CDO S* / CDC S* )* )*

 import <-
	 IMPORT_SYM S* ( STRING / URI ) S* media_list? ";" S*

 media <-
	 MEDIA_SYM S* media_list "{" S* ruleset* "}" S*

 media_list <-
	 medium ( "," S* medium )*

 medium <-
	 IDENT S*

 page <-
	 PAGE_SYM S* pseudo_page? "{" S* declaration? ( ";" S* declaration? )* "}" S*

 pseudo_page <-
	 ":" IDENT S*

 operator <-
	 "/" S* / "," S*

 combinator <-
	 "+" S* / ">" S*

 property <-
	 IDENT S*

 ruleset <-
	 selector ( "," S* selector )* "{" S* declaration? ( ";" S* declaration? )* "}" S*

 selector <-
	 simple_selector S* combinator selector / simple_selector S+ selector / simple_selector S*

 simple_selector <-
	 element_name ( id / class / attrib / pseudo )* / ( id / class / attrib / pseudo )+

 id <-
	 HASH

 class <-
	 "." IDENT

 element_name <-
	 IDENT / "*"

 attrib <-
	 "[" S* IDENT S* ( ( "=" / INCLUDES / DASHMATCH ) S* ( IDENT / STRING ) S* )? "]"

 pseudo <-
	 ":" ( FUNCTION S* ( IDENT S* )? ")" / IDENT )

 declaration <-
	 property ':' S* expr prio?

 prio <-
	 IMPORTANT_SYM S*

 expr <-
	 term ( operator? term )*

 term {L} <-
	 ( PERCENTAGE / LENGTH / EMS / EXS / ANGLE / TIME / FREQ / NUMBER ) S* / STRING S* / URI S* / function / hexcolor / IDENT S*

 function <-
	 FUNCTION S* expr ")" S*

 hexcolor <-
	 HASH S*

 h <-
	 [0-9a-fA-F]

 nonascii <-
	 #[\x80-\uFFFF]
	 [\200-\377]

 unicode <-
	 "\\" ( h h? h? h? h? h? ) ( "\r\n" / [ \t\r\n\f] )?

 escape <-
	 unicode / "\\" [^\r\n\f0-9a-fA-F]

 nmstart <-
	 [_a-zA-Z] / nonascii / escape

 nmchar <-
	 [_a-zA-Z0-9-] / nonascii / escape

 string1 <-
	 '"' ( [^\n\r\f\\"] / "\\" nl / escape )* '"'

 string2 <-
	 "'" ( [^\n\r\f\\'] / "\\" nl / escape )* "'"

 comment <-
	 "/*" [^*]* "*"+ ( [^/*] [^*]* "*"+ )* "/"

 ident {L} <-
	 "-"? nmstart nmchar*

 name {L} <-
	 nmchar+

 num {L} <-
	 [+-]? ( [0-9]* "." [0-9]+ / [0-9]+ ) ( "e" [+-]? [0-9]+ )?

 string {L} <-
	 string1 / string2

 url <-
	 ( [!#$%&*-\[\]-~] / nonascii / escape )*

 s {I} <-
	 [ \t\r\n\f]+

 w <-
	 s?

 nl <-
	 "\n" / "\r\n" / "\r" / "\f"

 A <-
	 [aA] / "\\" "0"? "0"? "0"? "0"? [\101\141] ( "\r\n" / [ \t\r\n\f] )?

 C <-
	 [cC] / "\\" "0"? "0"? "0"? "0"? [\103\143] ( "\r\n" / [ \t\r\n\f] )?

 D <-
	 [dD] / "\\" "0"? "0"? "0"? "0"? [\104\144] ( "\r\n" / [ \t\r\n\f] )?

 E <-
	 [eE] / "\\" "0"? "0"? "0"? "0"? [\105\145] ( "\r\n" / [ \t\r\n\f] )?

 G <-
	 [gG] / "\\" "0"? "0"? "0"? "0"? [\107\147] ( "\r\n" / [ \t\r\n\f] )? / "\\" [gG]

 H <-
	 [hH] / "\\" "0"? "0"? "0"? "0"? [\110\150] ( "\r\n" / [ \t\r\n\f] )? / "\\" [hH]

 I <-
	 [iI] / "\\" "0"? "0"? "0"? "0"? [\111\151] ( "\r\n" / [ \t\r\n\f] )? / "\\" [iI]

 K <-
	 [kK] / "\\" "0"? "0"? "0"? "0"? [\113\153] ( "\r\n" / [ \t\r\n\f] )? / "\\" [kK]

 L <-
	 [lL] / "\\" "0"? "0"? "0"? "0"? [\114\154] ( "\r\n" / [ \t\r\n\f] )? / "\\" [lL]

 M <-
	 [mM] / "\\" "0"? "0"? "0"? "0"? [\115\155] ( "\r\n" / [ \t\r\n\f] )? / "\\" [mM]

 N <-
	 [nN] / "\\" "0"? "0"? "0"? "0"? [\116\156] ( "\r\n" / [ \t\r\n\f] )? / "\\" [nN]

 O <-
	 [oO] / "\\" "0"? "0"? "0"? "0"? [\117\157] ( "\r\n" / [ \t\r\n\f] )? / "\\" [oO]

 P <-
	 [pP] / "\\" "0"? "0"? "0"? "0"? [\120\160] ( "\r\n" / [ \t\r\n\f] )? / "\\" [pP]

 R <-
	 [rR] / "\\" "0"? "0"? "0"? "0"? [\122\162] ( "\r\n" / [ \t\r\n\f] )? / "\\" [rR]

 S_ <-
	 [sS] / "\\" "0"? "0"? "0"? "0"? [\123\163] ( "\r\n" / [ \t\r\n\f] )? / "\\" [sS]

 T <-
	 [tT] / "\\" "0"? "0"? "0"? "0"? [\124\164] ( "\r\n" / [ \t\r\n\f] )? / "\\" [tT]

 U <-
	 [uU] / "\\" "0"? "0"? "0"? "0"? [\125\165] ( "\r\n" / [ \t\r\n\f] )? / "\\" [uU]

 X <-
	 [xX] / "\\" "0"? "0"? "0"? "0"? [\130\170] ( "\r\n" / [ \t\r\n\f] )? / "\\" [xX]

 Z <-
	 [zZ] / "\\" "0"? "0"? "0"? "0"? [\132\172] ( "\r\n" / [ \t\r\n\f] )? / "\\" [zZ]

 S {I} <-
	 comment* s

 CDO <-
	 comment* "<!--"

 CDC <-
	 comment* "-->"

 INCLUDES <-
	 comment* "~="

 DASHMATCH <-
	 comment* "|="

 STRING <-
	 comment* string

 IDENT <-
	 comment* ident

 HASH <-
	 comment* "#" name

 IMPORT_SYM <-
	 comment* "@" I M P O R T

 PAGE_SYM <-
	 comment* "@" P A G E

 MEDIA_SYM <-
	 comment* "@" M E D I A

 CHARSET_SYM <-
	 comment* "@charset "

 IMPORTANT_SYM <-
	 comment* "!" ( s / comment )* I M P O R T A N T

 EMS <-
	 comment* num E M

 EXS <-
	 comment* num E X

 LENGTH <-
	 comment* num P X / comment* num C M / comment* num M M / comment* num I N / comment* num P T / comment* num P C

 ANGLE <-
	 comment* num D E G / comment* num R A D / comment* num G R A D

 TIME <-
	 comment* num M S_ / comment* num S_

 FREQ <-
	 comment* num H Z / comment* num K H Z

 PERCENTAGE <-
	 comment* num "%"

 NUMBER <-
	 comment* num

 URI <-
	 comment* U R L "(" w string w ")" / comment* U R L "(" w url w ")"

 FUNCTION <-
	 comment* ident "("

@mingodad
Copy link
Contributor Author

Looking through the code now that I'm a bit more familiar with it I think I found a simple way to add case insensitive literals by mainly adding a new option at Compiler_add_instructions assuming a new definition called PEG_NOCASE and a new VM insruction LIT_NC and a change in the VM pointed here #4 (comment) :

void Compiler_add_instructions(CompilationUnit *cu, GNode *gp)
{
    switch (gp->type) {
        case PEG_GRAMMAR:
...
        case PEG_NOCASE:
            {
                int lit_inst = cu->bc->num_instructions-2; //assuming PEG_LITERAL is the previous isntruction
                int op = cu->bc->instructions[lit_inst] & 0xff;
                int arg = cu->bc->instructions[lit_inst] >> 8;
                ASSERT(op == LIT); // ensure it's what we expect
                cu->bc->instructions[lit_inst] = INST(LIT_NC, arg); // overwrite it and do not generate any new instruction
            }
            break;

@ChrisHixon
Copy link
Owner

ChrisHixon commented May 11, 2022

Yes you would need to add something there (Compiler_add_instructions), but I'm thinking there should not be a normal literal involved as the previous instruction. (See more below re: compiler)

I'm thinking the new chpeg grammar would be adjusted so the AST we end up with captures things like this:

"string" and 'string' would be captured as Literal
"string"i and 'string'i would be captured as LiteralNC

It's a matter of adjusting the grammar to capture things that way. The grammar is specifically tuned to the way the compiler wants things presented in the AST.

I don't know why it would be any different than PEG_LITERAL in the compiler:

void Compiler_add_instructions(CompilationUnit *cu, GNode *gp)
{
    switch (gp->type) {
        case PEG_GRAMMAR:
...
        case PEG_LITERAL:
            Compiler_add_inst(cu, INST(LIT, gp->val.ival));
            Compiler_add_inst(cu, INST(GOTO, gp->parent_fail_state - 1));
            break;
        case PEG_LITERAL_NOCASE:
            Compiler_add_inst(cu, INST(LIT_NC, gp->val.ival));
            Compiler_add_inst(cu, INST(GOTO, gp->parent_fail_state - 1));
            break;
    }
}

@mingodad
Copy link
Contributor Author

mingodad commented May 11, 2022

For example my usage of https://github.com/mingodad/peg from Piumarta showed me the need to know the differentiate literals as Single or Double quoted for further processing because depending on which language we'll use the result we'll need to apply different escaping filters and also https://github.com/edubart/lpegrex uses back quoted literals to automatically add spacing after then (an interesting approach and cpp-peglib has %whitespace for that).

As long as we have it available and is easy to use/implement I'm Ok with it.

For example with minimal change for what is already implemented and my minimal implementation this grammar would do it:

# Extended PEG grammar for chpeg bytecode parser/compiler
#
# The effective grammar is basically the same as the original PEG grammar from
# Bryan Ford's paper, but with Options added in front of <- to guide parse tree building.

# Hierarchical syntax
Grammar     {S} <- S Definition+ !.
Definition  { } <- Identifier S ('{' S Options S '}' S)? '<-' S Choice
Choice      { } <- Sequence ('/' S Sequence)*
Sequence    { } <- Predicate+
Predicate   { } <- (PredOp S)? Repeat
Repeat      { } <- Primary (RepOp S)?
Primary     { } <- Identifier S ![{<]
                 / '(' S Choice ')' S
                 / Literal S / CharClass S / Dot S

# Lexical syntax
Options     { } <- [ILS]*
Identifier  {L} <- [a-zA-Z_][a-zA-Z_0-9]*
Literal     {S} <- (['] (!['] Char)* [']
                 / ["] (!["] Char)* ["]) NoCase?
CharClass   {S} <- '[' (!']' CharRange)+ ']'
CharRange   { } <- Char '-' !']' Char / Char
Char        { } <- EscChar / OctChar / PlainChar 
EscChar     {L} <- '\\' [nrt'"\[\]\\]
OctChar     {L} <- '\\' [0-3][0-7][0-7]
                 / '\\' [0-7][0-7]?
PlainChar   {L} <- !'\\' .
PredOp      {L} <- [&!]
RepOp       {L} <- [*+?]
Dot         {L} <- '.'
NoCase      { } <- 'i'
S           {I} <- ([ \t\r\n]+ / '#' (![\r\n] .)* [\r\n] )*

@ChrisHixon
Copy link
Owner

Perhaps something like this would be easy to deal with for the compiler (untested)

Literal     {S} <- ( LiteralSQuote / LiteralDQuote ) LiteralOptions
LiteralSQuote <- ['] (!['] Char)* [']
LiteralDQuote <- ["] (!["] Char)* ["]
LiteralOptions {L} <- 'i'?

You want to see Literal in the compiler and then you check the first child which is either LiteralSQoute or LiteralDQuote, then take a look at the LiteralOptions to make it either normal LIT instruction or new LIT_NC instruction.

@mingodad
Copy link
Contributor Author

It seems fine to me, everything starts with Literal so it gives the correct idea that all of then are related instead of mine NoCase and it leave the door open to extend the LiteralOptions if the need arise.

@ChrisHixon
Copy link
Owner

This idea seems to work, I just tried this in the playground at https://yhirose.github.io/cpp-peglib/

Start <- (Literal S)+
Literal <- ( LiteralSQuote / LiteralDQuote ) LiteralOptions
LiteralSQuote <- ['] (!['] Char)* [']
LiteralDQuote <- ["] (!["] Char)* ["]
LiteralOptions <- 'i'?

Char        <- EscChar / OctChar / PlainChar 
EscChar     <- '\\' [nrt'"\[\]\\]
OctChar     <- '\\' [0-3][0-7][0-7]
                 / '\\' [0-7][0-7]?
PlainChar   <- !'\\' .
S           <- ([ \t\r\n]+ / '#' (![\r\n] .)* [\r\n] )*

@mingodad
Copy link
Contributor Author

After writing my previous answer I looked at it again and still think that having an extra definition only to give a meaningful name for the i LiteralOption would self document it (I'm using NoCase but another name that documents better the meaning of it is fine):

Literal     {S} <- ( LiteralSQuote / LiteralDQuote ) LiteralOptions
LiteralSQuote <- ['] (!['] Char)* [']
LiteralDQuote <- ["] (!["] Char)* ["]
LiteralOptions {L} <- NoCase?
NoCase { } <- 'i'

@mingodad
Copy link
Contributor Author

The playground is really useful and we should also try one for chpeg one possible seed for it is already here #1 (comment) .

@ChrisHixon
Copy link
Owner

ChrisHixon commented May 11, 2022

Yeah I like the idea of a playground, I will check your version out. [edited]

@mingodad
Copy link
Contributor Author

And probably it's better to have it like:

Literal     {S} <- ( LiteralSQuote / LiteralDQuote ) LiteralOptions?
LiteralSQuote <- ['] (!['] Char)* [']
LiteralDQuote <- ["] (!["] Char)* ["]
LiteralOptions {L} <- NoCase
NoCase { } <- 'i'

I still use an LL(1) parser generator https://github.com/mingodad/CocoR-CPP and family where a deletable rule is an warning.

@ChrisHixon
Copy link
Owner

I don't mind adding stuff in the grammar for documentation/clarity, if it doesn't affect the performance.

@ChrisHixon
Copy link
Owner

I have an LL1 project that's experimental not posted or really usable, that would optimize that NoCase out (no warning though). chpeg isn't doing any optimization of grammar or bytecode.

@ChrisHixon
Copy link
Owner

And for that reason chpeg's grammar that it uses itself should probably be optimized manually. You can always publish a separate grammar for documentation purposes.

@ChrisHixon
Copy link
Owner

You want it to be as fast as possible so that parsing the user's grammars is fast :) So it should end up with really minimal bytecode and minimize the tree building.

@mingodad
Copy link
Contributor Author

And here is a test grammar converted from http://ingmarschlecht.de/gamsToLatex/ to test a possible implementation of case insensitive literals, it's working with cpp-peglib (that's having a bit of trouble with it in terms of time to compile/parse), peg/leg, pegjs/peggy and sooner chpeg (also still there is the sql grammar from the first issue).

/usr/bin/time ./gamsToLatex gamsToLatex2.mod 
yy 1
0.09user 0.00system 0:00.09elapsed 100%CPU (0avgtext+0avgdata 1712maxresident)k
0inputs+0outputs (0major+71minor)pagefaults 0swaps

/usr/bin/time peglint gamsToLatex2.peglib gamsToLatex2.mod 
gamsToLatex2.peglib:205:2: 'endOfLineComment' is not referenced.
gamsToLatex2.peglib:190:2: 'sqrt' is not referenced.
gamsToLatex2.peglib:160:2: 'someString' is not referenced.
0.66user 0.00system 0:00.66elapsed 100%CPU (0avgtext+0avgdata 4388maxresident)k
0inputs+0outputs (0major+219minor)pagefaults 0swaps

gamsToLatex2.zip

@mingodad
Copy link
Contributor Author

By the way your LL1 project that's experimental can also interpret the grammar at runtime ?
Do you mind to share it ?

@ChrisHixon
Copy link
Owner

ChrisHixon commented May 11, 2022

The LL(1) project can't interpret grammar at run time or read any grammar from files. It's written in Python, and the grammar is defined in Python code. I was using it to iron out the concepts and add things like *, ?, +, {min, max} that aren't part of basic LL(1), and some work on rule optimization, which as I recall is important because you have to convert more complicated concepts to the simple LL(1) rules for the LL(1) algo to work, and when you do that in a straightforward/easy way the result often is sub-optimal. It can't even built parse trees but it does allow calling Python functions during parsing. It looks like the last thing I did was implement a basic calculator as a test using standard math ops, +-*/. I'm not sure I'm ready to share it but I'll revisit it and think about it.

@mingodad
Copy link
Contributor Author

Thank you !
As you describe it I'm not interested but nice to know that you are on this space looking for ways to make it easier/useful.

@mingodad
Copy link
Contributor Author

Hello @ChrisHixon I hope you're well !
I just found a way to make string case insensitive to work (without LiteralSQ/LiteralDQ), I was expecting the NoCase bytecode as standalone but it seems the the way chpeg works right now it's appended to the string literal, here is my code to find/use it:

static void ChpegCU_alloc_instructions(ChpegCU *cu, ChpegGNode *gp)
{
    switch (gp->type) {
...
        case CHPEG_BC(LITERAL):
            {
                int literal_op = CHPEG_OP_LIT;
#ifdef CHPEG_HAS_NOCASE
        //case CHPEG_BC(LITERALSQ):
        //case CHPEG_BC(LITERALDQ):
                for (ChpegGNode *p = gp->head; p; p = p->next) {
                    if(p->node->def == CHPEG_BC(NOCASE))
                        literal_op = CHPEG_OP_LIT_NC;
                }
#endif /*CHPEG_HAS_NOCASE*/
                ChpegCU_add_inst(cu, CHPEG_INST(literal_op, gp->val.ival));
                ChpegCU_add_inst(cu, CHPEG_INST(CHPEG_OP_GOTO, gp->parent_fail_state - 1));
            }
            break;
    }
}

@mingodad
Copy link
Contributor Author

Now I'm interested on have back reference implemented any idea ?

@ChrisHixon
Copy link
Owner

Back references could be a fun challenge. I'm thinking in the parser VM, you would first want to match the same thing the referenced item matches, and then compare the values by looking back into the parse tree.

Take this example (using new chpeg util):

./chpeg -G 'S <- N [0-9]* N    N <- [a-z]' -P a99a

If we wanted to convert this to back reference so the first N matches the second N, the grammar might be something like

S <- N [0-9]* $1
N <- [a-z]

Lets call the positions in S $1, $2, $3; so $1 is N, $2 is [0-9]*, $3 is the $1 back reference.

In the VM, when working on matching $3, first match N (what $1 refers to), then if that matches, execute a new BACKREF op that compares the values of $3 with $1 looking back into the parse tree.

@ChrisHixon
Copy link
Owner

Nice work on the case insensitive matching! I have not tried this yet... Do you have an isolated patch for this? It'd be great if it was on top of the current refactor branch without all of your other changes/experiments mixed in.

@mingodad
Copy link
Contributor Author

While testing chpeg_nocase with the grammar converted from http://ingmarschlecht.de/gamsToLatex/ I noticed that it's a bit slow (it takes a bit of time to show the result) even with cpp-peglib.

I compiled it with profiling and here is the partial output in hope it can help improve chpeg and/or end users experience:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 90.92      0.50     0.50        2   250.04   272.55  ChpegParser_parse
  6.36      0.54     0.04 12842352     0.00     0.00  ChpegParser_expected
  1.82      0.55     0.01   568553     0.00     0.00  ChpegNode_free
  0.91      0.55     0.01        1     5.00     5.00  ChpegParser_print_tree
  0.00      0.55     0.00   697389     0.00     0.00  chpeg_free
  0.00      0.55     0.00   696013     0.00     0.00  chpeg_malloc
  0.00      0.55     0.00   695546     0.00     0.00  ChpegNode_new
  0.00      0.55     0.00   695544     0.00     0.00  ChpegNode_push_child
  0.00      0.55     0.00   566392     0.00     0.00  ChpegNode_pop_child
  0.00      0.55     0.00     1488     0.00     0.00  ChpegCU_add_inst
  0.00      0.55     0.00     1488     0.00     0.00  ChpegCU_alloc_inst
  0.00      0.55     0.00     1376     0.00     0.00  chpeg_calloc
  0.00      0.55     0.00     1374     0.00     0.00  ChpegGNode_new
  0.00      0.55     0.00     1373     0.00     0.00  ChpegGNode_push_child
  0.00      0.55     0.00      324     0.00     0.00  ChpegCU_find_def
  0.00      0.55     0.00      152     0.00     0.00  ChpegCU_alloc_string
  0.00      0.55     0.00      138     0.00     0.00  ChpegByteCode_def_name
  0.00      0.55     0.00      138     0.00     0.00  chpeg_esc_bytes
  0.00      0.55     0.00        6     0.00     0.00  chpeg_realloc
  0.00      0.55     0.00        2     0.00     0.02  ChpegNode_unwrap
  0.00      0.55     0.00        2     0.00     0.00  ChpegParser_free
  0.00      0.55     0.00        2     0.00     0.00  ChpegParser_new
  0.00      0.55     0.00        2     0.00     0.00  chpeg_read_file
  0.00      0.55     0.00        1     0.00     0.00  ChpegByteCode_free
  0.00      0.55     0.00        1     0.00     0.00  ChpegByteCode_new
  0.00      0.55     0.00        1     0.00     0.00  ChpegCU_add_instructions
  0.00      0.55     0.00        1     0.00     0.00  ChpegCU_alloc_instructions
  0.00      0.55     0.00        1     0.00     0.00  ChpegCU_alloc_strings
  0.00      0.55     0.00        1     0.00     0.00  ChpegCU_build_tree
  0.00      0.55     0.00        1     0.00     0.00  ChpegCU_setup_def_addrs
  0.00      0.55     0.00        1     0.00     0.00  ChpegCU_setup_defs
  0.00      0.55     0.00        1     0.00     0.00  ChpegGNode_free
  0.00      0.55     0.00        1     0.00     0.00  ChpegGNode_reverse
  0.00      0.55     0.00        1     0.00     0.00  ChpegNode_print
  0.00      0.55     0.00        1     0.00   272.55  chpeg_compile
  0.00      0.55     0.00        1     0.00     0.00  chpeg_default_bytecode

gamsToLatex.zip content:

unzip -l gamsToLatex.zip 
Archive:  gamsToLatex.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
    23129  2022-05-30 08:47   chpeg_nocase_prof.txt << the output of gprof
     1237  2022-05-11 12:02   gamsToLatex2.mod << the input file
     5636  2022-05-29 14:25   gamsToLatex3.chpeg << the grammar for chpeg_nocase
     5573  2022-05-29 20:38   gamsToLatex4.chpeg << the grammar for chpeg
     5389  2022-05-11 13:53   gamsToLatex2.peglib << the grammar for cpp-peglib
    12231  2022-05-10 19:11   gamsToLatex2.pegjs << the original pegjs grammar
     5377  2022-05-11 13:49   gamsToLatex2.peg << the naked original grammar converted to peg
---------                     -------
    58572                     7 files

gamsToLatex.zip

@mingodad
Copy link
Contributor Author

Execution time (build without optimization):

/usr/bin/time /home/mingo/dev/c/A_grammars/chpeg-dad/examples/chpeg_nocase gamsToLatex3.chpeg gamsToLatex2.mod
...
1233      1     73  3 --L       literalString "Z"
0.61user 0.00system 0:00.61elapsed 99%CPU (0avgtext+0avgdata 2036maxresident)k
0inputs+0outputs (0major+133minor)pagefaults 0swaps
/usr/bin/time /home/mingo/dev/c/A_grammars/chpeg-dad/examples/chpeg_nocase gamsToLatex3-perf.chpeg gamsToLatex2.mod
...
  1233      1     73  3 --L       literalString "Z"
0.00user 0.00system 0:00.00elapsed 87%CPU (0avgtext+0avgdata 1940maxresident)k
0inputs+0outputs (0major+131minor)pagefaults 0swaps

@mingodad
Copy link
Contributor Author

Would be nice if we could detect bad definitions and emmit warnings/suggestions.

@mingodad
Copy link
Contributor Author

Another interesting thing found is that I inlined the comment definition and commented it out, but if uncomment it (but still not referencing it) the VM loops increase more than 70% (using the gamsToLatex3-perf.chpeg grammar):
With comment commented out:

73          76   2.35          64          12  literalString {L} <- [A-Za-z_] ( [A-Za-z0-9_] )* 

            3229               1778        1451  Total counters

                              55.06       44.94  % success/fail

Total VM loops 29890

With comment uncommented (but not referenced):

74               0.00                          comment <- [\n] "*" ( ! [\n] . )* 

            3229               1778        1451  Total counters

                              55.06       44.94  % success/fail

Total VM loops 51871

@mingodad
Copy link
Contributor Author

mingodad commented May 31, 2022

Doing experiments I found that if we have a comment definition like the one bellow (without referencing it) the VM count doesn't grow:

comment <-
	![\n]

But if we add a dot (again without referencing it) then the VM count grow 70%:

comment <-
	![\n].

@ChrisHixon
Copy link
Owner

Doing experiments I found that if we have a comment definition like the one bellow (without referencing it) the VM count doesn't grow:

comment <-
	![\n]

But if we add a dot (again without referencing it) then the VM count grow 70%:

comment <-
	![\n].

Can you isolate this to a small set of rules and then compare the bytecode and trace (from chpeg-trace -t or otherwise enabling VM_TRACE)?

@mingodad
Copy link
Contributor Author

mingodad commented Jun 1, 2022

I'm using the grammar shown bellow and it seems that the VM loop counter I'm using is accumulating the grammar compilation, because when I delete the top comment on the grammar the VM loop count also change.

 #../../chpeg-dad/examples/chpeg_chpeg gamsToLatex2.chpeg gamsToLatex2.mod
 start <-
	 (_ integer)+ !.

 integer <-
	 [0-9]+

 _ {I} <-
	 #comment / [ \n\t\r]
         #[\n] ( "*" (![\n] .)* / [ \t\r]) #TODO fix error message with this
         [\n] "*" (![\n] .)* / [ \t\n\r]+ #inline comment


comment <-
	 ( [\n] "*" ) (![\n] .)*

#comment <-
#	![\n] .

This input:

 23 49
 66 11

Output with grammar with top comment:

chpeg_nocase gamsToLatex3-vmbug.chpeg gamsToLatex3-vmbug.txt 
chpeg_compile: Parse successful.
Grammar file compiled successfully. Parser returned: 0
Parse successful.
  id       total      %     success        fail  definition
               1  10.00           1              start <- ( _ integer )+ ! . 
   1           4  40.00           4              integer <- ( [0-9] )+ 
   2           5  50.00           4           1  _ {I} <- ( [\n] "*" ( ! [\n] . )* / ( [ \t\n\r] )+ ) 
   3               0.00                          comment <- [\n] "*" ( ! [\n] . )* 

              10                  9           1  Total counters

                              90.00       10.00  % success/fail

Total VM loops 7885

 offset   len     id dp flg ident "data"
     0     13      0  0 --- start " 23 49\n 66 11"
     1      2      1  1 ---   integer "23"
     4      2      1  1 ---   integer "49"
     8      2      1  1 ---   integer "66"
    11      2      1  1 ---   integer "11"

Output without grammar with top comment:

chpeg_nocase gamsToLatex3-vmbug.chpeg gamsToLatex3-vmbug.txt 
chpeg_compile: Parse successful.
Grammar file compiled successfully. Parser returned: 0
Parse successful.
  id       total      %     success        fail  definition
               1  10.00           1              start <- ( _ integer )+ ! . 
   1           4  40.00           4              integer <- ( [0-9] )+ 
   2           5  50.00           4           1  _ {I} <- ( [\n] "*" ( ! [\n] . )* / ( [ \t\n\r] )+ ) 
   3               0.00                          comment <- [\n] "*" ( ! [\n] . )* 

              10                  9           1  Total counters

                              90.00       10.00  % success/fail

Total VM loops 6316

 offset   len     id dp flg ident "data"
     0     13      0  0 --- start " 23 49\n 66 11"
     1      2      1  1 ---   integer "23"
     4      2      1  1 ---   integer "49"
     8      2      1  1 ---   integer "66"
    11      2      1  1 ---   integer "11"

@ChrisHixon
Copy link
Owner

Do you mean like the following?

comment1.chpeg:

#../../chpeg-dad/examples/chpeg_chpeg gamsToLatex2.chpeg gamsToLatex2.mod
 start <-
	 (_ integer)+ !.

 integer <-
	 [0-9]+

 _ {I} <-
	 #comment / [ \n\t\r]
         #[\n] ( "*" (![\n] .)* / [ \t\r]) #TODO fix error message with this
         [\n] "*" (![\n] .)* / [ \t\n\r]+ #inline comment


comment <-
	 ( [\n] "*" ) (![\n] .)*

#comment <-
#	![\n] .

comment2.chpeg:

#../../chpeg-dad/examples/chpeg_chpeg gamsToLatex2.chpeg gamsToLatex2.mod
 start <-
	 (_ integer)+ !.

 integer <-
	 [0-9]+

 _ {I} <-
	 #comment / [ \n\t\r]
         #[\n] ( "*" (![\n] .)* / [ \t\r]) #TODO fix error message with this
         [\n] "*" (![\n] .)* / [ \t\n\r]+ #inline comment


#comment <-
#	 ( [\n] "*" ) (![\n] .)*

comment <-
	![\n] .

I get the exact output from both:

% printf ' 23 49\n 66 11' | ../util/chpeg-trace --profile -g comment1.chpeg -p - --ast > comment1-out.txt
% printf ' 23 49\n 66 11' | ../util/chpeg-trace --profile -g comment2.chpeg -p - --ast > comment2-out.txt
% diff comment1-out.txt comment2-out.txt # no difference

comment1-out.txt:

Instructions executed:
   OP-   opcode       count      %
   OP      GOTO          15  12.50
   OP     IDENT          10   8.33
   OP     ISUCC           9   7.50
   OP     IFAIL           1   0.83
   OP     RSBEG           0   0.00
   OP     RQBEG           0   0.00
   OP    CHOICE           5   4.17
   OP    CISUCC           4   3.33
   OP     CFAIL           1   0.83
   OP    CIFAIL           6   5.00
   OP     RPBEG          10   8.33
   OP     RPMAT          17  14.17
   OP    RPDONE          10   8.33
   OP     RSMAT           0   0.00
   OP    RSDONE           0   0.00
   OP    RQDONE           0   0.00
   OP     RQMAT           0   0.00
   OP     PREDA           0   0.00
   OP     PREDN           1   0.83
   OP   PMATCHF           0   0.00
   OP   PNOMATF           0   0.00
   OP   PMATCHS           0   0.00
   OP   PNOMATS           1   0.83
   OP    CHRCLS          27  22.50
   OP       LIT           1   0.83
   OP       DOT           1   0.83
   OP      SUCC           1   0.83
   OP      FAIL           0   0.00
   OP      TRIM           0   0.00
   OP     TRIMS           0   0.00
   OP     TRIMF           0   0.00
   OP=    Total         120 100.00

Definition identifier calls:
  DEF-   id       IDENT      %       ISUCC       IFAIL  name
  DEF     0           1  10.00           1           0  start
  DEF     1           4  40.00           4           0  integer
  DEF     2           5  50.00           4           1  _
  DEF     3           0   0.00           0           0  comment
  DEF=   --          10 100.00           9           1  --

Choice calls per definition:
  CHOICE-  def      CHOICE      %      CISUCC      CIFAIL  def_name
  CHOICE     0           0   0.00           0           0  start
  CHOICE     1           0   0.00           0           0  integer
  CHOICE     2           5 100.00           4           6  _
  CHOICE     3           0   0.00           0           0  comment
  CHOICE=   --           5 100.00           4           6  --
 offset   len     id dp flg ident "data"
     0     13      0  0 --- start " 23 49\n 66 11"
     1      2      1  1 ---   integer "23"
     4      2      1  1 ---   integer "49"
     8      2      1  1 ---   integer "66"
    11      2      1  1 ---   integer "11"

120 instructions executed in both cases. I'm not sure what your 'VM loops' is measuring.

@mingodad
Copy link
Contributor Author

mingodad commented Jun 1, 2022

As I said before the VM loop counter was not zeroed before each call to ChpegParser_parse and was giving the sum of parsing the grammar too.
And even parsing the grammar with an extra not referenced rule seems a bit high a 70% increase in VM loop count.

VM loop count (parsing the grammar + parsint the input) with unused comment rule:

Simple grammar Total VM loops 7885
gamsToLatex3.chpeg Total VM loops 51871

VM loop count (parsing the grammar + parsint the input) without unused comment rule:

Simple grammar Total VM loops 6316
gamsToLatex3.chpeg Total VM loops 29890

@ChrisHixon
Copy link
Owner

As I said before the VM loop counter was not zeroed before each call to ChpegParser_parse and was giving the sum of parsing the grammar too. And even parsing the grammar with an extra not referenced rule seems a bit high a 70% increase in VM loop count.

VM loop count (parsing the grammar + parsint the input) with unused comment rule:

Simple grammar Total VM loops 7885
gamsToLatex3.chpeg Total VM loops 51871

VM loop count (parsing the grammar + parsint the input) without unused comment rule:

Simple grammar Total VM loops 6316
gamsToLatex3.chpeg Total VM loops 29890

Can you send the grammar files involved in this test?

@mingodad
Copy link
Contributor Author

mingodad commented Jun 1, 2022

Here they are (without the case insensitive strings):

gamsToLatex2.mod <<-- the input file
gamsToLatex4.chpeg <<-- the original grammar
gamsToLatex4-perf.chpeg <<-- the cleanup grammar

Output (VM loop count = parsing grammar + parsing input):
Without unused comment definition: (# comment <-)

chpeg_hex gamsToLatex4-perf.chpeg gamsToLatex2.mod 
chpeg_compile: Parse successful.
Grammar file compiled successfully. Parser returned: 0
Parse successful.
  id       total      %     success        fail  definition
               1   0.03           1              start <- ( gamsCode )+ ! . 
   1          15   0.46          14           1  gamsCode <- _ gamsCodePart _ 
...
  71               0.00                          comment <- [\n] "*" ( ! [\n] . )* 

            3257               1792        1465  Total counters

                              55.02       44.98  % success/fail

Total VM loops 30100

With unused comment definition: ( comment <-)

chpeg_hex gamsToLatex4-perf.chpeg gamsToLatex2.mod 
chpeg_compile: Parse successful.
Grammar file compiled successfully. Parser returned: 0
Parse successful.
  id       total      %     success        fail  definition
               1   0.03           1              start <- ( gamsCode )+ ! . 
   1          15   0.46          14           1  gamsCode <- _ gamsCodePart _ 
...
  71               0.00                          comment <- [\n] "*" ( ! [\n] . )* 

            3257               1792        1465  Total counters

                              55.02       44.98  % success/fail

Total VM loops 52079

gamsToLatex4-perf.zip

@ChrisHixon
Copy link
Owner

My results profiling the parsing of the grammars

% ../../../chpeg-profile/util/chpeg-trace --profile -p gamsToLatex4-perf-comment-test.chpeg | grep Total
   OP=    Total       76594 100.00
% ../../../chpeg-profile/util/chpeg-trace --profile -p gamsToLatex4-perf.chpeg | grep Total 
   OP=    Total       77228 100.00

Not much difference from each other, but different from your numbers. This Total is total instruction executed, and it checks out vs. the complete trace with instruction count.

The only difference in the file is in gamsToLatex4-perf-comment-test.chpeg has this commented out:

# comment <-
#	 ( [\n] "*" ) (![\n] .)*

This util used is from chpeg profile branch.

@ChrisHixon
Copy link
Owner

Parsing input using both grammars, exactly the same:

% ../../../chpeg-profile/util/chpeg-trace --profile -g gamsToLatex4-perf-comment-test.chpeg -p gamsToLatex2.mod | grep Total
   OP=    Total       30100 100.00
% ../../../chpeg-profile/util/chpeg-trace --profile -g gamsToLatex4-perf.chpeg -p gamsToLatex2.mod | grep Total 
   OP=    Total       30100 100.00

@mingodad
Copy link
Contributor Author

mingodad commented Jun 1, 2022

Sorry by the noise, it was my mistake, I did changed malloc by calloc in some places and somehow I've got confused thinking that in ChpegParser_new I was using calloc and hence no need to initialize the new ChpegParser structure member vm_count, after seeing your results and running your code I was getting crazy then I run my code under valgrind and it told me the use of unitialized variable.

I didn't suspected of it because the numbers were allays the same and only manifested when uncommenting the comment.

Again thank you for all your help and dedication !

Fixed here mingodad@5ec5d0d

@ChrisHixon
Copy link
Owner

No prob. Check out some inital packrat experiments:

% /usr/bin/time ../../util/chpeg -v -g c99-mouse.peg -packrat -p c99-mouse.peg.pp.c > out-packrat.txt 2>&1
% /usr/bin/time ../../util/chpeg -v -g c99-mouse.peg -p c99-mouse.peg.pp.c > out-normal.txt 2>&1

% diff out-*                                                                                    
62295,62296c62295,62296
< 0.58user 0.12system 0:00.71elapsed 100%CPU (0avgtext+0avgdata 454252maxresident)k
< 0inputs+7448outputs (0major+113281minor)pagefaults 0swaps
---
> 0.12user 0.10system 0:00.23elapsed 99%CPU (0avgtext+0avgdata 262404maxresident)k
> 0inputs+7448outputs (0major+65488minor)pagefaults 0swaps

% tail out-*                                                                                    
==> out-normal.txt <==
187580      6    180  6 ---             SEMI ";\n    "
187586      4    153  5 ---           RWING "}\n  "
187590     14     48  3 ---       JumpStatement "return yyctx;\n"
187590      7     96  4 ---         RETURN "return "
187597      5    117  4 --L         Identifier "yyctx"
187602      2    180  4 ---         SEMI ";\n"
187604      2    153  3 ---       RWING "}\n"
187606      0    194  1 ---   EOT ""
0.58user 0.12system 0:00.71elapsed 100%CPU (0avgtext+0avgdata 454252maxresident)k
0inputs+7448outputs (0major+113281minor)pagefaults 0swaps

==> out-packrat.txt <==
187580      6    180  6 ---             SEMI ";\n    "
187586      4    153  5 ---           RWING "}\n  "
187590     14     48  3 ---       JumpStatement "return yyctx;\n"
187590      7     96  4 ---         RETURN "return "
187597      5    117  4 --L         Identifier "yyctx"
187602      2    180  4 ---         SEMI ";\n"
187604      2    153  3 ---       RWING "}\n"
187606      0    194  1 ---   EOT ""
0.12user 0.10system 0:00.23elapsed 99%CPU (0avgtext+0avgdata 262404maxresident)k
0inputs+7448outputs (0major+65488minor)pagefaults 0swaps

Memory use is crazy because I disabled node freeing until I figure out a new memory management strategy. I also plan on trying the packrat window idea.

@mingodad
Copy link
Contributor Author

mingodad commented Jun 1, 2022

After seeing what a difference a grammar fine tuning can make on parser performance I'm having a mixed feeling about packrat because it'll allow people to write crazy grammars and the packrat cache will hide the mess .

Like with this experiment with gamsToLatex at first I was also thinking in how to improve the parser and or memory management but after the cleanup the performance sky-rocked without need to touch the parser.

I'm more interested now on back reference, better error message (line:col info on error messages) and figure out the usage of the AST for real usage (dog food).

@mingodad
Copy link
Contributor Author

mingodad commented Jun 1, 2022

Here is a naive first implementation to show line/col info on error messages mingodad@4530bd4

@mingodad
Copy link
Contributor Author

mingodad commented Jun 3, 2022

Using this grammar #20 (comment) for bison and a similar one for cpp-peglib when parsing postgresql-13.3/src/backend/parser/gram.y (450KB) I'm getting this results:

chpeg compiled without optimizations:

/usr/bin/time ../../chpeg-dad/examples/chpeg_nocase bison.chpeg ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y
chpeg_compile: Parse successful.
Grammar file compiled successfully. Parser returned: 0
Parse successful.
  id       total      %     success        fail  definition
               1   0.00           1              input <- sp ( prologue_declaration )* "%%" sp grammar ( epilogue )? YYEOF 
   1         244   0.05         243           1  prologue_declaration <- ( grammar_declaration / PROLOGUE / ( "%<flag>" / "%locations" ) sp / "%define" sp variable ( value )? / "%header" sp ( string_opt )? / "%error-verbose" sp / "%expect" [-_] "rr" sp INT_LITERAL / "%expect" sp INT_LITERAL / "%file" [-_] "prefix" sp eqopt STRING / "%glr" [-_] "parser" sp / "%initial" [-_] "action" sp braces / "%language" sp STRING / "%name" [-_] "prefix" sp ( "=" sp )? STRING / "%nondeterministic-parser" / "%no" [-_] "lines" sp / "%output" sp STRING / ( "%param" / "%parse" [-_] "param" / "%lex" [-_] "param" ) sp params / "%pure" [-_] "parser" sp / "%require" sp STRING / "%skeleton" sp STRING / "%token" [-_] "table" sp / "%verbose" sp / "%yacc" sp / error sp SEMICOLON / SEMICOLON ) 
   2           2   0.00           2              params <- ( braces )+ 
...
55        6657   1.25        6657              char <- ( "\\" [-abefnrtv'\"[]\\] / "\\" "x" [0-9A-Fa-f] [0-9A-Fa-f] / "\\" "x" [0-9A-Fa-f] / "\\" [0-3] [0-7] [0-7] / "\\" [0-7] ( [0-7] )? / ! "\\" . ) 
  56           2   0.00                       2  error <- "error" sp 

          533866             427549      106317  Total counters

                              80.09       19.91  % success/fail

Total VM loops 6408863

 0.10user 0.00system 0:00.11elapsed 94%CPU (0avgtext+0avgdata 4592maxresident)k
0inputs+0outputs (0major+882minor)pagefaults 0swaps

chpeg compiled with -O2 optimizations:

/usr/bin/time ../../chpeg-dad/examples/chpeg_nocase bison.chpeg ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y
chpeg_compile: Parse successful.
Grammar file compiled successfully. Parser returned: 0
Parse successful.
  id       total      %     success        fail  definition
               1   0.00           1              input <- sp ( prologue_declaration )* "%%" sp grammar ( epilogue )? YYEOF 
   1         244   0.05         243           1  prologue_declaration <- ( grammar_declaration / PROLOGUE / ( "%<flag>" / "%locations" ) sp / "%define" sp variable ( value )? / "%header" sp ( string_opt )? / "%error-verbose" sp / "%expect" [-_] "rr" sp INT_LITERAL / "%expect" sp INT_LITERAL / "%file" [-_] "prefix" sp eqopt STRING / "%glr" [-_] "parser" sp / "%initial" [-_] "action" sp braces / "%language" sp STRING / "%name" [-_] "prefix" sp ( "=" sp )? STRING / "%nondeterministic-parser" / "%no" [-_] "lines" sp / "%output" sp STRING / ( "%param" / "%parse" [-_] "param" / "%lex" [-_] "param" ) sp params / "%pure" [-_] "parser" sp / "%require" sp STRING / "%skeleton" sp STRING / "%token" [-_] "table" sp / "%verbose" sp / "%yacc" sp / error sp SEMICOLON / SEMICOLON ) 
   2           2   0.00           2              params <- ( braces )+ 
...
55        6657   1.25        6657              char <- ( "\\" [-abefnrtv'\"[]\\] / "\\" "x" [0-9A-Fa-f] [0-9A-Fa-f] / "\\" "x" [0-9A-Fa-f] / "\\" [0-3] [0-7] [0-7] / "\\" [0-7] ( [0-7] )? / ! "\\" . ) 
  56           2   0.00                       2  error <- "error" sp 

          533866             427549      106317  Total counters

                              80.09       19.91  % success/fail

Total VM loops 6408863

 0.07user 0.00system 0:00.07elapsed 93%CPU (0avgtext+0avgdata 4644maxresident)k
0inputs+0outputs (0major+884minor)pagefaults 0swaps

cpp-peglib without optimizations/AST:

/usr/bin/time ../../cpp-peglib0/build/lint/peglint  --profile bison.peglib ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y
bison.peglib:191:1: 'splice' is not referenced.
  id       total      %     success        fail  definition
   0           1   0.00           1           0  input
   1       20971   3.78       20971           0  sp
...
 47           1   0.00           1           0  epilogue
  48           1   0.00           1           0  YYEOF

          554837             427549      127288  Total counters
                              77.06       22.94  % success/fail
2.45user 0.00system 0:02.46elapsed 99%CPU (0avgtext+0avgdata 5392maxresident)k
0inputs+0outputs (0major+398minor)pagefaults 0swaps

cpp-peglib without optimizations/AST but --packrat:

/usr/bin/time ../../cpp-peglib0/build/lint/peglint --packrat --profile bison.peglib ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y
bison.peglib:191:1: 'splice' is not referenced.
  id       total      %     success        fail  definition
   0           1   0.00           1           0  input
   1       20971   3.79       20971           0  sp
...
47           1   0.00           1           0  epilogue
  48           1   0.00           1           0  YYEOF

          553176             427549      125627  Total counters
                              77.29       22.71  % success/fail
2.96user 0.02system 0:03.00elapsed 99%CPU (0avgtext+0avgdata 44664maxresident)k
0inputs+0outputs (0major+10290minor)pagefaults 0swaps

cpp-peglib without AST but -O2:

/usr/bin/time ../../cpp-peglib0/build/lint/peglint  --profile bison.peglib ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y
bison.peglib:191:1: 'splice' is not referenced.
  id       total      %     success        fail  definition
   0           1   0.00           1           0  input
   1       20971   3.78       20971           0  sp
...
47           1   0.00           1           0  epilogue
  48           1   0.00           1           0  YYEOF

          554837             427549      127288  Total counters
                              77.06       22.94  % success/fail
0.49user 0.00system 0:00.49elapsed 100%CPU (0avgtext+0avgdata 5024maxresident)k
0inputs+0outputs (0major+389minor)pagefaults 0swaps

cpp-peglib without AST but -O2 --packrat:

/usr/bin/time ../../cpp-peglib0/build/lint/peglint --packrat --profile bison.peglib ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y
bison.peglib:191:1: 'splice' is not referenced.
  id       total      %     success        fail  definition
   0           1   0.00           1           0  input
   1       20971   3.79       20971           0  sp
...
 47           1   0.00           1           0  epilogue
  48           1   0.00           1           0  YYEOF

          553176             427549      125627  Total counters
                              77.29       22.71  % success/fail
0.54user 0.02system 0:00.57elapsed 99%CPU (0avgtext+0avgdata 44316maxresident)k
0inputs+0outputs (0major+10280minor)pagefaults 0swaps

@mingodad
Copy link
Contributor Author

mingodad commented Jun 3, 2022

And for comparison bison itself generating the parser:

/usr/bin/time bison-nb ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y

../../../A_databases/postgresql-13.3/src/backend/parser/gram.y:202.1-12: warning: deprecated directive: ‘%pure-parser’, use ‘%define api.pure’ [-Wdeprecated]
  202 | %pure-parser
      | ^~~~~~~~~~~~
      | %define api.pure
../../../A_databases/postgresql-13.3/src/backend/parser/gram.y:204.1-22: warning: deprecated directive: ‘%name-prefix="base_yy"’, use ‘%define api.prefix {base_yy}’ [-Wdeprecated]
  204 | %name-prefix="base_yy"
      | ^~~~~~~~~~~~~~~~~~~~~~
      | %define api.prefix {base_yy}
../../../A_databases/postgresql-13.3/src/backend/parser/gram.y: warning: fix-its can be applied.  Rerun with option '--update'. [-Wother]
2.18user 0.03system 0:02.17elapsed 101%CPU (0avgtext+0avgdata 16284maxresident)k
0inputs+5256outputs (0major+9452minor)pagefaults 0swaps

@ChrisHixon
Copy link
Owner

(previously posted on wrong topic)

I've been considering adding the "caseinsensitive"i syntax (I know you have created an implementation of this); however I'm also thinking of a few other case insensitive things.

It would make sense to allow back-references to compare case insensitively. So start tag <a> could have an ending </A> if desired. I'm thinking of using :i postfix. So capture/mark: $tag<TAGNAME> reference: $tag:i.

Another thing I'm pondering is dictionaries like cpp-peglib. I'm thinking users might want to match those case intensively too.

Or perhaps even match entire rules or parenthesized groups case insensitively.

So the :i postfix could apply to many things. It would make sense to change the literal syntax to use :i as well. Another thought is to add a rule Option {i}, not to be confused with {I}. Also :postfix might make sense for other options other than i later.

@mingodad
Copy link
Contributor Author

mingodad commented Jun 20, 2022

My implementation only do it on literals and I saw in other implementaions (peggy/pegjs) that they also apply it to character class like:

hex <- [0-9a-z]i

In your example if TAGNAME is marked as case insensitive we don't need mark again $tag:i:

tagged <- '<' $tag<TAGNAME> '>' !'<' . '</' $tag '>'
TAGNAME <- [a-z]i+

@mingodad
Copy link
Contributor Author

The dictionaries I think that's useful because in several grammars there is constructions like the one shown bellow that consumes a lot of VM loops.

hardKeyword <-
	 AS
	/ BREAK
	/ CLASS
	/ ...

simpleIdentifier <-
	 ! hardKeyword Identifier

@mingodad
Copy link
Contributor Author

I would recommend to read https://github.com/edubart/lpegrex/blob/main/README.md for some interesting implementation ideas.

@ChrisHixon
Copy link
Owner

My implementation only do it on literals and I saw in other implementaions (peggy/pegjs) that they also apply it to character class like:

hex <- [0-9a-z]i

In your example if TAGNAME is marked as case insensitive we don't need mark again $tag:i:

tagged <- '<' $tag<TAGNAME> '>' !'<' . '</' $tag '>'
TAGNAME <- [a-z]i+

As references are implemented, you would need to mark the reference comparison as case-insensitive; references do a simple memcmp comparison to the captured value. Experiments seem to indicate cpp-peglib works that way too.

I'm getting some weird errors on cpp-peglib playground with this:

ROOT          <- CONTENT !.
CONTENT       <- (ELEMENT / TEXT)*
ELEMENT       <- $(STAG CONTENT ETAG)
STAG          <- '<'  < $tag<TAG_NAME> > '>'
ETAG          <- '</' < $tag > '>'
TAG_NAME      <- 'a'i / 'b'i
TEXT          <- (![<] .)+

Input:
<a>foo</A>

Error:
1:9 syntax error, unexpected 'A', expecting 's) i'.

@mingodad
Copy link
Contributor Author

I'm also getting an error with this:

ROOT          <- CONTENT !.
CONTENT       <- (ELEMENT / TEXT)*
ELEMENT       <- '<' <$tag<TAG_NAME>> '>' CONTENT '</' <$tag> '>'
TAG_NAME      <- 'a'i / 'b'i
TEXT          <- (![<] .)+

Input:

<a>foo</A>

Error:

1:9 syntax error, unexpected 'A', expecting 'a'.

@ChrisHixon
Copy link
Owner

That's the proper error. I get that consistenly on the command line (lint) version but the playground is giving me occasional corrupted bits like the above.

@mingodad
Copy link
Contributor Author

I just opened an issue on cpp-peglib yhirose/cpp-peglib#216

@mingodad
Copy link
Contributor Author

I just implemented character class case insensitive on my fork of peg/leg here https://github.com/mingodad/peg also added several leg features to peg see the https://github.com/mingodad/peg/blob/dad/examples/basic.leg and https://github.com/mingodad/peg/blob/dad/examples/basic.peg to see that they accept the same features (embedded C code, variables, ...).
While I was doing it I notice some similarity between chpeg code to detect infinite loop/left recursion and peg/leg code https://github.com/mingodad/peg/blob/85e23480e09a124e2a1c1b4772e2229f08bb2213/src/compile.c#L880 (obviously as they are working on a similar problem/application).

@mingodad
Copy link
Contributor Author

The LL(1) project can't interpret grammar at run time or read any grammar from files

I just got an initial port of CocoR-CSharp to Typescript/Javascript would you be interested in test/help improve the Typescript port ?

See also here Lercher/CocoR#2 (comment) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants