New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case insensitive literals #4
Comments
Some hints about adding a VM op code like this:
There are many places you'll have to add this instruction/op code for this to work. I think the easiest way to deal with adding a new instruction is in the bootstrap code. (You could hack the generated files and go that way, but
Do a search in the bootstrap dir for LIT and Literal and add your variation, mirroring what is done for those. I created the bootstrap process in Ruby before there was a C version of the byte code compiler. All it really does is create the hardcoded C version of the parser byte code and tack stuff on the C 'templates'. Do a diff on the templates vs. generated C files and you'll see what I mean. The C version of the compiler (in the src dir) came later. That will also need to be modified to handle your new instruction. Since there is a C version of the compiler, it wouldn't be that hard to create something to output hardcoded bytecode (as C code) and then perhaps ditch the Ruby bootstrap. This is another potential feature that could be used to create a parser with your bytecode already compiled in. |
I just finished this
|
Nice! Yeah, with a few more steps the Ruby bootstrap could be eliminated. |
Here is a
|
Looking through the code now that I'm a bit more familiar with it I think I found a simple way to add case insensitive literals by mainly adding a new option at
|
Yes you would need to add something there ( I'm thinking the new chpeg grammar would be adjusted so the AST we end up with captures things like this: "string" and 'string' would be captured as Literal It's a matter of adjusting the grammar to capture things that way. The grammar is specifically tuned to the way the compiler wants things presented in the AST. I don't know why it would be any different than PEG_LITERAL in the compiler:
|
For example my usage of https://github.com/mingodad/peg from Piumarta showed me the need to know the differentiate literals as Single or Double quoted for further processing because depending on which language we'll use the result we'll need to apply different escaping filters and also https://github.com/edubart/lpegrex uses back quoted literals to automatically add As long as we have it available and is easy to use/implement I'm Ok with it. For example with minimal change for what is already implemented and my minimal implementation this grammar would do it:
|
Perhaps something like this would be easy to deal with for the compiler (untested)
You want to see Literal in the compiler and then you check the first child which is either LiteralSQoute or LiteralDQuote, then take a look at the LiteralOptions to make it either normal LIT instruction or new LIT_NC instruction. |
It seems fine to me, everything starts with |
This idea seems to work, I just tried this in the playground at https://yhirose.github.io/cpp-peglib/
|
After writing my previous answer I looked at it again and still think that having an extra definition only to give a meaningful name for the
|
The playground is really useful and we should also try one for |
Yeah I like the idea of a playground, I will check your version out. [edited] |
And probably it's better to have it like:
I still use an LL(1) parser generator https://github.com/mingodad/CocoR-CPP and family where a |
I don't mind adding stuff in the grammar for documentation/clarity, if it doesn't affect the performance. |
I have an LL1 project that's experimental not posted or really usable, that would optimize that NoCase out (no warning though). chpeg isn't doing any optimization of grammar or bytecode. |
And for that reason chpeg's grammar that it uses itself should probably be optimized manually. You can always publish a separate grammar for documentation purposes. |
You want it to be as fast as possible so that parsing the user's grammars is fast :) So it should end up with really minimal bytecode and minimize the tree building. |
And here is a test grammar converted from http://ingmarschlecht.de/gamsToLatex/ to test a possible implementation of case insensitive literals, it's working with
|
By the way your |
The LL(1) project can't interpret grammar at run time or read any grammar from files. It's written in Python, and the grammar is defined in Python code. I was using it to iron out the concepts and add things like |
Thank you ! |
Hello @ChrisHixon I hope you're well !
|
Now I'm interested on have |
Back references could be a fun challenge. I'm thinking in the parser VM, you would first want to match the same thing the referenced item matches, and then compare the values by looking back into the parse tree. Take this example (using new chpeg util):
If we wanted to convert this to back reference so the first N matches the second N, the grammar might be something like
Lets call the positions in In the VM, when working on matching $3, first match N (what $1 refers to), then if that matches, execute a new BACKREF op that compares the values of $3 with $1 looking back into the parse tree. |
Nice work on the case insensitive matching! I have not tried this yet... Do you have an isolated patch for this? It'd be great if it was on top of the current refactor branch without all of your other changes/experiments mixed in. |
While testing I compiled it with profiling and here is the partial output in hope it can help improve
gamsToLatex.zip content:
|
Execution time (build without optimization):
|
Would be nice if we could detect bad definitions and emmit warnings/suggestions. |
Another interesting thing found is that I inlined the
With
|
Doing experiments I found that if we have a
But if we add a
|
Can you isolate this to a small set of rules and then compare the bytecode and trace (from chpeg-trace -t or otherwise enabling VM_TRACE)? |
I'm using the grammar shown bellow and it seems that the VM loop counter I'm using is accumulating the grammar compilation, because when I delete the top comment on the grammar the VM loop count also change.
This input:
Output with grammar with top comment:
Output without grammar with top comment:
|
Do you mean like the following? comment1.chpeg:
comment2.chpeg:
I get the exact output from both:
comment1-out.txt:
120 instructions executed in both cases. I'm not sure what your 'VM loops' is measuring. |
As I said before the VM loop counter was not zeroed before each call to VM loop count (parsing the grammar + parsint the input) with unused
VM loop count (parsing the grammar + parsint the input) without unused
|
Can you send the grammar files involved in this test? |
Here they are (without the case insensitive strings):
Output (VM loop count = parsing grammar + parsing input):
With unused
|
My results profiling the parsing of the grammars
Not much difference from each other, but different from your numbers. This Total is total instruction executed, and it checks out vs. the complete trace with instruction count. The only difference in the file is in gamsToLatex4-perf-comment-test.chpeg has this commented out:
This util used is from chpeg |
Parsing input using both grammars, exactly the same:
|
Sorry by the noise, it was my mistake, I did changed I didn't suspected of it because the numbers were allays the same and only manifested when uncommenting the Again thank you for all your help and dedication ! Fixed here mingodad@5ec5d0d |
No prob. Check out some inital packrat experiments:
Memory use is crazy because I disabled node freeing until I figure out a new memory management strategy. I also plan on trying the packrat window idea. |
After seeing what a difference a grammar fine tuning can make on parser performance I'm having a mixed feeling about Like with this experiment with I'm more interested now on |
Here is a naive first implementation to show line/col info on error messages mingodad@4530bd4 |
Using this grammar #20 (comment) for chpeg compiled without optimizations:
chpeg compiled with
cpp-peglib without optimizations/AST:
cpp-peglib without optimizations/AST but
cpp-peglib without AST but
cpp-peglib without AST but
|
And for comparison
|
(previously posted on wrong topic) I've been considering adding the It would make sense to allow back-references to compare case insensitively. So start tag Another thing I'm pondering is dictionaries like cpp-peglib. I'm thinking users might want to match those case intensively too. Or perhaps even match entire rules or parenthesized groups case insensitively. So the |
My implementation only do it on literals and I saw in other implementaions (peggy/pegjs) that they also apply it to character class like:
In your example if
|
The dictionaries I think that's useful because in several grammars there is constructions like the one shown bellow that consumes a lot of VM loops.
|
I would recommend to read https://github.com/edubart/lpegrex/blob/main/README.md for some interesting implementation ideas. |
As references are implemented, you would need to mark the reference comparison as case-insensitive; references do a simple memcmp comparison to the captured value. Experiments seem to indicate cpp-peglib works that way too. I'm getting some weird errors on cpp-peglib playground with this:
|
I'm also getting an error with this:
Input:
Error:
|
That's the proper error. I get that consistenly on the command line (lint) version but the playground is giving me occasional corrupted bits like the above. |
I just opened an issue on |
I just implemented character class case insensitive on my fork of peg/leg here https://github.com/mingodad/peg also added several |
I just got an initial port of CocoR-CSharp to Typescript/Javascript would you be interested in test/help improve the Typescript port ? See also here Lercher/CocoR#2 (comment) . |
Related to #1 (comment) I did implemented it here https://github.com/mingodad/peg doing a stack top replacement, maybe something like this can be done in
chpeg
too.In https://github.com/mingodad/peg/src/peg.peg:
The text was updated successfully, but these errors were encountered: