Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement Ideas #1

Open
mingodad opened this issue May 7, 2022 · 68 comments
Open

Improvement Ideas #1

mingodad opened this issue May 7, 2022 · 68 comments
Assignees

Comments

@mingodad
Copy link
Contributor

mingodad commented May 7, 2022

Are you still around ?
Have you done some improvements to this project ?
I just found it and I'm impressed by it's simplicity and performance !
Thank you for your great work !

@ChrisHixon
Copy link
Owner

I haven't done any further work on this project. Do you have any suggestions for improvements you're interested in? Are you using this project somewhere? I know I had some ideas for it, but I believe most things I was thinking of were pretty radical changes that would require redoing the byte compiler - essentially a new project entirely. It's not very fresh in my memory and I haven't dug into relevant notes in a while.

@mingodad
Copy link
Contributor Author

mingodad commented May 8, 2022

Hello Chris !
Nice to hear that you're still around and thank you for reply !
I'm currently doing several changes to peg/leg from Piumarta here https://github.com/mingodad/peg, have done some changes to bison/yacc/lemon here https://github.com/mingodad/lalr-parser-test and also to CocoR here https://github.com/mingodad/CocoR-CPP.

But latelly I've been looking at PEG parsers due to it's easy of use/creation/evaluation (although we can make big mess as well like non typed scripting languages that is prevented by parsers generators that do several sanity checks) and found this one https://github.com/yhirose/cpp-peglib and it's online playground https://yhirose.github.io/cpp-peglib that has good performance and having the playground make even easier to develop/check a grammar.

Other parser generators with online playgrounds:

And I was looking before at LPeg from Lua to embed in other projects but it's too specific for Lua.

Your project is small/simple and has good performance low resource usage produces an AST/Parser tree and also can evaluate at runtime, I think that having a online playground like the cpp-peglib that is done compiling with emscripten https://emscripten.org/ would be nice.

I'm trying to extend it a bit to manage the grammar shown bellow and managed to add the easy part (expand escape sequences) but the case insensitive literals is taking a bit of time to figure out (see the change to the VM bellow).

I would like to know what are your new ideas and maybe I'm willing to cooperate/help implement then.

Another intriguing peg parser from a Japanese group is here https://github.com/nez-peg/nez .

The cpp-peglib is nice but is heavy C++ and takes a while to build, but so far a very good balance of features.

# Extended PEG grammar for chpeg bytecode parser/compiler
#
# The effective grammar is basically the same as the original PEG grammar from
# Bryan Ford's paper, but with Options added in front of <- to guide parse tree building.

# Hierarchical syntax
Grammar     {S} <- S Definition+ !.
Definition  { } <- Identifier S ('{' S Options S '}' S)? '<-' S Choice
Choice      { } <- Sequence ('/' S Sequence)*
Sequence    { } <- Predicate+
Predicate   { } <- (PredOp S)? Repeat
Repeat      { } <- Primary (RepOp S)?
Primary     { } <- Identifier S ![{<]
                 / '(' S Choice ')' S
                 / Literal S / CharClass S / Dot S

# Lexical syntax
Options     { } <- [ILS]*
Identifier  {L} <- [a-zA-Z_][a-zA-Z_0-9]*
Literal     {S} <- ( LiteralSQ / LiteralDQ ) NoCase?
LiteralSQ   {S} <- ['] (!['] Char)* [']
LiteralDQ   {S} <- ["] (!["] Char)* ["]
CharClass   {S} <- '[' (!']' CharRange)+ ']'
CharRange   { } <- Char '-' !']' Char / Char
Char        { } <- EscChar / OctChar / PlainChar
EscChar     {L} <- '\\' [abefnrtv'"\[\]\\]
OctChar     {L} <- '\\' [0-3][0-7][0-7]
                 / '\\' [0-7][0-7]?
PlainChar   {L} <- !'\\' .
PredOp      {L} <- [&!]
RepOp       {L} <- [*+?]
Dot         {L} <- '.'
NoCase      {L} <- 'i' !Identifier
EndLine     {L} <- '\r\n' / '\n\r' / '\n' / '\r'
Comment     {L} <- '#' (!EndLine .)* EndLine
Space       {L} <- [ \t\r\n\f\v]
S           {I} <- (Space+ / Comment)*

VM changes to add LIT_NC:

// Literal
            case LIT: // arg = str_id; match literal string; skip next instruction on match
	    case LIT_NC:
                {
                    int len = self->str_len[arg];
                    if ((offset < (size - (len - 1))) && !(
			    (op == LIT) ? memcmp(&input[offset], self->strings[arg], len)
					: strncasecmp(&input[offset], self->strings[arg], len))) {
                        offset += len;
                        ++pc;
                        break;
                    }
                }

@mingodad
Copy link
Contributor Author

mingodad commented May 8, 2022

Also you parser/vm seem to manage left recursion judging by one test https://github.com/ChrisHixon/chpeg/blob/master/test/grammars/left_recursion.peg but I didn't tried it with a real grammar so far.

@mingodad
Copy link
Contributor Author

mingodad commented May 8, 2022

I did one test that made me seriously consider your project with a SQL grammar converted from pegjs and I was impressed by the speed and low resource usage (see attached) and that's why I was trying to add case insensitive literals like "select"i.
sql-parser.zip

@ChrisHixon
Copy link
Owner

That all looks very interesting. I will check this out after I refresh myself on the project a bit. I'm a little lost trying to figure out how all the pieces fit together at the moment. The example in this project does not compile...

% make
mkdir obj
mkdir chpeg
gcc -c -Wall -std=c99 ../src/peg_cbyte_parser.c -o chpeg/peg_cbyte_parser.o
gcc -c -Wall -std=c99 ../src/peg_cbyte_compiler.c -o chpeg/peg_cbyte_compiler.o
gcc -c -Wall -std=c99 -I../src peg_parse_file.c -o obj/peg_parse_file.o
gcc  obj/peg_parse_file.o chpeg/peg_cbyte_parser.o chpeg/peg_cbyte_compiler.o -o peg_parse_file
/usr/bin/ld: chpeg/peg_cbyte_parser.o:(.data.rel.local+0x0): multiple definition of `peg_byte_code'; obj/peg_parse_file.o:(.bss+0x0): first defined here
/usr/bin/ld: chpeg/peg_cbyte_compiler.o:(.bss+0x0): multiple definition of `peg_byte_code'; obj/peg_parse_file.o:(.bss+0x0): first defined here
collect2: error: ld returned 1 exit status
make: *** [Makefile:30: peg_parse_file] Error 1

I didn't really look far into this. Any ideas - since you've been using the project? Do you have a fork somewhere with your changes/experiments?

@ChrisHixon
Copy link
Owner

This project does not support left recursion. It's meant to be simple and fast, leaving it up to the user to avoid left recursion. I believe it is designed to fail and return an error on left recursion, rather than crash.

@ChrisHixon ChrisHixon self-assigned this May 8, 2022
@mingodad
Copy link
Contributor Author

mingodad commented May 8, 2022

I just cloned this repository and moved to examples folder and executed make and I've got an executable:

git clone https://github.com/ChrisHixon/chpeg.git
cd chpeg/examples
make
./peg_parse_file ../grammars/peg.peg ../grammars/peg.peg
Grammar file compiled successfully. Parser returned: 1246
parse ok.
parse ok.
---------------------------------------------------------------------------------
 Begin    Len  DefID  Flags  Def. Name / Data
---------------------------------------------------------------------------------
     0   1246      0 |     | Grammar "# Hierarchical syntax\nGrammar     <-..."
    21      1     27 |     |   EndOfLine "\n"
    22     45      1 |     |   Definition "Grammar     <- Spacing Definition+ En..."
    22     12      7 |     |     Identifier "Grammar     "
    22      1      8 |     |       IdentStart "G"
    23      1      8 |     |       IdentStart "r"
    24      1      8 |     |       IdentStart "a"
    25      1      8 |     |       IdentStart "m"
...

@mingodad
Copy link
Contributor Author

mingodad commented May 8, 2022

I also merged your topdown project on bootstrap/peg_cbyte.rb to make easy to run without extra dependencies other than ruby see attached (with some small changes to try implement the case insensitive literals).
peg_cbyte.rb.zip

@ChrisHixon
Copy link
Owner

Thanks, I'll check it out.

I'm getting a similar error using clang. I'm thinking this is a bug that needs to be fixed in this code; perhaps some (older?) toolchains are ok with it. What versions of cc and ld are you using? I'm on Arch Linux (rolling):

% gcc --version
gcc (GCC) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

% ld --version  
GNU ld (GNU Binutils) 2.38
Copyright (C) 2022 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.

% clang --version
clang version 13.0.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

@mingodad
Copy link
Contributor Author

mingodad commented May 8, 2022

Ok I just found the problem you mention it's the declaration of ByteCode peg_byte_code; in src/peg_cbyte_parser.h it needs an extern in front of it (although it does build for me with gcc-9.4 on Ubuntu 18.04 64 bits but with clang I'm getting the same error as you).

git diff
diff --git a/src/peg_cbyte_parser.h b/src/peg_cbyte_parser.h
index efd3ba0..9ab2291 100644
--- a/src/peg_cbyte_parser.h
+++ b/src/peg_cbyte_parser.h
@@ -162,4 +162,4 @@ void Parser_print_error(Parser *self, const unsigned char *input);
 #define PEG_REPOP 17
 #define PEG_DOT 18
 #define PEG_S 19
-ByteCode peg_byte_code;
+extern ByteCode peg_byte_code;

@ChrisHixon
Copy link
Owner

Ah, yeah, that does it, thanks.

@ChrisHixon
Copy link
Owner

ChrisHixon commented May 8, 2022

I checked out the SQL PEG grammar along with the test SQL file. It's nice to see this code working with a large grammar for an established complex language. I have not tested much besides the grammars you see in the repo.

Did you have to modify the grammar to get it to work with chpeg? I notice a few {I} but nothing else chpeg specific. I imagine there could be issues with 'foreign' grammars (not meant for the way chpeg operates) when trying to build the AST you want. The biggest issue, I think, is getting the spacing out of the text nodes you want to capture. This spacing issue is kind of an annoyance with PEG in general. The SQL example looks pretty good in this area at quick glance.

One idea is to add a separate lex stage that generates tokens, and then the PEG stage deals with tokens rather than characters. White space could be removed at the lex stage. Of course this is no longer really PEG, though perhaps it could be done in a way that is PEG compatible.

Another idea is to cache (or add a way to load/save) the byte code so the grammar doesn't have to be compiled every time a program instance is run.

I'll have to give the case insensitive issue some thought and check out your code changes for that. Have you encountered other PEG implementations that have tacked on case insensitive matching? If so, how did they handle the PEG syntax extension? Is "string"i a standard? I imagine you could just use something like [Ss][Ee][Ll][Ee][Cc][Tt] but that's kind of ugly. I also wonder what the performance would be like on that.

@mingodad
Copy link
Contributor Author

mingodad commented May 8, 2022

Yes cpp-peglib, pegjs/peggy and pegged uses "string"i, pest uses ^"string" others use the way you've mention [Ss][Ee][Ll][Ee][Cc][Tt].
It's more of an exercise/excuse to dive in your code trying to understanding it.

The one that has interesting ideas is nez that has a symbol table, scope manager usable by the end user in a declarative way.
Another issue is error reporting that most of then struggle with.

@mingodad
Copy link
Contributor Author

mingodad commented May 8, 2022

The question about the grammar I'm trying to look for another parser generators that have usage and opensource projects and write a translator that reads the grammar and convert it to another format, like I did for bison/byacc/lemon and for peg using my fork of Piumarta here https://github.com/mingodad/peg I wrote a pegjs2peg.leg and use it as base.

# ../leg -o pegjs2peg.peg.c pegjs2peg.leg; gcc -g -o pegjs2peg pegjs2peg.peg.c; ./pegjs2peg arithmetics.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_grammars/pegjs/src/parser.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_grammars/pegjs/examples/css.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_grammars/pegjs/examples/javascript.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_grammars/pegjs/examples/json.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_grammars/pegjs/examples/arithmetics.pegjs
# ./pegjs2peg arithmetics.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_databases/dbml/packages/dbml-core/src/parse/dbml/parser.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_programming-languages/JSCPP/pegjs/ast.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_programming-languages/JSCPP/pegjs/prepast.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_databases/sqlite-parser/src/grammar.pegjs
# ./pegjs2peg sqlite3-parser.pegjs
# ./pegjs2peg /home/mingo/dev/c/A_databases/sql-parser/src/grammar.pegjs
%{
#include <stdio.h>
#include <stdlib.h>

typedef struct _yycontext yycontext;
void printName(yycontext *ctx, const char *s);
void printDefined(yycontext *ctx, const char *s);
void printElmNSP(yycontext *ctx, const char *s);
void printElm(yycontext *ctx, const char *s);
void printAltSym(yycontext *ctx);
int igetchar(yycontext *ctx);
void synerr(yycontext *ctx, const char *errmsg);

#define getchar() igetchar(yy)
//#define YY_DEBUG
#define YY_CTX_LOCAL

#define YY_CTX_MEMBERS \
	int doOutput; \
	int insideParen; \
	FILE *ifp; \
	const char *ifname;

#define MYRULE_SEP	"<-"
#define MYRULE_END "\n\n"
#define MYALT_SEP	"/"

%}

Grammar =
	Spacing Action* Definition+ EndOfFile ~{synerr(yy, "unexpected syntax");}
	;
Spacing =
	( Space |  Comment )*
	;
Definition =
	Identifier ({yy->doOutput=0;} Literal {yy->doOutput=1;})? EQUAL {printf(" " MYRULE_SEP "\n\t");} Expression  SEMICOLON? {printf(MYRULE_END);}
	;
EndOfFile =
	!.
	;
Identifier =
	<IdentStart IdentCont*> Spacing  {printName(yy, yytext);}
	;
EQUAL =
	"=" Spacing
	;
Expression =
	Sequence ( SLASH {printAltSym(yy);} Sequence )*
	;
Sequence =
	error error*
	;
SLASH =
	"/" Spacing
	;
error =
	Prefix ( TILDE Action )?
	;
Prefix =
	Action | (AND | NOT) Action | (AND | NOT | AT | MATCH_TXT) Suffix |  Suffix
	;
TILDE =
	"~" Spacing
	;
Action =
	"{" Braces* "}" Spacing
	;
AT =
	"@" Spacing
	;
AND =
	"&" Spacing
	;
Suffix =
	Primary ( QUESTION | STAR | PLUS )?
	;
NOT =
	"!" {printf(" !");} Spacing
	;
Primary =
	{yy->doOutput=0;} Identifier COLON {yy->doOutput=1;} Prefix ! (Literal? EQUAL)
	| Identifier  ! (Literal? EQUAL)
	| OPEN Expression CLOSE
	| Literal( "i" {printf("i");} ! IdentStart Spacing )?
	| Class
	| DOT
	| Action
	| AstDirective
	;
QUESTION =
	"?" {printf("?");} Spacing
	;
STAR =
	"*" {printf("*");} Spacing
	;
PLUS =
	"+" {printf("+");} Spacing
	;
COLON =
	":"   {printElm(yy, ":"); ++yy->insideParen;} Spacing
	;
OPEN =
	"("  {printf(" ("); ++yy->insideParen;} Spacing
	;
CLOSE =
	")"  {printf(" )"); --yy->insideParen;} Spacing
	;
Literal =
	<['](! ['] Char)* [']> {printName(yy, yytext);} Spacing
	| <["](! ["] Char)* ["]> {printName(yy, yytext);} Spacing
	| <[`](! [`] Char)* [`]> {printName(yy, yytext);} Spacing
	;
IdentStart =
	 [a-zA-Z_]
	;
Class =
	< '[' (! "]" Range)* "]" 'i'?> {printName(yy, yytext);} Spacing
	;
DOT =
	"." {printf(" .");} Spacing
	;
SEMICOLON =
	";" Spacing
	;
BEGIN =
	"<" Spacing
	;
END =
	">" Spacing
	;
MATCH_TXT =
	"$" Spacing
	;
IdentCont =
	IdentStart |  Digit
	;
Char =
	"\\" [abefnrtv'"`\[\]\\]
	| "\\" [0-3] OctalDigit OctalDigit
	| "\\" OctalDigit OctalDigit?
	| "\\x" HexDigit HexDigit
	| "\\u" HexDigit HexDigit HexDigit HexDigit
	| "\\" "-"
	| ! "\\" .
	;
Digit =
	[0-9]
	;
OctalDigit =
	[0-7]
	;
HexDigit =
	[0-9a-fA-F]
	;
Range =
	Char "-" ! ']' Char
	|  Char
	;
Space =
	" " |  "\t" |  EndOfLine
	;
Comment =
	"//"(! EndOfLine.)* EndOfLine
	| "/*" (! '*/' .)* "*/"
	;
EndOfLine =
	"\r\n" |  "\n\r" |  "\n" |  "\r"
	;

AstDirective =
	BEGIN (! END .)* END
	;
Braces =
	"{" Braces* "}"
	|  LiteralBraces
	| ! "}" ( EndOfLine | .)
	;
LiteralBraces =
	 ( ['] ( ! ['] Char )* ['] )
	| ( ["] ( ! ["] Char )* ["] )
	| ( [`] ( LiteralEmbedded | ! [`] Char )* [`] )
	| ('/' (! '/' .)* '/')

LiteralEmbedded =
    '${' (LiteralEmbedded | ! '}' .)* '}'

%%

int igetchar(yycontext *ctx)
{
	return fgetc(ctx->ifp);
}

void printName(yycontext *ctx, const char *s) {
	if(ctx->doOutput) {
		printf(" %s", s);
		/*
		size_t len = strlen(s);
		fputc(' ', stdout);
		for(size_t i=0; i < len; ++i) fputc(s[i] == '-' ? '_' : s[i], stdout);
		fputc(' ', stdout);
		*/
	}
}
void printDefined(yycontext *ctx, const char *s) {
	//printf("===printDefined: %d : %s\n", ctx->doOutput, s);
	#define MYSTRNCMP(ps, ls) (strncmp(ps, ls, sizeof(ls)-1))
	if(ctx->doOutput) {
		if(MYSTRNCMP(s, "nl") == 0) printf(" '\\n' ");
		else if(MYSTRNCMP(s, "cn") == 0) printf(" '\\n' ");
		else if(MYSTRNCMP(s, "cr") == 0) printf(" '\\r' ");
		else if(MYSTRNCMP(s, "s") == 0)
			printf(" (' ' " MYALT_SEP " '\\t' " MYALT_SEP " '\\r' " MYALT_SEP " '\\n') ");
	}
}
void printElmNSP(yycontext *ctx, const char *s) {
	if(ctx->doOutput) printf("%s", s);
}
void printElm(yycontext *ctx, const char *s) {
	if(ctx->doOutput) printf(" %s ", s);
}
void printAltSym(yycontext *ctx) {
	if(ctx->doOutput) {
		printf("%s" MYALT_SEP, ctx->insideParen ? " " : "\n\t");
	}
}

void synerr(yycontext *ctx, const char *errmsg)
{
	fprintf(stderr, "%s:%d:%d %s\n", ctx->ifname, ctx->__lineno, ctx->__inputpos-ctx->__linenopos, errmsg);
}

int main(int argc, char *argv[])
{
  int rc;
  /* create a local parser context in automatic storage */
  yycontext yy;
  /* the context *must* be initialised to zero before first use*/
  memset(&yy, 0, sizeof(yy));

  if(argc < 2) {
	printf("usage: %s mouse_filename\n", argv[0]);
	return -1;
  }
  yy.ifp = fopen(argv[1], "r");
  if(yy.ifp == NULL) {
     printf("failed to open: %s\n", argv[1]);
     return -2;
  }
  yy.ifname = argv[1];
  yy.__lineno = 1;
  yy.doOutput = 1;
  /*while ( (*/rc = yyparse(&yy); /*) )*/  printf("yy %d\n", rc);

  /* release all resources associated with the context */
  yyrelease(&yy);

  fclose(yy.ifp);

  return 0;
}

@mingodad
Copy link
Contributor Author

mingodad commented May 8, 2022

Here is an example of a C99 parser that is not working properly on chpeg but does work on cpp-peglib playgorund https://yhirose.github.io/cpp-peglib.

On cpp-peglib playground copy and paste c99-mouse-cpp-peglib.peg in the Grammar and c99-mouse.peg.pp.c on Source Code.

 c99-mouse-cpp-peglib.peg # is the same as c99-mouse.peg with some `~` Ignore flags
 c99-mouse.peg
 c99-mouse.peg.pp.c

With chpeg:

chpeg c99-mouse.peg c99-mouse.peg.pp.c 
parse ok.
Grammar file compiled successfully. Parser returned: 19382
parse failed with result: -1
Expected SEMI in Declaration at offset 3092

c99-mouse.zip

@ChrisHixon
Copy link
Owner

I opened new issues for a couple things discussed here that need to be fixed (or investigated as potential bugs). Feel free to open more as necessary, and we can talk about misc. improvements, changes, etc. on this issue. Also, we can open more issues for distinct improvement ideas, like say, case insensitivity.

@mingodad
Copy link
Contributor Author

Here is a first crude attempt to create a playground for chpeg reusing the one from https://github.com/yhirose/cpp-peglib/tree/master/docs, create a new folder and decompress it there, I needed to add include guards in the header files from chpeg, there is a nginx config file to launch it (or any local webserver), no error messages other than Valid/Invalid test with chpeg/grammars/peg.peg in both Grammar and Source Code.

playground.zip

@mingodad
Copy link
Contributor Author

Probably in the future we'll need to add a prefix to all public symbols to avoid name clash with other code/libraries because some enums have short common names and Node, Parser, ByteCode seems to be a common name used elsewhere.

@ChrisHixon
Copy link
Owner

Moving this here to the improvements issue.

I think that these could have clearer names or explanations:

// rule/node flags AKA options
enum Flags {
    STOP = 1<<0,    // used - stop automatic unwrapping, forcing this node to be a container
    IGNORE = 1<<1,  // used - deletes nodes matching this identifier
    LEAF = 1<<2,    // used - collects this node and anything underneath as a final leaf (text) node
};

STOP is really more like "CAPTURE", it forces that node to capture (contain what is under it) and not be eliminate by unwrap.
IGNORE is ok I suppose
LEAF is more like "TERMINAL" or "TEXT"? No nodes will exist under it and it really only "contains" the text - really the match offset and length.

Unwrap is probably unclear also, but it seems similar to what https://yhirose.github.io/cpp-peglib/ does as 'optimize' AST. I'm not sure of the correct or most often used terminology for this concept. Unwrap eliminates intermediate nodes that only contain one child node. A->B->C->D reduced to D.
Any thoughts on this since you've been experimenting with so many PEG libs?

@ChrisHixon ChrisHixon changed the title Any new improvement for this project ? Improvement Ideas May 11, 2022
@mingodad
Copy link
Contributor Author

mingodad commented May 11, 2022

I'm do not have too much work done with AST creation and at first your way was simpler to digest (only 3 options, although in topdown there was some more [ILUSE]).

For example lpeg/lpegrex (https://github.com/edubart/lpegrex) have so many options that I've got lost, in cpp-peglib I just have used ~ ignore that's like {I} and <> that's like {L}, but there is also https://github.com/imasahiro/nez that adds SymbolTable and ScopeManagement to help with context sensitive grammars.

For myself I think that the best way to find out how or if something is useful is to dogfood it like we did here with the sql and the c99 grammar, also that's why I mentioned the next challenge would be the https://github.com/edubart/lpegrex/blob/main/parsers/c11.lua because it's a context sensitive grammar and we need to manage scope and symbol table, dogfooding it will bring up interesting things.

In resuming a good working collection of non trivial grammars in my opinion is the best showcase for a parser generator.

And there is a large pool of grammars done for https://tree-sitter.github.io/tree-sitter/ with a guacamole javascript description to generate a json that the parser generator uses to create the parser and I think that a more structured declarative way ebnf like would be a big improvement, I already did some work trying to tap on that pool https://github.com/mingodad/plgh/blob/main/json2ebnf.js and tree-sitter/tree-sitter#1013 and probably will try again with chpeg with a custom grammar.

@ChrisHixon
Copy link
Owner

I'm aiming for a sort of minimal approach to everything with chpeg. I forget what those [ILUSE] options are in top_down, but I probably eliminated some useless/redundant stuff. For chpeg I came up with what seemed to be the bare minimum necessary to generate the AST you are interested in. Once you start using the AST output for some purpose you start to figure out the AST you want to make your life easier or make things clearer/simpler - what you want to capture at what level for your use case.

It seems like the extensions get out of hand with a lot of the parsers. I'd like to come up with a simple way to do the kinds of things you're talking about.

@mingodad
Copy link
Contributor Author

Another improved try to generate cbytecode from C if still misses the inner comments in the VM instructions:

void ByteCode_write_cbytecode(FILE *fp, const ByteCode *bc)
{
    const int max_line_len = 80;
    fprintf(fp, "int peg_num_defs = %d;\n", bc->num_defs);
    fprintf(fp, "char *peg_def_names[] = {\n\t");
    size_t line_len = 0;
    for(int i=0; i < bc->num_defs; ++i) {
        fprintf(fp, "%s\"%s\"", i ? "," : "", bc->def_names[i]);
        line_len += strlen(bc->def_names[i])+3;
        if(line_len > max_line_len) {
            line_len = 0;
            fprintf(fp, "\n\t");
        }
    }
    fprintf(fp, "\n};\n");
    fprintf(fp, "int peg_def_flags[] = {\n\t");
    line_len = 0;
    for(int i=0; i < bc->num_defs; ++i) {
        const char *str;
        switch(bc->def_flags[i]) {
            case STOP: str = "STOP"; break;
            case IGNORE: str = "IGNORE"; break;
            case LEAF: str = "LEAF"; break;
            default:
                str = "0";
        }
        fprintf(fp, "%s%s", i ? "," : "", str);
        line_len += strlen(str)+2;
        if(line_len > max_line_len) {
            line_len = 0;
            fprintf(fp, "\n\t");
        }
    }
    fprintf(fp, "\n};\n");
    fprintf(fp, "int peg_def_addrs[] = { // presubtracted by 1\n\t");
    line_len = 0;
    for(int i=0; i < bc->num_defs; ++i) {
        fprintf(fp, "%s%d", i ? "," : "", bc->def_addrs[i]);
        if(++line_len > 10) {
            line_len = 0;
            fprintf(fp, "\n\t");
        }
    }
    fprintf(fp, "\n};\n");
    fprintf(fp, "int peg_num_strings = %d;\n", bc->num_strings);
    fprintf(fp, "char *peg_strings[] = {\n\t");
    line_len = 0;
    for(int i=0; i < bc->num_strings; ++i) {
        char *esc_str = esc_string(bc->strings[i], bc->str_len[i], 0);
        fprintf(fp, "%s\"%s\"", i ? "," : "", esc_str);
        free(esc_str);
        line_len += bc->str_len[i]+3;
        if(line_len > max_line_len) {
            line_len = 0;
            fprintf(fp, "\n\t");
        }
    }
    fprintf(fp, "\n};\n");
    fprintf(fp, "int peg_str_len[] = {\n\t");
    line_len = 0;
    for(int i=0; i < bc->num_strings; ++i) {
        fprintf(fp, "%s%d", i ? "," : "", bc->str_len[i]);
        if(++line_len > 20) {
            line_len = 0;
            fprintf(fp, "\n\t");
        }
    }
    fprintf(fp, "\n};\n");
    fprintf(fp, "int peg_num_instructions = %d;\n", bc->num_instructions);
    fprintf(fp, "int peg_instructions[] = {\n");
    for(int i=0; i < bc->num_instructions; ++i) {
        int op = bc->instructions[i] & 0xff;
        int arg = bc->instructions[i] >> 8;
        fprintf(fp, "\t/*%d*/ INST(%s, %d),\n", i, OpNames[op], arg);
    }
    fprintf(fp, "};\n\n");
    for(int i=0; i < bc->num_defs; ++i) {
        const char *def_name = bc->def_names[i];
        fprintf(fp, "#define PEG_");
        for(int j=0, jmax=strlen(def_name); j < jmax; ++j)
            fputc(toupper(def_name[j]), fp);
        fprintf(fp, " %d\n", i);
    }
    fprintf(fp, "\n");
}

Output:

int peg_num_defs = 20;
char *peg_def_names[] = {
	"Grammar","Definition","Choice","Sequence","Predicate","Repeat","Primary","Options"
	,"Identifier","Literal","CharClass","CharRange","Char","EscChar","OctChar","PlainChar"
	,"PredOp","RepOp","Dot","S"
};
int peg_def_flags[] = {
	STOP,0,0,0,0,0,0,0,LEAF,STOP,STOP,0,0,LEAF,LEAF,LEAF,LEAF,LEAF
	,LEAF,IGNORE
};
int peg_def_addrs[] = { // presubtracted by 1
	2,15,42,55,62,73,84,129,136,145,181
	,197,218,234,240,265,273,277,281,285
};
int peg_num_strings = 25;
char *peg_strings[] = {
	"{","}","<-","/","{<","(",")","ILS","a-zA-Z_","a-zA-Z_0-9","'","\"","[","]","-","\\"
	,"nrt'\"[]\\","0-3","0-7","&!","*+?","."," \t\r\n","#","\r\n"
};
int peg_str_len[] = {
	1,1,2,1,2,1,1,3,7,10,1,1,1,1,1,1,8,3,3,2,3
	,1,4,1,2
};
int peg_num_instructions = 315;
int peg_instructions[] = {
	/*0*/ INST(IDENT, 0),
	/*1*/ INST(FAIL, 0),
...
	/*313*/ INST(ISUCC, 19),
	/*314*/ INST(IFAIL, 0),
};

#define PEG_GRAMMAR 0
#define PEG_DEFINITION 1
#define PEG_CHOICE 2
#define PEG_SEQUENCE 3
#define PEG_PREDICATE 4
#define PEG_REPEAT 5
#define PEG_PRIMARY 6
#define PEG_OPTIONS 7
#define PEG_IDENTIFIER 8
#define PEG_LITERAL 9
#define PEG_CHARCLASS 10
#define PEG_CHARRANGE 11
#define PEG_CHAR 12
#define PEG_ESCCHAR 13
#define PEG_OCTCHAR 14
#define PEG_PLAINCHAR 15
#define PEG_PREDOP 16
#define PEG_REPOP 17
#define PEG_DOT 18
#define PEG_S 19

@ChrisHixon
Copy link
Owner

I created a new issue related to eliminating the Ruby bootstrap and going with separate bytecode .h files. Your code will be a good start.

@mingodad
Copy link
Contributor Author

Would be nice to add hexadecimal escape sequences and in future unicode escape sequences:

Char        { } <- EscChar / HexChar / OctChar / PlainChar
EscChar     {L} <- '\\' [abefnrtv'"\[\]\\]
OctChar     {L} <- '\\' [0-3][0-7][0-7]
                 / '\\' [0-7][0-7]?
HexChar     {L} <- '\\x' [0-9a-fA-F][0-9a-fA-F]?
PlainChar   {L} <- !'\\' .

@ChrisHixon
Copy link
Owner

I'm still investigating but it seems that as it is now there is no way to have this character class [^].

Yes I thought of that, but why make a special case for that when user can use '^' or "^"?

@mingodad
Copy link
Contributor Author

Only to avoid confuse the user !

@ChrisHixon
Copy link
Owner

ChrisHixon commented Jun 25, 2022

I'm still investigating but it seems that as it is now there is no way to have this character class [^].

Yes I thought of that, but why make a special case for that when user can use '^' or "^"?

Unfortunately the negated char class extension breaks PEG compatibility, meaning valid PEG can fail or be interpreted differently. Note that cpp-peglib flags [^] as syntax error.

@mingodad
Copy link
Contributor Author

This tool is awesome to simplify grammars https://www.bottlecaps.de/rr/ui , when possible I always add an option to output equivalent EBNF understood by https://www.bottlecaps.de/rr/ui to help develop/debug grammars see https://github.com/mingodad/lalr-parser-test , https://github.com/mingodad/CocoR-CPP, https://github.com/mingodad/CocoR-CSharp, https://github.com/mingodad/CocoR-Java and https://github.com/mingodad/peg/blob/cf6630fb15ca3ff430d36b28e72a244bed2b9f4e/src/tree.c#L344 .
Copy and paste these grammars on two different tabs at https://www.bottlecaps.de/rr/ui:

MinMax  ::= MinMaxVal S Comma S MinMaxVal
           | MinMaxVal S Comma S
           | Comma S MinMaxVal
           | MinMaxVal

And the simplified one:

MinMax  ::= MinMaxVal (S Comma S MinMaxVal?)?
           | Comma S MinMaxVal

@mingodad
Copy link
Contributor Author

With my alternative grammar shown here #1 (comment) this grammar seems to parse properly (and indeed cpp-peglib doesn't accept it, bug?):

start <- name1 / name2 
name1 <- [^][a-z]
name2 <- [^2][a-z]

AST:

OFFSET   LEN     ID NC FLG IDENT "DATA"
     0     60      0  3 R-S Grammar "start <- name1 / name2 \nname1 <- [^]..."
     0     24      1  2 ---   Definition "start <- name1 / name2 \n"
     0      5     14  0 --L     Identifier "start"
     9     15      2  2 ---     Choice "name1 / name2 \n"
     9      5     14  0 --L       Identifier "name1"
    17      5     14  0 --L       Identifier "name2"
    24     18      1  2 ---   Definition "name1 <- [^][a-z]\n"
    24      5     14  0 --L     Identifier "name1"
    33      9      3  2 ---     Sequence "[^][a-z]\n"
    33      3     17  1 --S       CharClass "[^]"
    34      1     23  0 --L         PlainChar "^"
    36      5     17  1 --S       CharClass "[a-z]"
    37      3     18  2 ---         CharRange "a-z"
    37      1     23  0 --L           PlainChar "a"
    39      1     23  0 --L           PlainChar "z"
    42     18      1  2 ---   Definition "name2 <- [^2][a-z]"
    42      5     14  0 --L     Identifier "name2"
    51      9      3  2 ---     Sequence "[^2][a-z]"
    51      4     16  1 --S       NCharClass "[^2]"
    53      1     23  0 --L         PlainChar "2"
    55      5     17  1 --S       CharClass "[a-z]"
    56      3     18  2 ---         CharRange "a-z"
    56      1     23  0 --L           PlainChar "a"
    58      1     23  0 --L           PlainChar "z"

@ChrisHixon
Copy link
Owner

Nice... If you want to give that a shot with chpeg (the utility), you could add an action to output EBNF (and/or actions for other formats); see util/src, util/src/actions. This would be a matter of grabbing the parse from -p/-P and then walking the AST.

@ChrisHixon
Copy link
Owner

With my alternative grammar shown here #1 (comment) this grammar seems to parse properly (and indeed cpp-peglib doesn't accept it, bug?):

start <- name1 / name2 
name1 <- [^][a-z]
name2 <- [^2][a-z]

Yes, it looks like you removed the !'^' in CharClass. Which I think was only necessary when CharClass is listed first in the Primary choices. I'll consider it but it needs a little more testing, does it work as expected in all cases or have any side effects?

@mingodad
Copy link
Contributor Author

As I said before it's almost done here https://github.com/mingodad/chpeg/blob/696339f5802834a49e56371a255e3aeae487b7af/src/bytecode.c#L503 with an extra parameter to decide when to use <- and / or ::= and | and probably ommit/convert other minor deviations.

@ChrisHixon
Copy link
Owner

As I said before it's almost done here https://github.com/mingodad/chpeg/blob/696339f5802834a49e56371a255e3aeae487b7af/src/bytecode.c#L503 with an extra parameter to decide when to use <- and / or ::= and | and probably ommit/convert other minor deviations.

Right, but that's not the approach I'd take or suggest, the AST from a parse is much easier to deal with than reversing the bytecode, which I think could be problematic, especially if changes are made later to bytecode/parser. It might map pretty well to the AST now but that could change later if it's ever reorganized or optimizations added. It maps pretty well now because it's a pretty basic/naive translation to bytecode.

@mingodad
Copy link
Contributor Author

I agree with you, but honestly I'm having trouble to work with the AST, first because it's not easy to get my desired AST shape and because the same problem can happen when the way AST is constructed change and any existing working code need to be reviewd (AST simplification 0/1/2/?).

But I would suggest to do it and then you'll be dog fooding chpeg and will have something to ponder/test when making changes that break it.

@mingodad
Copy link
Contributor Author

mingodad commented Jun 25, 2022

Right now chpeg, cpp-peglib, peggy/pegjs/, pest are nice to develop/test a grammar but when it comes to do something with the grammars old school parsers with embedded actions are easier to work in several scenarios.

@mingodad
Copy link
Contributor Author

For example here is an lepegrex to peg conversion program in leg easily convertible to peg to:

# ../leg -o lpegrex2peg.leg.c lpegrex2peg.leg; gcc -g -o lpegrex2peg lpegrex2peg.leg.c; ./lpegrex2peg c11-lpegrex.lpeg
%{
#include <stdio.h>
#include <stdlib.h>

typedef struct _yycontext yycontext;
void printName(yycontext *ctx, const char *s);
void printDefined(yycontext *ctx, const char *s);
void printElmNSP(yycontext *ctx, const char *s);
void printElm(yycontext *ctx, const char *s);
void printAltSym(yycontext *ctx);
int igetchar(yycontext *ctx);

#define getchar() igetchar(yy)
//#define YY_DEBUG
#define YY_CTX_LOCAL

#define YY_CTX_MEMBERS \
	int doOutput; \
	int insideParen; \
	FILE *ifp; \
	const char *ifname;

#define MYRULE_SEP	"<-"
#define MYRULE_END "\t"
#define MYALT_SEP	"/"

%}

pattern =
	S exp ! . { printf("\n\n"); }
	;
S =
	( Space | Comment )*
	;
exp =
	grammar
	| alternative
	;
grammar =
	definition+
	;
alternative =
	seq ( "/" S { printAltSym(yy); } seq )*
	;
seq =
	prefix*
	;
prefix =
	< "&" > { printElm(yy, yytext); } S prefix
	|  < "!" > { printElm(yy, yytext); } S prefix
	| suffix
	;
suffix =
	primary S (
		(
			< [+*?] > { printf(" %s ", yytext); } S
			| "^" [+\-]? num
			| "->" { yy->doOutput=0; } S ( string | ( "{}" S ) | name | num ) { yy->doOutput=1; }
			| "=>" { yy->doOutput=0; } S name
			) S
		)* { yy->doOutput=1; }
	;
primary =
	"(" { printf(" ( "); ++yy->insideParen; } S exp ")" { printf(" ) "); --yy->insideParen; } S
	| ( string | keyword )
	| class
	| defined
	| "{:" ( name ":" S )? exp ":}" S
	| "=" name
	| "@" exp
	| "{}" S
	| "{~" S exp "~}" S
	| "{|" S exp "|}" S
	| "{" S exp "}" S
	| "~?" { printf(" ? "); } S
	| "~>" S ( "foldleft" | "foldright" | "rfoldleft" | "rfoldright" ) S
	| "$" ( string | name | num )
	| < "." > S {  printElm(yy, yytext); }
	| name S ! ( asttag | arrow )
	| "<" S name ">" S
	| { yy->doOutput = 0; } "^" name { yy->doOutput = 1; }
	| ( "%{" S name "}" S )
	;
num =
	[0-9]+ S
	;
string =
	< '"' [^"]* '"' > { printElm(yy, yytext); }
	| < "'" [^']* "'" > S { printElm(yy, yytext); }
	;
name =
	< [A-Za-z_] ( [A-Za-z0-9_] | ( "-" ! ">" ) )* > S {  printName(yy, yytext); }
	;
keyword =
	"`" < [^`]+ > "`" S {  printf(" ( '%s' SKIP ) ", yytext); }
	;
class =
	< "[" > { printElmNSP(yy, yytext); } item ( ! "]" item )* < "]" > {  printElmNSP(yy, yytext); } S
	;
defined =
	{ yy->doOutput = 0; } < "%" name > { yy->doOutput = 1;printDefined(yy, yytext); }
	;
asttag =
	{ yy->doOutput = 0; } ":" S name { yy->doOutput = 1; }
	;
arrow =
	( "<--" | "<==" | "<-|" | "<-" ) S
	;
definition =
	{ printf("\n\n"); } name { printf(" " MYRULE_SEP "\n\t"); } asttag? arrow exp { printf(MYRULE_END); }
	;
item =
	defined
	| range
	| ( < . > {  printElmNSP(yy, yytext); } )
	;
range =
	< . "-" [^\]] > { printElmNSP(yy, yytext); }
	;
Space =
	" " | "\t" | EndOfLine
	;
Comment =
	"--" ( ! EndOfLine . )* EndOfLine
	;
EndOfLine =
	"\r\n" | "\n" | "\r"
	;

%%

int igetchar(yycontext *ctx)
{
	return fgetc(ctx->ifp);
}

void printName(yycontext *ctx, const char *s) {
	if(ctx->doOutput) {
		size_t len = strlen(s);
		fputc(' ', stdout);
		for(size_t i=0; i < len; ++i) fputc(s[i] == '-' ? '_' : s[i], stdout);
		fputc(' ', stdout);
	}
}
void printDefined(yycontext *ctx, const char *s) {
	//printf("===printDefined: %d : %s\n", ctx->doOutput, s);
	#define MYSTRNCMP(ps, ls) (strncmp(ps, ls, sizeof(ls)-1))
	if(ctx->doOutput) {
		if(MYSTRNCMP(s, "nl") == 0) printf(" '\\n' ");
		else if(MYSTRNCMP(s, "cn") == 0) printf(" '\\n' ");
		else if(MYSTRNCMP(s, "cr") == 0) printf(" '\\r' ");
		else if(MYSTRNCMP(s, "s") == 0)
			printf(" (' ' " MYALT_SEP " '\\t' " MYALT_SEP " '\\r' " MYALT_SEP " '\\n') ");
	}
}
void printElmNSP(yycontext *ctx, const char *s) {
	if(ctx->doOutput) printf("%s", s);
}
void printElm(yycontext *ctx, const char *s) {
	if(ctx->doOutput) printf(" %s ", s);
}
void printAltSym(yycontext *ctx) {
	if(ctx->doOutput) {
		printf("%s" MYALT_SEP " ", ctx->insideParen ? " " : "\n\t");
	}
}


int main(int argc, char *argv[])
{
  int rc;
  /* create a local parser context in automatic storage */
  yycontext yy;
  /* the context *must* be initialised to zero before first use*/
  memset(&yy, 0, sizeof(yy));

  if(argc < 2) {
	printf("usage: %s lpegrex_filename\n", argv[0]);
	return -1;
  }
  yy.ifp = fopen(argv[1], "r");
  if(yy.ifp == NULL) {
     printf("failed to open: %s\n", argv[1]);
     return -2;
  }
  yy.ifname = argv[1];
  yy.__lineno = 1;
  yy.doOutput = 1;
  /*while ( (*/rc = yyparse(&yy); /*) )*/  printf("yy %d\n", rc);

  #ifdef YY_RULES_PROFILE
    yyShowRulesProfile(&yy, stdout);
  #endif

  /* release all resources associated with the context */
  yyrelease(&yy);

  fclose(yy.ifp);

  return 0;
}

@mingodad
Copy link
Contributor Author

And here is a pest to peg in leg:

# ../leg -o pest2peg.peg.c pest2peg.leg; gcc -g -o pest2peg pest2peg.peg.c; ./pest2peg pest.pest
%{
#include <stdio.h>
#include <stdlib.h>

typedef struct _yycontext yycontext;
void printName(yycontext *ctx, const char *s);
void printDefined(yycontext *ctx, const char *s);
void printElmNSP(yycontext *ctx, const char *s);
void printElm(yycontext *ctx, const char *s);
void printAltSym(yycontext *ctx);
int igetchar(yycontext *ctx);
void synerr(yycontext *ctx, const char *errmsg);

#define getchar() igetchar(yy)
//#define YY_DEBUG
#define YY_CTX_LOCAL
//#define YYSTYPE char*

#define YY_CTX_MEMBERS \
	int doOutput; \
	int insideParen; \
	FILE *ifp; \
	const char *ifname;

//#define TO_EBNF
#ifdef TO_EBNF
#define MYRULE_SEP	"::="
#define MYRULE_END "\n\n"
#define MYALT_SEP	"|"
#define MYNOT	"_NOT_"
#define MYAND	"_AND_"
#else
#define MYRULE_SEP	"<-"
#define MYRULE_END "\n\n"
#define MYALT_SEP	"/"
#define MYNOT	"!"
#define MYAND	"&"
#endif

%}

Grammar =
	Spacing Definition+ EndOfFile ~{synerr(yy, "unexpected syntax");} {
#ifdef TO_EBNF
	printf("\n//Added tokens for railroad generation"
		"\nSOI ::= (WHITESPACE | COMMENT)*"
		"\nEOI ::= _NOT_ ."
		"\n_NOT_ ::= '!'"
		"\n_AND_ ::= '&'\n");
#endif
	}
	;
Spacing =
	( Space |  Comment )*
	;
Definition =
	Identifier  EQUAL {printf(" " MYRULE_SEP "\n\t");} modifier? opening_brace Expression closing_brace {printf(MYRULE_END);}
	;
EndOfFile =
	!.
	;
Identifier =
	!PUSH <IdentStart IdentCont*> Spacing  {printName(yy, yytext);}
	;
EQUAL =
	"=" Spacing
	;
Expression =
	Sequence ( SLASH {printAltSym(yy);} Sequence )*
	;
Sequence =
	Prefix (TILDE Prefix)*
	;
SLASH =
	"|" Spacing
	;
Prefix =
	(AND | NOT)? Suffix
	;
TILDE =
	"~" Spacing
	;
AND =
	"&" {printf(MYAND);} Spacing
	;
Suffix =
	Primary (
	    QUESTION
	    | STAR
	    | PLUS
	    | repeat_exact
	    | repeat_min
	    | repeat_max
	    | repeat_min_max
	    )?
	;
NOT =
	"!" {printf(MYNOT);} Spacing
	;
Primary =
	{yy->doOutput=0;} Identifier COLON {yy->doOutput=1;} Prefix ! (Literal? EQUAL)
	| DOT
	| Identifier  ! (Literal? EQUAL)
	| PUSH? OPEN Expression CLOSE
	| Range
	| insensitive_operator? Literal
	| peek_slice
	;
QUESTION =
	"?" {printf("?");} Spacing
	;
STAR =
	"*" {printf("*");} Spacing
	;
PLUS =
	"+" {printf("+");} Spacing
	;
COLON =
	":"   {printElm(yy, ":"); ++yy->insideParen;} Spacing
	;
OPEN =
	"("  {printf(" ("); ++yy->insideParen;} Spacing
	;
CLOSE =
	")"  {printf(" )"); --yy->insideParen;} Spacing
	;
Literal =
	<['](! ['] Char)* [']> {printName(yy, yytext);} Spacing
	| <["](! ["] Char)* ["]> {printName(yy, yytext);} Spacing
	| <[`](! [`] Char)* [`]> {printName(yy, yytext);} Spacing
	;
LiteralUnquotes =
	[']<(! ['] Char)*> ['] {printElmNSP(yy, yytext);} Spacing
	| ["]<(! ["] Char)*> ["] {printElmNSP(yy, yytext);} Spacing
	| [`]<(! [`] Char)*> [`] {printElmNSP(yy, yytext);} Spacing
	;
IdentStart =
	 [a-zA-Z_]
	;
DOT =
	"ANY" {printf(" .");} Spacing
	;
IdentCont =
	IdentStart |  Digit
	;
Char =
	"\\" [abefnrtv'"`\[\]\\]
	| "\\" [0-3] OctalDigit OctalDigit
	| "\\" OctalDigit OctalDigit?
	| "\\x" HexDigit HexDigit
	| "\\u" opening_brace HexDigit HexDigit HexDigit HexDigit closing_brace
	| "\\" "-"
	| ! "\\" .
	;
Digit =
	[0-9]
	;
OctalDigit =
	[0-7]
	;
HexDigit =
	[0-9a-fA-F]
	;
range_operator =
	".." Spacing
	;
Range =
	{printf(" [");} LiteralUnquotes range_operator {printf("-");} LiteralUnquotes {printf("]");}
	;
Space =
	" " |  "\t" |  EndOfLine
	;
Comment =
	"//"(! EndOfLine.)* EndOfLine
	| "/*" (! '*/' .)* "*/"
	;
EndOfLine =
	"\r\n" |  "\n\r" |  "\n" |  "\r"
	;
opening_brace =
	'{' Spacing
	;
closing_brace =
	'}' Spacing
	;
insensitive_operator =
	"^"
	;
modifier =
	silent_modifier
	| atomic_modifier
	| compound_atomic_modifier
	| non_atomic_modifier
	;

silent_modifier  =
	"_" Spacing
	;
atomic_modifier =
	"@" Spacing
	;
compound_atomic_modifier =
	"$" Spacing
	;
non_atomic_modifier =
	"!" Spacing {printf(MYNOT);}
	;

comma =
    ',' Spacing
    ;
number =
    Digit+
    ;

repeat_exact =
    opening_brace number closing_brace
    ;

repeat_min =
    opening_brace number comma closing_brace
    ;

repeat_max =
    opening_brace comma number closing_brace
    ;

repeat_min_max =
    opening_brace number comma number closing_brace
    ;

PUSH =
	"PUSH" Spacing
	;
PEEK =
	"PEEK" Spacing
	;
opening_brack =
	'[' Spacing
	;
closing_brack =
	']' Spacing
	;
integer =
	number | "-"  "0"*  '1'..'9'  number?
	;
peek_slice =
	PEEK opening_brack  integer? range_operator integer? closing_brack
	;

%%

int igetchar(yycontext *ctx)
{
	return fgetc(ctx->ifp);
}

void printName(yycontext *ctx, const char *s) {
	if(ctx->doOutput) {
		printf(" %s", s);
		/*
		size_t len = strlen(s);
		fputc(' ', stdout);
		for(size_t i=0; i < len; ++i) fputc(s[i] == '-' ? '_' : s[i], stdout);
		fputc(' ', stdout);
		*/
	}
}
void printDefined(yycontext *ctx, const char *s) {
	//printf("===printDefined: %d : %s\n", ctx->doOutput, s);
	#define MYSTRNCMP(ps, ls) (strncmp(ps, ls, sizeof(ls)-1))
	if(ctx->doOutput) {
		if(MYSTRNCMP(s, "nl") == 0) printf(" '\\n' ");
		else if(MYSTRNCMP(s, "cn") == 0) printf(" '\\n' ");
		else if(MYSTRNCMP(s, "cr") == 0) printf(" '\\r' ");
		else if(MYSTRNCMP(s, "s") == 0)
			printf(" (' ' " MYALT_SEP " '\\t' " MYALT_SEP " '\\r' " MYALT_SEP " '\\n') ");
	}
}
void printElmNSP(yycontext *ctx, const char *s) {
	if(ctx->doOutput) printf("%s", s);
}
void printElm(yycontext *ctx, const char *s) {
	if(ctx->doOutput) printf(" %s ", s);
}
void printAltSym(yycontext *ctx) {
	if(ctx->doOutput) {
		printf("%s" MYALT_SEP, ctx->insideParen ? " " : "\n\t");
	}
}

void synerr(yycontext *ctx, const char *errmsg)
{
	fprintf(stderr, "%s:%d:%d %s\n", ctx->ifname, ctx->__lineno, ctx->__inputpos-ctx->__linenopos, errmsg);
}

int main(int argc, char *argv[])
{
  int rc;
  /* create a local parser context in automatic storage */
  yycontext yy;
  /* the context *must* be initialised to zero before first use*/
  memset(&yy, 0, sizeof(yy));

  if(argc < 2) {
	printf("usage: %s pest_filename\n", argv[0]);
	return -1;
  }
  yy.ifp = fopen(argv[1], "r");
  if(yy.ifp == NULL) {
     printf("failed to open: %s\n", argv[1]);
     return -2;
  }
  yy.ifname = argv[1];
  yy.__lineno = 1;
  yy.doOutput = 1;
  /*while ( (*/rc = yyparse(&yy); /*) )*/  printf("yy %d\n", rc);

  /* release all resources associated with the context */
  yyrelease(&yy);

  fclose(yy.ifp);

  return 0;
}

@mingodad
Copy link
Contributor Author

Could you as an exercise build the equivalent of pest2peg and lpegrex2peg with chpeg to show it's usage ?
I can provide working grammar !
pest -> #20 (comment)
lpegrex -> #20 (comment)

Again another way to dogfood chpeg and have working examples to test/ponder when making changes.

@ChrisHixon
Copy link
Owner

The compiler itself is where I'm using the AST the most. Since there are no callbacks/actions, it seems easiest to wrap the AST with your own tree where you add whatever your app requires - this is what the compiler does. The compiler is stuck on -s1 for now because the grammars haven't been ported to -s2. As you're aware the compiler requires a specific AST. I would suggest ditching -s1 and consider -s2 the only option. I don't foresee any further AST simplification related changes. For a simple punctuation change like EBNF I'm thinking it might be a variation of what something like ChpegNode_print() does.

@mingodad
Copy link
Contributor Author

This is the answer from cpp-peglib about the [^] class yhirose/cpp-peglib#218 (comment) , I'm no in doubt of my view as a valid one concerning confuse users.

@ChrisHixon
Copy link
Owner

Could you as an exercise build the equivalent of pest2peg and lpegrex2peg with chpeg to show it's usage ? I can provide working grammar ! pest -> #20 (comment) lpegrex -> #20 (comment)

Again another way to dogfood chpeg and have working examples to test/ponder when making changes.

I'll consider doing that and/or other examples. It's certainly a lacking area of chpeg as a project.

@mingodad
Copy link
Contributor Author

See this issue yhirose/cpp-peglib#226 for a request/wish for a better user experience when developing/debugging grammars on the playground.

The error message from chpeg is:

input:39:29: Expected: declarator_list in normal_item_declaration

@ChrisHixon
Copy link
Owner

ChrisHixon commented Jun 26, 2022

As a start, I created a calculator example here:
da1cff7
Minor change: 5bb13fb

A few things of note:

This example incorporates a C bytecode file, generated in the Makefile. Without doing this (say, generating bytecode at runtime), it's going to be difficult to hook up the definition IDs in user code in order to process the AST nodes.

I added a function chpeg_token to make it easier to process tokens:

// Return a zero-terminated token from `input` at offset `offset` with length `length`.
// Result must be freed with CHPEG_FREE().
CHPEG_DEF char *chpeg_token(const unsigned char *input, size_t offset, size_t length)
{
    char *token = (char *)CHPEG_MALLOC(length + 1);
    memcpy(token, input + offset, length);
    token[length] = '\0';
    return token;
}

calc takes input via stdin or file. The result printed by calc should match the same input sent to the unix bc command.

@mingodad
Copy link
Contributor Author

Here is another interesting resource for peg parsing rwxrob/pegn-spec#15

@mingodad
Copy link
Contributor Author

I've just got this parser https://github.com/edubart/lpegrex/blob/main/parsers/c11.lua converted to peg running with https://github.com/mingodad/peg, it needs a hashtable/dicitionary available to be usable by the end users (I'm using khash, included).
Nez (#20 (comment) ) has this:

Extension
   <- LT addExtension S* GT

addExtension
   <- IF       FlagName #If
	/ ON        FlagName Expression #On
	/ BLOCK     Expression #Block
	/ SYMBOL    Expression #Symbol
	/ DEF       Expression #Symbol
	/ MATCH     Expression #Match
	/ IS        TableName #Is
	/ ISA       TableName #Isa
	/ EXISTS    TableName #Exists
	/ LOCAL     TableName Expression #Local
	/ Identifier Expression (COMMA Expression)* #Expand
	/ (!('>') .)+ #Undefined

PEG/LEG (https://github.com/mingodad/peg) has this:

A special form of the '&' predicate is provided:

       &{ expression }
              In  this  predicate  the  simple C expression (not statement) is
              evaluated immediately when the parser reaches the predicate.  If
              the  expression  yields non-zero (true) the 'match' succeeds and
              the parser continues with the next element in the  pattern.   If
              the  expression  yields  zero  (false) the 'match' fails and the
              parser backs up to look for an alternative parse of the input.

My implementation (see attached) has this:

typedef_name  <-
	 &  (  identifier  )  identifier  &{is_typedef(yy, yytext)}

typedef_identifier  <-
	 &  (  identifier  )  identifier &{set_typedef(yy, yytext)}

Lpegrex (https://github.com/edubart/lpegrex) has this:

typedef-name <==
  &(identifier => is_typedef) identifier

typedef-identifier <==
  &(identifier => set_typedef) identifier

Can we have something like this in chpeg ?

c11-lpegrex.zip

@decalek
Copy link

decalek commented Jun 29, 2022

treesitter and lezer (codemirror 6) brought the incremental parsing to a practical usage scale. But PEG grammars are easier for authoring (than LR) and being scannerless they are also more convenient for nested languages/grammars.

I am wondering would it be possible a VM based PEG parser to be made incremental? I.e. given a previous parse tree and a diff, is it possible a valid VM state at near point before the diff to be artificially constructed as a function of the tree and VM to be resumed in that state?

@ChrisHixon
Copy link
Owner

@decalek

treesitter and lezer (codemirror 6) brought the incremental parsing to a practical usage scale. But PEG grammars are easier for authoring (than LR) and being scannerless they are also more convenient for nested languages/grammars.

I am wondering would it be possible a VM based PEG parser to be made incremental? I.e. given a previous parse tree and a diff, is it possible a valid VM state at near point before the diff to be artificially constructed as a function of the tree and VM to be resumed in that state?

See this project: https://github.com/zyedidia/gpeg and the papers listed under publications: https://github.com/zyedidia/gpeg#publications

I have not explored incremental parsing myself.

@ChrisHixon
Copy link
Owner

I've just got this parser https://github.com/edubart/lpegrex/blob/main/parsers/c11.lua converted to peg running with https://github.com/mingodad/peg, it needs a hashtable/dicitionary available to be usable by the end users (I'm using khash, included). Nez (#20 (comment) ) has this:

Extension
   <- LT addExtension S* GT

addExtension
   <- IF       FlagName #If
	/ ON        FlagName Expression #On
	/ BLOCK     Expression #Block
	/ SYMBOL    Expression #Symbol
	/ DEF       Expression #Symbol
	/ MATCH     Expression #Match
	/ IS        TableName #Is
	/ ISA       TableName #Isa
	/ EXISTS    TableName #Exists
	/ LOCAL     TableName Expression #Local
	/ Identifier Expression (COMMA Expression)* #Expand
	/ (!('>') .)+ #Undefined

PEG/LEG (https://github.com/mingodad/peg) has this:

A special form of the '&' predicate is provided:

       &{ expression }
              In  this  predicate  the  simple C expression (not statement) is
              evaluated immediately when the parser reaches the predicate.  If
              the  expression  yields non-zero (true) the 'match' succeeds and
              the parser continues with the next element in the  pattern.   If
              the  expression  yields  zero  (false) the 'match' fails and the
              parser backs up to look for an alternative parse of the input.

My implementation (see attached) has this:

typedef_name  <-
	 &  (  identifier  )  identifier  &{is_typedef(yy, yytext)}

typedef_identifier  <-
	 &  (  identifier  )  identifier &{set_typedef(yy, yytext)}

Lpegrex (https://github.com/edubart/lpegrex) has this:

typedef-name <==
  &(identifier => is_typedef) identifier

typedef-identifier <==
  &(identifier => set_typedef) identifier

Can we have something like this in chpeg ?

c11-lpegrex.zip

I'm looking into these things and thinking about how I might do symbol tables in chpeg.

@mingodad
Copy link
Contributor Author

mingodad commented Jul 3, 2022

That would be a useful feature and that will allow parse at runtime more grammars (for instance C).

@mingodad
Copy link
Contributor Author

mingodad commented Jul 3, 2022

While you are at it would be nice to have also scopes, this project has a good description of desired features in a parser https://github.com/Lercher/CocoR that it somehow implements (worth short reading).

@mingodad
Copy link
Contributor Author

Where is the link to the playground on the README ?
How other people will know the playground exists ?

@mingodad
Copy link
Contributor Author

I just add an online playground for my CocoR Typescript/Javascript port here https://mingodad.github.io/CocoR-Typescript/playground

@mingodad
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants