Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packrat algorithm #21

Open
ChrisHixon opened this issue Jun 3, 2022 · 25 comments
Open

Packrat algorithm #21

ChrisHixon opened this issue Jun 3, 2022 · 25 comments
Assignees

Comments

@ChrisHixon
Copy link
Owner

I've implemented the Packrat algorithm here: https://github.com/ChrisHixon/chpeg/tree/packrat

% /usr/bin/time ../../util/chpeg -v -g c99-mouse.peg -packrat -p c99-mouse.peg.pp.c > /dev/null
Compiling grammar file: 'c99-mouse.peg'
Grammar file 'c99-mouse.peg' compiled successfully. Parser returned: 0
Parsing file: 'c99-mouse.peg.pp.c'
Parse successful.
0.19user 0.08system 0:00.28elapsed 99%CPU (0avgtext+0avgdata 197828maxresident)k
0inputs+0outputs (0major+120616minor)pagefaults 0swaps

% /usr/bin/time ../../util/chpeg -v -g c99-mouse.peg -p c99-mouse.peg.pp.c > /dev/null
Compiling grammar file: 'c99-mouse.peg'
Grammar file 'c99-mouse.peg' compiled successfully. Parser returned: 0
Parsing file: 'c99-mouse.peg.pp.c'
Parse successful.
0.56user 0.00system 0:00.56elapsed 99%CPU (0avgtext+0avgdata 13236maxresident)k
0inputs+0outputs (0major+2988minor)pagefaults 0swaps

TODO: windowing to reduce memory usage.

@ChrisHixon ChrisHixon self-assigned this Jun 3, 2022
@mingodad
Copy link
Contributor

mingodad commented Jun 4, 2022

Using the same example shown here #4 (comment) I'm getting the same behavior as cpp-peglib where it takes more time with -packrat:

/usr/bin/time ../../chpeg-packrat/util/chpeg -g bison.chpeg -p ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y > /dev/null

0.05user 0.00system 0:00.05elapsed 96%CPU (0avgtext+0avgdata 4464maxresident)k
0inputs+0outputs (0major+879minor)pagefaults 0swaps

/usr/bin/time ../../chpeg-packrat/util/chpeg -packrat -g bison.chpeg -p ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y > /dev/null

0.11user 0.04system 0:00.15elapsed 99%CPU (0avgtext+0avgdata 63372maxresident)k
0inputs+0outputs (0major+64968minor)pagefaults 0swaps

@ChrisHixon
Copy link
Owner Author

Indeed. I tested the same thing after reading your test results for bison.chpeg / postgresql-13.3/src/backend/parser/gram.y.
There are certainly cases where the overhead of packrat makes things worse. This bison grammar must be pretty optimal already.

@mingodad
Copy link
Contributor

mingodad commented Jun 4, 2022

That's why I said that I have mixed feelings about packrat because it can hide bad grammars (not saying that that bison grammar is optimal).

@ChrisHixon
Copy link
Owner Author

Yeah I am having (and have had) mixed feelings about packrat, and it can certainly hide bad grammars, but I still think it's useful to have available.

Perhaps optimal was not the best word choice - I haven't really analyzed it, but I'm thinking if packrat makes something worse it might already be in fairly good shape. The ISUCC/IFAIL ratio looks good and it seems pretty fast for the input size. 5742685 inst / 451294 bytes = 12.72 instructions per input byte. Maybe a good stat to add.

@mingodad
Copy link
Contributor

mingodad commented Jun 4, 2022

Definitely a good one !

@ChrisHixon
Copy link
Owner Author

I adding windowing to the packrat branch. It does do the job of reducing memory use; however, I have not yet found an example that doesn't trade window size for speed.

# no window (window is actually input_length + 1)
% /usr/bin/time ../../util/chpeg -v -g c99-mouse.peg -packrat -p c99-mouse.peg.pp.c > /dev/null
Compiling grammar file: 'c99-mouse.peg'
Grammar file 'c99-mouse.peg' compiled successfully. Parser returned: 0
Parsing file: 'c99-mouse.peg.pp.c'
Parse successful.
0.19user 0.08system 0:00.28elapsed 99%CPU (0avgtext+0avgdata 197812maxresident)k
0inputs+0outputs (0major+120612minor)pagefaults 0swaps

# window=64
% /usr/bin/time ../../util/chpeg -v -g c99-mouse.peg -packrat -window=64 -p c99-mouse.peg.pp.c > /dev/null
Compiling grammar file: 'c99-mouse.peg'
Grammar file 'c99-mouse.peg' compiled successfully. Parser returned: 0
Parsing file: 'c99-mouse.peg.pp.c'
Parse successful.
0.68user 0.00system 0:00.68elapsed 100%CPU (0avgtext+0avgdata 11312maxresident)k
0inputs+0outputs (0major+2496minor)pagefaults 0swaps

# window=4096
% /usr/bin/time ../../util/chpeg -v -g c99-mouse.peg -packrat -window=4096 -p c99-mouse.peg.pp.c > /dev/null
Compiling grammar file: 'c99-mouse.peg'
Grammar file 'c99-mouse.peg' compiled successfully. Parser returned: 0
Parsing file: 'c99-mouse.peg.pp.c'
Parse successful.
0.57user 0.00system 0:00.57elapsed 100%CPU (0avgtext+0avgdata 17340maxresident)k
0inputs+0outputs (0major+5658minor)pagefaults 0swaps

# window=65536
% /usr/bin/time ../../util/chpeg -v -g c99-mouse.peg -packrat -window=65536 -p c99-mouse.peg.pp.c > /dev/null
Compiling grammar file: 'c99-mouse.peg'
Grammar file 'c99-mouse.peg' compiled successfully. Parser returned: 0
Parsing file: 'c99-mouse.peg.pp.c'
Parse successful.
0.27user 0.05system 0:00.32elapsed 99%CPU (0avgtext+0avgdata 118652maxresident)k
0inputs+0outputs (0major+54364minor)pagefaults 0swaps

% ../../util/chpeg-trace --profile -v -g c99-mouse.peg -packrat -window=65536 -p c99-mouse.peg.pp.c
Compiling grammar file: 'c99-mouse.peg'
Grammar file 'c99-mouse.peg' compiled successfully. Parser returned: 0
Parsing file: 'c99-mouse.peg.pp.c'
Parse successful.
Instructions executed:
   OP-   opcode       count      %
   OP      GOTO     3691965  24.97
   OP     IDENT     2190767  14.82
   OP     ISUCC      443906   3.00
   OP     IFAIL     1296870   8.77
   OP     RSBEG      207555   1.40
   OP     RQBEG       91117   0.62
   OP    CHOICE      658365   4.45
   OP    CISUCC      237379   1.61
   OP     CFAIL      420986   2.85
   OP    CIFAIL     2505124  16.95
...
   OP=    Total    14783641 100.00

Definition identifier calls:
  DEF-   id       IDENT      %       ISUCC       IFAIL  name
...
  DEF=   --     2190767 100.00      443906     1296870  --

Choice calls per definition:
  CHOICE-  def      CHOICE      %      CISUCC      CIFAIL  def_name
...
  CHOICE=   --      658365 100.00      237379     2505124  --

Farthest backtrack: 27233

% python # instructions per input byte
Python 3.10.4 (main, May 14 2022, 05:21:19) [GCC 12.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 14783641 / 187606
78.80153619820261

Note the farthest backtrack stat. Speed doesn't improve much until the window covers that amount.

@mingodad
Copy link
Contributor

mingodad commented Jun 4, 2022

After you've mentioned the instructions per input byte metric I've add it to my fork and I'm looking again on c99-mouse.peg grammar and it seems that a good amount of time is due to this two rules:

Keyword
   <- ( 'asm'
      / 'auto'
      / 'break'
      / 'case'
      / 'char'
      / 'const'
      / 'continue'
      / 'default'
      / 'double'
      / 'do'
      / 'else'
      / 'enum'
      / 'extern'
      / 'float'
      / 'for'
      / 'goto'
      / 'if'
      / 'int'
      / 'inline'
      / 'long'
      / 'register'
      / 'restrict'
      / 'return'
      / 'short'
      / 'signed'
      / 'sizeof'
      / 'static'
      / 'struct'
      / 'switch'
      / 'typedef'
      / 'union'
      / 'unsigned'
      / 'void'
      / 'volatile'
      / 'while'
      / '_Bool'
      / '_Complex'
      / '_Imaginary'
      / '_stdcall'
      / '__declspec'
      / '__extension__'
      / '__attribute__'
      )
    !IdChar
Identifier {L} <- !Keyword IdNondigit IdChar* Spacing 

Then I've tried to test on cpp-peglib using it's Dicitionary https://github.com/yhirose/cpp-peglib#dictionary functionality and I think that I found a bug there yhirose/cpp-peglib#209 like when I tried to use the backreference.

Some rules have a multiplier/cascade effect and would be nice to have a way to detect/show then.

Another possibility that I was thinking is to have a way to specify where (rule) we want the packrat/cache to be used because in some grammars only a few rules would need/benefit from it.

@mingodad
Copy link
Contributor

mingodad commented Jun 4, 2022

I'm trying several tweaks in hope I'll find the big multiplier:
Initial c99-mouse.peg:

./../chpeg-dad/examples/chpeg_nocase c99-mouse.chpeg c99-mouse.peg.pp.c
chpeg_compile: Parse successful.
Grammar file compiled successfully. Parser returned: 0
Parse successful.
  id       total      %     success        fail  definition
               1   0.00           1              TranslationUnit <- Spacing ( ( ExternalDeclaration / SEMI ) )* EOT 
   1         841   0.01         840           1  ExternalDeclaration <- ( FunctionDefinition / Declaration ) 
...
193       82445   0.77       23591       58854  COMMA <- "," Spacing 
 194           1   0.00           1              EOT <- ! . 

        10687487            2527875     8159612  Total counters

                              23.65       76.35  % success/fail

Total VM loops 92589946,  instructions per input byte 493.53

After some tweaks:

../../chpeg-dad/examples/chpeg_nocase c99-mouse-perf.chpeg c99-mouse.peg.pp.c
chpeg_compile: Parse successful.
Grammar file compiled successfully. Parser returned: 0
Parse successful.
  id       total      %     success        fail  definition
               1   0.00           1              TranslationUnit <- Spacing ( ( ExternalDeclaration / SEMI ) )* EOT 
   1         841   0.01         840           1  ExternalDeclaration <- ( FunctionDefinition / Declaration ) 
...
 193       82445   0.90       23591       58854  COMMA <- "," Spacing 
 194           1   0.00           1              EOT <- ! . 

         9211500            2665126     6546374  Total counters

                              28.93       71.07  % success/fail

Total VM loops 73947310,  instructions per input byte 394.16

@ChrisHixon
Copy link
Owner Author

I added the option {P} to enable a node to be cached in packrat. If no such rules exist, all nodes will be cached as before.

43bb8ef

@ChrisHixon
Copy link
Owner Author

Just add P into the options of a rule like:

% diff c99-mouse.peg c99-mouse-packrat.peg 
537c537
< Identifier {L} <- !Keyword IdNondigit IdChar* Spacing #{}
---
> Identifier {LP} <- !Keyword IdNondigit IdChar* Spacing #{}

@ChrisHixon
Copy link
Owner Author

Do your tweaks improve:

Farthest backtrack: 27233

I think if you can get the backtracking to fit into the sliding packrat window (which always sits at the furthest offset reached), then performance and memory use should both be good. Of course it's also possible that with such grammar changes, it'll be fine without packrat.

@ChrisHixon
Copy link
Owner Author

I'm also pondering more memory efficient packrat storage. The current implementation uses a table much like the "Packrat Parsing with Elastic Sliding Window" paper you linked at https://www.jstage.jst.go.jp/article/ipsjjip/23/4/23_505/_pdf , but instead of recording length, it links to PNode wrapped parse sub-trees. There is a compile-time node limit to prevent it from caching huge sub-trees - perhaps I should expose this as runtime option.

I'm not sure it's worth the trouble of making the lookup table more memory efficient or not, at the cost of more overhead. It'd have to be something like a linked list or b-tree, rb-tree, etc., for each window position.

@ChrisHixon
Copy link
Owner Author

I guess I could also try the elastic sliding window - still trying to wrap my head around that one. The current implementation is the basic sliding window mentioned in the paper.

@mingodad
Copy link
Contributor

mingodad commented Jun 5, 2022

Here is the tweaked c99-mouse grammar:

c99-mouse-perf.chpeg.zip

@ChrisHixon ChrisHixon mentioned this issue Jun 18, 2022
@ChrisHixon
Copy link
Owner Author

From the other discussion:

I tried raising the packrat node limit using -DCHPEG_PACKRAT_NODE_LIMIT=32 (default is 8)
This seems to make performance similar to cpp-peglib. Also, -s2 seems to improve things a tiny bit.

% /usr/bin/time ../../../chpeg/util/chpeg -g kotlin.chpeg -s2 -packrat -p kotlin.kt > /dev/null
0.22user 0.08system 0:00.30elapsed 100%CPU (0avgtext+0avgdata 172404maxresident)k
0inputs+0outputs (0major+73735minor)pagefaults 0swaps

Perhaps the default CHPEG_PACKRAT_NODE_LIMIT should be changed. Also this could be made a runtime settable option. This is the total count of nodes allowed (recursive, entire sub-tree) in a single packrat cache entry. Either simplification (-s1, -s2) helps performance because it helps reduce the node count and node copy overhead (when packrat cache is used the result has to be copied to the parse tree).

@mingodad
Copy link
Contributor

Here is the kotlin grammars I'm using, also see this comment yhirose/cpp-peglib#213 (comment) to see if you can reproduce it.

kotlin-peglib.zip

@ChrisHixon
Copy link
Owner Author

Results: (this is with -DCHPEG_PACKRAT_NODE_LIMIT=64)

% /usr/bin/time ../../../chpeg/util/chpeg -g kotlin.chpeg -packrat -s2 -p kotlin.kt > /dev/null           
0.20user 0.05system 0:00.25elapsed 100%CPU (0avgtext+0avgdata 192804maxresident)k
0inputs+0outputs (0major+78820minor)pagefaults 0swaps

% /usr/bin/time ../../../chpeg/util/chpeg -g kotlin-perf.chpeg -packrat -s2 -p kotlin.kt > /dev/null
0.10user 0.03system 0:00.14elapsed 99%CPU (0avgtext+0avgdata 76804maxresident)k
0inputs+0outputs (0major+49852minor)pagefaults 0swaps

% /usr/bin/time ../../../cpp-peglib/lint/build/peglint --ast --packrat kotlin.peglib kotlin.kt > /dev/null
0.26user 0.01system 0:00.27elapsed 100%CPU (0avgtext+0avgdata 60708maxresident)k
0inputs+0outputs (0major+17575minor)pagefaults 0swaps

% /usr/bin/time ../../../cpp-peglib/lint/build/peglint --ast --packrat kotlin-perf.peglib kotlin.kt > /dev/null
0.23user 0.00system 0:00.24elapsed 99%CPU (0avgtext+0avgdata 52324maxresident)k
0inputs+0outputs (0major+14176minor)pagefaults 0swaps

@mingodad
Copy link
Contributor

It seems that somehow cpp-peglib removes or creates less nodes by default because it's memory usage is lower before refining the grammars.
I'm interested in this grammar to see if we can make it also work without packrat by using the profile metrics we've developed so far in a kind of tutorial/howto to be useful by anyone (mainly ourselves) as guidelines (even better if some of the procedures could be automated).

@ChrisHixon
Copy link
Owner Author

I made -DCHPEG_PACKRAT_NODE_LIMIT=64 (and -s2) the default in the current dev branch: https://github.com/ChrisHixon/chpeg/tree/dev . At some point I plan on addressing packrat and other memory usage. If you're really concerned about it for now, try smaller window sizes (needs adjustment depending on grammar/input, and there is a trade off between memory and performance)...

% /usr/bin/time ../../../chpeg/util/chpeg -g kotlin.chpeg -packrat -window=1024 -p kotlin.kt > /dev/null
1.69user 0.00system 0:01.69elapsed 99%CPU (0avgtext+0avgdata 11884maxresident)k
0inputs+0outputs (0major+3491minor)pagefaults 0swaps

% /usr/bin/time ../../../chpeg/util/chpeg -g kotlin.chpeg -packrat -window=2048 -p kotlin.kt > /dev/null
0.33user 0.00system 0:00.33elapsed 99%CPU (0avgtext+0avgdata 16320maxresident)k
0inputs+0outputs (0major+5339minor)pagefaults 0swaps

% /usr/bin/time ../../../chpeg/util/chpeg -g kotlin.chpeg -packrat -window=4096 -p kotlin.kt > /dev/null
0.30user 0.01system 0:00.32elapsed 99%CPU (0avgtext+0avgdata 25284maxresident)k
0inputs+0outputs (0major+9020minor)pagefaults 0swaps

% /usr/bin/time ../../../chpeg/util/chpeg -g kotlin.chpeg -packrat -window=8192 -p kotlin.kt > /dev/null
0.30user 0.00system 0:00.31elapsed 99%CPU (0avgtext+0avgdata 43264maxresident)k
0inputs+0outputs (0major+16369minor)pagefaults 0swaps

% /usr/bin/time ../../../chpeg/util/chpeg -g kotlin.chpeg -packrat -window=16384 -p kotlin.kt > /dev/null
0.25user 0.00system 0:00.26elapsed 98%CPU (0avgtext+0avgdata 76760maxresident)k
0inputs+0outputs (0major+30610minor)pagefaults 0swaps

% /usr/bin/time ../../../chpeg/util/chpeg -g kotlin.chpeg -packrat -window=12000 -p kotlin.kt > /dev/null
0.25user 0.03system 0:00.30elapsed 99%CPU (0avgtext+0avgdata 58444maxresident)k

@ChrisHixon
Copy link
Owner Author

If you haven't been following commits, I also added another binary, chpeg-profile, that enables profiling but not VM_TRACE/VM_PRINT_TREE and -g like chpeg-trace. It's almost as fast as the plain chpeg util.

@ChrisHixon
Copy link
Owner Author

It seems that somehow cpp-peglib removes or creates less nodes by default because it's memory usage is lower before refining the grammars. I'm interested in this grammar to see if we can make it also work without packrat by using the profile metrics we've developed so far in a kind of tutorial/howto to be useful by anyone (mainly ourselves) as guidelines (even better if some of the procedures could be automated).

I'd definitely be interested in this tutorial. I haven't really attempted such things myself, and I wouldn't know where to begin automating any part of it.

@mingodad
Copy link
Contributor

mingodad commented Jun 19, 2022

For example simple metrics applied to text like word count, ideally in a grammar an identifier would appear once (except named punctuation):

identifier
1	file
2	ADD_ASSIGNMENT
2	AS_SAFE
2	AT_POST_WS
2	BREAK_AT
2	BinDigitOrSeparator
2	CONTINUE_AT
2	CharacterLiteral
2	DIV
2	DIV_ASSIGNMENT
2	DecDigitNoZero
2	EOF
2	EQEQ
2	EQEQEQ
2	EXCL_EQ
2	EXCL_EQEQ
2	EXCL_WS
2	EscapeSeq
2	FloatLiteral
2	GE
2	HexDigitOrSeparator
2	IdentifierOrSoftKey
2	LE
2	LineStrEscapedChar
2	LineStrExprStart
2	LineStrRef
2	LineStrText
2	LongLiteral
2	MOD
2	MOD_ASSIGNMENT
2	MULT_ASSIGNMENT
2	MultiLineStrExprStart
2	MultiLineStrRef
2	MultiLineStrText
2	NOT_IN
2	NOT_IS
2	QUEST_WS
2	QUOTE_CLOSE
2	QUOTE_OPEN
2	RETURN_AT
2	RealLiteral
2	SUB_ASSIGNMENT
2	SUPER_AT
2	ShebangLine
2	THIS_AT
2	TRIPLE_QUOTE_CLOSE
2	TRIPLE_QUOTE_OPEN
2	TYPEOF
2	UnsignedLiteral
2	annotatedLambda
2	anonymousFunction
2	anonymousInitializer
2	assignableExpression
2	assignment
2	assignmentAndOperator
2	callableReference
2	catchBlock
2	classDeclaration
2	classMemberDeclaration
2	classModifier
2	classParameters
2	collectionLiteral
2	companionObject
2	constructorDelegationCall
2	delegationSpecifier
2	directlyAssignableExpression
2	disjunction
2	doWhileStatement
2	enumClassBody
2	enumEntries
2	explicitDelegation
2	fileAnnotation
2	filePart
2	forStatement
2	functionDeclaration
2	functionLiteral
2	functionModifier
2	functionTypeParameters
2	hardKeyword
2	ifExpression
2	importAlias
2	importHeader
2	importList
2	infixOperation
2	inheritanceModifier
2	inside_assignableExpression
2	inside_directlyAssignableExpression
2	inside_disjunction
2	inside_infixOperation
2	inside_postfixUnarySuffix
2	inside_unaryPrefix
2	jumpExpression
2	lambdaParameters
2	lineStringContent
2	lineStringExpression
2	lineStringLiteral
2	literalConstant
2	loopStatement
2	memberAccessOperator
2	memberModifier
2	modifier
2	multiAnnotation
2	multiLineStringContent
2	multiLineStringExpression
2	multiLineStringLiteral
2	objectDeclaration
2	objectLiteral
2	packageHeader
2	parametersWithOptionalType
2	platformModifier
2	postfixUnarySuffix
2	primaryConstructor
2	propertyDeclaration
2	propertyDelegate
2	propertyModifier
2	quest
2	rangeTest
2	reificationModifier
2	safeNav
2	secondaryConstructor
2	shebangLine
2	singleAnnotation
2	stringLiteral
2	superExpression
2	thisExpression
2	topLevelObject
2	tryExpression
2	typeAlias
2	typeModifier
2	typeParameterModifier
2	typeParameterModifiers
2	typeProjectionModifier
2	typeProjectionModifiers
2	typeTest
2	unaryPrefix
2	unparsable
2	visibilityModifier
2	whenEntry
2	whenExpression
2	whenSubject
2	whileStatement
3	ADD
3	BREAK
3	BooleanLiteral
3	COLONCOLON
3	CONJ
3	CONTINUE
3	DECR
3	DISJ
3	DO
3	DecDigitOrSeparator
3	DoubleExponent
3	DoubleLiteral
3	EXCL_NO_WS
3	EscapedIdentifier
3	FOR
3	FieldIdentifier
3	INCR
3	INTERFACE
3	IS
3	MultiLineStringQuote
3	NullLiteral
3	PACKAGE
3	RANGE
3	RETURN
3	SUB
3	THROW
3	TRY
3	TYPE_ALIAS
3	UniCharacterLiteral
3	WHEN
3	WS
3	additiveExpression
3	additiveOperator
3	annotatedDelegationSpecifier
3	annotationUseSiteTarget
3	asExpression
3	asOperator
3	assignableSuffix
3	classMemberDeclarations
3	classParameter
3	comparison
3	comparisonOperator
3	conjunction
3	constructorInvocation
3	elvisExpression
3	enumEntry
3	equality
3	equalityOperator
3	excl
3	finallyBlock
3	functionValueParameter
3	functionValueParameters
3	genericCallLikeComparison
3	getter
3	identifier
3	infixFunctionCall
3	inside_additiveExpression
3	inside_asExpression
3	inside_comparison
3	inside_conjunction
3	inside_elvisExpression
3	inside_equality
3	inside_genericCallLikeComparison
3	inside_infixFunctionCall
3	inside_multiplicativeExpression
3	inside_prefixUnaryExpression
3	inside_rangeExpression
3	inside_valueArgument
3	lambdaLiteral
3	lambdaParameter
3	multiplicativeExpression
3	multiplicativeOperator
3	parameterModifier
3	parameterModifiers
3	parenthesizedAssignableExpression
3	parenthesizedDirectlyAssignableExpression
3	parenthesizedExpression
3	postfixUnaryOperator
3	prefixUnaryExpression
3	prefixUnaryOperator
3	rangeExpression
3	receiverType
3	receiverTypeAndDot
3	setter
3	typeConstraint
3	typeParameter
3	typeProjection
3	varianceModifier
3	whenCondition
4	ABSTRACT
4	ACTUAL
4	ANNOTATION
4	AS
4	BinDigit
4	BinLiteral
4	CATCH
4	COMPANION
4	CONST
4	CROSSINLINE
4	DATA
4	DELEGATE
4	DYNAMIC
4	ELSE
4	ENUM
4	EXPECT
4	EXTERNAL
4	FIELD
4	FILE
4	FINAL
4	FINALLY
4	HexLiteral
4	IF
4	IMPORT
4	INFIX
4	INIT
4	INLINE
4	INNER
4	INTERNAL
4	IntegerLiteral
4	LATEINIT
4	LineComment
4	NOINLINE
4	OPEN
4	OPERATOR
4	OUT
4	OVERRIDE
4	PARAM
4	PRIVATE
4	PROPERTY
4	PROTECTED
4	PUBLIC
4	QUEST_NO_WS
4	RECEIVER
4	REIFIED
4	SEALED
4	SETPARAM
4	SUPER
4	TAILREC
4	THIS
4	VAR
4	VARARG
4	WHERE
4	declaration
4	elvis
4	functionType
4	inOperator
4	indexingSuffix
4	inside_postfixUnaryExpression
4	isOperator
4	multiVariableDeclaration
4	navigationSuffix
4	nullableType
4	parameter
4	parameterWithOptionalType
4	postfixUnaryExpression
4	simpleUserType
4	statements
4	typeModifiers
4	typeReference
5	ARROW
5	AT_PRE_WS
5	BY
5	CLASS
5	CONSTRUCTOR
5	DecDigit
5	DelimitedComment
5	FUN
5	Hidden
5	IN
5	LANGLE
5	MULT
5	RANGLE
5	SUSPEND
5	VAL
5	WHILE
5	callSuffix
5	delegationSpecifiers
5	functionBody
5	label
5	parenthesizedType
5	primaryExpression
5	semis
5	statement
5	typeConstraints
5	typeParameters
6	DecDigits
6	GET
6	OBJECT
6	SET
6	userType
7	AT_NO_WS
7	LCURL
7	LSQUARE
7	RSQUARE
7	classBody
7	typeArguments
7	unescapedAnnotation
7	valueArguments
7	variableDeclaration
8	HexDigit
8	Identifier
8	SEMICOLON
8	block
9	ASSIGNMENT
9	RCURL
9	controlStructureBody
9	inside_expression
10	semi
11	DOT
17	annotation
17	modifiers
21	COLON
21	expression
22	LPAREN
22	RPAREN
26	type
28	simpleIdentifier
30	NL
31	COMMA
77	UnicodeDigit
78	Letter
115	_
302	__
literals
1	'!='
1	'!=='
1	'!in'
1	'!is'
1	'"""""'
1	'""""'
1	'#!'
1	'%'
1	'%='
1	'&&'
1	'('
1	')'
1	'*'
1	'*='
1	'+'
1	'++'
1	'+='
1	','
1	'-'
1	'--'
1	'-='
1	'->'
1	'..'
1	'/'
1	'/*'
1	'//'
1	'/='
1	':'
1	'::'
1	';'
1	'<'
1	'<='
1	'=='
1	'==='
1	'>'
1	'>='
1	'['
1	'\r'
1	']'
1	'abstract'
1	'actual'
1	'annotation'
1	'as'
1	'as?'
1	'b'
1	'break'
1	'break@'
1	'by'
1	'catch'
1	'class'
1	'companion'
1	'const'
1	'constructor'
1	'continue'
1	'continue@'
1	'crossinline'
1	'data'
1	'delegate'
1	'do'
1	'dynamic'
1	'else'
1	'enum'
1	'expect'
1	'external'
1	'false'
1	'field'
1	'file'
1	'final'
1	'finally'
1	'for'
1	'fun'
1	'get'
1	'if'
1	'import'
1	'in'
1	'infix'
1	'init'
1	'inline'
1	'inner'
1	'interface'
1	'internal'
1	'is'
1	'lateinit'
1	'n'
1	'noinline'
1	'null'
1	'object'
1	'open'
1	'operator'
1	'out'
1	'override'
1	'package'
1	'param'
1	'private'
1	'property'
1	'protected'
1	'public'
1	'r'
1	'receiver'
1	'reified'
1	'return'
1	'return@'
1	'sealed'
1	'set'
1	'setparam'
1	'super'
1	'super@'
1	'suspend'
1	't'
1	'tailrec'
1	'this'
1	'this@'
1	'throw'
1	'true'
1	'try'
1	'typealias'
1	'typeof'
1	'u'
1	'val'
1	'var'
1	'vararg'
1	'when'
1	'where'
1	'while'
1	'{'
1	'||'
1	'}'
2	'!'
2	'"""'
2	'${'
2	'*/'
2	'.'
2	'='
2	'?'
2	'\n'
2	'`'
3	'""'
3	'@'
3	'\''
3	'\\'
4	'$'
4	'0'
5	'"'
6	'_'
chclass
1	[ \t\r\n]
1	[ \t]
1	["$]
1	[-+]
1	[0-9a-fA-F]
1	[01]
1	[1-9]
1	[\\"$]
1	[\n\r'\\]
1	[\n]
1	[`\r\n]
1	[a-zA-Z]
1	[eE]
1	[uU]
2	[0-9]
2	[;\r\n]
2	[\r\n]
2	[bB]
2	[fF]
2	[lL]
2	[xX]

Then we can see that the identifier __ is used 302 times and using grep we can see that only 2 times it's used without * following it and it's definition is __ {I} <- ( [ \t\r\n] / DelimitedComment / LineComment )+, remember when I asked and added to my fork a warning for repetition over repetition ?
This is in my view a silly grammar rule definition/usage, of course that we need to start with profiling to work first on the most time consuming parts first, but when cpp-peglib added an option to show the parsing profile based on the one we/you did for chpeg he also added the duration time and while I was trying to improve some grammars based on the profile info I noticed that sometime I could reduce the total success/fail rate but the total time/duration increased.

grep -o '__.' kotlin-perf.chpeg
__*
__*
__*
__*
__ 
__*
__*
...

@mingodad
Copy link
Contributor

I did a mistake when count the usage of __ not followed by * it's only one time when it's defined.

@mingodad
Copy link
Contributor

For example removing the + from the rule improved the success/fail ration but the instructions per input byte got bigger:

Duration: 397.000000ms, Total VM loops 11251995,  instructions per input byte 229.50

  id       total      %     success        fail  definition

         1079905             484135      595770  Total counters
                              44.83       55.17  % success/fail

               1   0.00           1              file <- shebangLine ? NL * fileAnnotation * _ * packageHeader * _ * importList * _ * ( ( filePart / _ / unparsable ) )* EOF 
  35        9067   0.84        2627        6440  __ {I} <- ( ( [ \t\f\r\n] / DelimitedComment / LineComment ) )+ 

Now removing the + from the rule:

Duration: 426.000000ms, Total VM loops 13330331,  instructions per input byte 271.89

  id       total      %     success        fail  definition

         1088802             493032      595770  Total counters
                              45.28       54.72  % success/fail

               1   0.00           1              file <- shebangLine ? NL * fileAnnotation * _ * packageHeader * _ * importList * _ * ( ( filePart / _ / unparsable ) )* EOF 
  35       18143   1.67       11703        6440  __ {I} <- ( [ \t\f\r\n] / DelimitedComment / LineComment ) 

Now replacing all __* by __ and adding * to the rule definition:

Duration: 359.000000ms, Total VM loops 10169649,  instructions per input byte 207.42

  id       total      %     success        fail  definition

         1078309             488979      589330  Total counters
                              45.35       54.65  % success/fail

               1   0.00           1              file <- shebangLine ? NL * fileAnnotation * _ * packageHeader * _ * importList * _ * ( ( filePart / _ / unparsable ) )* EOF 
  35        7471   0.69        7471              __ {I} <- ( ( [ \t\f\r\n] / DelimitedComment / LineComment ) )* 

@mingodad
Copy link
Contributor

Continuing looking at this kotlin grammar I can see that it was blindly done, originally it has a parser and lexer so on that context things like this make sense:

simpleIdentifier
    : Identifier
    //soft keywords:
    | ABSTRACT
    | ANNOTATION
    | ...

Because on some contexts the soft keywords can be accepted as Identifier so this rules join the return from the lexer into a common symbol simpleIdentifier but for a peg grammar it doesn't make sense because the Identifier will always match first any of the soft keywords.
I have created a minimal pest grammar to see if it would flag this nonsense but nop so it's "expressions that cannot fail" and things that cannot be reached can not detect this:

start = { simpleIdentifier }

simpleIdentifier = {
	// ( ! ( hardKeyword ~ ! ( Letter | "_" | UnicodeDigit ) ) ~ Identifier )
	 ( ! hardKeyword ~ Identifier )
	| ABSTRACT
	}
hardKeyword = {
	 AS
	| BREAK
	}

Letter = { 'a'..'z' | 'A'..'Z' }

UnicodeDigit = { '0' .. '9' }

Identifier  = {
	 ( "`" ~ ( ! ("`" | "\r" | "\n" ) ~ ANY )+ ~ "`" )
	| ( ( Letter | "_" ) ~ ( Letter | "_" | UnicodeDigit )* )
	}
ABSTRACT = {
	 "abstract" ~ ! ( Letter | UnicodeDigit )
	 }
BREAK = {
	 "break" ~ ! ( Letter | UnicodeDigit )
	 }
AS = {
	 "as" ~ ! ( Letter | UnicodeDigit )
     }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants