Optimize Scanner Regexes #2135

tsholmes · 2017-10-18T05:36:00Z

Includes changes in #2134

https://github.com/KSP-KOS/KOS/blob/develop/src/kOS.Safe/Compilation/KS/Scanner.cs#L458
In the generated Scanner, when looking for a token it runs the regex for every token that is allowed to be here, keeping the longest match that matches starting at the beginning of the string. This means that it will attempt to find a match over the entire rest of the file if it can't find one at the position its looking for and then discard it if it isn't at the start, which caused the issue fixed by #2134 to happen even in comments, since it was matching anywhere it could find it. This PR changes all of the regexes to start with ^ ensuring it will only attempt to find a match at the start of the string. This leads to some pretty considerable speedups.

All the tests scripts in kerboscript_tests used to take ~1.8 seconds to parse into a ParseTree and now take ~200 milliseconds.

All the files in dunbaratu/kerboscripts used to take ~12 seconds to parse into a ParseTree and now take ~800 milliseconds.

Dunbaratu · 2017-10-18T07:11:40Z

I never really looked at the generated code in Scanner.cs and just assumed it was using the regexes in a context where "^" would mean start of line, and of course it would only find matches that start with the current character because any other match would be meaningless in the context of a token scanner. But apparently it wasn't set up that way by default. Thanks for catching this.

(In other words, I presumed the "only matches that start at current pos" would be the implied default since finding a "valid" hit by simply skipping over a run of characters that don't match any of the regex patterns is wrong behavior for a scanner to do.)

We do have access to the TinyPG source code. Would it make more sense to insert the implied caret in the TinyPG source code where it builds the Scanner.cs file instead of requiring it in the grammar file? (Is there any instance where it would make sense to leave the caret off, basically, and therefore we can't do it in TinyPG?)

I don't mind us editing TinyPG itself beause it's is open source abandonware the original author hasn't touched for years. We've already made a few tiny alterations to it to support things like case-insensitivity through lowercasing the whole string buffer instead of using regex's own case-insensitivity (which seems to be exceptionally slow in Mono for some reason).

Dunbaratu · 2017-10-18T17:05:25Z

To point you to the code I mean, this is the kOS project's altered copy of the TinyPG parser generator:

https://github.com/KSP-KOS/TinyPG

tsholmes · 2017-10-18T17:37:00Z

Ah cool. Yeah we should be able to just wrap everything in ^(<regex>) to get this for free. I'll update it there this evening

hvacengi · 2017-10-18T21:28:42Z

First off: thank you for pulling this together. Improving compiler performance is one of those things that always makes a big difference to the end users. I'm looking forward to testing out this addition with my own very large script library.

Second, regarding incorporating this change directly into TinyPG: There is a slight difference between the two optimizations. The case sensitivity optimization is done literally in the C# code by converting the string to lower case and not by modifying the regular expression (it used to append the case insensitive option, but that was removed as part of the previous optimization). In this case we would be modifying the regular expression itself. I would personally prefer that the regular expression string stored in Scanner.cs matches the one from KRISC.tpg from a consistency point of view. My vote would be that we document the standard of including ^ as a comment in the tpg.

I'm curious: because we've obviously never really run through any parsing tests, is it possible that the white space token was frequently getting ignored previously due to longer strings matching the white space? As I understand the explaination, the current implementation would match

    1.1234

against the regex of

(\d[_\d]*)*\.\d[_\d]*

while it would not match against

^(\d+(?:_\d*)*)?\.\d+(?:_\d*)*

But the match would report that the match starts at index 4. Meanwhile the new system matches the 4 white space characters as a white space token, then proceeds to parse the remainder of the line. If that's the case, we might need to pay attention to the white space token regex to make sure there aren't funny regressions with unconventional use of white space.

Dunbaratu · 2017-10-18T21:33:48Z

The whitespace token is defined in the [Skip] section. That is supposed to mean that it gets stripped out by the Scanner when it gets discovered, before being sent to the Parser. So for something like INT WHITESPACE PLUS WHITESPACE INT, what the parser sees is just INT PLUS INT, with the WHITESPACE content entirely missing.

At least, that's what the [Skip] section is supposed to mean. I admit I never actually examined the code that TinyPg generates that closely.

tsholmes · 2017-10-18T21:36:37Z

it only checks the length to find the longest if the match starts at index 0, so in that case it would end up getting a whitespace token first, and then matching on the number the next time around

Dunbaratu · 2017-10-18T21:39:36Z

@hvacengi ,
As to including the "^" in the regex, I don't entirely agree. I see that as part of the context outside the regex itself - a description of what the scanner is plugging the regex into rather than the regex itself (the same effect can be had with a flag to the regex call, I believe). The presumption should be that any characters the scanner cannot match should be an error, not merely skipped over. I honestly think that if the scanner is capable of calling a file a "match" when it had to skip over some bogus characters in order to find something that matched, that this is broken behaviour for a scanner.

hvacengi · 2017-10-18T21:43:39Z

I took a quick peak at the regex attribute flags and I didn't think that there was one labeled that made me think it had the same effect as the ^ anchor (or whatever the technical term is, I don't have any of my regex references in front of me at the moment). Maybe there is an attribute that would work, in which case I would prefer we append that attribute rather than have a discrepancy between the two sets of regex strings. I just really like the idea of Scanner.cs containing the same string you read in KRISC.tpg, so any method that doesn't modify the string as stored in Scanner.cs is a win in my book.

Dunbaratu · 2017-10-18T21:46:25Z

The other reason I dislike encoding the "^" in the kRISC.tpg file is that "^" has another meaning depending on flags, and it's the more commonly used meaning, of "this only is a match if it's at the start of a line of text". Claiming that the rule is "this is only an integer if it starts a line" is what that looks like when you include the caret. Having to include the caret is an artifact of how the Scanner is applying the rule, not an attribute of the rule itself.

tsholmes · 2017-10-18T21:59:20Z

I sort of agree with both of you on this. I think the anchor does make since in that it means start of token, but putting it in the parser generator itself also makes sense since semantically it should always be there.

Dunbaratu · 2017-10-18T22:11:06Z

To make my case another way: Consider this program:

set x to ©¶¶30 + 5£º.
print x.

The characters "©¶¶" and "£º" are not used in any token in the scanner. They are not legal, and they cause the scanner to complain right now even without the carets in the regexes.

In other words, it's ALREADY true that the carets are effectively implied without being said. It's just that the scanner is being extremely inefficient about it, searching for hits that the scanner then throws away as invalid because they don't start at index zero.

Leaving off the caret does nothing other than slow the scanner down. Leaving it off is not a thing the TPG file can meaningfully make use of anyway, therefore it belongs in the scanner because semantically it's an optimization issue, not a real change to what the rule effectively is. I would agree to having it in the TPG file instead of the scanner if and only if leaving it off caused the syntax to be different in some way. But it doesn't.

Dunbaratu · 2017-10-21T03:02:28Z

[ this comment deleted because I decided to start an issue over in our TinyPG repo instead, here: https://github.com/KSP-KOS/TinyPG/issues/3 ]

Dunbaratu · 2017-10-21T04:52:48Z

NOTE TO SELF: Remember to close this PR as wontmerge, if we instead solve the issue in TinyPG here:
KSP-KOS/TinyPG#3

hvacengi · 2017-10-24T02:41:59Z

@Dunbaratu won't KSP-KOS/TinyPG#4 result in a modified Scanner.cs still? In which case we will either need a 2nd pull request to correct Scanner.cs or this pull request will need to be modified to include that updated file.

Dunbaratu · 2017-10-24T06:21:58Z

@hvacengi Good point. I'll have to amend this PR tomorrow. I just don't think of Scanner.cs, Parser.cs, and ParseTree.cs as being "source code" because of how they're generated, so I forget to include them in the PR.

hvacengi · 2017-10-27T01:46:00Z

Close via #2145

tsholmes added 4 commits October 17, 2017 01:49

Fix scanner hang on very large numbers

8f7bf00

Allow unprefixed decimals .1234

16acaad

Fix indent

134af67

Optimize Scanner Regexes

bebecc4

tsholmes mentioned this pull request Oct 18, 2017

Fix scanner hang on very large numbers #2134

Merged

Dunbaratu mentioned this pull request Oct 21, 2017

[performance] Regex matches inefficiently find irrelevant hits that get culled out right away. KSP-KOS/TinyPG#3

Closed

Dunbaratu added the Not Ready Author is warning Reveiwer not to merge this PR yet. More edits are expected. label Oct 24, 2017

Dunbaratu mentioned this pull request Oct 24, 2017

Scanner change from Tiny PG PR #4. #2145

Merged

hvacengi closed this Oct 27, 2017

Dunbaratu added this to the v1.1.4.0 milestone Dec 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Scanner Regexes #2135

Optimize Scanner Regexes #2135

tsholmes commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017 •

edited

Loading

Dunbaratu commented Oct 18, 2017

tsholmes commented Oct 18, 2017

hvacengi commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017

tsholmes commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017

hvacengi commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017

tsholmes commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017 •

edited

Loading

Dunbaratu commented Oct 21, 2017 •

edited

Loading

Dunbaratu commented Oct 21, 2017

hvacengi commented Oct 24, 2017

Dunbaratu commented Oct 24, 2017

hvacengi commented Oct 27, 2017

Optimize Scanner Regexes #2135

Optimize Scanner Regexes #2135

Conversation

tsholmes commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017 • edited Loading

Dunbaratu commented Oct 18, 2017

tsholmes commented Oct 18, 2017

hvacengi commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017

tsholmes commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017

hvacengi commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017

tsholmes commented Oct 18, 2017

Dunbaratu commented Oct 18, 2017 • edited Loading

Dunbaratu commented Oct 21, 2017 • edited Loading

Dunbaratu commented Oct 21, 2017

hvacengi commented Oct 24, 2017

Dunbaratu commented Oct 24, 2017

hvacengi commented Oct 27, 2017

Dunbaratu commented Oct 18, 2017 •

edited

Loading

Dunbaratu commented Oct 18, 2017 •

edited

Loading

Dunbaratu commented Oct 21, 2017 •

edited

Loading