Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Scanner Regexes #2135

Closed
wants to merge 4 commits into from

Conversation

tsholmes
Copy link
Member

Includes changes in #2134

https://github.com/KSP-KOS/KOS/blob/develop/src/kOS.Safe/Compilation/KS/Scanner.cs#L458
In the generated Scanner, when looking for a token it runs the regex for every token that is allowed to be here, keeping the longest match that matches starting at the beginning of the string. This means that it will attempt to find a match over the entire rest of the file if it can't find one at the position its looking for and then discard it if it isn't at the start, which caused the issue fixed by #2134 to happen even in comments, since it was matching anywhere it could find it. This PR changes all of the regexes to start with ^ ensuring it will only attempt to find a match at the start of the string. This leads to some pretty considerable speedups.

All the tests scripts in kerboscript_tests used to take ~1.8 seconds to parse into a ParseTree and now take ~200 milliseconds.

All the files in dunbaratu/kerboscripts used to take ~12 seconds to parse into a ParseTree and now take ~800 milliseconds.

@Dunbaratu
Copy link
Member

Dunbaratu commented Oct 18, 2017

I never really looked at the generated code in Scanner.cs and just assumed it was using the regexes in a context where "^" would mean start of line, and of course it would only find matches that start with the current character because any other match would be meaningless in the context of a token scanner. But apparently it wasn't set up that way by default. Thanks for catching this.

(In other words, I presumed the "only matches that start at current pos" would be the implied default since finding a "valid" hit by simply skipping over a run of characters that don't match any of the regex patterns is wrong behavior for a scanner to do.)

We do have access to the TinyPG source code. Would it make more sense to insert the implied caret in the TinyPG source code where it builds the Scanner.cs file instead of requiring it in the grammar file? (Is there any instance where it would make sense to leave the caret off, basically, and therefore we can't do it in TinyPG?)

I don't mind us editing TinyPG itself beause it's is open source abandonware the original author hasn't touched for years. We've already made a few tiny alterations to it to support things like case-insensitivity through lowercasing the whole string buffer instead of using regex's own case-insensitivity (which seems to be exceptionally slow in Mono for some reason).

@Dunbaratu
Copy link
Member

To point you to the code I mean, this is the kOS project's altered copy of the TinyPG parser generator:

https://github.com/KSP-KOS/TinyPG

@tsholmes
Copy link
Member Author

Ah cool. Yeah we should be able to just wrap everything in ^(<regex>) to get this for free. I'll update it there this evening

@hvacengi
Copy link
Member

First off: thank you for pulling this together. Improving compiler performance is one of those things that always makes a big difference to the end users. I'm looking forward to testing out this addition with my own very large script library.

Second, regarding incorporating this change directly into TinyPG: There is a slight difference between the two optimizations. The case sensitivity optimization is done literally in the C# code by converting the string to lower case and not by modifying the regular expression (it used to append the case insensitive option, but that was removed as part of the previous optimization). In this case we would be modifying the regular expression itself. I would personally prefer that the regular expression string stored in Scanner.cs matches the one from KRISC.tpg from a consistency point of view. My vote would be that we document the standard of including ^ as a comment in the tpg.

I'm curious: because we've obviously never really run through any parsing tests, is it possible that the white space token was frequently getting ignored previously due to longer strings matching the white space? As I understand the explaination, the current implementation would match

    1.1234

against the regex of

(\d[_\d]*)*\.\d[_\d]*

while it would not match against

^(\d+(?:_\d*)*)?\.\d+(?:_\d*)*

But the match would report that the match starts at index 4. Meanwhile the new system matches the 4 white space characters as a white space token, then proceeds to parse the remainder of the line. If that's the case, we might need to pay attention to the white space token regex to make sure there aren't funny regressions with unconventional use of white space.

@Dunbaratu
Copy link
Member

The whitespace token is defined in the [Skip] section. That is supposed to mean that it gets stripped out by the Scanner when it gets discovered, before being sent to the Parser. So for something like INT WHITESPACE PLUS WHITESPACE INT, what the parser sees is just INT PLUS INT, with the WHITESPACE content entirely missing.

At least, that's what the [Skip] section is supposed to mean. I admit I never actually examined the code that TinyPg generates that closely.

@tsholmes
Copy link
Member Author

it only checks the length to find the longest if the match starts at index 0, so in that case it would end up getting a whitespace token first, and then matching on the number the next time around

@Dunbaratu
Copy link
Member

@hvacengi ,
As to including the "^" in the regex, I don't entirely agree. I see that as part of the context outside the regex itself - a description of what the scanner is plugging the regex into rather than the regex itself (the same effect can be had with a flag to the regex call, I believe). The presumption should be that any characters the scanner cannot match should be an error, not merely skipped over. I honestly think that if the scanner is capable of calling a file a "match" when it had to skip over some bogus characters in order to find something that matched, that this is broken behaviour for a scanner.

@hvacengi
Copy link
Member

I took a quick peak at the regex attribute flags and I didn't think that there was one labeled that made me think it had the same effect as the ^ anchor (or whatever the technical term is, I don't have any of my regex references in front of me at the moment). Maybe there is an attribute that would work, in which case I would prefer we append that attribute rather than have a discrepancy between the two sets of regex strings. I just really like the idea of Scanner.cs containing the same string you read in KRISC.tpg, so any method that doesn't modify the string as stored in Scanner.cs is a win in my book.

@Dunbaratu
Copy link
Member

The other reason I dislike encoding the "^" in the kRISC.tpg file is that "^" has another meaning depending on flags, and it's the more commonly used meaning, of "this only is a match if it's at the start of a line of text". Claiming that the rule is "this is only an integer if it starts a line" is what that looks like when you include the caret. Having to include the caret is an artifact of how the Scanner is applying the rule, not an attribute of the rule itself.

@tsholmes
Copy link
Member Author

I sort of agree with both of you on this. I think the anchor does make since in that it means start of token, but putting it in the parser generator itself also makes sense since semantically it should always be there.

@Dunbaratu
Copy link
Member

Dunbaratu commented Oct 18, 2017

To make my case another way: Consider this program:

set x to ©¶¶30 + 5£º.
print x.

The characters "©¶¶" and "£º" are not used in any token in the scanner. They are not legal, and they cause the scanner to complain right now even without the carets in the regexes.

In other words, it's ALREADY true that the carets are effectively implied without being said. It's just that the scanner is being extremely inefficient about it, searching for hits that the scanner then throws away as invalid because they don't start at index zero.

Leaving off the caret does nothing other than slow the scanner down. Leaving it off is not a thing the TPG file can meaningfully make use of anyway, therefore it belongs in the scanner because semantically it's an optimization issue, not a real change to what the rule effectively is. I would agree to having it in the TPG file instead of the scanner if and only if leaving it off caused the syntax to be different in some way. But it doesn't.

@Dunbaratu
Copy link
Member

Dunbaratu commented Oct 21, 2017

[ this comment deleted because I decided to start an issue over in our TinyPG repo instead, here: https://github.com/KSP-KOS/TinyPG/issues/3 ]

@Dunbaratu
Copy link
Member

NOTE TO SELF: Remember to close this PR as wontmerge, if we instead solve the issue in TinyPG here:
KSP-KOS/TinyPG#3

@hvacengi
Copy link
Member

@Dunbaratu won't KSP-KOS/TinyPG#4 result in a modified Scanner.cs still? In which case we will either need a 2nd pull request to correct Scanner.cs or this pull request will need to be modified to include that updated file.

@Dunbaratu
Copy link
Member

@hvacengi Good point. I'll have to amend this PR tomorrow. I just don't think of Scanner.cs, Parser.cs, and ParseTree.cs as being "source code" because of how they're generated, so I forget to include them in the PR.

@Dunbaratu Dunbaratu added the Not Ready Author is warning Reveiwer not to merge this PR yet. More edits are expected. label Oct 24, 2017
@hvacengi
Copy link
Member

Close via #2145

@hvacengi hvacengi closed this Oct 27, 2017
@Dunbaratu Dunbaratu added this to the v1.1.4.0 milestone Dec 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Not Ready Author is warning Reveiwer not to merge this PR yet. More edits are expected.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants