Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Netkan localization parser performance #2816

Merged
merged 3 commits into from
Jun 28, 2019

Conversation

HebaruSan
Copy link
Member

@HebaruSan HebaruSan commented Jun 28, 2019

Problem

This netkan hangs for a long time, maybe indefinitely:

mono netkan.exe NetKAN/PolishTranslation.netkan

This is inhibiting the functioning of the bot.

Cause

I think we're experiencing catastrophic backtracking. The regex for parsing a Localization block has several instances of the same character potentially being matchable by multiple consecutive pieces of the regex.

A { could match the first or second part of this:

.*?{

A letter could match the first or second part of this:

.*?([-a-zA-Z]+

And so on. This results in a lot of retries when it reaches the end of the string without a match. In the case of PolishTranslation, I think the issue is simply that its Localization blocks are very, very long, so the inherent performance problems of this regex are multiplied.

Changes

Now the regex for Localization is rewritten to be more specific in terms of what's expected. It must start with Localization followed by optional whitespace and an open curly brace. Then some non-brace characters followed by one or more sequences of an open curly brace followed by non-brace characters followed by a close curly brace and any number of non-brace characters. Finally the final close brace. This regex no longer cares about the locale names; it leaves that to the localeRegex. This makes it much simpler to determine whether a cfg file's contents match what we're looking for, thus preventing backtracking.

Netkan is now able to inflate PolishTranslation quickly, and other netkans that were the subject of previous fixes are still working.

Fixes #2814.

@HebaruSan HebaruSan added Bug Something is not working as intended Pull request Netkan Issues affecting the netkan data labels Jun 28, 2019
@DasSkelett
Copy link
Member

DasSkelett commented Jun 28, 2019

I'm trying to take that regex apart:

It must start with Localization followed by optional whitespace and an open curly brace.


Then one or more sequences of non-brace characters

This is the locale name (e.g. de-de, en-us, pl...) part with potential white space before and after, right?
Shouldn't that * be a +, because there have to be non-curly-brace chars, i.e. a locale name?


followed by an open curly brace


followed by non-brace characters

That's the content of the localization, right? Is there no chance that there are curly braces in the translation text?
Also, shouldn't that be at least one too?
(#2805, we test it in the locale regex again, but we could do this here too to potentially shortcut if there's only one locale)


followed by a close curly brace.


Then any number of non-brace characters

This is not white-space only for potential comments?


followed by the final close brace.

@HebaruSan
Copy link
Member Author

Shouldn't that * be a +, because there have to be non-curly-brace chars, i.e. a locale name?

Yes, I think that makes sense. Will push that change in a moment.

That's the content of the localization, right? Is there no chance that there are curly braces in the translation text?

I don't know whether it's allowed in KSP cfg files syntactically. I'll try to find that out, but I'm pessimistic about matching such a format with a regex, and I'm trying to avoid implementing a full parser for this.

Also, shouldn't that be at least one too?

I don't think so; if someone does this, we don't want the overall Localization match to fail:

en-us
{}
es-es
{
   a = b
}

This is not white-space only for potential comments?

Just trying to make it permissive to avoid future problems with the syntax, but yes comments are a good example.

Thinking about the backtracking some more, I think we should repeat the {(non-braces)}(non-braces) part instead after a non-repeating (non-braces) block. The way I have it now (repeating the (non-braces){(non-braces)} part) the next non-braces character after a close brace could either be the end section or the start of a new block. I'll include that in the next push as well.

@DasSkelett
Copy link
Member

DasSkelett commented Jun 28, 2019

don't know whether it's allowed in KSP cfg files syntactically. I'll try to find that out, but I'm pessimistic about matching such a format with a regex, and I'm trying to avoid implementing a full parser for this.

FYI, this here exists: https://github.com/WuphonsReach/KSP-ConfigParser

I don't think so; if someone does this, we don't want the overall Localization match to fail:

Yeah, I see that now.

Just trying to make it permissive to avoid future problems with the syntax, but yes comments are a good example.

👍

Thinking about the backtracking some more, I think we should repeat the {(non-braces)}(non-braces) part instead after a non-repeating (non-braces) block. The way I have it now (repeating the (non-braces){(non-braces)} part) the next non-braces character after a close brace could either be the end section or the start of a new block. I'll include that in the next push as well.

Sounds reasonable.

@HebaruSan
Copy link
Member Author

FYI, this here exists: https://github.com/WuphonsReach/KSP-ConfigParser

Interesting. I think that project treats { and } as not allowed in values; if it finds one anywhere on a line, it splits the line at that point:

https://github.com/WuphonsReach/KSP-ConfigParser/blob/f42e2ab5cac5cc8f51f9e5eea2fe33b9be0c16e5/src/parse/Extensions/InputLineExtensions.cs#L9-L42

If this is how KSP does it as well, then this PR should be OK.

@DasSkelett
Copy link
Member

Did you find anything (official)?
I went through the config node API documentation, but there's nothing written about illegal chars, parsing, brackets, formatting or anything. Neither can I find something in the KSP wiki page about config files.
Nor can I win anything out of ModuleManager's code, I think they use KSP's parser, and only intercept for their own command keywords.

So in my feeling curly braces are not allowed in value strings,
it's probably quite hard to rewrite the regex to allow curly braces,
the chance that someone puts braces in there are relatively low, and even IF someday we encounter that case, theoretically nothing should break, just the localizations wouldn't be recognized.

So I would give this PR a go.

@HebaruSan
Copy link
Member Author

That's exactly what I found as well; nobody documents this aspect of the format, and MM doesn't do its own parsing.

@HebaruSan HebaruSan merged commit facee43 into KSP-CKAN:master Jun 28, 2019
@HebaruSan HebaruSan deleted the fix/loc-perf branch June 28, 2019 20:58
@DasSkelett
Copy link
Member

Is it safe to unfreeze PolishTranslation already?
When does the bot pull the updated netkan.exe? At the beginning of the next run?

@HebaruSan
Copy link
Member Author

At the beginning of each run, the bot gets a fresh copy of netkan.exe and a fresh update of the NetKAN repo, so unfreezing is probably safe anytime. I was going to wait to see whether this change causes any new issues with other modules though.

@HebaruSan
Copy link
Member Author

HebaruSan commented Jun 29, 2019

Sigh, there's always one...

https://github.com/PapaJoesSoup/BDArmory/blob/master/BDArmory/Distribution/GameData/BDArmory/Localization/BDArmory.cfg

Though that's essentially using the localization resources incorrectly. Those should be <<1>> instead of {0} and use Localizer.Format instead of string.Format. I might cook up a PR for that if I run out of self-caused Netkan bugs to fix.

EDIT: Those resources are not being used at all. They were added along with some C# changes in the middle of a PR with dozens of commits, and then later those C# changes were lost in the shuffle of a merge from another branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something is not working as intended Netkan Issues affecting the netkan data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] NetKAN Hanging on Localizations
2 participants