Fix Netkan localization parser performance #2816

HebaruSan · 2019-06-28T17:18:20Z

Problem

This netkan hangs for a long time, maybe indefinitely:

mono netkan.exe NetKAN/PolishTranslation.netkan

This is inhibiting the functioning of the bot.

Cause

I think we're experiencing catastrophic backtracking. The regex for parsing a Localization block has several instances of the same character potentially being matchable by multiple consecutive pieces of the regex.

A { could match the first or second part of this:

.*?{

A letter could match the first or second part of this:

.*?([-a-zA-Z]+

And so on. This results in a lot of retries when it reaches the end of the string without a match. In the case of PolishTranslation, I think the issue is simply that its Localization blocks are very, very long, so the inherent performance problems of this regex are multiplied.

Changes

Now the regex for Localization is rewritten to be more specific in terms of what's expected. It must start with Localization followed by optional whitespace and an open curly brace. Then some non-brace characters followed by one or more sequences of an open curly brace followed by non-brace characters followed by a close curly brace and any number of non-brace characters. Finally the final close brace. This regex no longer cares about the locale names; it leaves that to the localeRegex. This makes it much simpler to determine whether a cfg file's contents match what we're looking for, thus preventing backtracking.

Netkan is now able to inflate PolishTranslation quickly, and other netkans that were the subject of previous fixes are still working.

Fixes #2814.

DasSkelett · 2019-06-28T17:53:24Z

I'm trying to take that regex apart:

It must start with Localization followed by optional whitespace and an open curly brace.

Then one or more sequences of non-brace characters

This is the locale name (e.g. de-de, en-us, pl...) part with potential white space before and after, right?
Shouldn't that * be a +, because there have to be non-curly-brace chars, i.e. a locale name?

followed by an open curly brace

followed by non-brace characters

That's the content of the localization, right? Is there no chance that there are curly braces in the translation text?
Also, shouldn't that be at least one too?
(#2805, we test it in the locale regex again, but we could do this here too to potentially shortcut if there's only one locale)

followed by a close curly brace.

Then any number of non-brace characters

This is not white-space only for potential comments?

followed by the final close brace.

HebaruSan · 2019-06-28T18:35:39Z

Shouldn't that * be a +, because there have to be non-curly-brace chars, i.e. a locale name?

Yes, I think that makes sense. Will push that change in a moment.

That's the content of the localization, right? Is there no chance that there are curly braces in the translation text?

I don't know whether it's allowed in KSP cfg files syntactically. I'll try to find that out, but I'm pessimistic about matching such a format with a regex, and I'm trying to avoid implementing a full parser for this.

Also, shouldn't that be at least one too?

I don't think so; if someone does this, we don't want the overall Localization match to fail:

en-us
{}
es-es
{
   a = b
}

This is not white-space only for potential comments?

Just trying to make it permissive to avoid future problems with the syntax, but yes comments are a good example.

Thinking about the backtracking some more, I think we should repeat the {(non-braces)}(non-braces) part instead after a non-repeating (non-braces) block. The way I have it now (repeating the (non-braces){(non-braces)} part) the next non-braces character after a close brace could either be the end section or the start of a new block. I'll include that in the next push as well.

DasSkelett · 2019-06-28T18:48:47Z

don't know whether it's allowed in KSP cfg files syntactically. I'll try to find that out, but I'm pessimistic about matching such a format with a regex, and I'm trying to avoid implementing a full parser for this.

FYI, this here exists: https://github.com/WuphonsReach/KSP-ConfigParser

I don't think so; if someone does this, we don't want the overall Localization match to fail:

Yeah, I see that now.

Just trying to make it permissive to avoid future problems with the syntax, but yes comments are a good example.

👍

Thinking about the backtracking some more, I think we should repeat the {(non-braces)}(non-braces) part instead after a non-repeating (non-braces) block. The way I have it now (repeating the (non-braces){(non-braces)} part) the next non-braces character after a close brace could either be the end section or the start of a new block. I'll include that in the next push as well.

Sounds reasonable.

HebaruSan · 2019-06-28T19:09:34Z

FYI, this here exists: https://github.com/WuphonsReach/KSP-ConfigParser

Interesting. I think that project treats { and } as not allowed in values; if it finds one anywhere on a line, it splits the line at that point:

https://github.com/WuphonsReach/KSP-ConfigParser/blob/f42e2ab5cac5cc8f51f9e5eea2fe33b9be0c16e5/src/parse/Extensions/InputLineExtensions.cs#L9-L42

If this is how KSP does it as well, then this PR should be OK.

…king

DasSkelett · 2019-06-28T20:46:33Z

Did you find anything (official)?
I went through the config node API documentation, but there's nothing written about illegal chars, parsing, brackets, formatting or anything. Neither can I find something in the KSP wiki page about config files.
Nor can I win anything out of ModuleManager's code, I think they use KSP's parser, and only intercept for their own command keywords.

So in my feeling curly braces are not allowed in value strings,
it's probably quite hard to rewrite the regex to allow curly braces,
the chance that someone puts braces in there are relatively low, and even IF someday we encounter that case, theoretically nothing should break, just the localizations wouldn't be recognized.

So I would give this PR a go.

HebaruSan · 2019-06-28T20:49:00Z

That's exactly what I found as well; nobody documents this aspect of the format, and MM doesn't do its own parsing.

DasSkelett · 2019-06-28T21:10:52Z

Is it safe to unfreeze PolishTranslation already?
When does the bot pull the updated netkan.exe? At the beginning of the next run?

HebaruSan · 2019-06-28T21:15:43Z

At the beginning of each run, the bot gets a fresh copy of netkan.exe and a fresh update of the NetKAN repo, so unfreezing is probably safe anytime. I was going to wait to see whether this change causes any new issues with other modules though.

HebaruSan · 2019-06-29T02:20:52Z

Sigh, there's always one...

https://github.com/PapaJoesSoup/BDArmory/blob/master/BDArmory/Distribution/GameData/BDArmory/Localization/BDArmory.cfg

Though that's essentially using the localization resources incorrectly. Those should be <<1>> instead of {0} and use Localizer.Format instead of string.Format. I might cook up a PR for that if I run out of self-caused Netkan bugs to fix.

EDIT: Those resources are not being used at all. They were added along with some C# changes in the middle of a PR with dozens of commits, and then later those C# changes were lost in the shuffle of a merge from another branch.

Fix Netkan localization parser performance

1a3b972

HebaruSan added Bug Something is not working as intended Pull request Netkan Issues affecting the netkan data labels Jun 28, 2019

Require a character in name, repeat tail instead of head

8dabf43

Fix compile warning about 'e', force transforms to finish before chec…

facee43

…king

DasSkelett approved these changes Jun 28, 2019

View reviewed changes

HebaruSan merged commit facee43 into KSP-CKAN:master Jun 28, 2019

HebaruSan added a commit that referenced this pull request Jun 28, 2019

Merge #2816 Fix Netkan localization parser performance

f69b737

HebaruSan deleted the fix/loc-perf branch June 28, 2019 20:58

This was referenced Jun 28, 2019

The bot is running out of CPU credits KSP-CKAN/NetKAN#7261

Closed

Unfreeze Polish translation KSP-CKAN/NetKAN#7272

Merged

HebaruSan mentioned this pull request Jun 29, 2019

Localization fixes PapaJoesSoup/BDArmory#716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Netkan localization parser performance #2816

Fix Netkan localization parser performance #2816

HebaruSan commented Jun 28, 2019 •

edited

Loading

DasSkelett commented Jun 28, 2019 •

edited

Loading

HebaruSan commented Jun 28, 2019

DasSkelett commented Jun 28, 2019 •

edited

Loading

HebaruSan commented Jun 28, 2019

DasSkelett commented Jun 28, 2019

HebaruSan commented Jun 28, 2019

DasSkelett commented Jun 28, 2019

HebaruSan commented Jun 28, 2019

HebaruSan commented Jun 29, 2019 •

edited

Loading

Fix Netkan localization parser performance #2816

Fix Netkan localization parser performance #2816

Conversation

HebaruSan commented Jun 28, 2019 • edited Loading

Problem

Cause

Changes

DasSkelett commented Jun 28, 2019 • edited Loading

HebaruSan commented Jun 28, 2019

DasSkelett commented Jun 28, 2019 • edited Loading

HebaruSan commented Jun 28, 2019

DasSkelett commented Jun 28, 2019

HebaruSan commented Jun 28, 2019

DasSkelett commented Jun 28, 2019

HebaruSan commented Jun 28, 2019

HebaruSan commented Jun 29, 2019 • edited Loading

HebaruSan commented Jun 28, 2019 •

edited

Loading

DasSkelett commented Jun 28, 2019 •

edited

Loading

DasSkelett commented Jun 28, 2019 •

edited

Loading

HebaruSan commented Jun 29, 2019 •

edited

Loading