Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added test for polynomial backtracking #2597

Merged
merged 28 commits into from Dec 28, 2020

Conversation

@RunDevelopment
Copy link
Member

@RunDevelopment RunDevelopment commented Oct 17, 2020

This was branched from #2590. Blocked by #2590.


Very similar to #2590, I added a test that checks for polynomial backtracking in Prism's regexes. It only detects one kind of polynomial backtracking but even that yields hundreds of potential vulnerabilities. The problem this time around is that I don't know how to fix them all.

The problem

Let's look at one example from JavaScript to illustrate the problem: /[_$a-z\xA0-\uFFFF][$\w\xA0-\uFFFF]*(?=\s*=>)/i
Do you see it? Here is the output of the test:

Prism.languages.javascript.parameter[1].pattern: Polynomial backtracking. By repeating any character that matches /[\xa0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]/i, an attack string can be created.

    [$\w\xA0-\uFFFF]*(?=\s*
    ^~~~~~~~~~~~~~~~~   ^~~

Full pattern:
/[_$a-z\xA0-\uFFFF][$\w\xA0-\uFFFF]*(?=\s*=>)/i
                   ^~~~~~~~~~~~~~~~~   ^~~

The problem is that [$\w\xA0-\uFFFF] and \s are not disjoint, so we can just take any character that the two have in common and create an attack string. The test tells us that the available characters are [\xa0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff].

Using this knowledge, we can create the attack string "\xA0".repeat(n) for any n. The above regex will take O(n^3) time to reject strings of that form. On my laptop, it takes over 1 second for /[_$a-z\xA0-\uFFFF][$\w\xA0-\uFFFF]*(?=\s*=>)/i.test("\xa0".repeat(2000)) to return false.

Now that you understand the example, how do we fix this?
Well, we "just" have to make [$\w\xA0-\uFFFF] and \s disjoint. The correct way to do it is to subtract \s from [$\w\xA0-\uFFFF] but how do we do that... The other is to subtract [$\w\xA0-\uFFFF] from \s.

This question goes out to my fellow maintainers. How do we fix this?
There are a lot of other examples like this too. Just take a look at the TravisCI logs. There are literally hundreds of these cases and most are non-trivial to fix.

@Golmote @mAAdhaTTah @LeaVerou

How does the analysis work?

It's not necessary to understand the analysis in order to know that all the problems it finds are vulnerabilities that should be fixed. If you don't care about the implementation details, skip this section.

I go through all quantifiers that have an unbounded maximum (e.g. +, {2,}). A quantifier's element is then converted into a character set such that the character set is guaranteed accepted by the quantifier's element (e.g. a+ -> [a], (a|bb|c)* -> [ac], (\w\s*|a\b|foo)* -> []). The basic idea is that we now know all characters that the quantifier will trivially accept.

To get polynomial backtracking, we just have to find a compatible quantifier and a path to get there. This is done by examining the regex AST nodes that can be reached from the initial quantifier and then recursively continuing the search until we cannot proceed or a compatible quantifier was found.

Examples:

  1. The easiest example is a*a*. The initial quantifier is the left a* with the character set [a]. We then find the right a* reachable from the left a*. The right a* is compatible with [a], so we just found a match.
    The attack string is "a"*n+$. ($ is some suffix that causes the full regex to reject all inputs.)
  2. [bh\d]+[a-z][a-fA-F]+: The initial quantifier is [bh\d]+ with the character set [0-9bh]. We then find [a-z] reachable from [bh\d]+. To proceed, we have to narrow down the available character set from [0-9bh] to [bh]. From [a-z] reachable is [a-fA-F]+. We narrow down the available character set from [bh] to [b] to match the character set of [a-fA-F]+. Since we still have available characters, [bh\d]+ and [a-fA-F]+ are compatible.
    The attack string is "b"*n+$.

This is a very limited analysis but it's still quite good.

@LeaVerou
Copy link
Member

@LeaVerou LeaVerou commented Oct 19, 2020

For single character matches, you can subtract character classes by using a negative lookahead before the character class. E.g. for a word character that's not also a number, you can do /(?!\d)[\w]/.

However, I am a bit wary of introducing a performance overhead on every single Prism usage, by default, to prevent this, which is only a concern when highlighting unsafe content.

@RunDevelopment
Copy link
Member Author

@RunDevelopment RunDevelopment commented Oct 19, 2020

I usually try to avoid negative lookaheads to fix backtracking issues for 3 reasons:

  1. The performance overhead.
    Most browsers are unable to optimize away the lookahead (?!a)b (for any character sets a and b) even though it's a trivial optimization.
  2. Tooling.
    Lookarounds are not easy to handle for tooling. The checks for polynomial and exponential backtracking both have trouble with them and are usually forced to ignore the branches that contain them. If you try to fix the backtracking with lookaheads and make a mistake, you'll never find out.
  3. Simplicity.
    Having a bunch of seemingly random lookahead doesn't make it easier to understand a regex.

While I could do something about points 1 and 2 (tooling can be improved and we could transform our regexes to do these trivial optimizations as part of the build process), point 3 still remains.

which is only a concern when highlighting unsafe content.

I'd rather have Prism be safe by default or at least as save as we can manage.

@LeaVerou
Copy link
Member

@LeaVerou LeaVerou commented Oct 19, 2020

The alternative I suppose is to manually compute the subtraction result as a character class, but is that really better simplicity-wise?

I'd rather have Prism be safe by default or at least as safe as we can manage.

At any cost?

@RunDevelopment
Copy link
Member Author

@RunDevelopment RunDevelopment commented Oct 22, 2020

The alternative I suppose is to manually compute the subtraction result as a character class, but is that really better simplicity-wise?

In some cases, it is but here... no.

Should I write a little babel plugin that applies the lookaheads, so we don't have to worry as much about the performance overhead? At least the minified files should be fast. (see side note below)

I'd rather have Prism be safe by default or at least as safe as we can manage.

At any cost?

No. If I wanted to guarantee O(n) runtime for Prism on all input, I'd have to reimplement everything using something like an LR parser. Yeah... let's not for the time being.

Side note

Even the fixed version of the above example will take O(n^2) time for some strings:

/(?!\s)[_$a-z\xA0-\uFFFF](?:(?!\s)[$\w\xA0-\uFFFF])*(?=\s*=>)/i
    .test("a".repeat(5_000)) // takes about 35 milliseconds on my computer

To make it O(n) for all input, we'd have to add a lookbehind as well:

/(?<!(?!\s)[_$a-z\d\xA0-\uFFFF])(?!\s)[_$a-z\xA0-\uFFFF](?:(?!\s)[$\w\xA0-\uFFFF])*(?=\s*=>)/i
    .test("a".repeat(5_000 * 1000)) // takes about 70 milliseconds on my computer

After applying all the (?!\s) lookaheads, the regex will run about 4 times faster and looks like this:

/(?<![^\s\x00-\x08\x0e-\x1f!"#\x25-\x2f\x3a-\x40[\\\]`\x7b-\x9f^])[^\s\d\x00-\x08\x0e-\x1f!"#\x25-\x2f\x3a-\x5e`\x7b-\x9f][^\s\x00-\x08\x0e-\x1f!"#\x25-\x2f\x3a-\x40[\\\]`\x7b-\x9f^]*(?=\s*=>)/i
    .test("a".repeat(5_000 * 1000)) // takes about 16 milliseconds on my computer

/Side note

Side side note

While playing around, I found that the lookahead versions also have another problem: The stack.

For long strings (>1MB), both Chrome and Firefox throw an error:

/(?<!(?!\s)[_$a-z\d\xA0-\uFFFF])(?!\s)[_$a-z\xA0-\uFFFF](?:(?!\s)[$\w\xA0-\uFFFF])*(?=\s*=>)/i.test("a".repeat(10e6))
// Chrome 86
> Uncaught RangeError: Maximum call stack size exceeded
// FireFox 80
> Uncaught InternalError: too much recursion

The version with applied lookaheads doesn't have this problem. The regex will happily test strings >100MB. Well, it will until we hit the hard limits of the browsers (about 500MB in Chrome and about 1GB in Firefox).

/Side side note

While it would really nice to have guaranteed O(n) runtime for all of our regexes, it is also impossible in general (e.g. /(.+)\1/) and not easy where it is possible.
However, I do think that we should aim for guaranteed O(n^2). This will ensure that Prism is able to highlight malicious inputs <10kB in at most a few seconds. From what I've seen in the wild, this will probably cover >95% of our users. The added tests will point out (to the best of their ability) possible ways to go higher than O(n^2). If those issues are fixed, we'll be reasonably close to guaranteeing O(n^2) runtime for all regexes.

The cost? Fixing 200-400 of Prism's regexes.
Truth be told, I already fixed more than half of them (many systematic issues that are easy to fix). The problems are with regexes like the above where it's not easy to subtract the character sets.

@LeaVerou You're right. I'll use looaheads when there's no other way and be done with it. I was hoping that there was some better way that I overlook but unfortunately it seems like there isn't. Talking about it help me clear this up. Thank you.

Side side side note

If we ever wanted to try to get O(n) for all regexes (where it's possible), we can use the current test for polynomial backtracking.

Basically, it's possible to exploit the linear backtracking of regex R if the regex /[\s\S]*<R>/ is vulnerable to polynomial backtracking.

/Side side side note

@LeaVerou
Copy link
Member

@LeaVerou LeaVerou commented Oct 22, 2020

Thank you for the thoughtful analysis @RunDevelopment!
I'm not sure an ad hoc Babel plugin would be the way to go, not to mention it's more work for you.
Perhaps we could use a regexp library like XRegExp to do this client-side? It only needs doing once, so the benefits of precomputing it are small, whereas making the minified version functionally different is not ideal.

@RunDevelopment
Copy link
Member Author

@RunDevelopment RunDevelopment commented Oct 22, 2020

making the minified version functionally different is not ideal.

True, but we already do that to some extend (#2296). I'm all for using tricks to optimize the minified files but I'm also keenly aware that this might go wrong someday. We could run all language tests against the minified files to prevent that thou.

Perhaps we could use a regexp library like XRegExp to do this client-side?

How do we use XRegExp here? It's an amazing library but it doesn't optimize regexes.

@LeaVerou
Copy link
Member

@LeaVerou LeaVerou commented Oct 22, 2020

How do we use XRegExp here? It's an amazing library but it doesn't optimize regexes.

It allows building regexes from smaller regexes, may make it easier to subtract groups, and once we use it, it may also help simplify our grammars.
I haven't personally used XRegExp so I may be off here about what it does; I just respect Steven Levithan's regexp expertise so I kind of assume it does everything 😁 but I'm fine with any other library too. We just need a helper to subtract regexes, right? That must exist.

@RunDevelopment
Copy link
Member Author

@RunDevelopment RunDevelopment commented Oct 22, 2020

We just need a helper to subtract regexes, right? That must exist.

Are you sure about that?
The closest thing I know is regexp-tree and its optimizer API and even that doesn't optimize the lookaheads away. (However, their transpiler API could allow us features like named capturing groups (not at runtime), so it might be worth looking into even though it seems like their optimizer has a few problems with correctness.)

There is surprisingly little tooling for regexes available in the JS ecosystem.

@RunDevelopment RunDevelopment mentioned this pull request Oct 25, 2020
@joshgoebel
Copy link

@joshgoebel joshgoebel commented Nov 18, 2020

which is only a concern when highlighting unsafe content.

Is this not a common use case for Prism? I know at least one very large organization that uses Prism for highlighting user input... and I know of several (and imagine many more) use Highlight.js in exactly such a fashion (chat rooms, message boards, etc)... all filled with content provided by users.

Obviously one can imagine many scenarios where all the content is "safe" but I guess if you'd asked me off the top of my head I'd have guessed much more unsafe content than safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants