Core: Fixed greedy matching bug #2032

RunDevelopment · 2019-08-29T22:07:19Z

This fixes #1492 and all issues caused by it.

Instead of the old rematching algorithm, which only allowed for at most one match, the new one matches until a certain point is reached (called rematchReach).

The idea is that greedy matches if the change the token stream, won't change the entire token stream but only a small part of it. I call this the reach of the rematching. The idea is to rematch everything within that small part of the token stream.

While being correct, this can also be a lot slower than the old implementation because greedy tokens can expand the current rematching reach which might lead to the whole text being match again and again until the text is matched correctly. So one should keep in mind that greedy now works correctly but also got more expensive.

Notes

I refactored parts of the Core code.

This also fixes another bug greedy matching had. Greedy matching could (in certain edge cases) generate a token stream with adjacent strings. This is a problem because non-greedy matching won't work correctly then.
This was caused by !strarr[k - 1].greedy here. I removed it because it doesn't make any sense to me.
Because this was the only usage of the .greedy property of the Token class, I removed it entirely.

mAAdhaTTah · 2019-09-02T17:44:59Z

Yeah, I'd strongly prefer @Golmote take a look at this. It also may be worth getting actual measurements on how much this impacts performance.

tests/core/greedy.js

RunDevelopment · 2019-09-02T18:17:59Z

I was thinking about adding a benchmark which would compare the local Prism version against the one of Prism's master branch.

Suggestion for what language/code?

mAAdhaTTah · 2019-09-03T00:52:11Z

DangerJS would work for this: https://danger.systems/js/

Ideally, I'd like to end up on GitHub actions and drop Travis, but it's still in beta.

RunDevelopment · 2019-09-03T14:44:55Z

That would be awesome but we still have to implement the benchmark ourselves.

mAAdhaTTah · 2019-09-03T15:22:36Z

Only one I know offhand is this: https://github.com/bestiejs/benchmark.js

Golmote · 2019-12-27T22:26:09Z

The code looks OK to me. Greedy matching has always been subject to a performance issue I guess. But I can't think of a way to leverage that without rewriting everything from scratch haha.

A benchmark would be nice tho indeed, to get the proper measure of it. 👍

RunDevelopment · 2020-05-19T14:00:29Z

I re-implemented this after #1909. This actually increased the performance quite a bit. Pre #1909, it was on average about 1.3x slower (if I remember correctly) than the non-fixed version.

The average is now 1.06x slower ranging from 1.12x faster to 1.54x slower.
I'm quite happy with this result and think this is ready to get merged.

@Golmote, could you please give this another look?

Benchmark

Found 9 cases with 27 files in total.
Test 2 candidates with Prism.tokenize
Estimated duration: 9m 0s

------------------------------------------------------------

c

  https://raw.githubusercontent.com/git/git/master/mergesort.c (1 kB)
  | RunDevelopment@greedy-fix     0.08ms ±  1%  176smp      
  | PrismJS@master                0.09ms ±  0%  178smp 1.01x
  https://raw.githubusercontent.com/git/git/master/mergesort.h (1 kB)
  | RunDevelopment@greedy-fix     0.02ms ±  1%  179smp 1.00x
  | PrismJS@master                0.02ms ±  0%  181smp      
  https://raw.githubusercontent.com/git/git/master/remote.c (58 kB)
  | RunDevelopment@greedy-fix     2.96ms ±  1%  164smp      
  | PrismJS@master                3.01ms ±  1%  154smp 1.02x
  https://raw.githubusercontent.com/git/git/master/remote.h (10 kB)
  | RunDevelopment@greedy-fix     0.38ms ±  1%  172smp 1.03x
  | PrismJS@master                0.37ms ±  1%  180smp      

------------------------------------------------------------

css

  ../../style.css (7 kB)
  | RunDevelopment@greedy-fix     0.94ms ±  1%  174smp      
  | PrismJS@master                0.96ms ±  1%  171smp 1.02x

------------------------------------------------------------

css!+css-extras (css)

  ../../style.css (7 kB)
  | RunDevelopment@greedy-fix     1.45ms ±  1%  173smp      
  | PrismJS@master                1.49ms ±  1%  170smp 1.03x

------------------------------------------------------------

javascript

  ../../components.json (25 kB)
  | RunDevelopment@greedy-fix     2.04ms ±  1%  174smp      
  | PrismJS@master                2.07ms ±  1%  170smp 1.01x
  ../../package-lock.json (190 kB)
  | RunDevelopment@greedy-fix    11.73ms ±  1%  135smp      
  | PrismJS@master               12.96ms ±  1%  124smp 1.10x
  ../../scripts/utopia.js (11 kB)
  | RunDevelopment@greedy-fix     1.30ms ±  1%  171smp 1.05x
  | PrismJS@master                1.24ms ±  1%  178smp      
  https://cdnjs.cloudflare.com/ajax/libs/prism/1.20.0/prism.js (29 kB)
  | RunDevelopment@greedy-fix     3.00ms ±  0%  169smp      
  | PrismJS@master                3.37ms ±  0%  169smp 1.12x
  https://cdnjs.cloudflare.com/ajax/libs/prism/1.20.0/prism.min.js (14 kB)
  | RunDevelopment@greedy-fix     2.90ms ±  0%  174smp 1.54x
  | PrismJS@master                1.88ms ±  0%  177smp      
  https://code.jquery.com/jquery-3.4.1.js (274 kB)
  | RunDevelopment@greedy-fix    29.31ms ±  0%  111smp 1.01x
  | PrismJS@master               28.89ms ±  0%  112smp      
  https://code.jquery.com/jquery-3.4.1.min.js (86 kB)
  | RunDevelopment@greedy-fix    16.59ms ±  0%  117smp 1.04x
  | PrismJS@master               15.93ms ±  0%  121smp      

------------------------------------------------------------

json

  ../../components.json (25 kB)
  | RunDevelopment@greedy-fix     1.63ms ±  1%  159smp 1.23x
  | PrismJS@master                1.33ms ±  0%  180smp      
  ../../package-lock.json (190 kB)
  | RunDevelopment@greedy-fix     7.81ms ±  1%  138smp 1.09x
  | PrismJS@master                7.15ms ±  1%  149smp      

------------------------------------------------------------

markup

  ../../download.html (4 kB)
  | RunDevelopment@greedy-fix     0.33ms ±  1%  177smp      
  | PrismJS@master                0.34ms ±  1%  174smp 1.02x
  ../../index.html (19 kB)
  | RunDevelopment@greedy-fix     1.79ms ±  1%  167smp 1.01x
  | PrismJS@master                1.78ms ±  1%  168smp      
  https://github.com/PrismJS/prism (154 kB)
  | RunDevelopment@greedy-fix    17.07ms ±  1%  114smp 1.07x
  | PrismJS@master               15.98ms ±  1%  122smp      

------------------------------------------------------------

markup!+css+javascript (markup)

  ../../download.html (4 kB)
  | RunDevelopment@greedy-fix     0.61ms ±  0%  172smp 1.01x
  | PrismJS@master                0.60ms ±  1%  174smp      
  ../../index.html (19 kB)
  | RunDevelopment@greedy-fix     2.43ms ±  1%  163smp 1.02x
  | PrismJS@master                2.39ms ±  1%  176smp      
  https://github.com/PrismJS/prism (154 kB)
  | RunDevelopment@greedy-fix    18.99ms ±  1%  129smp 1.02x
  | PrismJS@master               18.62ms ±  1%  131smp      

------------------------------------------------------------

ruby

  https://raw.githubusercontent.com/rails/rails/master/actionview/lib/action_view/base.rb (12 kB)
  | RunDevelopment@greedy-fix     0.52ms ±  1%  166smp 1.16x
  | PrismJS@master                0.45ms ±  0%  184smp      
  https://raw.githubusercontent.com/rails/rails/master/actionview/lib/action_view/layouts.rb (16 kB)
  | RunDevelopment@greedy-fix     0.64ms ±  1%  170smp 1.16x
  | PrismJS@master                0.56ms ±  0%  185smp      
  https://raw.githubusercontent.com/rails/rails/master/actionview/lib/action_view/template.rb (14 kB)
  | RunDevelopment@greedy-fix     0.82ms ±  1%  177smp 1.12x
  | PrismJS@master                0.73ms ±  0%  181smp      

------------------------------------------------------------

rust

  https://raw.githubusercontent.com/rust-lang/regex/master/src/compile.rs (40 kB)
  | RunDevelopment@greedy-fix     2.38ms ±  1%  163smp 1.15x
  | PrismJS@master                2.06ms ±  0%  176smp      
  https://raw.githubusercontent.com/rust-lang/regex/master/src/lib.rs (28 kB)
  | RunDevelopment@greedy-fix     0.38ms ±  1%  176smp      
  | PrismJS@master                0.40ms ±  0%  183smp 1.05x
  https://raw.githubusercontent.com/rust-lang/regex/master/src/utf8.rs (9 kB)
  | RunDevelopment@greedy-fix     0.62ms ±  1%  171smp 1.33x
  | PrismJS@master                0.47ms ±  0%  183smp      

------------------------------------------------------------

summary
                             best  worst  relative
  RunDevelopment@greedy-fix     9     18     1.06x
  PrismJS@master               18      9     1.00x

A small explanation on why we see this performance range (1.12x faster to 1.54x slower):

The basic idea behind this fix is that when a greedy token changes the token stream, it can only affect other tokens within a specific range. Within that range, we then have to rematch and that can extend the range further causing even more rematching (this can be thought of as the changes propagating through the token stream).
In the worst-worst-case, we might have to rematch the whole token stream but that's what greedy matching is. I can't do anything about this worst-case but it's very unlikely that this will ever happen anyway.

This explains both the speedup and the slowdown.
The fix can be faster because there might not be any changes to be made within the current range. In that case, we can quickly quit the rematching. The old version always has to find one match, the fixed version might get away with zero.
But because the reach can be extended any number of times, we have to do a lot more matching than the old version to be correct. You can see this with prism.js/prism.min.js. The minified file now takes a lot longer to tokenize because comments, strings, templates, and regexes are much more likely to cause rematching there because it's all one line. In un-minified code, strings and regexes can only cause rematching within their line, so it's a lot more local and the rematching reach isn't extended as often.

yairEO · 2020-07-06T09:53:59Z

it has been almost a year, what's going on? I need this fix...

RunDevelopment · 2020-07-06T10:26:58Z

@yairEO The main problem is that there's nobody to review it. This is a fundamental change to the core matching algorithm, so it better correct. I had to reimplement this after #1909, so I was reluctant to merge this even after @Golmote's approval (that was given before #1909).
But you're right. We should finally get this merged.

@mAAdhaTTah @Golmote
If it's ok, I'll merge this within the next couple of days, so it can be part of the next release. I re-reviewed it and I think it's ready as-is.

RunDevelopment added 3 commits August 29, 2019 23:19

Fixed the greedy rematching bug

f310f38

Resolved conflict

811232d

Added test

ae6da1c

RunDevelopment added bug needs review core labels Aug 29, 2019

RunDevelopment requested a review from Golmote August 29, 2019 22:07

Resolved conflict

4af2875

mAAdhaTTah reviewed Sep 2, 2019

View reviewed changes

tests/core/greedy.js Outdated Show resolved Hide resolved

Removed unnecessary comment

5988083

Merge branch 'master' into greedy-fix

e2f8113

RunDevelopment mentioned this pull request Jan 27, 2020

JavaScript parser is confused by a backtick in a string #2193

Closed

RunDevelopment added 4 commits May 19, 2020 13:39

Merge branch 'master' into greedy-fix

135ed95

Reimplemented fix

4184eef

Removed empty comment line

0712d2a

Changed some comments

fec4243

RunDevelopment added 2 commits May 19, 2020 16:48

Removed empty comment line

a9edd20

Merge branch 'master' into greedy-fix

aa12e65

Merge branch 'master' into greedy-fix

49cdc39

RunDevelopment merged commit 4028520 into PrismJS:master Jul 13, 2020

RunDevelopment mentioned this pull request Jul 13, 2020

php single quote issue #1431

Closed

This was referenced Jul 13, 2020

JS comments not highlighted under certain conditions #1900

Closed

JS syntax highlighting issue with a string containing /* followed by single line comments #2394

Closed

quentinvernot pushed a commit to TankerHQ/prismjs that referenced this pull request Sep 11, 2020

Core: Fixed greedy matching bug (PrismJS#2032)

bb70053

RunDevelopment mentioned this pull request Jan 8, 2021

Core: Fixed greedy rematching reach bug #2705

Merged

RunDevelopment deleted the greedy-fix branch December 3, 2021 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Fixed greedy matching bug #2032

Core: Fixed greedy matching bug #2032

RunDevelopment commented Aug 29, 2019

mAAdhaTTah commented Sep 2, 2019

RunDevelopment commented Sep 2, 2019

mAAdhaTTah commented Sep 3, 2019

RunDevelopment commented Sep 3, 2019

mAAdhaTTah commented Sep 3, 2019

Golmote commented Dec 27, 2019

RunDevelopment commented May 19, 2020 •

edited

Loading

yairEO commented Jul 6, 2020

RunDevelopment commented Jul 6, 2020

Core: Fixed greedy matching bug #2032

Core: Fixed greedy matching bug #2032

Conversation

RunDevelopment commented Aug 29, 2019

Notes

mAAdhaTTah commented Sep 2, 2019

RunDevelopment commented Sep 2, 2019

mAAdhaTTah commented Sep 3, 2019

RunDevelopment commented Sep 3, 2019

mAAdhaTTah commented Sep 3, 2019

Golmote commented Dec 27, 2019

RunDevelopment commented May 19, 2020 • edited Loading

yairEO commented Jul 6, 2020

RunDevelopment commented Jul 6, 2020

RunDevelopment commented May 19, 2020 •

edited

Loading