Core: matchGrammar now uses a linked list #1909

RunDevelopment · 2019-05-25T18:37:53Z

This changes the implementation of matchGrammar to use a linked list instead of an array to replace this O(n) (worst case) operation with an O(1) operation.

Prism.tokenize performance on the minified and dev source of Prism with all languages from the download page:

Code	Size	Array	Linked list	Gain
Prism JS lang	3.5kB	0.232ms +-0.304µs 10k samples	0.179ms +-0.262µs 10k samples	1.3x faster
Prism Core min	6kB	1.26ms +-3.08µs 1000 samples	0.818ms +-3.45ms 1000 samples	1.54x faster
Prism Core	13kB	2ms +-7µs 500 samples	1.45ms +-5.62µs 500 samples	1.38x faster
Minified (Core + all languages)	317kB	98.2ms +-0.177ms 50 samples	73.9ms +-0.198ms 50 samples	1.33x faster
Dev (Core + all languages)	403kB	111.5ms +-4.4ms 50 samples	66.5ms +-0.476ms 50 samples	1.68x faster
Dev²	806kB	3913ms +-53ms 10 samples	193ms +-0.432ms 50 samples	20.3x faster
Some big JSON using JS grammar	39MB	142s 1 sample	1440ms +-33.5ms 5 samples	98.6x faster
Same JSON using JSON grammar	39MB	214s 1 sample	876ms +-4.45ms 5 samples	244x faster

(Dev² is the code of dev concatenated with the code of dev)

Test code

const fs = require('fs');
const { performance } = require('perf_hooks');
const Prism = require('./components/prism-core');
const loadLanguages = require('./components/index');

loadLanguages(['javascript']);
const js = Prism.languages['js'];


// load the test code
const code = fs.readFileSync('./testCode.txt', 'utf-8');
console.log(`${code.length} characters`);


function measure(func, sampleCount = 20) {
	var samples = [];
	for (let i = 0; i < sampleCount; i++) {
		const start = performance.now();
		func();
		samples.push(performance.now() - start);
	}
	const e = samples.reduce((y, x) => x + y, 0) / samples.length;
	const std = Math.sqrt(samples.reduce((sum, x) => sum + (x - e) ** 2, 0)) / (samples.length - 1);

	return `${e}ms +-${std}ms`
}


console.log('Warm up');
measure(() => Prism.tokenize(code, js));


console.log('measure');
console.log(`tokenize: ${measure(() => Prism.tokenize(code, js), 50)}`);

Why it isn't always hugely faster

Before we start: Please note that n is not the number of characters in the text but the current number of tokens in the token array/list. This number increases over time and depends on both the text being highlighted and the grammar used.

The gain in performance is significant but less than I expected for switching from an O(n) operation to an O(1) operation.

The main reason for this, I think, is that splice will operate quite often in O(1) which is the case when as many tokens are deleted as inserted.
The easiest example of this is when a regex matches the whole string (not the whole text but a whole token string), in which case the entire string will be replaced with the created token making splice O(1). The more token there already are in the array, the more likely this is to happen because the strings get shorter and shorter.
This means that splice is very likely to operate in O(n) when n is small (there are few tokens compared to the overall text length) and it is very likely to operate in O(1) when n is large.

This is most likely the reason why the array implementation is faster for the minified code than for the dev code. Because all spaces are removed, the last regexes (which are punctuation and operators) will almost always match the entire string. This advantage is enough to combat the increase in greedy rematching which can be seen in the extra milliseconds of the linked list implementation.
Greedy rematching is also the main reason why the gain isn't as big for the minified code.

That being said. For a text large enough, the linked list implementation will always be significantly faster.

So, what now?

The performance gain is only really noticeable for large amounts of text and comes with some extras bytes (~700bytes minified). So this PR won't help the average user which just wants to highlight his/her ~200 lines of code. Half a millisecond faster or not just doesn't make a difference.

Another caveat is that this is technically a breaking change because I changed the parameters of matchGrammar. It's not even possible to use this function from the outside anymore because I don't expose the LinkedList class. Sure enough, the function isn't documented anywhere and could be seen as undefined behavior but this PR might still break someone's code.
(Why do we expose this function in the first place?)

The average user won't noticeable profit from this PR and it might even break some people's code (because matchGrammar is exposed (for some reason) and therefore technically public) but the performance gain for people who highlight large amounts of text is very real.

Please tell me what you think.

RunDevelopment · 2020-03-20T15:10:28Z

Here's a current benchmark:

Benchmark

C:\Users\micha\Git\prism>npm run benchmark -- --maxTime=30 --remotesOnly

> prismjs@1.19.0 benchmark C:\Users\micha\Git\prism
> node benchmark/benchmark.js "--maxTime=30" "--remotesOnly"

Found 6 cases with 14 files in total.
Test 2 candidates with Prism.tokenize
Estimated duration: 14m 0s

------------------------------------------------------------

css

  ../../style.css (7 kB)
  | PrismJS@master                      1.18ms ±  0%  538smp 1.38x
  | RunDevelopment@core-linked-list     0.85ms ±  0%  541smp      

------------------------------------------------------------

css!+css-extras (css)

  ../../style.css (7 kB)
  | PrismJS@master                      1.88ms ±  0%  529smp 1.38x
  | RunDevelopment@core-linked-list     1.37ms ±  0%  538smp      

------------------------------------------------------------

javascript

  ../../components.json (26 kB)
  | PrismJS@master                      3.64ms ±  0%  493smp 2.13x
  | RunDevelopment@core-linked-list     1.71ms ±  0%  523smp      
  ../../package-lock.json (196 kB)
  | PrismJS@master                     81.60ms ± 18%  131smp 7.19x
  | RunDevelopment@core-linked-list    11.35ms ±  2%  429smp      
  ../../prism.js (27 kB)
  | PrismJS@master                      4.82ms ±  0%  498smp 1.54x
  | RunDevelopment@core-linked-list     3.14ms ±  0%  509smp      
  ../../scripts/utopia.js (11 kB)
  | PrismJS@master                      1.80ms ±  0%  531smp 1.47x
  | RunDevelopment@core-linked-list     1.22ms ±  0%  533smp      
  https://code.jquery.com/jquery-3.4.1.js (274 kB)
  | PrismJS@master                    590.68ms ± 34%   26smp 20.13x
  | RunDevelopment@core-linked-list    29.35ms ±  0%  331smp      
  prism.min.js (13 kB)
  | PrismJS@master                      2.91ms ±  0%  521smp 1.62x
  | RunDevelopment@core-linked-list     1.79ms ±  0%  528smp      

------------------------------------------------------------

json

  ../../components.json (26 kB)
  | PrismJS@master                      3.15ms ±  0%  506smp 2.77x
  | RunDevelopment@core-linked-list     1.14ms ±  0%  517smp      
  ../../package-lock.json (196 kB)
  | PrismJS@master                     88.24ms ± 15%  115smp 12.42x
  | RunDevelopment@core-linked-list     7.11ms ±  0%  455smp      

------------------------------------------------------------

markup

  ../../download.html (4 kB)
  | PrismJS@master                      0.51ms ±  0%  549smp 1.52x
  | RunDevelopment@core-linked-list     0.34ms ±  0%  549smp      
  ../../index.html (19 kB)
  | PrismJS@master                      2.58ms ±  0%  531smp 1.46x
  | RunDevelopment@core-linked-list     1.77ms ±  0%  526smp      

------------------------------------------------------------

markup!+css+javascript (markup)

  ../../download.html (4 kB)
  | PrismJS@master                      0.84ms ±  0%  541smp 1.44x
  | RunDevelopment@core-linked-list     0.58ms ±  0%  551smp      
  ../../index.html (19 kB)
  | PrismJS@master                      3.23ms ±  0%  524smp 1.35x
  | RunDevelopment@core-linked-list     2.39ms ±  0%  524smp      

------------------------------------------------------------

summary
                                   best  worst  relative
  PrismJS@master                      0     14     4.13x
  RunDevelopment@core-linked-list    14      0     1.00x

mAAdhaTTah · 2020-03-20T16:52:34Z

My initial thoughts are here. @RunDevelopment responded here.

I'd love to hear @Golmote's or @LeaVerou's thoughts on the trade-off between bundle size & performance and whether this trade-off is worth it in this case.

This also reduces the minified size

RunDevelopment · 2020-03-20T18:08:24Z

Regarding bundle size:
Right now, this will increase the size of prism-core.min.js by ~500Bytes (6446B -> 6962B).
(I can probably save another ~100Bytes by using gulp to rename the internal properties of the linked list implementation which uglify doesn't replace with single-character names. Kinda hacky and about the last thing I can do.)

Golmote · 2020-03-24T18:22:39Z

This is pretty smart, I like this idea. This doesn't even add a whole lot of code, really.

Regarding BC, the function was never listed on https://prismjs.com/extending.html so I'd say it should be quite safe to break it. Or this PR can join the others for the 2.0 milestone.

mAAdhaTTah · 2020-03-24T22:43:41Z

Fair enough, let's do it.

RunDevelopment · 2020-03-25T10:15:58Z

Then I'll also make matchGrammar a private function since this will break it anyway.

The token streams in `matchGrammar` are now backed by a linked list instead of an array. This guarantees O(1) time for all operations. The `matchGrammar` is now private.

matchGrammar now uses a linked list

b5094c8

RunDevelopment added enhancement needs review labels May 25, 2019

Removed unnecessary Math.min

0a39e3e

RunDevelopment added the core label Aug 26, 2019

mAAdhaTTah added needs consensus and removed needs review labels Sep 3, 2019

Merge branch 'master' into core-linked-list

0592598

RunDevelopment mentioned this pull request Mar 20, 2020

Differential Highlighting #2252

Closed

Minor improvements

1ef3cf8

This also reduces the minified size

mAAdhaTTah approved these changes Mar 24, 2020

View reviewed changes

RunDevelopment added 2 commits March 25, 2020 11:16

Merge branch 'master' into core-linked-list

5b20a7d

Made matchGrammar private

7a5d105

RunDevelopment merged commit 2d4c94c into PrismJS:master Mar 25, 2020

RunDevelopment deleted the core-linked-list branch March 25, 2020 12:12

RunDevelopment mentioned this pull request May 19, 2020

Core: Fixed greedy matching bug #2032

Merged

RunDevelopment mentioned this pull request Aug 11, 2020

Added "MongoDB Query" syntax #2502

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: matchGrammar now uses a linked list #1909

Core: matchGrammar now uses a linked list #1909

RunDevelopment commented May 25, 2019 •

edited

Loading

RunDevelopment commented Mar 20, 2020 •

edited

Loading

mAAdhaTTah commented Mar 20, 2020

RunDevelopment commented Mar 20, 2020

Golmote commented Mar 24, 2020

mAAdhaTTah commented Mar 24, 2020

RunDevelopment commented Mar 25, 2020

Core: matchGrammar now uses a linked list #1909

Core: matchGrammar now uses a linked list #1909

Conversation

RunDevelopment commented May 25, 2019 • edited Loading

Why it isn't always hugely faster

So, what now?

RunDevelopment commented Mar 20, 2020 • edited Loading

mAAdhaTTah commented Mar 20, 2020

RunDevelopment commented Mar 20, 2020

Golmote commented Mar 24, 2020

mAAdhaTTah commented Mar 24, 2020

RunDevelopment commented Mar 25, 2020

RunDevelopment commented May 25, 2019 •

edited

Loading

RunDevelopment commented Mar 20, 2020 •

edited

Loading