-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: matchGrammar now uses a linked list #1909
Core: matchGrammar now uses a linked list #1909
Conversation
Here's a current benchmark: Benchmark
|
My initial thoughts are here. @RunDevelopment responded here. I'd love to hear @Golmote's or @LeaVerou's thoughts on the trade-off between bundle size & performance and whether this trade-off is worth it in this case. |
This also reduces the minified size
Regarding bundle size: |
This is pretty smart, I like this idea. This doesn't even add a whole lot of code, really. Regarding BC, the function was never listed on https://prismjs.com/extending.html so I'd say it should be quite safe to break it. Or this PR can join the others for the 2.0 milestone. |
Fair enough, let's do it. |
Then I'll also make |
The token streams in `matchGrammar` are now backed by a linked list instead of an array. This guarantees O(1) time for all operations. The `matchGrammar` is now private.
This changes the implementation of
matchGrammar
to use a linked list instead of an array to replace this O(n) (worst case) operation with an O(1) operation.Prism.tokenize
performance on the minified and dev source of Prism with all languages from the download page:10k samples
10k samples
1000 samples
1000 samples
500 samples
500 samples
(Core + all languages)
50 samples
50 samples
(Core + all languages)
50 samples
50 samples
10 samples
50 samples
using JS grammar
1 sample
5 samples
using JSON grammar
1 sample
5 samples
(Dev² is the code of dev concatenated with the code of dev)
Test code
Why it isn't always hugely faster
Before we start: Please note that n is not the number of characters in the text but the current number of tokens in the token array/list. This number increases over time and depends on both the text being highlighted and the grammar used.
The gain in performance is significant but less than I expected for switching from an O(n) operation to an O(1) operation.
The main reason for this, I think, is that
splice
will operate quite often in O(1) which is the case when as many tokens are deleted as inserted.The easiest example of this is when a regex matches the whole string (not the whole text but a whole token string), in which case the entire string will be replaced with the created token making
splice
O(1). The more token there already are in the array, the more likely this is to happen because the strings get shorter and shorter.This means that
splice
is very likely to operate in O(n) when n is small (there are few tokens compared to the overall text length) and it is very likely to operate in O(1) when n is large.This is most likely the reason why the array implementation is faster for the minified code than for the dev code. Because all spaces are removed, the last regexes (which are punctuation and operators) will almost always match the entire string. This advantage is enough to combat the increase in greedy rematching which can be seen in the extra milliseconds of the linked list implementation.
Greedy rematching is also the main reason why the gain isn't as big for the minified code.
That being said. For a text large enough, the linked list implementation will always be significantly faster.
So, what now?
The performance gain is only really noticeable for large amounts of text and comes with some extras bytes (~700bytes minified). So this PR won't help the average user which just wants to highlight his/her ~200 lines of code. Half a millisecond faster or not just doesn't make a difference.
Another caveat is that this is technically a breaking change because I changed the parameters of
matchGrammar
. It's not even possible to use this function from the outside anymore because I don't expose theLinkedList
class. Sure enough, the function isn't documented anywhere and could be seen as undefined behavior but this PR might still break someone's code.(Why do we expose this function in the first place?)
The average user won't noticeable profit from this PR and it might even break some people's code (because
matchGrammar
is exposed (for some reason) and therefore technically public) but the performance gain for people who highlight large amounts of text is very real.Please tell me what you think.