Optimize tag name allocations #479

HellBrick · 2016-12-02T17:44:22Z

The biggest source of ~~all evil~~ parsing allocations is BaseTokenizer.FlushBuffer:

It's not easy to eliminate allocations here, especially when the public API is designed in a way that forces the tokenizer to allocate strings. However, there's some low-hanging fruit that can be taken care of. I've temporary introduced a few intermediate methods to see what exactly this memory is allocated for:

This PR addresses the 18.4% caused by allocating tag names over and over again. Since HTML tag names almost never deviate from a well-known set, we can cache one instance of the string per known tag and reuse it. In order to make the cache check as fast as possible, the lookup code has been pre-generated to account for all the tags from TagNames fields (the generator can be found here). As the result of this change, those 18.4% of FlushBuffer allocations are almost gone (some unique tags happen after all, but they are quite rare):

Optimising away ~18% of a method that's responsible for ~24% allocations is probably not that nice in a grand scheme of things, but that's better than nothing, right? =) And as an additional bonus, the cache lookup is actually faster then creating all those strings:

Before:

After:

Technically speaking, adding an optional parameter to a public method, like I did to FlushBuffer, is considered a breaking change by some people: it preserves source compatibility, but doesn't preserve binary compatibility (AngleSharp dll can't be swapped for new version without recompiling the calling assembly). I don't know what your policy for this kind of things is, but if it's a problem, it can be solved by adding an overload instead of an optional parameter. (My additional idea of adding an optional parameter to FlushBuffer was stupid, see the comment below.)

HellBrick · 2016-12-02T18:37:34Z

After some additional thinking, I came to conclusion that it's better to make FlushBuffer( Func<StringBuilder, String> ) not public, but an internal overload instead. StringBuilder is an implementation detail and it doesn't make sense to expose it publicly. It might be a good idea to eventually expose a similar extension point somewhere to allow users to provide their own string caching logic for things like attribute values, but even then StringBuilder should be hidden behind an interface to allow switching internal implementation if and when needed.

… string resolver

FlorianRappl · 2016-12-02T21:13:05Z

Looks great, just two thoughts regarding the actual lookup from my side:

We could have used a much reduced "known" HTML tag set; only the ~90% mostly used tags (thus finding a good compromise between the lookup depth / cost and the allocation cost)
Potentially using a hashset would be more efficient - we could compute the hash on the contained characters and obtain the item (if there is any) in O(1) time

HellBrick · 2016-12-02T22:42:12Z

We could have used a much reduced "known" HTML tag set; only the ~90% mostly used tags (thus finding a good compromise between the lookup depth / cost and the allocation cost)

That's an interesting idea. If we use the current tree as a starting point, it's kind of difficult to get rid of the 3rd switch: it requires making some tough choices like table - label - param, embed - image or script - option. Even though it's possible that throwing away unpopular tags will restructure the tree in a way that the tags the most people care about fit into two levels, I'm not sure this is the best optimisation course at this point. We spend 0.5% of the total parsing time in this method, so I'd say it's fast enough for now =)

An easy thing that can be done though is removing CharsAreEqual() calls for 2-char tags. After the 3rd switch it's not needed at all, since at this point we've already verified both tag chars. And if the tag requires 2 switches, then we need to check just one char. But I really think it's better to spend the effort on some other part of the code at this point.

Potentially using a hashset would be more efficient - we could compute the hash on the contained characters and obtain the item (if there is any) in O(1) time

Since we have a fixed (and I'd say fairly small) number of items, O(whatever) doesn't really matter, it's all about the constant. My gut tells me using a typical hashset with computing hashcode of a full string can easily take more time than 3 switches, at least for the longer tags. If we figure out a specialised fast hash function that would only use first 3 chars and produce optimal hash codes without collisions, the resulting hashset would probably be faster. But even though it would be fun thing to do, there's still a lot of allocations to hunt down, so maybe another time.

HellBrick added 2 commits December 2, 2016 21:41

Added an internal BaseTokenizer.FlushBuffer() overload that accepts a…

5437ddc

… string resolver

HtmlTokenizer reuses well-known tag names

de3e551

HellBrick force-pushed the optimize-tag-name-allocations branch from 8f5f453 to de3e551 Compare December 2, 2016 18:48

FlorianRappl added enhancement parser performance labels Dec 3, 2016

FlorianRappl added this to the v0.10 milestone Dec 3, 2016

FlorianRappl merged commit d8f9167 into AngleSharp:devel Dec 4, 2016

HellBrick deleted the optimize-tag-name-allocations branch December 4, 2016 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize tag name allocations #479

Optimize tag name allocations #479

HellBrick commented Dec 2, 2016 •

edited

HellBrick commented Dec 2, 2016

FlorianRappl commented Dec 2, 2016 •

edited

HellBrick commented Dec 2, 2016

Optimize tag name allocations #479

Optimize tag name allocations #479

Conversation

HellBrick commented Dec 2, 2016 • edited

HellBrick commented Dec 2, 2016

FlorianRappl commented Dec 2, 2016 • edited

HellBrick commented Dec 2, 2016

HellBrick commented Dec 2, 2016 •

edited

FlorianRappl commented Dec 2, 2016 •

edited