Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize tag name allocations #479

Merged

Conversation

HellBrick
Copy link
Contributor

@HellBrick HellBrick commented Dec 2, 2016

The biggest source of all evil parsing allocations is BaseTokenizer.FlushBuffer:

FlushBuffer memory pressure

It's not easy to eliminate allocations here, especially when the public API is designed in a way that forces the tokenizer to allocate strings. However, there's some low-hanging fruit that can be taken care of. I've temporary introduced a few intermediate methods to see what exactly this memory is allocated for:

FlushBuffer memory distribution

This PR addresses the 18.4% caused by allocating tag names over and over again. Since HTML tag names almost never deviate from a well-known set, we can cache one instance of the string per known tag and reuse it. In order to make the cache check as fast as possible, the lookup code has been pre-generated to account for all the tags from TagNames fields (the generator can be found here). As the result of this change, those 18.4% of FlushBuffer allocations are almost gone (some unique tags happen after all, but they are quite rare):

FlashBuffer memory distribution after the optimisation

Optimising away ~18% of a method that's responsible for ~24% allocations is probably not that nice in a grand scheme of things, but that's better than nothing, right? =) And as an additional bonus, the cache lookup is actually faster then creating all those strings:

Before:
FlashTagNameBuffer - before

After:
FlashTagNameBuffer - after

Technically speaking, adding an optional parameter to a public method, like I did to FlushBuffer, is considered a breaking change by some people: it preserves source compatibility, but doesn't preserve binary compatibility (AngleSharp dll can't be swapped for new version without recompiling the calling assembly). I don't know what your policy for this kind of things is, but if it's a problem, it can be solved by adding an overload instead of an optional parameter. (My additional idea of adding an optional parameter to FlushBuffer was stupid, see the comment below.)

@HellBrick
Copy link
Contributor Author

After some additional thinking, I came to conclusion that it's better to make FlushBuffer( Func<StringBuilder, String> ) not public, but an internal overload instead. StringBuilder is an implementation detail and it doesn't make sense to expose it publicly. It might be a good idea to eventually expose a similar extension point somewhere to allow users to provide their own string caching logic for things like attribute values, but even then StringBuilder should be hidden behind an interface to allow switching internal implementation if and when needed.

@HellBrick HellBrick force-pushed the optimize-tag-name-allocations branch from 8f5f453 to de3e551 Compare December 2, 2016 18:48
@FlorianRappl
Copy link
Contributor

FlorianRappl commented Dec 2, 2016

Looks great, just two thoughts regarding the actual lookup from my side:

  • We could have used a much reduced "known" HTML tag set; only the ~90% mostly used tags (thus finding a good compromise between the lookup depth / cost and the allocation cost)
  • Potentially using a hashset would be more efficient - we could compute the hash on the contained characters and obtain the item (if there is any) in O(1) time

@HellBrick
Copy link
Contributor Author

We could have used a much reduced "known" HTML tag set; only the ~90% mostly used tags (thus finding a good compromise between the lookup depth / cost and the allocation cost)

That's an interesting idea. If we use the current tree as a starting point, it's kind of difficult to get rid of the 3rd switch: it requires making some tough choices like table - label - param, embed - image or script - option. Even though it's possible that throwing away unpopular tags will restructure the tree in a way that the tags the most people care about fit into two levels, I'm not sure this is the best optimisation course at this point. We spend 0.5% of the total parsing time in this method, so I'd say it's fast enough for now =)

An easy thing that can be done though is removing CharsAreEqual() calls for 2-char tags. After the 3rd switch it's not needed at all, since at this point we've already verified both tag chars. And if the tag requires 2 switches, then we need to check just one char. But I really think it's better to spend the effort on some other part of the code at this point.

Potentially using a hashset would be more efficient - we could compute the hash on the contained characters and obtain the item (if there is any) in O(1) time

Since we have a fixed (and I'd say fairly small) number of items, O(whatever) doesn't really matter, it's all about the constant. My gut tells me using a typical hashset with computing hashcode of a full string can easily take more time than 3 switches, at least for the longer tags. If we figure out a specialised fast hash function that would only use first 3 chars and produce optimal hash codes without collisions, the resulting hashset would probably be faster. But even though it would be fun thing to do, there's still a lot of allocations to hunt down, so maybe another time.

@FlorianRappl FlorianRappl added this to the v0.10 milestone Dec 3, 2016
@FlorianRappl FlorianRappl merged commit d8f9167 into AngleSharp:devel Dec 4, 2016
@HellBrick HellBrick deleted the optimize-tag-name-allocations branch December 4, 2016 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants