Defer building HtmlPage id/name lookup index until first read#1116
Merged
rbri merged 2 commits intoHtmlUnit:masterfrom Apr 27, 2026
Merged
Defer building HtmlPage id/name lookup index until first read#1116rbri merged 2 commits intoHtmlUnit:masterfrom
rbri merged 2 commits intoHtmlUnit:masterfrom
Conversation
Every element added during parsing eagerly invoked addMappedElement,
which issued two getAttribute("id"/"name") lookups, two
ConcurrentHashMap.put operations, and (when applicable) allocated a
MappedElementIndexEntry plus its inner ArrayList. For pages parsed
and walked via querySelector / iteration without any getElementById
call, the entire index is built and never read.
Add a mappedElementsBuilt_ flag on HtmlPage; addMappedElement and
removeMappedElement early-return until it flips. A new private
ensureMappedElementsBuilt() walks the document once on first read of
getElementById / getElementsById / getElementByName / getElementsByName
/ getElementsByIdAndOrName and populates both idMap_ and nameMap_;
subsequent mutations through addMappedElement/removeMappedElement
stay incremental as before. clone() resets the flag so cloned pages
re-lazy-build.
JMH HtmlUnitBenchmark.JMH on a ~30 KB HTML page (5 forks x 3 warmup
x 5 measurement = 25 samples, paired sequentially):
baseline: 633.205 +- 8.793 us/op
with this: 593.611 +- 13.110 us/op
delta: -39.6 us, -6.3%
Targeted tests across HtmlPageTest, HtmlPage2Test, HtmlPage3Test,
HtmlInlineFrameTest, HtmlFrameTest, HtmlFrameSetTest, HTMLDocumentTest,
Html*ParserTest: 1176 tests, 0 failures.
rbri
reviewed
Apr 27, 2026
| if (mappedElementsBuilt_) { | ||
| return; | ||
| } | ||
| mappedElementsBuilt_ = true; |
Member
There was a problem hiding this comment.
i guess we have to set this after the maps populated?
Contributor
Author
There was a problem hiding this comment.
Good catch — fixed in 342862f. Moved mappedElementsBuilt_ = true to after the addElement calls so a partial failure mid-walk leaves built_=false and the next read retries instead of silently returning a half-populated index.
Per review feedback: the flag was being flipped before addElement() ran, so a partial failure mid-walk would leave the page with built_=true and a half-populated index, and subsequent reads would silently return incomplete results. Move the assignment after the populate so a thrown exception leaves built_=false and the next read retries.
|
Member
|
Cool idea, many thanks |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Every element added during parsing eagerly invoked addMappedElement, which issued two getAttribute("id"/"name") lookups, two ConcurrentHashMap.put operations, and (when applicable) allocated a MappedElementIndexEntry plus its inner ArrayList. For pages parsed and walked via querySelector / iteration without any getElementById call, the entire index is built and never read.
Add a mappedElementsBuilt_ flag on HtmlPage; addMappedElement and removeMappedElement early-return until it flips. A new private ensureMappedElementsBuilt() walks the document once on first read of getElementById / getElementsById / getElementByName / getElementsByName / getElementsByIdAndOrName and populates both idMap_ and nameMap_; subsequent mutations through addMappedElement/removeMappedElement stay incremental as before. clone() resets the flag so cloned pages re-lazy-build.
JMH HtmlUnitBenchmark.JMH on a ~30 KB HTML page (5 forks x 3 warmup x 5 measurement = 25 samples, paired sequentially):
baseline: 633.205 +- 8.793 us/op
with this: 593.611 +- 13.110 us/op
delta: -39.6 us, -6.3%
Targeted tests across HtmlPageTest, HtmlPage2Test, HtmlPage3Test, HtmlInlineFrameTest, HtmlFrameTest, HtmlFrameSetTest, HTMLDocumentTest, Html*ParserTest: 1176 tests, 0 failures.