Skip to content

HtmlUnitNekoDOMBuilder is not thread-safe #148

@rschwietzke

Description

@rschwietzke

HtmlUnitNekoDOMBuilder causes odd error in highly concurrent environment such as:

Cannot invoke "Object.hashCode()" because "k" is null
java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because "k" is null
	at org.htmlunit.cyberneko.util.FastHashMap.get(FastHashMap.java:92)
	at org.htmlunit.cyberneko.HTMLElements.getElement(HTMLElements.java:644)
	at org.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2134)
	at org.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:914)
	at org.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:336)
	at org.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:294)
	at org.htmlunit.cyberneko.xerces.parsers.AbstractXMLDocumentParser.parse(AbstractXMLDocumentParser.java:79)
	at org.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:757)
	at org.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:196)
	at org.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:300)
	at org.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:219)
	at org.htmlunit.WebClient.loadWebResponseInto(WebClient.java:682)
	at org.htmlunit.WebClient.loadWebResponseInto(WebClient.java:576)
	at com.xceptance.xlt.engine.XltWebClient.loadWebResponseInto(XltWebClient.java:1007)
	at org.htmlunit.WebClient.getPage(WebClient.java:494)
	at org.htmlunit.WebClient.getPage(WebClient.java:403)

When checking the code, this could actually not happen unless something is going on concurrently. HTMLElements is a class that is not read-only once created, but has state that changes (see getElement(final String ename, final Element element)). It will update its local caches to reduce lookup cost. But that state is not synchronized because that would render the speed up void.

So, we should not share instances of HtmlElements across threads but HtmlUnitNekoDOMBuilder does that by declaring

private static final HTMLElements HTMLELEMENTS;
private static final HTMLElements HTMLELEMENTS_WITH_CMD;

Also for efficiency reasons, but this causes memory-thread-safety issues in return.

I don't have a solution yet and also it is hard to write a test case for it (but is possible) but the review was already enough to find that. The more dynamic the HTML is, with elements that have different tag name casing, the more often that might happen.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions