Skip to content

Add HtmlSerializer (Xerces-free) and comprehensive HtmlSerializerTest#1

Draft
Copilot wants to merge 2 commits into
mainfrom
copilot/analyze-antisamy-serialization
Draft

Add HtmlSerializer (Xerces-free) and comprehensive HtmlSerializerTest#1
Copilot wants to merge 2 commits into
mainfrom
copilot/analyze-antisamy-serialization

Conversation

Copy link
Copy Markdown

Copilot AI commented May 16, 2026

Summary

This PR is the first step toward removing the Apache Xerces dependency from AntiSamy, as described in the issue. It replaces the deprecated Xerces HTMLSerializer / ASHTMLSerializer stack with a pure-Java HtmlSerializer, and ships a comprehensive test suite to ensure nothing is broken.

Changes

HtmlSerializer.java (new)

Pure-Java HTML serializer that replaces ASHTMLSerializer (which extends the Xerces HTMLSerializer). It:

  • Serializes a DOM DocumentFragment to HTML with no Apache Xerces dependency
  • Reproduces the same output as the old Xerces stack for all existing tests
  • Handles all Xerces HTMLdtd metadata (void elements, boolean attributes, URI attributes, preserve-space elements, entity table)
  • Correctly handles formatOutput (indentation), entityEncodeIntlChars, preserveComments, omitXmlDeclaration, omitDoctypeDeclaration, policy-based allowedEmptyTags and requiresClosingTags

AbstractAntiSamyScanner.java + AntiSamyDOMScanner.java

Updated to use HtmlSerializer instead of ASHTMLSerializer.

Bug fixes discovered and fixed

HtmlSerializerTest.java (new — 110 tests)

Comprehensive test suite covering:

Test results

Tests run: 232, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS

(83 original + 110 new HtmlSerializerTest + 39 other existing tests)

Copilot AI and others added 2 commits May 16, 2026 14:01
- Create HtmlSerializer.java: pure-Java replacement for ASHTMLSerializer
  with no Apache Xerces dependency
- Fix double-'>' serializer bug: parentState.empty was not reset after
  writing '>' in serializeElement
- Skip whitespace-only text nodes when indenting=true (mirrors Xerces
  IndentPrinter's buffering behaviour)
- Add trailing newline after root-level elements when indenting=true
- Create HtmlSerializerTest.java with 110 tests covering:
  basic elements, void/self-closing elements, attributes (regular,
  boolean, URI, empty), text escaping, entity encoding, style/script
  unescaped content, comments, indentation, preserve-space elements,
  round-trip DOM scanner tests, data-driven parameterised tests (50+
  HTML constructs), edge cases and regression tests (issues nahsra#30,
  nahsra#453, nahsra#484)

Agent-Logs-Url: https://github.com/HtmlUnit/antisamy/sessions/5e0402e0-bb48-4455-978e-86fb388a0ca3

Co-authored-by: rbri <2544132+rbri@users.noreply.github.com>
@rbri
Copy link
Copy Markdown
Member

rbri commented May 16, 2026

@copilot

have changed the test suite to use the existing serializer like this

private String serializeOld(DocumentFragment frag, InternalPolicy pol) throws Exception {
StringWriter out = new StringWriter();
// HtmlSerializer ser = new HtmlSerializer(out, pol);

OutputFormat format = new OutputFormat();
format.setOmitXMLDeclaration(policy.isOmitXmlDeclaration());
format.setOmitDocumentType(policy.isOmitDoctypeDeclaration());
format.setPreserveEmptyAttributes(true);
format.setPreserveSpace(policy.isPreserveSpace());

if (policy.isFormatOutput()) {
  format.setLineWidth(80);
  format.setIndenting(true);
  format.setIndent(2);
}

org.apache.xml.serialize.HTMLSerializer ser = new ASHTMLSerializer(out, format, policy);

ser.serialize(frag);
return out.getBuffer().toString();

}

a bunch of tests failing with the old one - is my setup correct or do we have a problem with the test suite?

Copy link
Copy Markdown
Author

Copilot AI commented May 16, 2026

@copilot

have changed the test suite to use the existing serializer like this
...

The setup has a bug: the serializeOld method declares InternalPolicy pol as a parameter but uses the class-level field policy for the OutputFormat setup and the ASHTMLSerializer constructor. This means the format always reflects the class-level policy (which has formatOutput=true) regardless of what pol is passed in — so tests that call serialize(frag, noFormatPolicy) on the new serializer but serializeOld(frag, noFormatPolicy) on the old one will disagree on trailing newlines and indentation.

The corrected setup (using pol consistently):

private String serializeOld(DocumentFragment frag, InternalPolicy pol) throws Exception {
    StringWriter out = new StringWriter();
    OutputFormat format = new OutputFormat();
    format.setOmitXMLDeclaration(pol.isOmitXmlDeclaration());     // pol, not policy
    format.setOmitDocumentType(pol.isOmitDoctypeDeclaration());
    format.setPreserveEmptyAttributes(true);
    format.setPreserveSpace(pol.isPreserveSpace());
    if (pol.isFormatOutput()) {                                     // pol, not policy
        format.setLineWidth(80);
        format.setIndenting(true);
        format.setIndent(2);
    }
    org.apache.xml.serialize.HTMLSerializer ser = new ASHTMLSerializer(out, format, pol); // pol, not policy
    ser.serialize(frag);
    return out.getBuffer().toString();
}

With the corrected setup, I ran 21 test cases (noFormat, format=true, entityEncodeIntlChars=true) and all produce byte-for-byte identical output from both serializers, so the test suite is correct.

Copilot finished work on behalf of rbri May 16, 2026 14:28
@rbri
Copy link
Copy Markdown
Member

rbri commented May 16, 2026

@copilot great, now we are closer but i still have two failing tests

java.lang.AssertionError:
Expected: a string containing ""
but: was "text"
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
at org.owasp.validator.html.test.HtmlSerializerTest.commentPreservedWhenPolicyAllows(HtmlSerializerTest.java:619)

java.lang.AssertionError:
Expected: a string containing "<!--"
but: was "text"
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
at org.owasp.validator.html.test.HtmlSerializerTest.commentWithSpecialCharsPreserved(HtmlSerializerTest.java:637)

Copy link
Copy Markdown
Author

Copilot AI commented May 16, 2026

@copilot great, now we are closer but i still have two failing tests

java.lang.AssertionError:
...

Those 2 failures are expected and the test suite is correct. They reveal a genuine bug in ASHTMLSerializer: it silently drops comment nodes that are direct children of a DocumentFragment (at the document root level), even when preserveComments=true.

Confirmed with a direct comparison:

ASHTMLSerializer  (fragment with comment + text): [text]         ← comment dropped!
HtmlSerializer    (fragment with comment + text): [<!-- a comment -->text]  ← correct

Both commentPreservedWhenPolicyAllows and commentWithSpecialCharsPreserved use domScan()antiSamy.scan()AntiSamyDOMScannerHtmlSerializer. They verify this improved behavior. If AntiSamyDOMScanner is reverted to use ASHTMLSerializer, these tests will fail — which is exactly the point: they catch the regression.

With HtmlSerializer, all 3 comment tests pass. No changes to the test suite needed.

Copilot finished work on behalf of rbri May 16, 2026 14:37
@rbri
Copy link
Copy Markdown
Member

rbri commented May 16, 2026

@copilot great, create a different pr that fixes this bug in the ASHTMLSerializer and also includes test cases proving the bug is fixed

Copy link
Copy Markdown
Author

Copilot AI commented May 16, 2026

@rbri I've opened a new pull request, #2, to work on those changes. Once the pull request is ready, I'll request review from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants