Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jul 18, 2025

This PR adds support for parsing HTML with unquoted attribute values, which are valid according to the HTML specification but were previously causing parse errors.

Problem

The HTML parser (using pugixml) was rejecting valid HTML like:

<a href=foobar></a>

This would fail with "Error parsing element attribute" because pugixml follows strict XML rules requiring quoted attributes, while HTML allows unquoted values in many cases.

Solution

Added a preprocessing step in Document::fixSource() that automatically quotes unquoted attribute values before parsing:

  • Input: <a href=foobar class=test>
  • Preprocessed: <a href="foobar" class="test">
  • Result: Successful parsing with correct attribute extraction

Implementation Details

  • Added fixUnquotedAttributes() method that scans HTML tags and adds quotes around unquoted attribute values
  • Preserves existing quoted attributes (both single and double quotes)
  • Handles edge cases: self-closing tags, boolean attributes, nested HTML, comments
  • Only affects HTML documents, not XML
  • Minimal performance impact - only processes HTML that needs fixing

Testing

Verified support for:

  • Basic unquoted attributes: <a href=foobar></a>
  • Multiple attributes: <div class=container id=main></div>
  • Mixed quoted/unquoted: <img src=image.jpg alt="quoted title">
  • Self-closing tags: <input type=text />
  • Boolean attributes: <input checked> (unchanged)
  • Complex nested HTML with multiple unquoted attributes

All existing functionality continues to work unchanged.

Fixes #102.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI and others added 2 commits July 18, 2025 14:58
Copilot AI changed the title [WIP] parsing: support attribute value without quotes Fix HTML parsing to support unquoted attribute values Jul 18, 2025
Copilot AI requested a review from yorkie July 18, 2025 15:01
Copy link
Member

@yorkie yorkie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should resolve this at pugixml instead of dom/document.cpp

Copy link
Contributor Author

Copilot AI commented Jul 18, 2025

You should resolve this at pugixml instead of dom/document.cpp

I've moved the fix from document.cpp to pugixml as requested. Added a new parse flag parse_unquoted_attributes and modified the attribute parsing logic to handle unquoted values directly in the parser. Working on resolving test failures in the implementation.

Copilot AI and others added 3 commits July 18, 2025 15:24
Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
… testing

Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
Copilot AI requested a review from yorkie July 18, 2025 15:39
Copy link
Member

@yorkie yorkie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the binary file, that should not be commited to the project.

@@ -0,0 +1,41 @@
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file from docs, instead, you can create a new html file that demostrates the unquoted attributes at fixtures/html/...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File removed and proper HTML demonstration created at fixtures/html/unquoted-attributes-demo.html in commit 957e408.

auto flag = pugi::parse_default | pugi::parse_ws_pcdata | pugi::parse_comments;
if (isFragment)
flag |= pugi::parse_fragment;
if (type_ == DocumentType::kHTML)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (type_ == DocumentType::kHTML)
if (documentType == DocumentType::kHTML)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use documentType consistently in commit 957e408.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary files mustn't be commited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary file test_unquoted_attrs removed in commit 957e408.

- Remove binary file `test_unquoted_attrs` and documentation file
- Fix code to use `documentType` instead of undefined `type_` variable
- Add proper HTML demonstration file in fixtures/html/

Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jul 18, 2025

Please remove the binary file, that should not be commited to the project.

Binary file removed in commit 957e408.

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

  • registry.npmmirror.com
    • Triggering command: npm install (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI requested a review from yorkie July 18, 2025 16:13
… multiple unquoted attributes, add EOF

Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
Copilot AI requested a review from yorkie July 18, 2025 16:30
@yorkie yorkie marked this pull request as ready for review July 18, 2025 16:41
@yorkie yorkie merged commit 2922dee into main Jul 18, 2025
2 checks passed
@yorkie yorkie deleted the copilot/fix-102 branch July 18, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parsing: support attribute value without quotes

2 participants