-
Notifications
You must be signed in to change notification settings - Fork 3
Fix HTML parsing to support unquoted attribute values #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
yorkie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should resolve this at pugixml instead of dom/document.cpp
I've moved the fix from document.cpp to pugixml as requested. Added a new parse flag |
Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
… testing Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
yorkie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the binary file, that should not be commited to the project.
| @@ -0,0 +1,41 @@ | |||
| /** | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file from docs, instead, you can create a new html file that demostrates the unquoted attributes at fixtures/html/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File removed and proper HTML demonstration created at fixtures/html/unquoted-attributes-demo.html in commit 957e408.
src/client/dom/document.cpp
Outdated
| auto flag = pugi::parse_default | pugi::parse_ws_pcdata | pugi::parse_comments; | ||
| if (isFragment) | ||
| flag |= pugi::parse_fragment; | ||
| if (type_ == DocumentType::kHTML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (type_ == DocumentType::kHTML) | |
| if (documentType == DocumentType::kHTML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to use documentType consistently in commit 957e408.
test_unquoted_attrs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary files mustn't be commited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary file test_unquoted_attrs removed in commit 957e408.
- Remove binary file `test_unquoted_attrs` and documentation file - Fix code to use `documentType` instead of undefined `type_` variable - Add proper HTML demonstration file in fixtures/html/ Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
Binary file removed in commit 957e408. Warning Firewall rules blocked me from connecting to one or more addressesI tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
… multiple unquoted attributes, add EOF Co-authored-by: yorkie <1935767+yorkie@users.noreply.github.com>
This PR adds support for parsing HTML with unquoted attribute values, which are valid according to the HTML specification but were previously causing parse errors.
Problem
The HTML parser (using pugixml) was rejecting valid HTML like:
This would fail with "Error parsing element attribute" because pugixml follows strict XML rules requiring quoted attributes, while HTML allows unquoted values in many cases.
Solution
Added a preprocessing step in
Document::fixSource()that automatically quotes unquoted attribute values before parsing:<a href=foobar class=test><a href="foobar" class="test">Implementation Details
fixUnquotedAttributes()method that scans HTML tags and adds quotes around unquoted attribute valuesTesting
Verified support for:
<a href=foobar></a><div class=container id=main></div><img src=image.jpg alt="quoted title"><input type=text /><input checked>(unchanged)All existing functionality continues to work unchanged.
Fixes #102.
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.