Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore attributes with illegal characters in name #29

Merged
merged 2 commits into from
Feb 21, 2014
Merged

Ignore attributes with illegal characters in name #29

merged 2 commits into from
Feb 21, 2014

Conversation

miso-belica
Copy link
Contributor

Hi,
I would like to fix the issue #23 ASAP, because I parse a lot of pages that throw DOMException related to it. I read @technosophos's opinion about the issue but I think there's nothing to think about because of current DOMElement's implementation in PHP. Reality is that the method DOMElement::setAttribute throws an exception so element with the attributes like ?, ;, ... can't be created. That's why I implemented the other possible way - parser ignores them.

All I want is HTML parser that doesn't throw exceptions and believe we discuss to the proper solution :)

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling fc6aa4d on miso-belica:fix-wrong-attr-names into 4b3da09 on Masterminds:master.

@mattfarina
Copy link
Member

For reference, the attribute naming part of the spec is at http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#attributes-0.

I'm digging in to this.

$this->parseError("Unexpected characters in attribute name: %s", $name);
$isValidAttribute = FALSE;
}
else if (preg_match("/^[0-9.-]/u", $name)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are numbers not allowed in html5? In XML the starting character is different. This would apply to xhtml as well. In html5 I don't see that attribute name limitation.

It appears the DOM (provided by libxml) might have a limitation over html5.

I did some testing in chrome and firefox, to see some other implementation. They don't allow numbers as the first character in practice.

All of this is to say I did some legwork around numbers and I wanted to document it for the day we come back to this same issue. I'm ok with this for now. Though, I reserve the right to root @technosophos on as he writes a custom DOM some day in the far far future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the number regex can you provide a comment before it pointing to the specs and why it's filtering out the number as the first character. Knowing why this is here will cause some future questions, even if from me.

@mattfarina
Copy link
Member

This mostly looks good. It seems there might be a separation between the DOM (from libxml) and the html5 spec. I'd just like some more documentation around the parser error handling code so that when we revisit this in the future we have a clue what's going on and why.

This is neccesary because method "DOMElement::setAttribute"
throws exception for wrong names so DOM elements
can't contain these attributes.
@miso-belica
Copy link
Contributor Author

Thanks for the review. I updated last commit by adding some doc. Hope it's OK now.

@technosophos
Copy link
Member

The documentation is great.

Thanks, @miso-belica, for not only finding the issue, but for putting in the effort to come up with a sensible workaround. I really appreciate your work.

technosophos added a commit that referenced this pull request Feb 21, 2014
Ignore attributes with illegal characters in name
@technosophos technosophos merged commit c3ac1b3 into Masterminds:master Feb 21, 2014
@miso-belica miso-belica deleted the fix-wrong-attr-names branch February 21, 2014 16:55
@miso-belica
Copy link
Contributor Author

I'm glad I could help and thank you guys for this library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants