-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception is thrown for wrong tag names #30
Comments
Thanks for sharing this. For example, let's look at http://greylink.4fan.cz and the markup... <id="top_featured">
<class="first"> This is clearly broken markup that can't be parsed. @technosophos For situations like this should we skip the tags? In this example the class is paired with a closing div and the id is paired with a closing p. |
Oh. Hmmm... that is a great question. As usual @miso-belica is right that we shouldn't throw an exception... but should we skip the tag or create an element and a element? Arguments for skipping:
Arguments for including:
My initial inclination is that we should skip them. One other alternative could be to convert them to a span tags. But that is making a brazen assumption about the intent of the author. |
I'm inclined to ignore them and note the errors. We can't know what someone intended. The one thing I'm wondering is, how hard would it be to extrapolate it from a closing tag if one exists? The example I shared has that. It may not be there often but if it's not hard to do it might be useful. |
I think converting invalid tags into span is interesting idea, but it can cause WTF effect. Ignoring invalid tags is quite more intuitive and reasonable. And IMHO best solution would be extrapolating tag from the closing tag as @mattfarina suggested, but I think it's not trivial to implement. For group of tags looking like this |
@miso-belica did you want to try to roll a PR or did you want one of us to tackle this one? |
I'm quite busy these days, but I can try to prepare PR during weekend if you won't faster :) |
@miso-belica I won't get to it before the weekend. |
With the pull request for #31 , I believe this is complete. Re-open if there is still a problem. Thanks @miso-belica |
Hi,
I noticed some issues with pages that contain wrong tag names. I really don't know how to deal with the issue so maybe you find the solution. Below is the list of the pages with names of tag that are invalid. Exception
DOMException#5: Invalid Character Error
is always thrown atDOMTreeBuilder.php:227
by methodDOMDocument::createElement
. Every solution, except throwing the exception, is fine for me :) I can make the PR if you tell me what is proper fix for you.a href="http:
what is weird because it's valid form in HTMLid="top_featured"
color="white"
class='neaktivni_stranka'
src=<a
bgcolor="white"
class="nom"
, here is also tag<p class="nom">
that is valid but also invalid one<p class="f-right"><class="nom">
br...<a
span<
noscript<img
br<br
p<
wordpress<
center<a
li"
p"
a�href="http:
b
static*all
h*0720
The text was updated successfully, but these errors were encountered: