Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception is thrown for wrong tag names #30

Closed
miso-belica opened this issue Mar 29, 2014 · 8 comments
Closed

Exception is thrown for wrong tag names #30

miso-belica opened this issue Mar 29, 2014 · 8 comments

Comments

@miso-belica
Copy link
Contributor

Hi,
I noticed some issues with pages that contain wrong tag names. I really don't know how to deal with the issue so maybe you find the solution. Below is the list of the pages with names of tag that are invalid. Exception DOMException#5: Invalid Character Error is always thrown at DOMTreeBuilder.php:227 by method DOMDocument::createElement. Every solution, except throwing the exception, is fine for me :) I can make the PR if you tell me what is proper fix for you.




@mattfarina
Copy link
Member

Thanks for sharing this. For example, let's look at http://greylink.4fan.cz and the markup...

<id="top_featured">
                <class="first">

This is clearly broken markup that can't be parsed. @technosophos For situations like this should we skip the tags? In this example the class is paired with a closing div and the id is paired with a closing p.

@technosophos
Copy link
Member

Oh. Hmmm... that is a great question. As usual @miso-belica is right that we shouldn't throw an exception... but should we skip the tag or create an element and a element?

Arguments for skipping:

  • Clearly, these are not legal tag names
  • Using them as tag names will result in illegal generated HTML

Arguments for including:

  • Clearly, some sort of structure was intended here, and we should preserve that structure.

My initial inclination is that we should skip them. One other alternative could be to convert them to a span tags. But that is making a brazen assumption about the intent of the author.

@mattfarina
Copy link
Member

I'm inclined to ignore them and note the errors. We can't know what someone intended.

The one thing I'm wondering is, how hard would it be to extrapolate it from a closing tag if one exists? The example I shared has that. It may not be there often but if it's not hard to do it might be useful.

@miso-belica
Copy link
Contributor Author

I think converting invalid tags into span is interesting idea, but it can cause WTF effect. Ignoring invalid tags is quite more intuitive and reasonable. And IMHO best solution would be extrapolating tag from the closing tag as @mattfarina suggested, but I think it's not trivial to implement.

For group of tags looking like this validtag<anothervalidtag could be fine recognize validtag and continue with parsing from <anothervalidtag. So 2 nodes would be created (<validtag/> and <anothervalidtag arg1="val" arg2="val" ...>).

@mattfarina
Copy link
Member

@miso-belica did you want to try to roll a PR or did you want one of us to tackle this one?

@miso-belica
Copy link
Contributor Author

I'm quite busy these days, but I can try to prepare PR during weekend if you won't faster :)

@mattfarina
Copy link
Member

@miso-belica I won't get to it before the weekend.

@mattfarina
Copy link
Member

With the pull request for #31 , I believe this is complete. Re-open if there is still a problem. Thanks @miso-belica

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants