Use the HTML5 parser from html5lib when it's available #12

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
2 participants
@kennyluck

I am not particular happy about the namespace kludge myself, but I'll post this patch here just in case Simon wants to take it. It should also be useful for anyone else who wants to test WeasyPrint's standards-compliance without falling into some HTML parsing trap.

The kludge can be removed whenever one of following applies:

  1. lxml supports XPath 2.0's namespace wildcard *:html and cssselect outputs it.
  2. cssselect outputs XPath of the form *[localname() = "html"]

. I am not prepared to go into hacking lxml/cssselect so I'll stop here.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Sep 4, 2012

Member

The namespacing stuff is a separate issue in cssselect. For the HTML parser, I'd rather not switch silently, but only use html5parser when explicitly requested by the user (and then fail if it can not be imported). I'm open to ideas for the specific API: maybe a subclass or an additional parameter.

Member

SimonSapin commented Sep 4, 2012

The namespacing stuff is a separate issue in cssselect. For the HTML parser, I'd rather not switch silently, but only use html5parser when explicitly requested by the user (and then fail if it can not be imported). I'm open to ideas for the specific API: maybe a subclass or an additional parameter.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Sep 9, 2012

Member

Could you provide an HTML test case where html5parser parses differently from lxml.html?

Member

SimonSapin commented Sep 9, 2012

Could you provide an HTML test case where html5parser parses differently from lxml.html?

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Sep 11, 2012

Member

Oh, I had only seen half of the issue. We can easily add support for a different HTML parser in WeasyPrint, but html5parser is useless because of the namespace issues in cssselect. This is fixable in cssselect, just low priority for me right now.

In the meantime, this is a better way to do the same, without patching WeasyPrint:

tree = lxml.html.html5parser.parse('http://example.net')  # Or whatever
for element in tree.iter():
    # Comments and other non-element stuff have a non-string "tag"
    if hasattr(element.tag, 'split'):
        # lxml uses the '{namespaceURI}localname' syntax
        element.tag = element.tag.split('}')[-1]
html = weasyprint.HTML(tree=tree, base_url='http://example.net')
Member

SimonSapin commented Sep 11, 2012

Oh, I had only seen half of the issue. We can easily add support for a different HTML parser in WeasyPrint, but html5parser is useless because of the namespace issues in cssselect. This is fixable in cssselect, just low priority for me right now.

In the meantime, this is a better way to do the same, without patching WeasyPrint:

tree = lxml.html.html5parser.parse('http://example.net')  # Or whatever
for element in tree.iter():
    # Comments and other non-element stuff have a non-string "tag"
    if hasattr(element.tag, 'split'):
        # lxml uses the '{namespaceURI}localname' syntax
        element.tag = element.tag.split('}')[-1]
html = weasyprint.HTML(tree=tree, base_url='http://example.net')
@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Sep 11, 2012

Member

Leaving this open: I still want to support various parsers eventually.

Member

SimonSapin commented Sep 11, 2012

Leaving this open: I still want to support various parsers eventually.

SimonSapin added a commit that referenced this pull request Jul 22, 2013

liZe added a commit that referenced this pull request Oct 17, 2013

Merge pull request #112 from Kozea/html5lib
Switch to html5lib to parse HTML. Fix #12.

innovimax pushed a commit to innovimax/WeasyPrint that referenced this pull request Apr 18, 2015

Do not require HtmlElement.
* Do not use element.base_url which only exists in lxml.html.HtmlElement
* Use lxml.etree.HtmlParser instead of lxml.html

This is one step toward using the html5lib parser, but see
Kozea#12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment