Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the HTML5 parser from html5lib when it's available #12

wants to merge 2 commits into from


Copy link

I am not particular happy about the namespace kludge myself, but I'll post this patch here just in case Simon wants to take it. It should also be useful for anyone else who wants to test WeasyPrint's standards-compliance without falling into some HTML parsing trap.

The kludge can be removed whenever one of following applies:

  1. lxml supports XPath 2.0's namespace wildcard *:html and cssselect outputs it.
  2. cssselect outputs XPath of the form *[localname() = "html"]

. I am not prepared to go into hacking lxml/cssselect so I'll stop here.

Copy link

The namespacing stuff is a separate issue in cssselect. For the HTML parser, I'd rather not switch silently, but only use html5parser when explicitly requested by the user (and then fail if it can not be imported). I'm open to ideas for the specific API: maybe a subclass or an additional parameter.

Copy link

Could you provide an HTML test case where html5parser parses differently from lxml.html?

Copy link

Oh, I had only seen half of the issue. We can easily add support for a different HTML parser in WeasyPrint, but html5parser is useless because of the namespace issues in cssselect. This is fixable in cssselect, just low priority for me right now.

In the meantime, this is a better way to do the same, without patching WeasyPrint:

tree = lxml.html.html5parser.parse('')  # Or whatever
for element in tree.iter():
    # Comments and other non-element stuff have a non-string "tag"
    if hasattr(element.tag, 'split'):
        # lxml uses the '{namespaceURI}localname' syntax
        element.tag = element.tag.split('}')[-1]
html = weasyprint.HTML(tree=tree, base_url='')

Copy link

Leaving this open: I still want to support various parsers eventually.

SimonSapin added a commit that referenced this pull request Jul 22, 2013
liZe added a commit that referenced this pull request Oct 17, 2013
Switch to html5lib to parse HTML. Fix #12.
innovimax pushed a commit to innovimax/WeasyPrint that referenced this pull request Apr 18, 2015
* Do not use element.base_url which only exists in lxml.html.HtmlElement
* Use lxml.etree.HtmlParser instead of lxml.html

This is one step toward using the html5lib parser, but see
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
feature New feature that should be supported
None yet

Successfully merging this pull request may close these issues.

None yet

2 participants