Skip to content

parse_html adds unwanted tags like <html><head>...<body></html> #583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
qknight opened this issue Mar 13, 2025 · 10 comments
Open

parse_html adds unwanted tags like <html><head>...<body></html> #583

qknight opened this issue Mar 13, 2025 · 10 comments

Comments

@qknight
Copy link

qknight commented Mar 13, 2025

I want to use parse_document to create dom/vdom patches but the parse_document(...) keeps adding <html> and <body>. I wonder, is there an option to fine-tune the error correction level? I like that it does add a </title> in the example below.

But for creating a virtual-dom patch on a <div id="here"> it is bad to have to filter the html tags out afterwards.

/// parse none-escaped html strings as "Hello world!" into a node tree (see also raw_html(...))
pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
    let dom: RcDom = parse_document(RcDom::default(), Default::default()).one(html);
    if let Some(body) = find_body(&dom.document) {
        let new_document = Rc::new(markup5ever_rcdom::Node {
            data: NodeData::Document,
            parent: Cell::new(None),
            children: body.children.clone(),
        });
        process_handle(&new_document)
    } else {
        Err(ParseError::NoBodyInParsedHtml)
    }
}

// Recursively find the <body> element
fn find_body(handle: &Handle) -> Option<Handle> {
    match &handle.data {
        NodeData::Element { name, .. } if name.local.as_ref() == "body" => Some(handle.clone()),
        _ => {
            for child in handle.children.borrow().iter() {
                if let Some(body) = find_body(child) {
                    return Some(body);
                }
            }
            None
        }
    }
}

However, my problem is that I also want to parse html with a <html>...</html> tag in it and then it gets removed.

html-driver.rs test

#[test]
fn from_utf8() {
    let dom = driver::parse_document(RcDom::default(), Default::default())
        .from_utf8()
        .one("<title>Test".as_bytes());
    let mut serialized = Vec::new();
    let document: SerializableHandle = dom.document.clone().into();
    serialize::serialize(&mut serialized, &document, Default::default()).unwrap();
    assert_eq!(
        String::from_utf8(serialized).unwrap().replace(' ', ""),
        "<html><head><title>Test</title></head><body></body></html>"
    );
}

Update:

parse_fragment is also adding unwanted html.

@nicoburns
Copy link
Contributor

You likely want parse_fragment

@qknight
Copy link
Author

qknight commented Mar 14, 2025

@nicoburns thanks for the heads-up!

Using parse_fragment it still would add a <html> tag. bummer ;-)

I've implemented it like this:

pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
    let dom: RcDom = parse_fragment(RcDom::default(), Default::default(),
    QualName::new(None, ns!(html), local_name!("div")),
    vec![],
).one(html);
    process_handle(&dom.document)

and my test shows now:

#[test]
fn test_pre_code3() {
    let html = r#"<div><p> test </p><pre><code>
0
  1
  2
3
</code></pre>
</div>"#;
let expected = r#"<div><p> test </p><pre><code>
0
  1
  2
3
</code></pre><!--separator-->
</div>"#;

    let node: Node<()> = parse_html(html).ok().flatten().expect("must parse");
    //println!("node: {:#?}", node);
    println!("html: {}", html);
    println!("render: {}", node.render_to_string());
    assert_eq!(expected, node.render_to_string());
}

output:

---- test_pre_code3_paragraphs_mix stdout ----
--- <code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code> ---
html: <div><p> test </p><pre><code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code></pre>
</div>
render: <html><div><p> test </p><pre><code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code></pre><!--separator-->
</div></html>
thread 'test_pre_code3_paragraphs_mix' panicked at tests/html_parser_test.rs:130:5:
assertion `left == right` failed
  left: "<div><p> test </p><pre><code>\n  0\n  <p>1</p>\n  2\n<p>3</p>\n  4\n</code></pre><!--separator-->\n</div>"
 right: "<html><div><p> test </p><pre><code>\n  0\n  <p>1</p>\n  2\n<p>3</p>\n  4\n</code></pre><!--separator-->\n</div></html>"

@qknight
Copy link
Author

qknight commented Mar 14, 2025

Updates:

Found the function which adds the <html>...</html> tags. I wonder if this can be made optional, maybe with a new TreeBuilderOpts argument.

html5ever/mod.rs

    pub fn new_for_fragment(
        sink: Sink,
        context_elem: Handle,
        form_elem: Option<Handle>,
        opts: TreeBuilderOpts,
    ) -> TreeBuilder<Handle, Sink> {
        println!("new_for_fragment");
...
        // https://html.spec.whatwg.org/multipage/#parsing-html-fragments
        // 5. Let root be a new html element with no attributes.
        // 6. Append the element root to the Document node created above.
        // 7. Set up the parser's stack of open elements so that it contains just the single element root.
        tb.create_root(vec![]);
        println!("new_for_fragment 3");

html5ever/tree_builder.rs

    //§ creating-and-inserting-nodes
    fn create_root(&self, attrs: Vec<Attribute>) {
        let elem = create_element(
            &self.sink,
            QualName::new(None, ns!(html), local_name!("html")),
            attrs,
        );
        self.push(&elem);
        self.sink.append(&self.doc_handle, AppendNode(elem));
        // FIXME: application cache selection algorithm
    }

@qknight qknight changed the title parse_html adds <html><head>...<body> tags but I want to parse_html adds unwanted tags like <html><head>...<body></html> Mar 15, 2025
@qknight qknight mentioned this issue Mar 31, 2025
82 tasks
@jdm
Copy link
Member

jdm commented Apr 4, 2025

I'm hesitant to make this a configurable behaviour, since that takes us away from the HTML parsing specification that this crate implements. This seems like a post-processing step that would be better implemented by crates that use this library, using manual DOM operations on the parsed tree.

@qknight
Copy link
Author

qknight commented Apr 8, 2025

@jdm I couldn't find a formal definition in the html5ever source code what a html fragment actually is.

For my intuition the function parse_fragment does two things in one: 1. parse and fix a section of html like <p>foo</p> or even <html>....</html> and 2. fix it into a standalone document.

This is the 'collective' intuition represented in chatgpt: https://chatgpt.com/share/67f525ed-c53c-800c-8f49-85b6064226e7 and TL;DR A fragment means a snippet of HTML meant to live inside an existing HTML element, not a standalone HTML document.

So I kindly ask you this: Could you please add a 'formal' comment on the parse_fragment implementation what it actually does, so that other developers like me don't get confused, too.

If you don't want to alter the implementation in any way, since you like it as it is, then just close this ticket. I can go with the hack you proposed. Thanks!

@jdm
Copy link
Member

jdm commented Apr 8, 2025

@nicoburns
Copy link
Contributor

@jdm

Regarding Step 16:

Return root's children, in tree order.

Doesn't "return the root's children" imply that the root itself (<html> element) is not returned, only it's children?

@jdm
Copy link
Member

jdm commented Apr 8, 2025

That makes me wonder if it's possible to parse a fragment that has multiple elements at the fragment root. What do you return in that case, if not the artificial root mode?

@jdm
Copy link
Member

jdm commented Apr 8, 2025

Mmm, I see that users of that algorithm expect a list of direct children that they can iterate and append to some other tree. Interesting.

@nicoburns
Copy link
Contributor

I mean, one presumes this API is for parsing a https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment. Which is is it's own entity which provides a container for a list of children, but doesn't have the same enforced structure (<html><head><body>) as an actual Document. When appending a DocumentFragment to a node in a document, the DocumentFragment disappears and the list of children are appended instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants