parse_html adds unwanted tags like <html><head>...<body></html> #583

qknight · 2025-03-13T03:32:53Z

I want to use parse_document to create dom/vdom patches but the parse_document(...) keeps adding <html> and <body>. I wonder, is there an option to fine-tune the error correction level? I like that it does add a </title> in the example below.

But for creating a virtual-dom patch on a <div id="here"> it is bad to have to filter the html tags out afterwards.

/// parse none-escaped html strings as "Hello world!" into a node tree (see also raw_html(...))
pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
    let dom: RcDom = parse_document(RcDom::default(), Default::default()).one(html);
    if let Some(body) = find_body(&dom.document) {
        let new_document = Rc::new(markup5ever_rcdom::Node {
            data: NodeData::Document,
            parent: Cell::new(None),
            children: body.children.clone(),
        });
        process_handle(&new_document)
    } else {
        Err(ParseError::NoBodyInParsedHtml)
    }
}

// Recursively find the <body> element
fn find_body(handle: &Handle) -> Option<Handle> {
    match &handle.data {
        NodeData::Element { name, .. } if name.local.as_ref() == "body" => Some(handle.clone()),
        _ => {
            for child in handle.children.borrow().iter() {
                if let Some(body) = find_body(child) {
                    return Some(body);
                }
            }
            None
        }
    }
}

However, my problem is that I also want to parse html with a <html>...</html> tag in it and then it gets removed.

html-driver.rs test

#[test]
fn from_utf8() {
    let dom = driver::parse_document(RcDom::default(), Default::default())
        .from_utf8()
        .one("<title>Test".as_bytes());
    let mut serialized = Vec::new();
    let document: SerializableHandle = dom.document.clone().into();
    serialize::serialize(&mut serialized, &document, Default::default()).unwrap();
    assert_eq!(
        String::from_utf8(serialized).unwrap().replace(' ', ""),
        "<html><head><title>Test</title></head><body></body></html>"
    );
}

Update:

parse_fragment is also adding unwanted html.

The text was updated successfully, but these errors were encountered:

nicoburns · 2025-03-13T03:34:30Z

You likely want parse_fragment

qknight · 2025-03-14T02:24:43Z

@nicoburns thanks for the heads-up!

Using parse_fragment it still would add a <html> tag. bummer ;-)

I've implemented it like this:

pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
    let dom: RcDom = parse_fragment(RcDom::default(), Default::default(),
    QualName::new(None, ns!(html), local_name!("div")),
    vec![],
).one(html);
    process_handle(&dom.document)

and my test shows now:

#[test]
fn test_pre_code3() {
    let html = r#"<div><p> test </p><pre><code>
0
  1
  2
3
</code></pre>
</div>"#;
let expected = r#"<div><p> test </p><pre><code>
0
  1
  2
3
</code></pre><!--separator-->
</div>"#;

    let node: Node<()> = parse_html(html).ok().flatten().expect("must parse");
    //println!("node: {:#?}", node);
    println!("html: {}", html);
    println!("render: {}", node.render_to_string());
    assert_eq!(expected, node.render_to_string());
}

output:

---- test_pre_code3_paragraphs_mix stdout ----
--- <code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code> ---
html: <div><p> test </p><pre><code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code></pre>
</div>
render: <html><div><p> test </p><pre><code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code></pre><!--separator-->
</div></html>
thread 'test_pre_code3_paragraphs_mix' panicked at tests/html_parser_test.rs:130:5:
assertion `left == right` failed
  left: "<div><p> test </p><pre><code>\n  0\n  <p>1</p>\n  2\n<p>3</p>\n  4\n</code></pre><!--separator-->\n</div>"
 right: "<html><div><p> test </p><pre><code>\n  0\n  <p>1</p>\n  2\n<p>3</p>\n  4\n</code></pre><!--separator-->\n</div></html>"

qknight · 2025-03-14T14:41:46Z

Updates:

Found the function which adds the <html>...</html> tags. I wonder if this can be made optional, maybe with a new TreeBuilderOpts argument.

html5ever/mod.rs

    pub fn new_for_fragment(
        sink: Sink,
        context_elem: Handle,
        form_elem: Option<Handle>,
        opts: TreeBuilderOpts,
    ) -> TreeBuilder<Handle, Sink> {
        println!("new_for_fragment");
...
        // https://html.spec.whatwg.org/multipage/#parsing-html-fragments
        // 5. Let root be a new html element with no attributes.
        // 6. Append the element root to the Document node created above.
        // 7. Set up the parser's stack of open elements so that it contains just the single element root.
        tb.create_root(vec![]);
        println!("new_for_fragment 3");

html5ever/tree_builder.rs

    //§ creating-and-inserting-nodes
    fn create_root(&self, attrs: Vec<Attribute>) {
        let elem = create_element(
            &self.sink,
            QualName::new(None, ns!(html), local_name!("html")),
            attrs,
        );
        self.push(&elem);
        self.sink.append(&self.doc_handle, AppendNode(elem));
        // FIXME: application cache selection algorithm
    }

jdm · 2025-04-04T20:00:55Z

I'm hesitant to make this a configurable behaviour, since that takes us away from the HTML parsing specification that this crate implements. This seems like a post-processing step that would be better implemented by crates that use this library, using manual DOM operations on the parsed tree.

qknight · 2025-04-08T13:42:57Z

@jdm I couldn't find a formal definition in the html5ever source code what a html fragment actually is.

For my intuition the function parse_fragment does two things in one: 1. parse and fix a section of html like <p>foo</p> or even <html>....</html> and 2. fix it into a standalone document.

This is the 'collective' intuition represented in chatgpt: https://chatgpt.com/share/67f525ed-c53c-800c-8f49-85b6064226e7 and TL;DR A fragment means a snippet of HTML meant to live inside an existing HTML element, not a standalone HTML document.

So I kindly ask you this: Could you please add a 'formal' comment on the parse_fragment implementation what it actually does, so that other developers like me don't get confused, too.

If you don't want to alter the implementation in any way, since you like it as it is, then just close this ticket. I can go with the hack you proposed. Thanks!

jdm · 2025-04-08T15:15:39Z

https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments

In particular steps 7, 8 and 16.

nicoburns · 2025-04-08T18:11:50Z

@jdm

Regarding Step 16:

Return root's children, in tree order.

Doesn't "return the root's children" imply that the root itself (<html> element) is not returned, only it's children?

jdm · 2025-04-08T18:16:14Z

That makes me wonder if it's possible to parse a fragment that has multiple elements at the fragment root. What do you return in that case, if not the artificial root mode?

jdm · 2025-04-08T18:19:19Z

Mmm, I see that users of that algorithm expect a list of direct children that they can iterate and append to some other tree. Interesting.

nicoburns · 2025-04-08T19:14:03Z

I mean, one presumes this API is for parsing a https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment. Which is is it's own entity which provides a container for a list of children, but doesn't have the same enforced structure (<html><head><body>) as an actual Document. When appending a DocumentFragment to a node in a document, the DocumentFragment disappears and the list of children are appended instead.

qknight mentioned this issue Mar 14, 2025

parse_html ignoring white-spaces and newlines for <pre><code> ... </pre></code> html ivanceras/sauron#107

Open

qknight changed the title ~~parse_html adds <html><head>...<body> tags but I want to~~ parse_html adds unwanted tags like <html><head>...<body></html> Mar 15, 2025

This was referenced Mar 16, 2025

rust rewrite nixcloud/pankat#7

Open

parse_html ignoring white-spaces and newlines for <pre><code> ... </pre></code> html fefit/rphtml#4

Open

qknight mentioned this issue Mar 31, 2025

roadmap nixcloud/pankat-rs#2

Open

82 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse_html adds unwanted tags like <html><head>...<body></html> #583

parse_html adds unwanted tags like <html><head>...<body></html> #583

qknight commented Mar 13, 2025 •

edited

Loading

nicoburns commented Mar 13, 2025

qknight commented Mar 14, 2025

qknight commented Mar 14, 2025

jdm commented Apr 4, 2025

qknight commented Apr 8, 2025

jdm commented Apr 8, 2025 •

edited

Loading

nicoburns commented Apr 8, 2025

jdm commented Apr 8, 2025

jdm commented Apr 8, 2025 •

edited

Loading

nicoburns commented Apr 8, 2025

parse_html adds unwanted tags like <html><head>...<body></html> #583

parse_html adds unwanted tags like <html><head>...<body></html> #583

Comments

qknight commented Mar 13, 2025 • edited Loading

html-driver.rs test

nicoburns commented Mar 13, 2025

qknight commented Mar 14, 2025

qknight commented Mar 14, 2025

html5ever/mod.rs

html5ever/tree_builder.rs

jdm commented Apr 4, 2025

qknight commented Apr 8, 2025

jdm commented Apr 8, 2025 • edited Loading

nicoburns commented Apr 8, 2025

jdm commented Apr 8, 2025

jdm commented Apr 8, 2025 • edited Loading

nicoburns commented Apr 8, 2025

qknight commented Mar 13, 2025 •

edited

Loading

jdm commented Apr 8, 2025 •

edited

Loading

jdm commented Apr 8, 2025 •

edited

Loading