-
Notifications
You must be signed in to change notification settings - Fork 235
parse_html adds unwanted tags like <html><head>...<body></html> #583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You likely want parse_fragment |
@nicoburns thanks for the heads-up! Using parse_fragment it still would add a I've implemented it like this: pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
let dom: RcDom = parse_fragment(RcDom::default(), Default::default(),
QualName::new(None, ns!(html), local_name!("div")),
vec![],
).one(html);
process_handle(&dom.document) and my test shows now: #[test]
fn test_pre_code3() {
let html = r#"<div><p> test </p><pre><code>
0
1
2
3
</code></pre>
</div>"#;
let expected = r#"<div><p> test </p><pre><code>
0
1
2
3
</code></pre><!--separator-->
</div>"#;
let node: Node<()> = parse_html(html).ok().flatten().expect("must parse");
//println!("node: {:#?}", node);
println!("html: {}", html);
println!("render: {}", node.render_to_string());
assert_eq!(expected, node.render_to_string());
} output: ---- test_pre_code3_paragraphs_mix stdout ----
--- <code>
0
<p>1</p>
2
<p>3</p>
4
</code> ---
html: <div><p> test </p><pre><code>
0
<p>1</p>
2
<p>3</p>
4
</code></pre>
</div>
render: <html><div><p> test </p><pre><code>
0
<p>1</p>
2
<p>3</p>
4
</code></pre><!--separator-->
</div></html>
thread 'test_pre_code3_paragraphs_mix' panicked at tests/html_parser_test.rs:130:5:
assertion `left == right` failed
left: "<div><p> test </p><pre><code>\n 0\n <p>1</p>\n 2\n<p>3</p>\n 4\n</code></pre><!--separator-->\n</div>"
right: "<html><div><p> test </p><pre><code>\n 0\n <p>1</p>\n 2\n<p>3</p>\n 4\n</code></pre><!--separator-->\n</div></html>" |
Updates: Found the function which adds the html5ever/mod.rs pub fn new_for_fragment(
sink: Sink,
context_elem: Handle,
form_elem: Option<Handle>,
opts: TreeBuilderOpts,
) -> TreeBuilder<Handle, Sink> {
println!("new_for_fragment");
...
// https://html.spec.whatwg.org/multipage/#parsing-html-fragments
// 5. Let root be a new html element with no attributes.
// 6. Append the element root to the Document node created above.
// 7. Set up the parser's stack of open elements so that it contains just the single element root.
tb.create_root(vec![]);
println!("new_for_fragment 3"); html5ever/tree_builder.rs //§ creating-and-inserting-nodes
fn create_root(&self, attrs: Vec<Attribute>) {
let elem = create_element(
&self.sink,
QualName::new(None, ns!(html), local_name!("html")),
attrs,
);
self.push(&elem);
self.sink.append(&self.doc_handle, AppendNode(elem));
// FIXME: application cache selection algorithm
} |
I'm hesitant to make this a configurable behaviour, since that takes us away from the HTML parsing specification that this crate implements. This seems like a post-processing step that would be better implemented by crates that use this library, using manual DOM operations on the parsed tree. |
@jdm I couldn't find a formal definition in the html5ever source code what a For my intuition the function This is the 'collective' intuition represented in chatgpt: https://chatgpt.com/share/67f525ed-c53c-800c-8f49-85b6064226e7 and TL;DR A fragment means a snippet of HTML meant to live inside an existing HTML element, not a standalone HTML document. So I kindly ask you this: Could you please add a 'formal' comment on the If you don't want to alter the implementation in any way, since you like it as it is, then just close this ticket. I can go with the hack you proposed. Thanks! |
https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments In particular steps 7, 8 and 16. |
Regarding Step 16:
Doesn't "return the root's children" imply that the root itself ( |
That makes me wonder if it's possible to parse a fragment that has multiple elements at the fragment root. What do you return in that case, if not the artificial root mode? |
Mmm, I see that users of that algorithm expect a list of direct children that they can iterate and append to some other tree. Interesting. |
I mean, one presumes this API is for parsing a https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment. Which is is it's own entity which provides a container for a list of children, but doesn't have the same enforced structure ( |
I want to use parse_document to create dom/vdom patches but the
parse_document(...)
keeps adding<html>
and<body>
. I wonder, is there an option to fine-tune the error correction level? I like that it does add a</title>
in the example below.But for creating a virtual-dom patch on a
<div id="here">
it is bad to have to filter the html tags out afterwards.However, my problem is that I also want to parse html with a
<html>...</html>
tag in it and then it gets removed.html-driver.rs test
Update:
parse_fragment is also adding unwanted html.
The text was updated successfully, but these errors were encountered: