Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with unliking nodes #60

Closed
jangernert opened this issue Nov 11, 2019 · 10 comments
Closed

Issues with unliking nodes #60

jangernert opened this issue Nov 11, 2019 · 10 comments
Labels

Comments

@jangernert
Copy link
Contributor

jangernert commented Nov 11, 2019

With the release of rust 1.39 I hopped on the async train for the first time. However I am having some difficulties with libxml related code. I'm not sure if it's a problem with my code or libxml itself. I'll try to describe my problems here. Hopefully someone can help.

I'm running my unit tests with #[tokio::test] which states Uses current_thread runtime.. So there should be no multi threading shenanigans.

The first test straight up crashes with this message:

malloc_consolidate(): invalid chunk size

I nailed it down to the node.unlink() line in this function:

fn strip_id_or_class(context: &Context, id_or_class: &String) -> Result<(), ScraperError> {
   let xpath = &format!("//*[contains(@class, '{}') or contains(@id, '{}')]", id_or_class, id_or_class);
   evaluate_xpath!(context, xpath, node_vec);
   for mut node in node_vec {
      node.unlink();
   }
   Ok(())
}

A very similar function does not trigger this error. Maybe it's a very specific case that is triggered here.

The second problem I'm having is because async functions don't support recursing right now. So I refactored my code into a loop that looks like this:

let mut xpath_ctx = xpath_ctx;

loop {
    if let Some(url) = self.check_for_next_page(&xpath_ctx, config) {
        let mut html = ArticleScraper::download(&url, &self.client).await?;
        parse_html!(html, config, new_xpath_ctx);
        ArticleScraper::strip_junk(&new_xpath_ctx, config, &url);
        ArticleScraper::extract_body(&new_xpath_ctx, root, config)?;
        xpath_ctx = new_xpath_ctx;
    } else {
        break;
    }
}

However in the second iteration of the loop the check_for_next_page function gets stuck evaluating an xpath expression (for which I wrote a macro)

let res = $context.evaluate($xpath).map_err(|()| {
    error!("Evaluation of xpath {} yielded no results", $xpath);
    ScraperErrorKind::Xml
})?;

let $node_vec = res.get_nodes_as_vec();

The complete code is over here. So if anyone has any ideas I'd be happy to hear them.

If the code base of my crate is too big I can try to whip up some minimal samples.
Thank you!

@jangernert
Copy link
Contributor Author

jangernert commented Nov 19, 2019

Update: The problem with the loop was keeping the xpath context alive but not the document as described in #61 . So that's solved now that I know that (again).

Regarding the hard crash I observed it's not due to unlinking a single specific node, but rather always crashes at some point. Lowering the amount of stuff I strip step by step makes it run at some point. I'll try to reproduce this behavior in a minimal example.

@dginev
Copy link
Member

dginev commented Nov 19, 2019

Thanks for troubleshooting in this detail and letting us know @jangernert ! There is clearly a need for a more transparent (and ideally earlier) failure when the DocumentWeak refs are no longer around. It's also reassuring the root cause for both issues was the same flaw, definitely something to address for the next crate release.

@jangernert jangernert changed the title Issues with async code Issues with unliking nodes Nov 19, 2019
@jangernert
Copy link
Contributor Author

jangernert commented Nov 19, 2019

Update: Good but unexpected news. The issue isn't related to async at all. Master branch has the exact same issue now, so the only thing that changed to trigger this is the "Golem.de" website. Tested the example with an older release (rust 1.36.0) just to be sure.
I have a minimal sample over here. It just crashes with malloc_consolidate(): invalid chunk size for me. Libxml2 version is 2.9.10.

Taking out some of the things that get stripped from the document fixes this. But I coudln't pin it down to one thing specifically. It seems to either be a threshold or a certain combination.

I hope this is concrete enough information to help me find the cause of this.

@jangernert
Copy link
Contributor Author

jangernert commented Dec 16, 2019

So since I remember this working at some point I tested with Libxml2 version 2.9.9. Sadly without success.
I also added a working c version of the minimal rust sample to the repository. Hopefully this will help to track down this bug.
Maybe you can take a look at it @dginev

@dginev
Copy link
Member

dginev commented Dec 19, 2019

Hi @jangernert ,

I think you won't like my answer - but I don't think this is a bug in the crate - rather it's an issue with your aggressive use of unlink+xpath. If you trim down your example to:

// ... snip ...
let context = Context::new(&doc).unwrap();
let xpath = "//*[contains(@class, 'tags') or contains(@id, 'tags')]";
let res = context.evaluate(xpath).unwrap();
let node_vec = res.get_nodes_as_vec();
for node in node_vec {
  println!(" Node: {:?}", doc.node_to_string(&node));
}
// ... snip ... 

You see the nodes:

 Node: "<div class=\"tags\" id=\"breadcrumbs\">\n    <h3 class=\"tags__title\">\n        <a href=\"/specials/\">Themenseiten:</a>\n    </h3>\n    <ul class=\"tags__list\">\n<li><a href=\"/specials/ietf/\" title=\"IETF - Internet Engineering Task Force\" class=\"lm\">IETF</a></li>\n<li><a href=\"/specials/internet-der-dinge/\" title=\"Internet der Dinge\" class=\"lm\">IoT</a></li>\n<li><a href=\"/specials/node.js/\" title=\"Node.js\" class=\"lm\">Node.js</a></li>\n<li><a href=\"/specials/internet/\" title=\"Internet\" class=\"lm\">Internet</a></li>\n    </ul>\n</div>"
 Node: "<h3 class=\"tags__title\">\n        <a href=\"/specials/\">Themenseiten:</a>\n    </h3>"
 Node: "<ul class=\"tags__list\">\n<li><a href=\"/specials/ietf/\" title=\"IETF - Internet Engineering Task Force\" class=\"lm\">IETF</a></li>\n<li><a href=\"/specials/internet-der-dinge/\" title=\"Internet der Dinge\" class=\"lm\">IoT</a></li>\n<li><a href=\"/specials/node.js/\" title=\"Node.js\" class=\"lm\">Node.js</a></li>\n<li><a href=\"/specials/internet/\" title=\"Internet\" class=\"lm\">Internet</a></li>\n    </ul>"
 Node: "<div class=\"tags\"><span class=\"tags__nada\"><a href=\"#comments\" onclick=\"javascript:_gcpx.push([\'ev\',\'d\',\'commentscroll\']);return true;\">Zu den Kommentaren springen</a></span></div>"
 Node: "<span class=\"tags__nada\"><a href=\"#comments\" onclick=\"javascript:_gcpx.push([\'ev\',\'d\',\'commentscroll\']);return true;\">Zu den Kommentaren springen</a></span>"

In particular the 4th printed node with #comments is also a child of the 3rd printed node. So when you unlink the 3rd, and then unlink the 4th, you end up with a double-free for the #comments node.

The likely "intended" fix here, in the spirit of libxml2, would be for you to try and design your xpath in a restrictive enough way where you don't get multiple nodes from the same subtree in the result set.

A "higher level" solution, which would cost runtime performance, is to design something akin to a cautious_unlink, which does some advanced bookkeeping to throw a user-space error when a double-unlink takes place, rather than the low-level segfaults you end up with in libxml2. But that's an enhancement of its own right, not really a bug with wrapping over libxml2 itself.

@dginev
Copy link
Member

dginev commented Dec 19, 2019

Taking a step back, I am always in favor of adding more safety in using the wrapper, when there is no obvious performance cost. For cases where there is a noticeable cost I lean towards giving it a shot as an experiment, but as a separately namespaced set of methods. Maybe we need a convention that we add slower equivalents to dangerous methods with a cautious_* prefix, or some such, and then add the ones folks report as troublesome.

Alternatively one could make a context.cautious_evaluate which does a follow-up runtime check of all returned nodes, to ensure none are related in ancestry, and can be worked on in parallel. That way you would get a warning even if you're not interested in unlinking. However, it's still a slow thing to check if you get a large number of nodes back, it may end up even more expensive than the xpath itself... Hence my hesitation as to how to proceed.

@jangernert
Copy link
Contributor Author

@dginev No, that seems like a fair answer. But what baffles me is that the error appeared at some point in a unit test. So I'm quite confident in saying that it worked at time writing the test. Also not sure why the same code in C works.

@dginev
Copy link
Member

dginev commented Dec 19, 2019

Indeed, just validated that the main.c also has the same 3rd and 4th node, and xmlUnlinkNode does not lead to a segfault. So the segfault may be avoidable even in these cases, will dig some more...

@dginev
Copy link
Member

dginev commented Dec 19, 2019

Ah right, this is actually something I realized way back when and then forgot, but luckily I left a comment:

impl Drop for _Node {
  /// Free node if it isn't bound in some document
  /// Warning: xmlFreeNode is RECURSIVE into the node's children, so this may lead to segfaults if used carelessly

The rust wrapper needs to not leak memory, which means whenever it Drops an unlinked node it should free it explicitly. Unlinked nodes will not be deallocated internally by libxml2. Which is what the wrapper implementation does, but it does not spend the time and effort to check if there are multiple nodes from the same subtree -- and that leads to freeing twice in your example. So indeed your C file does not segfault, but here's something else it does instead:

$ valgrind ./example
==15567== HEAP SUMMARY:
==15567==     in use at exit: 29,406 bytes in 155 blocks
==15567==   total heap usage: 31,562 allocs, 31,407 frees, 3,425,600 bytes allocated
==15567== 
==15567== LEAK SUMMARY:
==15567==    definitely lost: 232 bytes in 6 blocks
==15567==    indirectly lost: 29,174 bytes in 149 blocks
==15567==      possibly lost: 0 bytes in 0 blocks
==15567==    still reachable: 0 bytes in 0 blocks
==15567==         suppressed: 0 bytes in 0 blocks
==15567== Rerun with --leak-check=full to see details of leaked memory

It leaks memory!
Meanwhile, when I run the Rust code you provided, (with a couple of the extra nodes kept linked, so that it succeeds) I see:

$ valgrind .target/debug/deps/xpath_torture-3f982b99394cf43f
...
==19985== 
==19985== HEAP SUMMARY:
==19985==     in use at exit: 1,134 bytes in 22 blocks
==19985==   total heap usage: 32,209 allocs, 32,187 frees, 3,436,069 bytes allocated
==19985== 
==19985== LEAK SUMMARY:
==19985==    definitely lost: 0 bytes in 0 blocks
==19985==    indirectly lost: 0 bytes in 0 blocks
==19985==      possibly lost: 0 bytes in 0 blocks
==19985==    still reachable: 1,134 bytes in 22 blocks
==19985==         suppressed: 0 bytes in 0 blocks
==19985== Rerun with --leak-check=full to see details of leaked memory

No leaks!

It took us a few iterations until we arrived at a very safe memory deallocation scheme here, which I will try to keep -- it's extremely useful for large batch jobs to not have your RAM leaking out. Your main.c file on the other hand will be in some RAM usage trouble unless it frees the nodes it gets from XPath.

@jangernert
Copy link
Contributor Author

jangernert commented Dec 19, 2019

Yes, I had a similar thought about an hour ago while riding my bike (its surprising how often you realize these things during physical exercise) that in the C example I don't free every unlinked node.
The only thing I could think of is that the site I pull the html code from during the test changed it's structure. And I absolutely NEVER hit a scenario like that before. Since the code is a straight up port of a medium popular vala application that is quite surprising.
Also working with rust kinda makes you forget about these memory problems :D

I'll work around the issue and close this report. Thank you for the support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants