Skip to content
This repository has been archived by the owner on Mar 25, 2024. It is now read-only.

Memory leak in .paragraph_iter #20

Closed
dginev opened this issue Sep 22, 2018 · 5 comments
Closed

Memory leak in .paragraph_iter #20

dginev opened this issue Sep 22, 2018 · 5 comments
Labels

Comments

@dginev
Copy link
Member

dginev commented Sep 22, 2018

More precisely, the corpus iterators do. This is a recent regression with the new Node implementation in rust-libxml 0.2.3 I believe. Which is in itself correct.

My current theory is that the excessive use of Rc pointers in the data structures creates impossible to deallocate dependencies, leading to entire DocumentRef objects to remain allocated long after the document itself has been used and is out of scope.

High priority to fix, if any of the corpus workflows are to be possible with an arXiv-sized corpus.

@dginev dginev added the bug label Sep 22, 2018
@dginev
Copy link
Member Author

dginev commented Sep 22, 2018

Misdiagnozed, may be xpath related after all, in particular .get_nodes_as_vec

@dginev dginev changed the title DNMs leak memory for multi-document workflows Memory leak in .paragraph_iter Sep 22, 2018
@dginev
Copy link
Member Author

dginev commented Sep 23, 2018

The bigger part of the memory leak was handled in KWARC/rust-libxml#42 and was due to Rc<>s being too interwoven for Rust to deallocate as expected.

There seems to be a much slower leak also present, leaking about 1 MB in 100 documents. For comparison, the one fixed in the PR leaked 1 MB in 10 documents, so 10x faster.

It's particularly annoying that I can not use valgrind on the corpus_token_model executable, as it would have helped significantly in pinpointing where the leaks are... If my math is accurate my desktop machine will have just enough RAM to tokenize arXiv with the smaller leak, but I would love to find a way to diagnoze the remaining issue as well, as a general concern...

@dginev
Copy link
Member Author

dginev commented Sep 25, 2018

Larger leak officially patched in rust-libxml 0.2.4. The smaller leak is still observable, can also add that the corpus_ams_para example, which is largely identical in its corpus iteration, also exhibits the leak.

Processing a million documents allocates 3.8 GB of RAM, to be exact.

@dginev
Copy link
Member Author

dginev commented Sep 26, 2018

Great breakthrough in the debug process, I can now run valgrind on the examples again! Key was not using jemalloc allocations, as suggested at:

rust-lang/rust#49183 (comment)

Adding this to the example preamble did the trick:

#![feature(alloc_system, allocator_api)]
extern crate alloc_system;
use alloc_system::System;

#[global_allocator]
static A: System = System;

@dginev
Copy link
Member Author

dginev commented Sep 26, 2018

With valgrind's help, the last culprit has been identified and patched - once libxml advances to merge and release Node::null, I can ship the DNM::default patch and close this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant