Eventfd not closed after executor finish #448

thirstycrow · 2021-10-20T13:05:08Z

As the following benchmark runs, there's an increasing number of eventfds listed by lsof. With 2 new executors created in each round of test, there's 12 more open eventfds exists after the executors finish.

use glommio::channels::shared_channel;
use glommio::prelude::*;
use std::sync::mpsc::sync_channel;
use std::time::{Duration, Instant};

fn test_spsc(capacity: usize) {
    let runs: u32 = 10_000_000;
    let (sender, receiver) = shared_channel::new_bounded(capacity);

    let sender = LocalExecutorBuilder::new()
        .pin_to_cpu(0)
        .spawn(move || async move {
            let sender = sender.connect().await;
            for _ in 0..runs {
                sender.send(1).await.unwrap();
            }
            drop(sender);
        })
        .unwrap();

    let receiver = LocalExecutorBuilder::new()
        .pin_to_cpu(1)
        .spawn(move || async move {
            let receiver = receiver.connect().await;
            for _ in 0..runs {
                receiver.recv().await.unwrap();
            }
        })
        .unwrap();

    sender.join().unwrap();
    receiver.join().unwrap();
}

fn main() {
    for i in 0..10000 {
        println!("==========");
        println!("Round {}", i);
        //test_spsc(10);
        test_spsc(100);
        test_spsc(1000);
        test_spsc(10000);
    }
}

The text was updated successfully, but these errors were encountered:

glommer · 2021-10-20T16:40:02Z

Thanks @thirstycrow . I reproduced this, and I will hunt where this is coming from. Leave this to me.
To set expectations, I am about to enter paternity leave so I'll be off for some days.

I suggest we manually raise the limit of file descriptors to a very high number so you can test your PR, and I'll fix this later.

glommer · 2021-10-20T16:52:16Z

Ok, I know why this happens. We keep a clone of the sleep notifier inside task, and there is a problem that we are aware of for a long time now, but has been a minor bother: tasks that are not runnable do not have their destructors run when the executor drops. So that reference count never drops.

thirstycrow · 2021-10-21T06:07:14Z

I raised the limit of open files, and the test lasts until round 2723, panicked with Cannot allocate memory (os error 12). I inspected the process status just before the panic. The VSZ and RSS from the ps output are 36.8G and 1.765G. I have 32G memory on my laptop.

glommer · 2021-11-15T20:06:54Z

As a status update, I spent some time trying to fix this, but it is really hard because tasks often get destroyed under our nose. This brought me back to the refcount hell in the task structures. I'll keep looking at it.

glommer · 2023-03-23T10:15:12Z

This turned out to be pretty hard to fix, so I didn't get back to it =(
There's currently no workaround, unfortunately.

I will notice, though, that creating executors is a fairly expensive operation, and this bug manifests with executors being created / destructed very frequently. Outside of test cases - like described in this bug, I'd recommend you work with long-lived executors when possible, which would not manifest the issue.

zaidoon1 · 2023-03-23T14:53:10Z

I'd recommend you work with long-lived executors when possible, which would not manifest the issue.

so i'm new to rust and maybe there is an easy way around this. my service is basically a cronjob that runs periodically and does the following:

iterate over millions of files on disk
parse the files, extract some data and send http requests based on some condition
sleep for x hours
repeat again

What i'm trying to use glommio for is reading files using Direct I/O (to bypass page cache since i need to read all the files on disk basically and caching wont' help but use more resources).

So for each file, i'm basically doing:

let executor = LocalExecutor::default();

executor.run(async {
            let file = match DmaFile::open(file_path).await {
                Ok(file) => file,
                Err(err) => bail!("failed to open xxxx: {}", err),
            };.....

I tried to work around this by moving the executor.run call up the chain to contain all the logic so that i run it once when the service starts but then ran into other problems, mainly any tokio based function stopped working such as timers, http requests using (hyper), etc.. I found a glommio version of timers but didn't get to fixin the http request part and wanted to get some feedback to see if there is a better way?

maybe i can do

let executor = LocalExecutor::default();

at the start of the service and call

executor.run(async {
            let file = match DmaFile::open(file_path).await {
                Ok(file) => file,
                Err(err) => bail!("failed to open xxxx: {}", err),
            };.....

per file? not sure if that would work around the eventfd leak

glommer · 2023-03-23T19:52:40Z

My recommendation would be to try to move the timer logic inside the executor as well.
The executor is essentially a thread, so you can spawn it early, and move stuff inside.

For everything that does I/O (including timers), they have to be glommio primitives. The one thing that works from Tokio is channels: so the way to do it is to add a channel for communication, then timers and files would go inside the executor.

Hyper indeed it's a bit harder, but if you can, you can read the requests in tokio, and pass just what you need through channels.

Forget for a moment this would be Rust executors: the recommended architecture would still be something like a thread pool. Long lived executors are essentially a thread pool, so you just have to pass data to them when you need it.

zaidoon1 · 2023-03-23T20:19:31Z

got it, thank you!

vlovich · 2023-09-21T21:48:17Z

Came across this issue trying to write a property test via proptest with a lot of test cases generated (since each test invocation would create a LocalExecutor and leak).

Here's the workaround I came up with for my tests.

mod test_support {
    use glommio::LocalExecutor;

    // Workaround for https://github.com/DataDog/glommio/issues/448
    thread_local! {
        static EXECUTOR: OnceCell<LocalExecutor> = OnceCell::new();
    }

    pub fn run_async<T>(future: impl std::future::Future<Output = T>) -> T {
        EXECUTOR.with(|cell| {
            let local_ex = cell.get_or_init(|| {
                glommio::LocalExecutorBuilder::new(glommio::Placement::Fixed(0))
                    .record_io_latencies(true)
                    .make()
                    .unwrap()
            });
            local_ex.run(async move { future.await })
        })
    }
}

Blocking versions can't be implemented because they would require spawning a new thread to avoid jamming up the background thread used for other syscalls (the jamming up can generate a deadlock which is what testing found). The blocking version would require spawning a new thread with a standalone executor but that's horibbly expensive AND runs into DataDog#448.

thirstycrow mentioned this issue Oct 20, 2021

Reduce foreign wake latency #434

Merged

HippoBaro added the bug Something isn't working label Feb 28, 2022

vlovich mentioned this issue Mar 13, 2024

RawTask memory leak #636

Open

vlovich mentioned this issue May 11, 2024

Implement very simple advisory lock mechanism #664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eventfd not closed after executor finish #448

Eventfd not closed after executor finish #448

thirstycrow commented Oct 20, 2021 •

edited

Loading

glommer commented Oct 20, 2021

glommer commented Oct 20, 2021

thirstycrow commented Oct 21, 2021

glommer commented Nov 15, 2021

glommer commented Mar 23, 2023

zaidoon1 commented Mar 23, 2023 •

edited

Loading

glommer commented Mar 23, 2023

zaidoon1 commented Mar 23, 2023

vlovich commented Sep 21, 2023

Eventfd not closed after executor finish #448

Eventfd not closed after executor finish #448

Comments

thirstycrow commented Oct 20, 2021 • edited Loading

glommer commented Oct 20, 2021

glommer commented Oct 20, 2021

thirstycrow commented Oct 21, 2021

glommer commented Nov 15, 2021

glommer commented Mar 23, 2023

zaidoon1 commented Mar 23, 2023 • edited Loading

glommer commented Mar 23, 2023

zaidoon1 commented Mar 23, 2023

vlovich commented Sep 21, 2023

thirstycrow commented Oct 20, 2021 •

edited

Loading

zaidoon1 commented Mar 23, 2023 •

edited

Loading