Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventfd not closed after executor finish #448

Open
thirstycrow opened this issue Oct 20, 2021 · 9 comments
Open

Eventfd not closed after executor finish #448

thirstycrow opened this issue Oct 20, 2021 · 9 comments
Labels
bug Something isn't working

Comments

@thirstycrow
Copy link
Contributor

thirstycrow commented Oct 20, 2021

As the following benchmark runs, there's an increasing number of eventfds listed by lsof. With 2 new executors created in each round of test, there's 12 more open eventfds exists after the executors finish.

use glommio::channels::shared_channel;
use glommio::prelude::*;
use std::sync::mpsc::sync_channel;
use std::time::{Duration, Instant};

fn test_spsc(capacity: usize) {
    let runs: u32 = 10_000_000;
    let (sender, receiver) = shared_channel::new_bounded(capacity);

    let sender = LocalExecutorBuilder::new()
        .pin_to_cpu(0)
        .spawn(move || async move {
            let sender = sender.connect().await;
            for _ in 0..runs {
                sender.send(1).await.unwrap();
            }
            drop(sender);
        })
        .unwrap();

    let receiver = LocalExecutorBuilder::new()
        .pin_to_cpu(1)
        .spawn(move || async move {
            let receiver = receiver.connect().await;
            for _ in 0..runs {
                receiver.recv().await.unwrap();
            }
        })
        .unwrap();

    sender.join().unwrap();
    receiver.join().unwrap();
}

fn main() {
    for i in 0..10000 {
        println!("==========");
        println!("Round {}", i);
        //test_spsc(10);
        test_spsc(100);
        test_spsc(1000);
        test_spsc(10000);
    }
}
@glommer
Copy link
Collaborator

glommer commented Oct 20, 2021

Thanks @thirstycrow . I reproduced this, and I will hunt where this is coming from. Leave this to me.
To set expectations, I am about to enter paternity leave so I'll be off for some days.

I suggest we manually raise the limit of file descriptors to a very high number so you can test your PR, and I'll fix this later.

@glommer
Copy link
Collaborator

glommer commented Oct 20, 2021

Ok, I know why this happens. We keep a clone of the sleep notifier inside task, and there is a problem that we are aware of for a long time now, but has been a minor bother: tasks that are not runnable do not have their destructors run when the executor drops. So that reference count never drops.

@thirstycrow
Copy link
Contributor Author

I raised the limit of open files, and the test lasts until round 2723, panicked with Cannot allocate memory (os error 12). I inspected the process status just before the panic. The VSZ and RSS from the ps output are 36.8G and 1.765G. I have 32G memory on my laptop.

@glommer
Copy link
Collaborator

glommer commented Nov 15, 2021

As a status update, I spent some time trying to fix this, but it is really hard because tasks often get destroyed under our nose. This brought me back to the refcount hell in the task structures. I'll keep looking at it.

@HippoBaro HippoBaro added the bug Something isn't working label Feb 28, 2022
@glommer
Copy link
Collaborator

glommer commented Mar 23, 2023

This turned out to be pretty hard to fix, so I didn't get back to it =(
There's currently no workaround, unfortunately.

I will notice, though, that creating executors is a fairly expensive operation, and this bug manifests with executors being created / destructed very frequently. Outside of test cases - like described in this bug, I'd recommend you work with long-lived executors when possible, which would not manifest the issue.

@zaidoon1
Copy link

zaidoon1 commented Mar 23, 2023

I'd recommend you work with long-lived executors when possible, which would not manifest the issue.

so i'm new to rust and maybe there is an easy way around this. my service is basically a cronjob that runs periodically and does the following:

  1. iterate over millions of files on disk
  2. parse the files, extract some data and send http requests based on some condition
  3. sleep for x hours
  4. repeat again

What i'm trying to use glommio for is reading files using Direct I/O (to bypass page cache since i need to read all the files on disk basically and caching wont' help but use more resources).

So for each file, i'm basically doing:

let executor = LocalExecutor::default();

executor.run(async {
            let file = match DmaFile::open(file_path).await {
                Ok(file) => file,
                Err(err) => bail!("failed to open xxxx: {}", err),
            };.....

I tried to work around this by moving the executor.run call up the chain to contain all the logic so that i run it once when the service starts but then ran into other problems, mainly any tokio based function stopped working such as timers, http requests using (hyper), etc.. I found a glommio version of timers but didn't get to fixin the http request part and wanted to get some feedback to see if there is a better way?

maybe i can do

let executor = LocalExecutor::default();

at the start of the service and call

executor.run(async {
            let file = match DmaFile::open(file_path).await {
                Ok(file) => file,
                Err(err) => bail!("failed to open xxxx: {}", err),
            };.....

per file? not sure if that would work around the eventfd leak

@glommer
Copy link
Collaborator

glommer commented Mar 23, 2023

My recommendation would be to try to move the timer logic inside the executor as well.
The executor is essentially a thread, so you can spawn it early, and move stuff inside.

For everything that does I/O (including timers), they have to be glommio primitives. The one thing that works from Tokio is channels: so the way to do it is to add a channel for communication, then timers and files would go inside the executor.

Hyper indeed it's a bit harder, but if you can, you can read the requests in tokio, and pass just what you need through channels.

Forget for a moment this would be Rust executors: the recommended architecture would still be something like a thread pool. Long lived executors are essentially a thread pool, so you just have to pass data to them when you need it.

@zaidoon1
Copy link

got it, thank you!

@vlovich
Copy link
Contributor

vlovich commented Sep 21, 2023

Came across this issue trying to write a property test via proptest with a lot of test cases generated (since each test invocation would create a LocalExecutor and leak).

Here's the workaround I came up with for my tests.

mod test_support {
    use glommio::LocalExecutor;

    // Workaround for https://github.com/DataDog/glommio/issues/448
    thread_local! {
        static EXECUTOR: OnceCell<LocalExecutor> = OnceCell::new();
    }

    pub fn run_async<T>(future: impl std::future::Future<Output = T>) -> T {
        EXECUTOR.with(|cell| {
            let local_ex = cell.get_or_init(|| {
                glommio::LocalExecutorBuilder::new(glommio::Placement::Fixed(0))
                    .record_io_latencies(true)
                    .make()
                    .unwrap()
            });
            local_ex.run(async move { future.await })
        })
    }
}

vlovich added a commit to vlovich/glommio that referenced this issue May 11, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 11, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 11, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 11, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 11, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 11, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 11, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 11, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 11, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 13, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 13, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 31, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
vlovich added a commit to vlovich/glommio that referenced this issue May 31, 2024
Blocking versions can't be implemented because they would require
spawning a new thread to avoid jamming up the background thread used for
other syscalls (the jamming up can generate a deadlock which is what
testing found). The blocking version would require spawning a new thread
with a standalone executor but that's horibbly expensive AND runs into
DataDog#448.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants