Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program hangs after successful shutdown #50

Closed
DASPRiD opened this issue Jan 3, 2023 · 33 comments
Closed

Program hangs after successful shutdown #50

DASPRiD opened this issue Jan 3, 2023 · 33 comments

Comments

@DASPRiD
Copy link

DASPRiD commented Jan 3, 2023

Hi there,

During some recent testing I ran into an "interesting" issue with SIGINT and similar shutdowns. The exact situation I identified is the following:

Deep inside one of my subsystems, I do an async lookup_host(), which has a timeout of apparently 20 seconds, which is await there. When hitting CTRL+C, all subsystems shut down and I get tokio_graceful_shutdown::toplevel] Shutdown successful..

At this point the program does not terminate though – it only exits after the lookup_host() future resolves/rejects. Do you happen to have an idea on how to force tokio to not wait for such things before shutting down itself?

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

Theoretically, after the timeout of handle_shutdown_requests is over, it should cancel all the pending futures ... Would you mind providing a reproducible example? This behaviour sounds like a bug and requires fixing :)

Did you spawn it through a tokio::spawn? Because that slips away from the reach of the shutdown handler, you would have to use SubsystemHandle::start instead. tokio::spawn is like a detached thread, the system looses all knowledge about its existence.

I could theoretically wait for it then through guard objects, but I wouldn't have any possibility of cancelling it. The shutdown manager can only cancel futures it knows, for obvious reasons :)

But none of this matters if you didn't use tokio::spawn :) So I will be able to help you further once you provide a reproducible example.

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

Hmm. It seems that lookup_host() internally uses spawn_blocking(). Sadly, it's almost impossible to handle spawn_blocking gracefully during shutdown. Let me construct a minimal example for further discussion.

@DASPRiD
Copy link
Author

DASPRiD commented Jan 3, 2023

I was just looking into lookup_host() as well for reproduction, and figured the same thing. I really wonder how this could be resolved nicely.

I do use vanilla threads in two other places in my program, but they all react to oneshot receivers being dropped and exit immediately.

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

The only thing I found so far would be shutdown_timeout() or shutdown_background(), but I'm unsure how to do this with externally provided runtimes. I suspect you can't shut down the runtime you are currently running on.

So spawning another runtime inside of Toplevel would probably work, but that would override the user's decision what runtime to spawn. I wonder if there is a config option of tokio::main that kills background tasks on exit, that would probably be the best option.

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

This seems to exhibit the behaviour you described:

use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let mut hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    let joinhandle_ref = &mut hanging_task;

    select! {
        e = joinhandle_ref => e.unwrap(),
        _ = subsys.on_shutdown_requested() => {
            hanging_task.abort()
        },
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(2000))
        .await
        .map_err(Into::into)
}

@DASPRiD
Copy link
Author

DASPRiD commented Jan 3, 2023

Yeah, that seems to be pretty much it. I'm screening the tokio docs and code, but haven't found anything for it yet.

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

I could add something along those lines in handle_shutdown_requests:

use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    select! {
        e = hanging_task => e.unwrap(),
        _ = subsys.on_shutdown_requested() => {
            // Spawn thread that kills leftover tasks if necessary
            thread::spawn(|| {
                thread::sleep(Duration::from_millis(1000));
                println!("Shutdown seems to hang. Killing program ...");
                std::process::exit(1);
            });
        },
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(200))
        .await
        .map_err(Into::into)
}

What do you think?

The idea is that we spawn a thread. If main() exits successfully, the thread is killed. But if main() hangs, the thread kills the program after a grace period.

@DASPRiD
Copy link
Author

DASPRiD commented Jan 3, 2023

I think that's a sane way to handle it, as long as the kill timeout is configurable. Though technically we already have timeout defined in handle_shutdown_requests.

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

The only problem is that it isn't guaranteed that Toplevel is used like this, it could be used in a bunch of ways, including somewhere nested in the program. And in that case we probably don't want to provoke a exit(1) if the shutdown hangs.

We could add a .exit_program_on_hang() option that can be appended to .handle_shutdown_requests().

Might have to sleep over it, though. I'm not perfectly happy with this yet, it seems too hacky.

@Finomnis Finomnis changed the title Question about dangling awaits during shutdown Program hangs after successful shutdown Jan 3, 2023
@DASPRiD
Copy link
Author

DASPRiD commented Jan 3, 2023

Sure thing, take your time (not too much please, I definitely like this resolved ;)). I agree that it should be an opt-in feature though.

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

@DASPRiD Until I/we come up with a solution, you can work around it like this:

use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    select! {
        e = hanging_task => e.unwrap(),
        _ = subsys.on_shutdown_requested() => (),
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(200))
        .await?;

    thread::spawn(|| {
        thread::sleep(Duration::from_millis(1000));
        log::error!("Shutdown seems to hang. Killing program ...");
        std::process::exit(1);
    });

    Ok(())
}

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

Feel free to post further thoughts or ideas, I'm open for everything on this problem. I'm struggling to think of something graceful right now, I'll also do some further research.

@DASPRiD
Copy link
Author

DASPRiD commented Jan 3, 2023

I agree that this is definitely not the nicest solution. Considering that such a blocking thread should usually just be ignored and an exit should happen immediately, this feels like an issue on the tokio side.

Granted though, I was only running into this issue because I had to restart my program because of network issues, and that's exactly when I noticed this problem of course. As long as the network just works, the shutdown is normal. So this is definitely an edge-case.

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

The best solution I can come up with right now is to put above chunk of code in a function, like

fn kill_on_hang(timeout: Duration) {
    thread::spawn(|| {
        thread::sleep(timeout);
        log::error!("Shutdown seems to hang. Killing program ...");
        std::process::exit(1);
    });
}

So that people can use it like this:

use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    select! {
        e = hanging_task => e.unwrap(),
        _ = subsys.on_shutdown_requested() => (),
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(200))
        .await?;

    kill_on_hang(Duration::from_secs(1));

    Ok(())
}

But it doesn't feel right. I agree that I think the solution should be on tokio side somewhere. And it is, if you use tokio's runtime directly and can call Runtime::shutdown_background(). It just sucks that tokio::Runtime's default drop behaviour is to wait indefinitely.

@Finomnis
Copy link
Owner

Finomnis commented Jan 3, 2023

@DASPRiD In case you want to report it to tokio, this is the minimal reproducible example:

use std::{thread, time::Duration};

#[tokio::main]
async fn main() {
    tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });
}

In my opinion it should exit immediately, but it doesn't.

My reasoning is that this also exits immediately:

use std::{thread, time::Duration};

fn main() {
    thread::spawn(|| {
        thread::sleep(Duration::from_secs(10));
    });
}

Further, this also exits immediately:

use std::time::Duration;

#[tokio::main]
async fn main() {
    tokio::spawn(async {
        tokio::time::sleep(Duration::from_secs(10)).await;
    });
}

@DASPRiD
Copy link
Author

DASPRiD commented Jan 3, 2023

I should do that, thanks a lot! I'll reference this issue over there.

@DASPRiD
Copy link
Author

DASPRiD commented Jan 3, 2023

I've created an issue on the tokio repository. Maybe you can chime in there as well if you have anything to add ❤️

@Finomnis
Copy link
Owner

Finomnis commented Jan 4, 2023

@DASPRiD I prototyped a solution in tokio:

Cargo.toml:

[patch.crates-io]
tokio = { version = "1.23.1", git = "https://github.com/Finomnis/tokio", branch = "main_arg_ignore_hanging_threads" }

[dependencies]
tokio-graceful-shutdown = "0.12.1"
tokio = { version = "1.23.1", features = ["full"] }
env_logger = "0.10.0"
miette = { version = "5.5.0", features = ["fancy"] }
use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let mut hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(100));
    });

    let joinhandle_ref = &mut hanging_task;

    select! {
        e = joinhandle_ref => e.unwrap(),
        _ = subsys.on_shutdown_requested() => (),
    }

    Ok(())
}

#[tokio::main(ignore_hanging_threads = true)]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(2000))
        .await
        .map_err(Into::into)
}

What are your thoughts?

@DASPRiD
Copy link
Author

DASPRiD commented Jan 4, 2023

That looks quite good to me. The parameter name is a little ambiguous though maybe, as this is quite specific to tokio spawned threads. Apart from that, it looks like a really clean solution.

@Finomnis
Copy link
Owner

Finomnis commented Jan 4, 2023

Yah, it's still open for debate. Now I just need two things:

  • tests
  • agreement with the tokio devs

The second one could take a while, tokio is notoriously understaffed with many open pull requests. They are doing an excellent job, just to clarify, but there's just too many people wanting things done :D

So be patient I guess

@Finomnis
Copy link
Owner

Finomnis commented Jan 4, 2023

Maybe ignore_pending_threads instead of ignore_hanging_threads? Or detach_threads_on_exit? I'm not sure, I hate naming things :D

@DASPRiD
Copy link
Author

DASPRiD commented Jan 4, 2023

I feel you on that, coming up with short but still descriptive names is… hard. Maybe something like abandon_pending_threads?

@Finomnis
Copy link
Owner

Finomnis commented Jan 4, 2023

Sounds good, might use it. Gotta leave for today though, can't promise when I'll get to it. Greetz

@DASPRiD
Copy link
Author

DASPRiD commented Jan 4, 2023

Thanks a lot for your effort! :)

@Finomnis
Copy link
Owner

Finomnis commented Jan 6, 2023

@DASPRiD I opened a PR for the first part of the fix, fyi: tokio-rs/tokio#5360

@DASPRiD
Copy link
Author

DASPRiD commented Jan 6, 2023

Thanks for keeping me in the loop. I assume a second PR would add that to the tokio-main macro?

@Finomnis
Copy link
Owner

Finomnis commented Jan 6, 2023

Exactly :) it's currently completely impossible to fix that reliably, with our without #[tokio::main]. The first PR adds that functionality to the Runtime.

@Finomnis
Copy link
Owner

Finomnis commented Jan 9, 2023

@DASPRiD It might be unlikely that we get the parameter for #[tokio::main] merged. I'm currently fighting for the config option for Runtime, that one might happen. You might have to use the Runtime object directly then, but that shouldn't be a problem, #[tokio::main] is really just a thin wrapper around the Runtime object. Wish me luck ;)

@DASPRiD
Copy link
Author

DASPRiD commented Jan 20, 2023

I see that there was no movement on the PR yet, I really hope they get around to it eventually :)

@Finomnis
Copy link
Owner

I'm not sure, they seem to be quite busy ... I'll ask again in their discord soon.

@Finomnis
Copy link
Owner

Finomnis commented Feb 6, 2023

@DASPRiD After a long discussion they rejected the pull request and instead proposed a solution on our side.

Their proposal was to wrap the runtime in a newtype that calls shutdown_background() in its Drop implementation.

Here is the documentation on what #[tokio::main] actually expands to: https://docs.rs/tokio/latest/tokio/attr.main.html#using-the-multi-thread-runtime

With that, this is how such a newtype thing would look like:

use std::{future::Future, thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::{runtime::Runtime, select};
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

struct RuntimeWithInstantShutdown(Option<Runtime>);
impl RuntimeWithInstantShutdown {
    pub fn new() -> Self {
        Self(Some(
            tokio::runtime::Builder::new_multi_thread()
                .enable_all()
                .build()
                .unwrap(),
        ))
    }
    pub fn block_on<F: Future>(&self, future: F) -> F::Output {
        self.0.as_ref().unwrap().block_on(future)
    }
}
impl Drop for RuntimeWithInstantShutdown {
    fn drop(&mut self) {
        self.0.take().unwrap().shutdown_background()
    }
}

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    select! {
        e = hanging_task => e.unwrap(),
        _ = subsys.on_shutdown_requested() => (),
    }

    Ok(())
}

fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    RuntimeWithInstantShutdown::new().block_on(async {
        Toplevel::new()
            .catch_signals()
            .start("Hanging Task", hanging_task)
            .handle_shutdown_requests(Duration::from_millis(200))
            .await
            .map_err(Into::into)
    })
}

@DASPRiD
Copy link
Author

DASPRiD commented Feb 6, 2023

@Finomnis Interesting, thanks a lot for your effort! Do you have any plan to have a macro for that within this library to make it easier for users?

@Finomnis
Copy link
Owner

Finomnis commented Feb 6, 2023

Not really, to be honest. It's different enough that it shouldn't be part of this crate, in my opinion.

It's just a few lines of code, so I don't think it's even worth abstracting it much. Rather the opposite: because runtime configuration is so highly context dependent, I think this is already the best and most generic API. Writing a macro would only make my crate more opinionated, and I'm trying to keep it as generic and unopinionated as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants