Graceful shutdown #38

imbolc · 2022-07-08T10:28:49Z

After changes in jobs code previously added payloads may not be valid anymore. A solution could be to stop spawning new jobs and wait until all the remaining jobs are done before restarting the runner. The only way to do it I could find for now is to query db directly .. from mq_msgs where attempts > 0. Should we add a method for this so users won't rely on the implementation details?

The text was updated successfully, but these errors were encountered:

Diggsey · 2022-07-08T11:28:42Z

This doesn't work because messages may be scheduled arbitrarily far into the future, or be "dead" (attempts = 0). Applications which require zero downtime upgrades are expected to only make backwards-compatible changes to the message payloads.

With serde, this is generally pretty easy to achieve. In cases where it is not possible, you can add a new job instead of modifying the old one, and then run that version of the code at least until all legacy jobs are gone from the database. (This is better than shutdown logic, because there is no time limit on how long you can run this version of the application)

imbolc · 2022-07-08T11:36:55Z

This is better than shutdown logic

Though it's significantly more complicated.

there is no time limit on how long you can run this version of the application

Not in general, but for each particular application its developer could accept a viable strategy. Most of apps could allow some downtime.

Diggsey · 2022-07-10T15:10:18Z

sqlxmq is designed to be run with multiple replicas talking to the same database, and it's not clear how graceful shutdown could work in that case.

That said, I agree there is a subsection of applications which:

Require messages to be reliable despite new deployments.
Do not require zero-downtime deployments.
Do not schedule messages into the future.
Need the message format to change over time.
Do not run with multiple replicas.

For which the effort of maintaining backwards compatibility of messages is unncessary.

The question is whether that subsection is large enough, and the extra effort burdensome enough to warrant this feature? I think for me this falls into the gap of: not important enough (and with enough challenges in implementation) to not want to implement myself, but maybe I would accept a PR with some conditions:

It should be clearly documented what the limitations of this approach are.
It should be named according to the function (eg. "wait until queue empty" or something) rather than just "graceful shutfown".
It should have a working example which shows correct use (including how to prevent new messages from being spawned during shutdown)
Some thought should be given to "dead" messages.
It should not complicate implementation of the core message queue mechanism.

imbolc · 2022-07-11T10:20:07Z

Could you explain how to achieve zero-downtime while adding a new job? Shouldn't we restart all runners before spawning the job? Because other way it won't be in their registries. So shouldn't there be a way to at least wait for the current tasks to gracefully finish before restarting the runners?

Diggsey · 2022-07-11T11:05:06Z

Could you explain how to achieve zero-downtime while adding a new job?

You would do a rolling restart of your replicas with the new version.

So shouldn't there be a way to at least wait for the current tasks to gracefully finish before restarting the runners?

Oh you just mean waiting for already-picked-up jobs to complete. This is already mostly possible - when you drop (or call stop()) on the OpaqueHandle returned by starting the job runner, it will stop accepting new tasks, and then you just need to wait for existing tasks to complete.

This second part could be made a little easier.

imbolc · 2022-07-11T19:10:51Z

it will stop accepting new tasks, and then you just need to wait for existing tasks to complete

It seems to also cancel all the existing tasks. Or maybe I got something wrong. Would you look at the code: https://github.com/imbolc/sqlxmq/blob/signal-shutdown/examples/signal-shutdown.rs

Diggsey · 2022-07-11T21:31:58Z

It's only cancelling the existing tasks because you're returning from main(), and tokio will stop all tasks when the tokio runtime is stopped. I didn't run the example you gave because it uses signals and I'm on windows, but you can replace the code with this and see that the tasks are not killed:

use std::time::Duration;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    dotenv::dotenv().ok();
    let db = sqlx::PgPool::connect(&std::env::var("DATABASE_URL").unwrap()).await?;

    sleep.builder().set_json(&10u64)?.spawn(&db).await?;

    let handle = sqlxmq::JobRegistry::new(&[sleep]).runner(&db).run().await?;

    println!("Waiting 2 seconds...");
    tokio::time::sleep(Duration::from_secs(2)).await;
    println!("Stopping...");

    handle.stop().await;

    tokio::time::sleep(Duration::from_secs(10)).await;
    Ok(())
}

#[sqlxmq::job]
pub async fn sleep(mut job: sqlxmq::CurrentJob) -> sqlx::Result<()> {
    let second = std::time::Duration::from_secs(1);
    let mut to_sleep: u64 = job.json().unwrap().unwrap();
    while to_sleep > 0 {
        println!("job#{} {to_sleep} more seconds to sleep ...", job.id());
        tokio::time::sleep(second).await;
        to_sleep -= 1;
    }
    job.complete().await
}

The way sqlxmq works, the job runner doesn't really know about tasks, it's the registry that manages task lifetimes. The job runner just invokes a callback whenever a job is picked up, and it's up to the callback to decide how to spawn a task.

You can see that the implementation provided by the JobRegistry is not even async:

sqlxmq/src/registry.rs

Line 91 in 592b7e6

pub fn spawn_internal<E: Into<Box<dyn Error + Send + Sync + 'static>>>(

Ideally tokio would have some way to "wait until all tasks are done", but it does not.

imbolc · 2022-07-12T08:47:07Z

Right, I see now, thank you :) Maybe we add an AtomicUsize to spawn_internal to track number of alive jobs?

Diggsey · 2022-07-12T08:49:50Z

Something like that, yeah.

imbolc · 2022-07-12T11:30:06Z

It seems like it's already there: JobRunner::running_jobs. But there's no access to JobRunner from outside, it's just instantly passed to a tokio task after creation. Should I add Arc<JobRunner> field to OwnedHandle?

imbolc · 2022-07-12T12:35:24Z

I found that OwnedHandle is also used in another place, so I added a wrapper for the runner handle. I also updated the example based on your code so it's not Linux specific anymore.

Diggsey closed this as completed Jul 8, 2022

Diggsey reopened this Jul 10, 2022

imbolc mentioned this issue Jul 12, 2022

Wait for running jobs to finish #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown #38

Graceful shutdown #38

imbolc commented Jul 8, 2022

Diggsey commented Jul 8, 2022

imbolc commented Jul 8, 2022

Diggsey commented Jul 10, 2022

imbolc commented Jul 11, 2022

Diggsey commented Jul 11, 2022

imbolc commented Jul 11, 2022

Diggsey commented Jul 11, 2022

imbolc commented Jul 12, 2022

Diggsey commented Jul 12, 2022

imbolc commented Jul 12, 2022 •

edited

imbolc commented Jul 12, 2022

Graceful shutdown #38

Graceful shutdown #38

Comments

imbolc commented Jul 8, 2022

Diggsey commented Jul 8, 2022

imbolc commented Jul 8, 2022

Diggsey commented Jul 10, 2022

imbolc commented Jul 11, 2022

Diggsey commented Jul 11, 2022

imbolc commented Jul 11, 2022

Diggsey commented Jul 11, 2022

imbolc commented Jul 12, 2022

Diggsey commented Jul 12, 2022

imbolc commented Jul 12, 2022 • edited

imbolc commented Jul 12, 2022

imbolc commented Jul 12, 2022 •

edited