Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown #38

Open
imbolc opened this issue Jul 8, 2022 · 11 comments
Open

Graceful shutdown #38

imbolc opened this issue Jul 8, 2022 · 11 comments

Comments

@imbolc
Copy link
Contributor

imbolc commented Jul 8, 2022

After changes in jobs code previously added payloads may not be valid anymore. A solution could be to stop spawning new jobs and wait until all the remaining jobs are done before restarting the runner. The only way to do it I could find for now is to query db directly .. from mq_msgs where attempts > 0. Should we add a method for this so users won't rely on the implementation details?

@Diggsey
Copy link
Owner

Diggsey commented Jul 8, 2022

This doesn't work because messages may be scheduled arbitrarily far into the future, or be "dead" (attempts = 0). Applications which require zero downtime upgrades are expected to only make backwards-compatible changes to the message payloads.

With serde, this is generally pretty easy to achieve. In cases where it is not possible, you can add a new job instead of modifying the old one, and then run that version of the code at least until all legacy jobs are gone from the database. (This is better than shutdown logic, because there is no time limit on how long you can run this version of the application)

@Diggsey Diggsey closed this as completed Jul 8, 2022
@imbolc
Copy link
Contributor Author

imbolc commented Jul 8, 2022

This is better than shutdown logic

Though it's significantly more complicated.

there is no time limit on how long you can run this version of the application

Not in general, but for each particular application its developer could accept a viable strategy. Most of apps could allow some downtime.

@Diggsey
Copy link
Owner

Diggsey commented Jul 10, 2022

sqlxmq is designed to be run with multiple replicas talking to the same database, and it's not clear how graceful shutdown could work in that case.

That said, I agree there is a subsection of applications which:

  1. Require messages to be reliable despite new deployments.
  2. Do not require zero-downtime deployments.
  3. Do not schedule messages into the future.
  4. Need the message format to change over time.
  5. Do not run with multiple replicas.

For which the effort of maintaining backwards compatibility of messages is unncessary.

The question is whether that subsection is large enough, and the extra effort burdensome enough to warrant this feature? I think for me this falls into the gap of: not important enough (and with enough challenges in implementation) to not want to implement myself, but maybe I would accept a PR with some conditions:

  1. It should be clearly documented what the limitations of this approach are.
  2. It should be named according to the function (eg. "wait until queue empty" or something) rather than just "graceful shutfown".
  3. It should have a working example which shows correct use (including how to prevent new messages from being spawned during shutdown)
  4. Some thought should be given to "dead" messages.
  5. It should not complicate implementation of the core message queue mechanism.

@Diggsey Diggsey reopened this Jul 10, 2022
@imbolc
Copy link
Contributor Author

imbolc commented Jul 11, 2022

Could you explain how to achieve zero-downtime while adding a new job? Shouldn't we restart all runners before spawning the job? Because other way it won't be in their registries. So shouldn't there be a way to at least wait for the current tasks to gracefully finish before restarting the runners?

@Diggsey
Copy link
Owner

Diggsey commented Jul 11, 2022

Could you explain how to achieve zero-downtime while adding a new job?

You would do a rolling restart of your replicas with the new version.

So shouldn't there be a way to at least wait for the current tasks to gracefully finish before restarting the runners?

Oh you just mean waiting for already-picked-up jobs to complete. This is already mostly possible - when you drop (or call stop()) on the OpaqueHandle returned by starting the job runner, it will stop accepting new tasks, and then you just need to wait for existing tasks to complete.

This second part could be made a little easier.

@imbolc
Copy link
Contributor Author

imbolc commented Jul 11, 2022

it will stop accepting new tasks, and then you just need to wait for existing tasks to complete

It seems to also cancel all the existing tasks. Or maybe I got something wrong. Would you look at the code: https://github.com/imbolc/sqlxmq/blob/signal-shutdown/examples/signal-shutdown.rs

@Diggsey
Copy link
Owner

Diggsey commented Jul 11, 2022

It's only cancelling the existing tasks because you're returning from main(), and tokio will stop all tasks when the tokio runtime is stopped. I didn't run the example you gave because it uses signals and I'm on windows, but you can replace the code with this and see that the tasks are not killed:

use std::time::Duration;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    dotenv::dotenv().ok();
    let db = sqlx::PgPool::connect(&std::env::var("DATABASE_URL").unwrap()).await?;

    sleep.builder().set_json(&10u64)?.spawn(&db).await?;

    let handle = sqlxmq::JobRegistry::new(&[sleep]).runner(&db).run().await?;

    println!("Waiting 2 seconds...");
    tokio::time::sleep(Duration::from_secs(2)).await;
    println!("Stopping...");

    handle.stop().await;

    tokio::time::sleep(Duration::from_secs(10)).await;
    Ok(())
}

#[sqlxmq::job]
pub async fn sleep(mut job: sqlxmq::CurrentJob) -> sqlx::Result<()> {
    let second = std::time::Duration::from_secs(1);
    let mut to_sleep: u64 = job.json().unwrap().unwrap();
    while to_sleep > 0 {
        println!("job#{} {to_sleep} more seconds to sleep ...", job.id());
        tokio::time::sleep(second).await;
        to_sleep -= 1;
    }
    job.complete().await
}

The way sqlxmq works, the job runner doesn't really know about tasks, it's the registry that manages task lifetimes. The job runner just invokes a callback whenever a job is picked up, and it's up to the callback to decide how to spawn a task.

You can see that the implementation provided by the JobRegistry is not even async:

pub fn spawn_internal<E: Into<Box<dyn Error + Send + Sync + 'static>>>(

Ideally tokio would have some way to "wait until all tasks are done", but it does not.

@imbolc
Copy link
Contributor Author

imbolc commented Jul 12, 2022

Right, I see now, thank you :) Maybe we add an AtomicUsize to spawn_internal to track number of alive jobs?

@Diggsey
Copy link
Owner

Diggsey commented Jul 12, 2022

Something like that, yeah.

@imbolc
Copy link
Contributor Author

imbolc commented Jul 12, 2022

It seems like it's already there: JobRunner::running_jobs. But there's no access to JobRunner from outside, it's just instantly passed to a tokio task after creation. Should I add Arc<JobRunner> field to OwnedHandle?

@imbolc
Copy link
Contributor Author

imbolc commented Jul 12, 2022

I found that OwnedHandle is also used in another place, so I added a wrapper for the runner handle. I also updated the example based on your code so it's not Linux specific anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants