Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when running multiple processes blocked by after_commit hook #1275

Open
padde opened this issue Mar 6, 2024 · 2 comments
Open

Comments

@padde
Copy link
Contributor

padde commented Mar 6, 2024

We have a job that generates a large file and then uploads it to ActiveStorage. ActiveStorage uses an after_commit hook to process the upload after the transaction was committed. Our code looks something like this:

class Parent < ActiveRecord::Base
end

class Model < ActiveRecord::Base
  has_one :parent
  has_one_attached :large_file, dependent: :destroy
  
  aasm do
    state :new
    state :processing
    
    event :start_processing do
      transitions from: :new, to: :processing
    end
  end
end

class GenerateLargeFileJob < ActiveJob::Base
  def perform(model)
    return if model.processing?
    
    ActiveRecord::Base.transaction do
      model.start_processing!
      model.parent.update!(some: 'values')
      io = generate_large_file_content # takes up to 1,5 hours
      model.large_file.attach(io:, filename: 'large-file.zip', content_type: 'application/zip')
    end
    # upload happens "here" in after_commit hook, can take a few minutes.
  end
end

If we run this with multiple processes, we find that apparently with the transaction being committed the advisory lock is already released, but the job might not be marked as finished yet. The upload is happening right after the transaction commit and can take a few minutes to complete. During that time, we saw another good_job process picking up the job again attempting to process it.

In our specific use case this second job will exit immediately with success due to the sanity check for processing?. This is a bit over-simplified but basically that's what's happening. Of course one could ask whether we could do better error handling here, but that's not the point here. The point is that good_job is trying to work on a job that is still in progress on another processor!

The first job will eventually finish too and then raise an error:

WARNING:  you don't own a lock of type ExclusiveLock

Is there any way around this? I think with sidekiq we did not see this problem because the locking was not directly tied to the Postgres Transaction, and so the lock would have only been released after the upload finished.

For now, we will disable our autoscaling rules to avoid such race conditions, but we can't keep it that way as it will severely degrade the performance during peak times, where we usually want to start up to 5 background workers to clear queues quickly.

@bensheldon
Copy link
Owner

bensheldon commented Mar 6, 2024

That's not good!

Everything you've described sounds like the database connection that holds the advisory lock is being disconnected/closed. (nothing to do with transactions or after_commit)

GoodJob uses session/connection-based Advisory Locks (not transaction-based advisory locks). So I don't think the problem is the transaction, but rather that somewhere the database connection that holds the advisory lock is being disconnected/closed/broken. The signs of that are:

  • The job is being executed again by a different thread, which is a sign the connection has specifically been closed (rather than returned to the connection pool).
  • The "WARNING: you don't own a lock of type ExclusiveLock" which is what happens when a new/different connection is checked-out and the code tries to release an advisory lock it didn't acquire.

Are you seeing this happen on every one of these long jobs, or just sometimes?

I'm wondering if this is like:

  • Occasionally, a sign that there is some specific event, like the database hiccuping or a network applicance restarting, causing it. Or...
  • Every job, which is a sign that configuration/something somewhere is intentionally closing long-lived database connections.

@padde
Copy link
Contributor Author

padde commented Mar 8, 2024

Thanks for your detailed answer. We don't know yet whether this happens on a regular basis. The thing is, this process does not run very often, and further testing will use up many resources at a third party provider, so we will need to monitor this closely over a longer period of time. For now we will not get new data as we disabled autoscaling to be safe. In the meanwhile, I will experiment further and try to come up with a minimal reproducible example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

2 participants