Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis memory growth #6205

Open
vvayngart1 opened this issue Feb 26, 2024 · 9 comments
Open

Redis memory growth #6205

vvayngart1 opened this issue Feb 26, 2024 · 9 comments

Comments

@vvayngart1
Copy link

ruby-2.7.6
rails 6.1.7.6
Sidekiq (6.5.12)
redis_version:4.0.9
Sidekiq-pro (5.5.5)

Good afternoon

I'm investigating a slow memory bloat/leak on the Redis server being used by Sidekiq. I'd appreciate any information/suggestions about the following:

  • Is there a way to configure Sidekiq to automatically set the TTL of Redis keys? I'm open to any other suggestions on how to determine which keys are old or never expire as currently there are no timestamps I can examine for patterns
  • My original thought about the memory issue was that Sidekiq batches don't expire and, as per the following article, setting Sidekiq::Batch::LINGER = 0 will solve the issue. But I saw release notes that now it's set to 24 hours by default, which wouldn't explain the memory leak

Any other ideas or suggestions are welcomed

@mperham
Copy link
Collaborator

mperham commented Feb 26, 2024

Is this Redis used for Sidekiq only?

All transient Sidekiq data set a TTL (e.g. batch data, rate limiters, unique locks, process heartbeats). Global data structures (e.g. queues, retry, scheduled, dead sets) do not set a TTL as they should always be present but they should always trend toward empty unless your system is actively pushing data into them. No one has identified an actual Sidekiq data leak in many years.

The Using Redis wiki page has some links to useful tools but I haven't scanned the Redis community for any new, useful, maintained tools in years. If you find anything, please let me know so we can add it.

@vvayngart1
Copy link
Author

Good afternoon - thank you for the response

Based on the above information and additional investigation, I have other questions:

  • I found the following TTLs being set in sidekiq and sidekiq-pro code:
    • TTLs set in sidekiq:
    • TTLs set in sidekiq-pro:
      • I use .gem/ruby/2.7.6/gems/sidekiq-pro-5.5.5/lib/sidekiq/batch.rb in my application and below code is copy/paste from that file:
      ONE_DAY = 60 * 60 * 24
      EXPIRY = ONE_DAY * 30
      
      # Controls how long a batch record "lingers" in Redis before expiring.
      # This allows APIs like Status#poll to check batch status even after
      # the batch succeeds and is no longer needed.  You can lower this
      # constant if you create lots of batches, want to reclaim the memory
      # and don't use polling.
      LINGER = ONE_DAY
      
      def initialize(bid = nil)
        @expiry = EXPIRY
        ...
      end
      
      def expiry
        @expiry || EXPIRY
      end
      
      def increment_batch_jobs_to_redis(conn, jids)
        ...
        conn.expire(jids_key, expiry)
      end
      
  • I performed the following Redis queries:
    [redis_host]:6379> ttl [app_name]:b-EsV6KTKaTLcqyA
    (integer) 13289740
    [redis_host]:6379> hkeys [app_name]:b-EsV6KTKaTLcqyA
    1) "created_at"
    2) "callbacks"
    3) "description"
    4) "pending"
    5) "total"
    6) "kids"
    7) "cbdeath"
    [redis_host]:6379> hvals [app_name]:b-EsV6KTKaTLcqyA
    1) "1706801595.5151515"
    2) "{}"
    3) "[Name]Worker run at 2024-02-01 09:33:15 -0600"
    4) "51"
    5) "51"
    6) "0"
    7) "1"
    
    When I do current TTL + created_at it's equal 180 days of initial TTL.
    • Q: I can't find anywhere in my application where that value would be set. I assume that Sidekig::Batch::EXPIRY = ONE_DAY * 180 can be set, but it's not. Any suggestions if there are any other places on how Sidekig TTLs may be customized in other ways are greatly appreciated

@mperham
Copy link
Collaborator

mperham commented Feb 28, 2024

Well, for instance the heartbeat is expired here:

> transaction.expire(work_key, 60)

Sidekiq Pro 4.x held dead batch data for 180 days, the same as the default dead timeout. This was fixed in #4217 in v5.0.0. My guess is you upgraded in the near past.

@vvayngart1
Copy link
Author

vvayngart1 commented Feb 28, 2024

👍

Thanks for pointing out - I also found HISTOGRAM_TTL

I also didn't upgrade any time recently (definitely not this year), so not sure why it shows 180 days for batch created 2024-02-01 :(

#4127
mentions that If the batch never succeeds, it will expire after six months. Is this still the case that if the batch never succeeds it will expire only in 180 days?

I see the following code:

Sidekiq.death_handlers << ->(job, ex) do
  ...
  q, cb, _ = Sidekiq::Batch.redis(bid) { |conn|
    conn.pipelined do |m|
      m.hget("b-#{bid}", "cbq")
      m.hsetnx("b-#{bid}", "cbdeath", 1)
      m.sadd("batch-died", [bid])
      m.sadd("b-#{bid}-died", [job["jid"]])
      # extend expiry so the batch data will linger along with any dead jobs
      m.expire("b-#{bid}-died", Sidekiq[:dead_timeout_in_seconds])
      m.expire("b-#{bid}", Sidekiq[:dead_timeout_in_seconds])
    end
  }
end

which uses dead_timeout_in_seconds set here to 180 days, so by default in our versions of gems it still would be set to 180 days?
Is there a way to reconfigure that?

Thanks

@mperham
Copy link
Collaborator

mperham commented Feb 28, 2024

Yep, that’s where the 180 days is coming from. You are welcome to lower that number but I would not recommend lowering it below 30 days.

@vvayngart1
Copy link
Author

vvayngart1 commented Feb 28, 2024

👍

I'm a bit new to the Sidekiq/redis ecosystem, so apologies for the more or less naive questions

  • What is the state transition logic for a batch to become a dead batch?
  • Is it: that a normal batch fails, a callback on failure is called, and a dead batch is created?
  • Also, can you explain why would you recommend dead batch TTL to be at least 30 days?

Any information/links to documents about batch state transitions is greatly appreciated.

  • If a batch is processed successfully, then it would still LINGER for 24 hours by default after successful processing, correct?

That might be my slow resource leak problem. I run an ever-increasing number of batches per day as the volume of transactions increases. There are periods of almost no activity, but batches are not released because LINGER is 24 hours. so overlapping lingering batches would be taking more and more memory over time

@mperham
Copy link
Collaborator

mperham commented Feb 28, 2024

  1. If a batch job dies, the batch dies. A job dies when it runs out of retries.
  2. Because you can "resurrect" a batch by manually running a dead batch job from the Dead tab. If the job succeeds, the batch can succeed.
  3. 24hr LINGER is pretty conservative, really to handle a rare edge case. It's totally ok to reduce it to one hour.

@vvayngart1
Copy link
Author

Q1: If the batch consists of N jobs and 1 job dies after exhausting the number of retries, new "b-#{bid}-died batch is created with a failed job key?
Q2: if M<N jobs failed after exhausting the number of retries, is there M number of "b-#{bid}-died created or 1 for all failed jobs?
Q3: Once the batch succeeds, all batch info is deleted from Redis after LINGER TTL, correct?

Thank you again for great information and support

@mperham
Copy link
Collaborator

mperham commented Feb 28, 2024

  1. Yes.
  2. 1
  3. Yes.

These are specific implementation details and subject to change (but unlikely) and my poor memory. The code is the ultimate source of truth, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants