Redis memory growth #6205

vvayngart1 · 2024-02-26T17:51:05Z

ruby-2.7.6
rails 6.1.7.6
Sidekiq (6.5.12)
redis_version:4.0.9
Sidekiq-pro (5.5.5)

Good afternoon

I'm investigating a slow memory bloat/leak on the Redis server being used by Sidekiq. I'd appreciate any information/suggestions about the following:

Is there a way to configure Sidekiq to automatically set the TTL of Redis keys? I'm open to any other suggestions on how to determine which keys are old or never expire as currently there are no timestamps I can examine for patterns
My original thought about the memory issue was that Sidekiq batches don't expire and, as per the following article, setting Sidekiq::Batch::LINGER = 0 will solve the issue. But I saw release notes that now it's set to 24 hours by default, which wouldn't explain the memory leak

Any other ideas or suggestions are welcomed

The text was updated successfully, but these errors were encountered:

mperham · 2024-02-26T19:58:43Z

Is this Redis used for Sidekiq only?

All transient Sidekiq data set a TTL (e.g. batch data, rate limiters, unique locks, process heartbeats). Global data structures (e.g. queues, retry, scheduled, dead sets) do not set a TTL as they should always be present but they should always trend toward empty unless your system is actively pushing data into them. No one has identified an actual Sidekiq data leak in many years.

The Using Redis wiki page has some links to useful tools but I haven't scanned the Redis community for any new, useful, maintained tools in years. If you find anything, please let me know so we can add it.

vvayngart1 · 2024-02-27T22:44:42Z

Good afternoon - thank you for the response

Based on the above information and additional investigation, I have other questions:

I found the following TTLs being set in sidekiq and sidekiq-pro code:

TTLs set in sidekiq:
- STATS_TTL
- MARK_TTL
- SHORT_TERM
- I didn't see any other TTLs in sidekiq gem and was wondering if I missed something

TTLs set in sidekiq-pro:

I use .gem/ruby/2.7.6/gems/sidekiq-pro-5.5.5/lib/sidekiq/batch.rb in my application and below code is copy/paste from that file:

ONE_DAY = 60 * 60 * 24
EXPIRY = ONE_DAY * 30

# Controls how long a batch record "lingers" in Redis before expiring.
# This allows APIs like Status#poll to check batch status even after
# the batch succeeds and is no longer needed.  You can lower this
# constant if you create lots of batches, want to reclaim the memory
# and don't use polling.
LINGER = ONE_DAY

def initialize(bid = nil)
  @expiry = EXPIRY
  ...
end

def expiry
  @expiry || EXPIRY
end

def increment_batch_jobs_to_redis(conn, jids)
  ...
  conn.expire(jids_key, expiry)
end

I performed the following Redis queries:

[redis_host]:6379> ttl [app_name]:b-EsV6KTKaTLcqyA
(integer) 13289740
[redis_host]:6379> hkeys [app_name]:b-EsV6KTKaTLcqyA
1) "created_at"
2) "callbacks"
3) "description"
4) "pending"
5) "total"
6) "kids"
7) "cbdeath"
[redis_host]:6379> hvals [app_name]:b-EsV6KTKaTLcqyA
1) "1706801595.5151515"
2) "{}"
3) "[Name]Worker run at 2024-02-01 09:33:15 -0600"
4) "51"
5) "51"
6) "0"
7) "1"

When I do current TTL + created_at it's equal 180 days of initial TTL.

Q: I can't find anywhere in my application where that value would be set. I assume that Sidekig::Batch::EXPIRY = ONE_DAY * 180 can be set, but it's not. Any suggestions if there are any other places on how Sidekig TTLs may be customized in other ways are greatly appreciated

mperham · 2024-02-28T01:37:37Z

Well, for instance the heartbeat is expired here:

> transaction.expire(work_key, 60)

Sidekiq Pro 4.x held dead batch data for 180 days, the same as the default dead timeout. This was fixed in #4217 in v5.0.0. My guess is you upgraded in the near past.

vvayngart1 · 2024-02-28T02:24:36Z

👍

Thanks for pointing out - I also found HISTOGRAM_TTL

I also didn't upgrade any time recently (definitely not this year), so not sure why it shows 180 days for batch created 2024-02-01 :(

#4127
mentions that If the batch never succeeds, it will expire after six months. Is this still the case that if the batch never succeeds it will expire only in 180 days?

I see the following code:

Sidekiq.death_handlers << ->(job, ex) do
  ...
  q, cb, _ = Sidekiq::Batch.redis(bid) { |conn|
    conn.pipelined do |m|
      m.hget("b-#{bid}", "cbq")
      m.hsetnx("b-#{bid}", "cbdeath", 1)
      m.sadd("batch-died", [bid])
      m.sadd("b-#{bid}-died", [job["jid"]])
      # extend expiry so the batch data will linger along with any dead jobs
      m.expire("b-#{bid}-died", Sidekiq[:dead_timeout_in_seconds])
      m.expire("b-#{bid}", Sidekiq[:dead_timeout_in_seconds])
    end
  }
end

which uses dead_timeout_in_seconds set here to 180 days, so by default in our versions of gems it still would be set to 180 days?
Is there a way to reconfigure that?

Thanks

mperham · 2024-02-28T02:51:00Z

Yep, that’s where the 180 days is coming from. You are welcome to lower that number but I would not recommend lowering it below 30 days.

vvayngart1 · 2024-02-28T03:13:08Z

👍

I'm a bit new to the Sidekiq/redis ecosystem, so apologies for the more or less naive questions

What is the state transition logic for a batch to become a dead batch?
Is it: that a normal batch fails, a callback on failure is called, and a dead batch is created?
Also, can you explain why would you recommend dead batch TTL to be at least 30 days?

Any information/links to documents about batch state transitions is greatly appreciated.

If a batch is processed successfully, then it would still LINGER for 24 hours by default after successful processing, correct?

That might be my slow resource leak problem. I run an ever-increasing number of batches per day as the volume of transactions increases. There are periods of almost no activity, but batches are not released because LINGER is 24 hours. so overlapping lingering batches would be taking more and more memory over time

mperham · 2024-02-28T15:37:18Z

If a batch job dies, the batch dies. A job dies when it runs out of retries.
Because you can "resurrect" a batch by manually running a dead batch job from the Dead tab. If the job succeeds, the batch can succeed.
24hr LINGER is pretty conservative, really to handle a rare edge case. It's totally ok to reduce it to one hour.

vvayngart1 · 2024-02-28T15:50:56Z

Q1: If the batch consists of N jobs and 1 job dies after exhausting the number of retries, new "b-#{bid}-died batch is created with a failed job key?
Q2: if M<N jobs failed after exhausting the number of retries, is there M number of "b-#{bid}-died created or 1 for all failed jobs?
Q3: Once the batch succeeds, all batch info is deleted from Redis after LINGER TTL, correct?

Thank you again for great information and support

mperham · 2024-02-28T16:42:36Z

Yes.
1
Yes.

These are specific implementation details and subject to change (but unlikely) and my poor memory. The code is the ultimate source of truth, of course.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis memory growth #6205

Redis memory growth #6205

vvayngart1 commented Feb 26, 2024

mperham commented Feb 26, 2024 •

edited

vvayngart1 commented Feb 27, 2024

mperham commented Feb 28, 2024 •

edited

vvayngart1 commented Feb 28, 2024 •

edited

mperham commented Feb 28, 2024

vvayngart1 commented Feb 28, 2024 •

edited

mperham commented Feb 28, 2024

vvayngart1 commented Feb 28, 2024

mperham commented Feb 28, 2024

Redis memory growth #6205

Redis memory growth #6205

Comments

vvayngart1 commented Feb 26, 2024

mperham commented Feb 26, 2024 • edited

vvayngart1 commented Feb 27, 2024

mperham commented Feb 28, 2024 • edited

vvayngart1 commented Feb 28, 2024 • edited

mperham commented Feb 28, 2024

vvayngart1 commented Feb 28, 2024 • edited

mperham commented Feb 28, 2024

vvayngart1 commented Feb 28, 2024

mperham commented Feb 28, 2024

mperham commented Feb 26, 2024 •

edited

mperham commented Feb 28, 2024 •

edited

vvayngart1 commented Feb 28, 2024 •

edited

vvayngart1 commented Feb 28, 2024 •

edited