Stripe rate limit handling by mayorova · Pull Request #4216 · 3scale/porta

mayorova · 2026-02-05T12:06:45Z

What this PR does / why we need it:

This implementation is a draft which was implemented by Claude (accurately guided by @mayorova). It is intended to serve as a starting point in discussions about how we can solve the issue (https://issues.redhat.com/browse/THREESCALE-4086) and also potentially open a discussion about Billing refactoring.

As it can be clearly seen, our billing process is quite complicated:

Many classes and modules are involved, and they keep passing control to each other, and the interactions are often back-and-forth, not very intuitive, and it's hard to follow the line of execution.
Error handling is not very clear either, errors sometimes get re-raised, sometimes "swallowed", and it's not easy to understand what the implications of a raised exceptions are
There is a number of "TODO" comments asking for refactor, e.g. Invoice#charge!.
The process uses sidekiq-batch gem, which is an open-source alternative to the commercial Sidekiq's Batch feature (paid). In our experience, it has some issues, for example those explained in THREESCALE_8124.
The whole process is not very intuitive (or, I'd even say - anti-intuitive), for example, the main process happens in Finance::BillingStrategy.daily method, which looks like it is intended to go over a list of billing strategies (i.e. provider accounts), and inside each strategy execute the daily process for a list of buyer accounts. However, in practice, this method is typically called by the BillingWorker job that runs for a single provider and a single buyer.

In general, I think the whole billing process could be refactored to make it significantly simpler and more predictable. We could do it also with Stripe rate limits in mind, to make the implementation more straightforward, and maybe even implement some kind of client-side rate limit (to avoid the error in the first place), rather than reacting to the error and re-try.

However, of course this is a very sensitive piece of logic, and it might be dangerous to modify it significantly (as we know it has been working quite reliably for ages).

So, let's talk 😉

Which issue(s) this PR fixes

https://issues.redhat.com/browse/THREESCALE-4086

Verification steps

Special notes for your reviewer:

Check out the doc STRIPE_RATE_LIMIT_HANDLING.md that Claude created explaining the implementation in detail.
Also, take a look at the diagrams Claude has drawn under docs/.
For testing the implemenation (or reproducing the original issue) in development mode, instead of mocking the stripe server, I just hard-coded a simple change in the activemerchant gem. Before this line I place

        return {"error"=>{"message"=>"Request rate limit exceeded. Learn more about rate limits here https://stripe.com/docs/rate-limits.", "type"=>"invalid_request_error", "code"=>"rate_limit", "doc_url"=>"https://stripe.com/docs/error-codes/rate-limit"}, "response_headers"=>{}} if parameters[:amount].to_i >= 10000

So, the invoices with total value > 100 will trigger this error, while the "cheaper" ones should pass successfully. Beware though about the order of invoices - if the first invoice fails, the process will not continue.

I also use these steps to prepare my Rails console for actual test:

Prepare

# Prerequisites:
# - set up Stripe payment gateway configuration for the provider
# - create a buyer account (under the provider) and set up payment details (credit card and billing address) in the dev portal

buyer = Account.find ID

redis_key = "lock:billing:#{buyer.id}"
redis = System::RedisClientPool.default

def create_invoice(buyer, cost)
  new_invoice = buyer.provider_account.billing_strategy.create_invoice!(buyer_account: buyer, period: Month.new(Time.zone.now))
  billing = Finance::AdminBilling.new(new_invoice)
  line_item_params = {name: 'test', description: 'test description', quantity: 1, cost: cost}
  billing.create_line_item(line_item_params)
  new_invoice.issue
  new_invoice.update_column(:due_on, new_invoice.issued_on)
end

Create invoices

create_invoice(buyer, "10.00")
create_invoice(buyer, "120.00")

Remove the lock and run the job

redis.call("DEL", redis_key)

BillingWorker.perform_async(buyer.id, buyer.provider_account_id, Time.zone.now.to_fs(:iso8601))

jlledom

I didn't review the tests but left some comments on the rest.

Basically the idea here is to differentiate between errors and warnings. I think it's fine to do so, but it could be done in a simpler way:

Payment transaction, the place to create the Rate Limit exception
Everything in between: just let the exception buble
Billing service: this is the place to check the exception and chose different path whether it's error or warning.
- If warning: log, report, release lock
- If error, same path as now
About lock: just add a release method to the logic already existing: Synchronization::NowaitLockService.

jlledom · 2026-02-09T13:15:23Z

app/models/finance/billing_strategy.rb

+        id = billing_strategy.id
+        buyer_ids = options[:buyer_ids]
+
+        # Note: We don't know which specific buyer hit the rate limit, only which buyers were being processed


Why not adding the buyer or invoice id as exception attributes? that way we can track who failed.

It would be helpful. Perhaps ideally we use the original exception error message also.

Makes sense, although in practice we always supply a single buyer id when calling this :)

jlledom · 2026-02-09T13:27:16Z

app/models/payment_transaction.rb

+    # Check for rate limit in error message (common patterns)
+    message = response.message.to_s.downcase
+    message.include?('rate limit') || message.include?('too many requests') || message.include?('429')


Is it returning 429 not enough to detect a rate limit error? are there scenarios where a rate limit error doesn't return 429?

Looks like claude decided it makes sense. Might be worth checking stripe docs for any errors that would mandate a retry and how to detect them.

According to this: https://docs.stripe.com/rate-limits
This: https://stripe.com/blog/rate-limiters
And this: https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff574d

It seems 429 is good enough. But OK.

I explained in this comment, while Stripe does return 429 status code, and it should be enough, when ActiveMerchant transforms it, the status code is lost, and we need to rely on the error code rate_limit.

jlledom · 2026-02-09T15:45:00Z

app/services/finance/billing_service.rb

+      acquire_lock
+      call
+    rescue Finance::Payment::RateLimitError => error
+      # Rate limit errors should retry immediately via Sidekiq
+      # Release the lock so retries can proceed without waiting 1 hour
+      release_lock
+      report_error(error)
+      raise error


So is the with_lock method dead code now, right?

Also Synchronization::NowaitLockService

jlledom · 2026-02-09T15:49:18Z

app/services/finance/billing_service.rb

+    def lock_key
+      @lock_key ||= "lock:billing:#{account_id}"
+    end
+
+    def lock_manager
+      @lock_manager ||= Redlock::Client.new([System::RedisClientPool.default], { retry_count: 0, redis_timeout: 1 })
+    end
+
+    def acquire_lock
+      # Acquire lock for 1 hour
+      # Normally we don't release it, but for rate limits we do (see rescue block)
+      @lock_info = lock_manager.lock(lock_key, 1.hour.in_milliseconds)
+      raise LockBillingError, "Concurrent billing job already running for account #{account_id}" unless @lock_info
+    end
+
+    def release_lock
+      # Only called on rate limit errors to allow immediate retry
+      lock_manager.unlock(@lock_info) if @lock_info
+      @lock_info = nil
+    rescue => e
+      Rails.logger.warn("Failed to release billing lock for account #{account_id}: #{e.message}")
+    end


Instead of this, wouldn't it be easier to add a release method to Synchronization::NowaitLockService ?

The reason to keep a 1 hour lock that is not manually released but the timeout is waited anyway, was that we were hitting an issue where we observed double scheduling of same providers. So kind of a second line of defense in case for some reason such an issue is reintroduced somehow.

I would prefer to keep that and instead reschedule jobs that we want to retry soon with a random delay of 1 hour to 1 hour and a half range.

But it is also acceptable to add the release to the with_lock method.

Yeah I know. What I mean is Claude here added a new parallel locking backend, that makes both with_lock and Synchronization::NowaitLockService dead code. And I don't see any other advantage over previous service, than having a release method. If that's the point, it's simpler to add a release method to Synchronization::NowaitLockService so we can reuse what we have

Instead of this, wouldn't it be easier to add a release method to Synchronization::NowaitLockService ?

Yeah, the reason I left it like that is that NowaitLockService uses the Service pattern, and the idea is that it only exposes a single public method - call. I agree that we need a service with acquire and release methods (or something like that). I'd also probably have a specific BillingLock service, so that we have the billing: prefix owned by it.

app/workers/billing_worker.rb

jlledom · 2026-02-09T16:09:21Z

app/services/finance/billing_service.rb

+      # Rate limit errors should retry immediately via Sidekiq
+      # Release the lock so retries can proceed without waiting 1 hour
+      release_lock
+      report_error(error)


It's already reported from the strategy #call! method, right?

I removed (or I think I did) all reports and all log printing except here in the BillingService.

jlledom · 2026-02-09T16:10:36Z

app/models/finance/billing_strategy.rb

+        System::ErrorReporting.report_error(e, :error_message => message,
+                                            :error_class => 'RateLimitError',
+                                            :parameters => { billing_strategy_id: id, buyer_ids: buyer_ids })
+


I think it would be easier to just raise all up the stack until reaching the service, no need for all the logic here I think.

This makes sense. I didn't track the whole chain carefully but my intuition was similar, that there are too many levels we are handling the exception at.

IMO ideally we would introduce a separate worker just for payment gateway processing. That may have its own locking and throttling rules. Basically scheduling invoice charging and finish the billing job. Then all retry and throttling can be more easily reasoned about.

IMO ideally we would introduce a separate worker just for payment gateway processing. That may have its own locking and throttling rules. Basically scheduling invoice charging and finish the billing job. Then all retry and throttling can be more easily reasoned about.

Yeah, that would make sense. As I mentioned in the PR description:

In general, I think the whole billing process could be refactored to make it significantly simpler and more predictable. We could do it also with Stripe rate limits in mind, to make the implementation more straightforward, and maybe even implement some kind of client-side rate limit (to avoid the error in the first place), rather than reacting to the error and re-try.

It's just that currently the approach is to group all billing jobs by provider, using batches, and there are some callbacks executed at the end of processing for each provider. If we have sidekiq jobs at invoice level, we might lose that ability, and it might become more complicated to track whether the provider's billing as a whole was successful or not.

Having said that, I am not a fan of this batch library either (I think it's quite flaky and doesn't really bring much value, IMO), and probably refactoring would be good. But I am not sure I would like to tackle it at this point in time 😬

Totally agree on getting rid of sidekiq-batch if possible.

I would bet getting rig of sidekiq-batch is possible because I did for the background deletion part.

The main point is to see what are the batches used for if anything. Like any hooks.

On the other hand, the reported issues with sidekiq-batch are related mostly to usage with ActiveJob type of jobs. If used with native sidekiq jobs/workers, then they should properly work... except the non-expiring Redis keys we have to regularly clean 😬

jlledom · 2026-02-09T16:11:45Z

app/models/invoice.rb

+    # Rate limit errors should bubble up to Sidekiq for immediate retry with exponential backoff
+    # Don't treat these as payment failures - they're temporary gateway issues
+    logger.warn("Rate limit error for invoice #{id} (buyer #{buyer_account_id}) - will retry via Sidekiq: #{e.message}")


I don't think we need to log here, just logging and reporting once in the billing service would be enough.

We can also use this location to reschedule billing for this provider and this buyer after a random interval. @mayorova, just another low-key approach.

Without using sidekiq's mechanism?

Don't know... I think scheduling a billing job should not be inside Invoice model 🫠

In general doing the charging from within the model does not seem right. It should be some from a service. Like we can have a convenience method Invoice#charge!, it can ensure the invoice is chargeable. But it should offload the charging to a service that would handle errors and retries.

So unless we are refactoring how things are done, I'm more in favor of adding rescheduling wherever it is easiest right now, than handling this specific error (rate limit) on 5 levels which will NOT make it easier to refactor later or make the code more readable.

Basically I'm for having a convenient small hack somewhere OR some refactoring that would bring us at least a step in the right direction.

But wait, invoice.rb, we don't need this rescue block at all. It performs needless logging and then raises the same exception. So my suggestion applies to one layer up, or the next layer up, wherever it seems most appropriate. Something like @jlledom already suggested I believe :)

Sounds reasonable.

jlledom · 2026-02-09T16:11:57Z

app/models/payment_transaction.rb

+
+      # Check for rate limit errors and raise immediately for Sidekiq retry
+      if rate_limit_error?(response)
+        logger.warn("Rate limit detected (429) for PaymentTransaction - will retry with backoff")


Same, no need to log here.

jlledom · 2026-02-09T16:18:48Z

app/models/payment_transaction.rb

+      # Check for rate limit errors and raise immediately for Sidekiq retry
+      if rate_limit_error?(response)
+        logger.warn("Rate limit detected (429) for PaymentTransaction - will retry with backoff")
+        raise Finance::Payment::RateLimitError.new(response)


do we have the invoice info here? it would be useful to add it to the exception.

jlledom

I didn't review the tests but left some comments on the rest.

Basically the idea here is to differentiate between errors and warnings. I think it's fine to do so, but it could be done in a simpler way:

Payment transaction, the place to create the Rate Limit exception
Everything in between: just let the exception buble
Billing service: this is the place to check the exception and chose different path whether it's error or warning.
- If warning: log, report, release lock
- If error, same path as now
About lock: just add a release method to the logic already existing: Synchronization::NowaitLockService.

akostadinov · 2026-02-09T17:16:22Z

app/services/finance/billing_service.rb

+    rescue Finance::Payment::RateLimitError => error
+      # Rate limit errors should retry immediately via Sidekiq
+      # Release the lock so retries can proceed without waiting 1 hour
+      release_lock


If we don't use the block mode of with_lock, the release should happen in an ensure block. Although we just leave it to the timeout value for a reason to avoid spurious attempts.

So here you want to release the lock in case of a rate limit only?

My main question is, why previously normal retry didn't take place? Or was it taking place?

From this change I make the conclusion that the normal retry was too fast for the 1 hour timeout. Maybe we should just adjust the retry to be after the standard timeout? Just questions, maybe this approach makes sense.

From this change I make the conclusion that the normal retry was too fast for the 1 hour timeout. Maybe we should just adjust the retry to be after the standard timeout? Just questions, maybe this approach makes sense.

So, the story is that "normal retry" (handled by Sidekiq) was never actually happening.

The reason is that the exception was being swallowed in the rescue block and never re-raised, only in tests:

porta/app/models/finance/billing_strategy.rb

Line 368 in 7f1fb55

raise if Rails.env.test?

So, in case of rate limit errors, the invoice was just maked as "failed", and the next attempt to charge it would happen only in 3 days, during the daily biling:

porta/app/models/invoice.rb

Lines 88 to 96 in 7f1fb55

# the invoice has to be due and at least 3 days later than the last

# automatic charging date to be automatically chargeable

scope :chargeable, ->(now) {

where.has do

((state == 'unpaid') | (state == 'pending')) &

(due_on <= now) &

((last_charging_retry == nil) | (last_charging_retry <= (now - 3.days)))

end

}

akostadinov · 2026-02-11T17:19:29Z

app/models/invoice.rb

+    # Don't treat these as payment failures - they're temporary gateway issues
+    logger.warn("Rate limit error for invoice #{id} (buyer #{buyer_account_id}) - will retry via Sidekiq: #{e.message}")
+    raise e
  rescue Finance::Payment::CreditCardError, ActiveMerchant::ActiveMerchantError


What if we add the exception here, in this block? then it will be subject to retries?

mayorova · 2026-02-23T17:15:25Z

app/services/finance/stripe_charge_service.rb

+  def rate_limit_error?(response)
+    return false if response.success?
+
+    response.params.dig("error","code") == 'rate_limit'


I repeated the tests, and while the Stripe API does return 429 status code, however the status code is "swallowed" by ActiveMerchant code, so we don't have the information on the HTTP status code in this point. See https://github.com/activemerchant/active_merchant/blob/v1.137.0/lib/active_merchant/billing/gateways/stripe.rb#L704-L710

So, the error.code seems to be the way to detect this.

As this is, of course, specific to the gateway (Stripe in this case), I moved this detection here, and I also decided to make the exception gatway-specific too Finance::Payment::StripeRateLimitError.

Fine for me, just a nitpick, maybe the rate_limit literal could be a constant inside Finance::Payment::StripeRateLimitError.

mayorova · 2026-02-23T18:33:55Z

Basically the idea here is to differentiate between errors and warnings. I think it's fine to do so, but it could be done in a simpler way:

Payment transaction, the place to create the Rate Limit exception

I refactored the initial code, and now the exception is raised in the StripeChargeService. That seemed more logical to me, because this is something gateway-specific.
Well, other gateways can probably also have some rate limit errors, but we do not have precedents so far.

Everything in between: just let the exception buble

Right, but we still need to rescue and re-throw, because otherwise the exception will get processed as another type, and we need to prevent this, at each stage of processing.

Billing service: this is the place to check the exception and chose different path whether it's error or warning.

If warning: log, report, release lock

If error, same path as now

Not sure what you mean - "error" or "warning". Currently the exception is bubbled up to the BillingWorker and is handled in sidekiq_retry_in - basially just returning a nil value from the block, so that the built-in Sidekiq exponential backoff kicks in and retries the job.

About lock: just add a release method to the logic already existing: Synchronization::NowaitLockService.

I have added Synchronization::BillingLockService that have lock and unlock methods.

Why not adding the buyer or invoice id as exception attributes? that way we can track who failed.

I added payment_metadata to StripeRateLimitError error, that includes

payment_metadata = {
  invoice_id: @invoice&.id,
  buyer_id: @invoice&.buyer_account&.id,
  payment_method_id: @payment_method_id,
  gateway_options: @gateway_options
}

I just need to find a good place to use/print it.

jlledom

This looks good to me, just a few comments.

I again didn't review the tests. @mayorova you said they are not ready, right?

jlledom · 2026-02-24T12:19:57Z

app/services/finance/stripe_charge_service.rb

+  def rate_limit_error?(response)
+    return false if response.success?
+
+    response.params.dig("error","code") == 'rate_limit'


Fine for me, just a nitpick, maybe the rate_limit literal could be a constant inside Finance::Payment::StripeRateLimitError.

jlledom · 2026-02-24T12:40:37Z

app/models/finance/billing_strategy.rb

+      rescue Finance::Payment::StripeRateLimitError => e
+        # Rate limit errors should bubble up to Sidekiq for immediate retry with exponential backoff
+        # Don't treat these as payment failures - they are temporary gateway issues
+        raise e


What if, instead of having to reference StripeRateLimitError on several levels, you make StripeRateLimitError inherit from something new like Finance::Payment::TemporaryError and rescue that? This way we can reuse this structure in the future if some other temporary error happens, also in other gateways. As long as the error inherits from TemporaryError, billing will be retried.

jlledom · 2026-02-24T12:44:23Z

app/lib/finance/payment.rb

    CreditCardPurchaseFailed = Class.new(GatewayError)

+    # Rate limit error - should be retried immediately, not treated as payment failure
+    class StripeRateLimitError < ActiveMerchant::ActiveMerchantError


Even when this only happens in Stripe, I think it's better to not make it stripe specific, because the concept of rate limiting is more general. Also I don't see anything in this error definition that would force to limit it to Stripe.

jlledom · 2026-02-24T13:09:51Z

app/services/synchronization/billing_lock_service.rb

@@ -0,0 +1,31 @@
+# frozen_string_literal: true
+
+class Synchronization::BillingLockService < Synchronization::NowaitLockService


Why splitting the logic into two classes? couldn't it be just one?

akostadinov · 2026-02-27T17:17:53Z

app/services/finance/stripe_charge_service.rb

+        gateway_options: @gateway_options
+      }
+      raise Finance::Payment::StripeRateLimitError.new(response, payment_metadata)
+    end


I don't like the idea that one exception we generate here and others in another place.

If service has to generate exceptions, it has to do it for any failure. And if it is not supposed to generate exceptions, then it should not generate any.

In this case it feels as if the "contract" is not to generate exceptions.

In other languages like Java, whether a method raises (throws) or not is part of the definition of the method.

Looking at the whole charging implementation though, it is rather convoluted. I think we should either:

implement all gateways to return/raise based on clear lassification of temporary, non temporary and success conditions

just implement classification of all gateways in BillingService#call! and then reschedule as desired

be lazy, keep everything as is but enable stripe SDK retry which takes care of back-off time as well idempotency of the requests. Stripe.max_network_retries = 2, see https://rubydoc.info/github/stripe/stripe-ruby . The mere use of idempotency keys is a huge win, otherwise we should make sure to use idempotency keys anyways to avoid double charging of the same payment.

My personal take would be to enable Stripe SDK native retries in the first place and do some refactor of how we classify errors in the future if needed.

My second preference would be to both (which doesn't preclude enabling native SDK retries):

implement idemotency keys (this could probably be simply the hashed invoice id so any payment attempts to that invoice would use the same idempotency key, preventing double charging for the same invoice reliably

classify the errors within BillingService#call! and decide whether to schedule an immediate retry or leave it for the future billing cycles

Last preference of mine is to classify the errors in the upper layers - as now but with more prominent structure that accounts for braintree and possible future gateways. But this will still require the final decision to be taken within BillingService#call! so I'm not sure how it will reduce complexity.

P.S. I understand that it is very complicated to figure out the least shitty approach here besides heavily refactoring everything. That's why I prefer the least changes or at least to be as compact as possible. Not trying to complain for everything although all approaches have pros and cons. The above is just what I find most sensible but I'm very open to change my mind.

jlledom reviewed Feb 9, 2026

View reviewed changes

akostadinov reviewed Feb 9, 2026

View reviewed changes

akostadinov reviewed Feb 11, 2026

View reviewed changes

mayorova added 5 commits February 19, 2026 16:51

Handle Stripe rate limit errors

86d1762

Further updates for Stripe limit handling

e943afb

Billing diagrams

7272fc0

Refactor

1d85ded

Update tests after refactoring

192c774

mayorova force-pushed the stripe-rate-limit-handling branch from 4ee1be0 to 192c774 Compare February 23, 2026 17:05

mayorova commented Feb 23, 2026

View reviewed changes

jlledom reviewed Feb 24, 2026

View reviewed changes

akostadinov reviewed Feb 27, 2026

View reviewed changes

	# the invoice has to be due and at least 3 days later than the last
	# automatic charging date to be automatically chargeable
	scope :chargeable, ->(now) {
	where.has do
	((state == 'unpaid') \| (state == 'pending')) &
	(due_on <= now) &
	((last_charging_retry == nil) \| (last_charging_retry <= (now - 3.days)))
	end
	}

		@@ -0,0 +1,31 @@
		# frozen_string_literal: true

		class Synchronization::BillingLockService < Synchronization::NowaitLockService

Conversation

mayorova commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlledom left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlledom Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlledom Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlledom left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayorova commented Feb 23, 2026

Uh oh!

jlledom left a comment

mayorova commented Feb 5, 2026 •

edited

Loading

jlledom Feb 12, 2026 •

edited

Loading

jlledom Feb 10, 2026 •

edited

Loading