Conversation
jlledom
left a comment
There was a problem hiding this comment.
I didn't review the tests but left some comments on the rest.
Basically the idea here is to differentiate between errors and warnings. I think it's fine to do so, but it could be done in a simpler way:
- Payment transaction, the place to create the Rate Limit exception
- Everything in between: just let the exception buble
- Billing service: this is the place to check the exception and chose different path whether it's error or warning.
- If warning: log, report, release lock
- If error, same path as now
- About lock: just add a
releasemethod to the logic already existing:Synchronization::NowaitLockService.
| id = billing_strategy.id | ||
| buyer_ids = options[:buyer_ids] | ||
|
|
||
| # Note: We don't know which specific buyer hit the rate limit, only which buyers were being processed |
There was a problem hiding this comment.
Why not adding the buyer or invoice id as exception attributes? that way we can track who failed.
There was a problem hiding this comment.
It would be helpful. Perhaps ideally we use the original exception error message also.
There was a problem hiding this comment.
Makes sense, although in practice we always supply a single buyer id when calling this :)
app/models/payment_transaction.rb
Outdated
| # Check for rate limit in error message (common patterns) | ||
| message = response.message.to_s.downcase | ||
| message.include?('rate limit') || message.include?('too many requests') || message.include?('429') |
There was a problem hiding this comment.
Is it returning 429 not enough to detect a rate limit error? are there scenarios where a rate limit error doesn't return 429?
There was a problem hiding this comment.
Looks like claude decided it makes sense. Might be worth checking stripe docs for any errors that would mandate a retry and how to detect them.
There was a problem hiding this comment.
According to this: https://docs.stripe.com/rate-limits
This: https://stripe.com/blog/rate-limiters
And this: https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff574d
It seems 429 is good enough. But OK.
There was a problem hiding this comment.
I explained in this comment, while Stripe does return 429 status code, and it should be enough, when ActiveMerchant transforms it, the status code is lost, and we need to rely on the error code rate_limit.
| acquire_lock | ||
| call | ||
| rescue Finance::Payment::RateLimitError => error | ||
| # Rate limit errors should retry immediately via Sidekiq | ||
| # Release the lock so retries can proceed without waiting 1 hour | ||
| release_lock | ||
| report_error(error) | ||
| raise error |
There was a problem hiding this comment.
So is the with_lock method dead code now, right?
Also Synchronization::NowaitLockService
| def lock_key | ||
| @lock_key ||= "lock:billing:#{account_id}" | ||
| end | ||
|
|
||
| def lock_manager | ||
| @lock_manager ||= Redlock::Client.new([System::RedisClientPool.default], { retry_count: 0, redis_timeout: 1 }) | ||
| end | ||
|
|
||
| def acquire_lock | ||
| # Acquire lock for 1 hour | ||
| # Normally we don't release it, but for rate limits we do (see rescue block) | ||
| @lock_info = lock_manager.lock(lock_key, 1.hour.in_milliseconds) | ||
| raise LockBillingError, "Concurrent billing job already running for account #{account_id}" unless @lock_info | ||
| end | ||
|
|
||
| def release_lock | ||
| # Only called on rate limit errors to allow immediate retry | ||
| lock_manager.unlock(@lock_info) if @lock_info | ||
| @lock_info = nil | ||
| rescue => e | ||
| Rails.logger.warn("Failed to release billing lock for account #{account_id}: #{e.message}") | ||
| end |
There was a problem hiding this comment.
Instead of this, wouldn't it be easier to add a release method to Synchronization::NowaitLockService ?
There was a problem hiding this comment.
The reason to keep a 1 hour lock that is not manually released but the timeout is waited anyway, was that we were hitting an issue where we observed double scheduling of same providers. So kind of a second line of defense in case for some reason such an issue is reintroduced somehow.
I would prefer to keep that and instead reschedule jobs that we want to retry soon with a random delay of 1 hour to 1 hour and a half range.
But it is also acceptable to add the release to the with_lock method.
There was a problem hiding this comment.
Yeah I know. What I mean is Claude here added a new parallel locking backend, that makes both with_lock and Synchronization::NowaitLockService dead code. And I don't see any other advantage over previous service, than having a release method. If that's the point, it's simpler to add a release method to Synchronization::NowaitLockService so we can reuse what we have
There was a problem hiding this comment.
Instead of this, wouldn't it be easier to add a
releasemethod toSynchronization::NowaitLockService?
Yeah, the reason I left it like that is that NowaitLockService uses the Service pattern, and the idea is that it only exposes a single public method - call. I agree that we need a service with acquire and release methods (or something like that). I'd also probably have a specific BillingLock service, so that we have the billing: prefix owned by it.
| # Rate limit errors should retry immediately via Sidekiq | ||
| # Release the lock so retries can proceed without waiting 1 hour | ||
| release_lock | ||
| report_error(error) |
There was a problem hiding this comment.
It's already reported from the strategy #call! method, right?
There was a problem hiding this comment.
I removed (or I think I did) all reports and all log printing except here in the BillingService.
| System::ErrorReporting.report_error(e, :error_message => message, | ||
| :error_class => 'RateLimitError', | ||
| :parameters => { billing_strategy_id: id, buyer_ids: buyer_ids }) | ||
|
|
There was a problem hiding this comment.
I think it would be easier to just raise all up the stack until reaching the service, no need for all the logic here I think.
There was a problem hiding this comment.
This makes sense. I didn't track the whole chain carefully but my intuition was similar, that there are too many levels we are handling the exception at.
IMO ideally we would introduce a separate worker just for payment gateway processing. That may have its own locking and throttling rules. Basically scheduling invoice charging and finish the billing job. Then all retry and throttling can be more easily reasoned about.
There was a problem hiding this comment.
IMO ideally we would introduce a separate worker just for payment gateway processing. That may have its own locking and throttling rules. Basically scheduling invoice charging and finish the billing job. Then all retry and throttling can be more easily reasoned about.
Yeah, that would make sense. As I mentioned in the PR description:
In general, I think the whole billing process could be refactored to make it significantly simpler and more predictable. We could do it also with Stripe rate limits in mind, to make the implementation more straightforward, and maybe even implement some kind of client-side rate limit (to avoid the error in the first place), rather than reacting to the error and re-try.
It's just that currently the approach is to group all billing jobs by provider, using batches, and there are some callbacks executed at the end of processing for each provider. If we have sidekiq jobs at invoice level, we might lose that ability, and it might become more complicated to track whether the provider's billing as a whole was successful or not.
Having said that, I am not a fan of this batch library either (I think it's quite flaky and doesn't really bring much value, IMO), and probably refactoring would be good. But I am not sure I would like to tackle it at this point in time 😬
There was a problem hiding this comment.
Totally agree on getting rid of sidekiq-batch if possible.
There was a problem hiding this comment.
I would bet getting rig of sidekiq-batch is possible because I did for the background deletion part.
The main point is to see what are the batches used for if anything. Like any hooks.
On the other hand, the reported issues with sidekiq-batch are related mostly to usage with ActiveJob type of jobs. If used with native sidekiq jobs/workers, then they should properly work... except the non-expiring Redis keys we have to regularly clean 😬
app/models/invoice.rb
Outdated
| # Rate limit errors should bubble up to Sidekiq for immediate retry with exponential backoff | ||
| # Don't treat these as payment failures - they're temporary gateway issues | ||
| logger.warn("Rate limit error for invoice #{id} (buyer #{buyer_account_id}) - will retry via Sidekiq: #{e.message}") |
There was a problem hiding this comment.
I don't think we need to log here, just logging and reporting once in the billing service would be enough.
There was a problem hiding this comment.
We can also use this location to reschedule billing for this provider and this buyer after a random interval. @mayorova, just another low-key approach.
There was a problem hiding this comment.
Without using sidekiq's mechanism?
Don't know... I think scheduling a billing job should not be inside Invoice model 🫠
There was a problem hiding this comment.
In general doing the charging from within the model does not seem right. It should be some from a service. Like we can have a convenience method Invoice#charge!, it can ensure the invoice is chargeable. But it should offload the charging to a service that would handle errors and retries.
So unless we are refactoring how things are done, I'm more in favor of adding rescheduling wherever it is easiest right now, than handling this specific error (rate limit) on 5 levels which will NOT make it easier to refactor later or make the code more readable.
Basically I'm for having a convenient small hack somewhere OR some refactoring that would bring us at least a step in the right direction.
But wait, invoice.rb, we don't need this rescue block at all. It performs needless logging and then raises the same exception. So my suggestion applies to one layer up, or the next layer up, wherever it seems most appropriate. Something like @jlledom already suggested I believe :)
There was a problem hiding this comment.
Sounds reasonable.
app/models/payment_transaction.rb
Outdated
|
|
||
| # Check for rate limit errors and raise immediately for Sidekiq retry | ||
| if rate_limit_error?(response) | ||
| logger.warn("Rate limit detected (429) for PaymentTransaction - will retry with backoff") |
app/models/payment_transaction.rb
Outdated
| # Check for rate limit errors and raise immediately for Sidekiq retry | ||
| if rate_limit_error?(response) | ||
| logger.warn("Rate limit detected (429) for PaymentTransaction - will retry with backoff") | ||
| raise Finance::Payment::RateLimitError.new(response) |
There was a problem hiding this comment.
do we have the invoice info here? it would be useful to add it to the exception.
jlledom
left a comment
There was a problem hiding this comment.
I didn't review the tests but left some comments on the rest.
Basically the idea here is to differentiate between errors and warnings. I think it's fine to do so, but it could be done in a simpler way:
- Payment transaction, the place to create the Rate Limit exception
- Everything in between: just let the exception buble
- Billing service: this is the place to check the exception and chose different path whether it's error or warning.
- If warning: log, report, release lock
- If error, same path as now
- About lock: just add a
releasemethod to the logic already existing:Synchronization::NowaitLockService.
| rescue Finance::Payment::RateLimitError => error | ||
| # Rate limit errors should retry immediately via Sidekiq | ||
| # Release the lock so retries can proceed without waiting 1 hour | ||
| release_lock |
There was a problem hiding this comment.
If we don't use the block mode of with_lock, the release should happen in an ensure block. Although we just leave it to the timeout value for a reason to avoid spurious attempts.
So here you want to release the lock in case of a rate limit only?
My main question is, why previously normal retry didn't take place? Or was it taking place?
From this change I make the conclusion that the normal retry was too fast for the 1 hour timeout. Maybe we should just adjust the retry to be after the standard timeout? Just questions, maybe this approach makes sense.
There was a problem hiding this comment.
From this change I make the conclusion that the normal retry was too fast for the 1 hour timeout. Maybe we should just adjust the retry to be after the standard timeout? Just questions, maybe this approach makes sense.
So, the story is that "normal retry" (handled by Sidekiq) was never actually happening.
The reason is that the exception was being swallowed in the rescue block and never re-raised, only in tests:
porta/app/models/finance/billing_strategy.rb
Line 368 in 7f1fb55
So, in case of rate limit errors, the invoice was just maked as "failed", and the next attempt to charge it would happen only in 3 days, during the daily biling:
Lines 88 to 96 in 7f1fb55
| # Don't treat these as payment failures - they're temporary gateway issues | ||
| logger.warn("Rate limit error for invoice #{id} (buyer #{buyer_account_id}) - will retry via Sidekiq: #{e.message}") | ||
| raise e | ||
| rescue Finance::Payment::CreditCardError, ActiveMerchant::ActiveMerchantError |
There was a problem hiding this comment.
What if we add the exception here, in this block? then it will be subject to retries?
4ee1be0 to
192c774
Compare
| def rate_limit_error?(response) | ||
| return false if response.success? | ||
|
|
||
| response.params.dig("error","code") == 'rate_limit' |
There was a problem hiding this comment.
I repeated the tests, and while the Stripe API does return 429 status code, however the status code is "swallowed" by ActiveMerchant code, so we don't have the information on the HTTP status code in this point. See https://github.com/activemerchant/active_merchant/blob/v1.137.0/lib/active_merchant/billing/gateways/stripe.rb#L704-L710
So, the error.code seems to be the way to detect this.
As this is, of course, specific to the gateway (Stripe in this case), I moved this detection here, and I also decided to make the exception gatway-specific too Finance::Payment::StripeRateLimitError.
There was a problem hiding this comment.
Fine for me, just a nitpick, maybe the rate_limit literal could be a constant inside Finance::Payment::StripeRateLimitError.
I refactored the initial code, and now the exception is raised in the
Right, but we still need to rescue and re-throw, because otherwise the exception will get processed as another type, and we need to prevent this, at each stage of processing.
Not sure what you mean - "error" or "warning". Currently the exception is bubbled up to the
I have added
I added I just need to find a good place to use/print it. |
| def rate_limit_error?(response) | ||
| return false if response.success? | ||
|
|
||
| response.params.dig("error","code") == 'rate_limit' |
There was a problem hiding this comment.
Fine for me, just a nitpick, maybe the rate_limit literal could be a constant inside Finance::Payment::StripeRateLimitError.
| rescue Finance::Payment::StripeRateLimitError => e | ||
| # Rate limit errors should bubble up to Sidekiq for immediate retry with exponential backoff | ||
| # Don't treat these as payment failures - they are temporary gateway issues | ||
| raise e |
There was a problem hiding this comment.
What if, instead of having to reference StripeRateLimitError on several levels, you make StripeRateLimitError inherit from something new like Finance::Payment::TemporaryError and rescue that? This way we can reuse this structure in the future if some other temporary error happens, also in other gateways. As long as the error inherits from TemporaryError, billing will be retried.
| CreditCardPurchaseFailed = Class.new(GatewayError) | ||
|
|
||
| # Rate limit error - should be retried immediately, not treated as payment failure | ||
| class StripeRateLimitError < ActiveMerchant::ActiveMerchantError |
There was a problem hiding this comment.
Even when this only happens in Stripe, I think it's better to not make it stripe specific, because the concept of rate limiting is more general. Also I don't see anything in this error definition that would force to limit it to Stripe.
| @@ -0,0 +1,31 @@ | |||
| # frozen_string_literal: true | |||
|
|
|||
| class Synchronization::BillingLockService < Synchronization::NowaitLockService | |||
There was a problem hiding this comment.
Why splitting the logic into two classes? couldn't it be just one?
| gateway_options: @gateway_options | ||
| } | ||
| raise Finance::Payment::StripeRateLimitError.new(response, payment_metadata) | ||
| end |
There was a problem hiding this comment.
I don't like the idea that one exception we generate here and others in another place.
If service has to generate exceptions, it has to do it for any failure. And if it is not supposed to generate exceptions, then it should not generate any.
In this case it feels as if the "contract" is not to generate exceptions.
In other languages like Java, whether a method raises (throws) or not is part of the definition of the method.
Looking at the whole charging implementation though, it is rather convoluted. I think we should either:
- implement all gateways to return/raise based on clear lassification of temporary, non temporary and success conditions
- just implement classification of all gateways in
BillingService#call!and then reschedule as desired - be lazy, keep everything as is but enable stripe SDK retry which takes care of back-off time as well idempotency of the requests.
Stripe.max_network_retries = 2, see https://rubydoc.info/github/stripe/stripe-ruby . The mere use of idempotency keys is a huge win, otherwise we should make sure to use idempotency keys anyways to avoid double charging of the same payment.
My personal take would be to enable Stripe SDK native retries in the first place and do some refactor of how we classify errors in the future if needed.
My second preference would be to both (which doesn't preclude enabling native SDK retries):
- implement idemotency keys (this could probably be simply the hashed invoice id so any payment attempts to that invoice would use the same idempotency key, preventing double charging for the same invoice reliably
- classify the errors within
BillingService#call!and decide whether to schedule an immediate retry or leave it for the future billing cycles
Last preference of mine is to classify the errors in the upper layers - as now but with more prominent structure that accounts for braintree and possible future gateways. But this will still require the final decision to be taken within BillingService#call! so I'm not sure how it will reduce complexity.
P.S. I understand that it is very complicated to figure out the least shitty approach here besides heavily refactoring everything. That's why I prefer the least changes or at least to be as compact as possible. Not trying to complain for everything although all approaches have pros and cons. The above is just what I find most sensible but I'm very open to change my mind.
What this PR does / why we need it:
This implementation is a draft which was implemented by Claude (accurately guided by @mayorova). It is intended to serve as a starting point in discussions about how we can solve the issue (https://issues.redhat.com/browse/THREESCALE-4086) and also potentially open a discussion about Billing refactoring.
As it can be clearly seen, our billing process is quite complicated:
Invoice#charge!.THREESCALE_8124.Finance::BillingStrategy.dailymethod, which looks like it is intended to go over a list of billing strategies (i.e. provider accounts), and inside each strategy execute the daily process for a list of buyer accounts. However, in practice, this method is typically called by theBillingWorkerjob that runs for a single provider and a single buyer.In general, I think the whole billing process could be refactored to make it significantly simpler and more predictable. We could do it also with Stripe rate limits in mind, to make the implementation more straightforward, and maybe even implement some kind of client-side rate limit (to avoid the error in the first place), rather than reacting to the error and re-try.
However, of course this is a very sensitive piece of logic, and it might be dangerous to modify it significantly (as we know it has been working quite reliably for ages).
So, let's talk 😉
Which issue(s) this PR fixes
https://issues.redhat.com/browse/THREESCALE-4086
Verification steps
Special notes for your reviewer:
STRIPE_RATE_LIMIT_HANDLING.mdthat Claude created explaining the implementation in detail.docs/.activemerchantgem. Before this line I placeSo, the invoices with total value > 100 will trigger this error, while the "cheaper" ones should pass successfully. Beware though about the order of invoices - if the first invoice fails, the process will not continue.
I also use these steps to prepare my Rails console for actual test: