New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backoff and retry option #199

Closed
corbinu opened this Issue Jul 15, 2018 · 16 comments

Comments

Projects
None yet
3 participants
@corbinu

corbinu commented Jul 15, 2018

Hello,

The autoLimit works great when doing a bunch of operations in a row. However I have noticed in the last few days that Shopify must have speedup something in their system as I am getting Product/Update webhooks called against my API at faster than their own API limit. Each product update hook makes a single call for that products metafields so the autoLimit doesn't help. Thoughts?

@lpinca

This comment has been minimized.

Collaborator

lpinca commented Jul 16, 2018

I'm not sure I understand the issue, are webhooks triggered too frequently? In that case the only thing to do is to reduce the request rate and the autoLimit option already allows to do that in a configurable way.

@corbinu

This comment has been minimized.

corbinu commented Jul 16, 2018

@lpinca Yes the webhooks have been triggering faster than 2 a second. The autoLimit isn't helping because there is no way for it to cross process boundaries. It would work fine if we had one process on one server but we do not.

So Shopify seems to be calling us more like 6 a second which is getting load balanced across our servers. We then reply to each at the same rate even with the autoLimit set only 1 response per second (per node process).

I will try to find time to open a PR tomorrow with exponential back off.

@corbinu

This comment has been minimized.

corbinu commented Jul 16, 2018

Just as a note we have known this was an issue and were planning to move all our webhook handlers to faktory however the business version with rate limting keeps getting delayed.

@lpinca

This comment has been minimized.

Collaborator

lpinca commented Jul 16, 2018

The autoLimit isn't helping because there is no way for it to cross process boundaries.

Yes the state is not shared between the processes/services. That case can be addressed by allowing each process to send up to only a fraction of the total limit no? For example if you have 4 processes let each process send up to limit/4.

I can't see how a retry + exponential back off strategy would help for this.

@corbinu

This comment has been minimized.

corbinu commented Jul 16, 2018

@lpinca Yes but that both would be less performant most of the time and assumes that the number is static don't really want to have to change the value constantly around black friday for example.

Sorry I should be more clear this is happening in bursts. For example we have multiple regular and Shopify Plus clients that do large batch product updates from their ERP every few hours or every day. So ERP would be pushing like 40K product updates at once right at the API limit.

Used to be this was rare but in the last couple days everytime that happens I get emails from Shopify of our webhook failing. However I checked the logs and it is failing because we are getting these webhooks much faster than Shopify's API limits. The key is this is a burst that then dissipates. I believe that the sudden change has been caused by Shopify making a performance update on their end causing the webhooks to fire off faster. Eventually it settles down as Shopify keeps retrying and all the hooks eventually clear.

So backing off on for example both 429 failure and if the most recent request was within the autoLimit had a X-Shopify-Shop-Api-Call-Limit of less than lets say half (all of this would be configurable) would spread the hits to Shopify's api out and we would stop having to return a failure.

@lpinca

This comment has been minimized.

Collaborator

lpinca commented Jul 16, 2018

In my opinion that is not the right way to handle rate limiting. Retrying and backing off is not good as you should never get a 429 in the first place. If you get a 429 it means that you are doing something wrong. If you can't stay below the limit, for example due to bursts, a queue + dispatch strategy should ensure that the limit is never reached.

@corbinu

This comment has been minimized.

corbinu commented Jul 16, 2018

@lpinca I am sorry but it is Shopify who is doing something wrong. Ideally they would never call their webhooks faster than their own API limits but thats not what is happening in reality.

@corbinu

This comment has been minimized.

corbinu commented Jul 16, 2018

I get this is a non ideal solution to a non ideal problem but the problem does not appear to have any ideal solutions. If faktory was at 1.0 and had throttling support we would be using that but it still has not been released.

@lpinca

This comment has been minimized.

Collaborator

lpinca commented Jul 16, 2018

Ideally they would never call their webhooks faster than their own API limits but thats not what is happening in reality.

Hmm, how can an event that is fired as a result of an API call be triggered faster that the API call itself?

Also assume that webhooks were completely decoupled from the API calls so that you would not know how often webhooks were triggered. How would you handle that? What I'd do is making sure that my "makeAPICall" function, invoked by the weebhook, does not create more than x/timeslice requests regardless of how frequently the webhook calls it.

@corbinu

This comment has been minimized.

corbinu commented Jul 16, 2018

Because multiple events can cause the events to fire orders are still causing product quantities to change and triggering their own product update events.

Again that doesn’t work with multiple servers.

@corbinu

This comment has been minimized.

corbinu commented Jul 16, 2018

I guess I will just create my own fork.

@corbinu corbinu closed this Jul 16, 2018

@lpinca

This comment has been minimized.

Collaborator

lpinca commented Jul 16, 2018

Because multiple events can cause the events to fire orders are still causing product quantities to change and triggering their own product update events.

Ok, now it makes sense, thank you.

Using a job queue seems a better solution to me, Faktory is not the only one, see for example https://github.com/OptimalBits/bull.

Again, in my opinion, the problem should not be "fixed", it should not be created, 429 is by definition a user error.

@richardscarrott

This comment has been minimized.

richardscarrott commented Sep 10, 2018

@corbinu the Shopify docs seems to recommend @lpinca's approach -- so having your code manage the call rate in an attempt to mimic Shopify's internal rate limiting state -- however we too have many independent services (and many instances of those services) which don't share memory so it's hard to know how one would reliably divi up the API limit.

Did you implement a retry mechanism and was it successful -- I was thinking it'd make sense to be able to configure a cron job to back-off further than, for example, a request from a user where response time is more critical -- basically some form of priority flag?

@corbinu

This comment has been minimized.

corbinu commented Sep 10, 2018

@richardscarrott Yes they do and if Shopify respected their API limits when calling out to webhooks that would be fine. This should never be an issue when your calling them it is only when they call you and then that triggers something that calls them (for example our product/update hook goes and pulls that products metafields).

Yes I simply turned on retries with an exponential backoff and that is working for now. I have been making a lot of changes to my version since however. Do you need it asap or can you wait a week?

@richardscarrott

This comment has been minimized.

richardscarrott commented Sep 10, 2018

@corbinu I'm not in production yet so can wait -- would be great to see how you've solved it 👍

I still think it'd become a problem for us calling Shopify (without listening to webhooks), e.g. we have one service which runs a cron job to fetch all products and another separate service (which doesn't share memory or state) fetching all orders -- isn't there a risk that they'll hit the API limit because they aren't aware that the API limit is being consumed by one another?

@corbinu

This comment has been minimized.

corbinu commented Sep 10, 2018

@richardscarrott Sounds good

Well that's something I am adding where the service would use the last known state to figure out if the API limit is getting close. That way even if two services were running at once they would see the limit was draining to fast. This could even be improved further by sharing state between the services (for example if both services can talk to a Redis DB that keeps track of the last know API level). Or even creating another service that they both can post to which just echos back the last API limit level it was given (consul has features for that built in actually).

However really the best solution would probably be a job queue which was my original way to solve the problem. Unfortunately finding one with decent rate limiting is next to impossible. So my solution was to go with the fastest fix then keep building smarter versions so that hopefully the bad quick fix never gets depended on (however I see no reason not to leave it there as a fail safe).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment