Retry 20 times instead of 10#58
Conversation
|
Sorry I am late to the party. But I want to change something else, probably more radically: https://github.com/Shopify/lhm/blob/master/lib/lhm/invoker.rb#L18 I propose to change that value to a positive one: 10 seconds instead of -2. With a -2 there (lock wait timeout for the lhm session two second below anybody else), it means that in the case of a queue of clients waiting to grab the lock, the lhm client is going to be the one giving up first. We first decided to add this delta after an incident in 2014, where we were still using the default lock wait timeout for the version of mysql we were using at that time (5.5): One year. Looking at that from another perspective, between the LHM client getting the lock, instead of regular web/job client, I'd rather have the LHM succeeding: A single web or job worker timing out is no big deal. |
Per #58 (comment) With a -2 there (lock wait timeout for the lhm session two second below anybody else), it means that in the case of a queue of clients waiting to grab the lock, the lhm client is going to be the one giving up first. Which is why we had so many issues completing this lhm/ptosc on collects (and it will likely happen again if we had to run another one on that table): Shopify/datastores#2803 We first decided to add this delta after an incident in 2014, where we were still using the default lock wait timeout for the version of mysql we were using at that time (5.5): One year. Looking at that from another perspective, between the LHM client getting the lock, instead of regular web/job client, I'd rather have the LHM succeeding: A single web or job worker timing out is no big deal. It's not like if we are risking all shopify to lock down by increasing this value: That was the case in 2013/2014, where clients waited to acquire the lock forever, not like now, for ten seconds.
Per #58 (comment) With a -2 there (lock wait timeout for the lhm session two second below anybody else), it means that in the case of a queue of clients waiting to grab the lock, the lhm client is going to be the one giving up first. Which is why we had so many issues completing this lhm/ptosc on collects (and it will likely happen again if we had to run another one on that table): Shopify/datastores#2803 We first decided to add this delta after an incident in 2014, where we were still using the default lock wait timeout for the version of mysql we were using at that time (5.5): One year. Looking at that from another perspective, between the LHM client getting the lock, instead of regular web/job client, I'd rather have the LHM succeeding: A single web or job worker timing out is no big deal. It's not like if we are risking all shopify to lock down by increasing this value: That was the case in 2013/2014, where clients waited to acquire the lock forever, not like now, for ten seconds.
Per #58 (comment) With a -2 there (lock wait timeout for the lhm session two second below anybody else), it means that in the case of a queue of clients waiting to grab the lock, the lhm client is going to be the one giving up first. Which is why we had so many issues completing this lhm/ptosc on collects (and it will likely happen again if we had to run another one on that table): Shopify/datastores#2803 We first decided to add this delta after an incident in 2014, where we were still using the default lock wait timeout for the version of mysql we were using at that time (5.5): One year. Looking at that from another perspective, between the LHM client getting the lock, instead of regular web/job client, I'd rather have the LHM succeeding: A single web or job worker timing out is no big deal. It's not like if we are risking all shopify to lock down by increasing this value: That was the case in 2013/2014, where clients waited to acquire the lock forever, not like now, for ten seconds.
No description provided.