Caffeine cache for request api #2213

rosalind210 · 2021-06-18T17:13:24Z

In response to Singularity experiencing slowness due to one endpoint getting hammered, we want the ability to cache a request value for at least a second to have some de-bouncing. To do this, we are using CaffeineCache with an expireAfterWrite of one second that can be re-configured to another value if necessary.

Open question: should the cache be in the resource file or somewhere nested in the RequestManager and RequestHelper?

pschoenfelder · 2021-06-22T18:58:24Z

🚢

pschoenfelder · 2021-06-23T17:52:30Z

Realized a bit after the fact that we should probably have @jschlather and/or @tpetr take a look to make sure this aligns with what we talked about in the critsit post — would you guys mind taking a peek?

jschlather · 2021-06-23T17:56:05Z

I think this would be cleaner if the cache was in RequestsManager.

rosalind210 · 2021-06-23T18:00:16Z

The reason that I didn't put the cache in the RequestManager is because just having the requests cached in the manager doesn't avoid the second call to ZK in the RequestHelper's fillDataForRequestsAndFilter.

rosalind210 · 2021-06-23T18:07:40Z

We also have several layers of cache in that call that didn't end up helping us with the 504s: https://github.com/HubSpot/Singularity/blob/master/SingularityService/src/main/java/com/hubspot/singularity/data/RequestManager.java#L628-L634.

jschlather · 2021-06-23T18:08:57Z

Do you know what this value is going to be for a java web app?

https://github.com/HubSpot/Singularity/pull/2213/files#diff-05dc8661deba9fc538b4bf1e7cc39260843cf5a0ca98070ef7cc5319b898000cR1476

rosalind210 · 2021-06-23T18:12:03Z

For the web app, the user ID is the user who's logged in accessing the web app. You're logged in through SSO, can you confirm @pschoenfelder?

jschlather · 2021-06-23T18:13:39Z

So it should be the janus username and stable across difference instances of the same deployable?

rosalind210 · 2021-06-23T18:22:28Z

Yes, just checked the logging around the user ID to confirm.

jschlather · 2021-06-23T18:30:57Z

Okay cool, it would be nice to debounce across callers. But this should prevent one service from taking us down. The other option here would be to put a 1s cache on the getRequests call and then also add a 5s cache to the getRequestsWithHistory calls.

I don't have the heap/thread dumps handy, but I'm pretty sure neither the leader cache or the web cache were active on the instances I looked at.

rosalind210 · 2021-06-23T18:46:29Z

I included the user in the key because users can have different levels of authorization, and since the past few times 504s have been caused by the same IP I think we should be in the clear with that level of granularity. I'll discuss the two layers of CaffeineCache with the team.

The web cache wouldn't have been active because that cache is only used for the web app, but I'll look more into the LeaderCache.

Edit: LeaderCache is only active on a single instance (the scheduler instance).

rosalind210 · 2021-06-23T19:52:14Z

@jschlather During the latest slow down, I found three ZK calls that kept timing out with the one getRequests call:

ZK call in DeployManager (from RequestHelper)

! at app//com.hubspot.singularity.data.CuratorAsyncManager.getAsync(CuratorAsyncManager.java:353)
! at app//com.hubspot.singularity.data.DeployManager.fetchDeployStatesByRequestIds(DeployManager.java:140)
! at app//com.hubspot.singularity.data.DeployManager.getRequestDeployStatesByRequestIds(DeployManager.java:127)
! at app//com.hubspot.singularity.helpers.RequestHelper.fillDataForRequestsAndFilter(RequestHelper.java:304)
! at app//com.hubspot.singularity.resources.RequestResource.getRequests(RequestResource.java:1492)

ZK call in RequestManager

! at app//com.hubspot.singularity.data.CuratorAsyncManager.getAsyncChildren(CuratorAsyncManager.java:383)
! at app//com.hubspot.singularity.data.RequestManager.fetchRequests(RequestManager.java:644)
! at app//com.hubspot.singularity.data.RequestManager.getRequests(RequestManager.java:635)
! at app//com.hubspot.singularity.resources.RequestResource.getRequests(RequestResource.java:1494)

ZK call in UserManager (from RequestHelper)

! at app//com.hubspot.singularity.data.CuratorManager.getData(CuratorManager.java:366)
! at app//com.hubspot.singularity.data.UserManager.getUserSettings(UserManager.java:75)
! at app//com.hubspot.singularity.helpers.RequestHelper.fillDataForRequestsAndFilter(RequestHelper.java:323)
! at app//com.hubspot.singularity.resources.RequestResource.getRequests(RequestResource.java:1492)

jschlather · 2021-06-23T19:57:14Z

And this was with the caffeine cache?

rosalind210 · 2021-06-23T20:01:42Z

Yes. I was thinking that it caches that first time for a second but then when we are hit with 100 calls a second immediately after expiry and we end up not being able to cache again because of ZK timeouts. We're looking at the heapdump now and there was one item in the cache, so some things are entering.

jschlather · 2021-06-23T20:05:10Z

Right, my original intention was for the cache to work across callers. So that way we only ever end up with one of these requests in progress. Since the cache is per caller, seems like we still end up with too many concurrent requests to ZK.

Caffeine should also debounce the call to ZK, so if if you have two requests for the same cache key then the first one that misses calls to ZK and the second one waits for that.

Maybe the answer here is more short TTL caches.

rosalind210 · 2021-06-23T20:57:07Z

I discussed with the team and it's not going to be possible to get the cache to work across all callers because of user settings and admin/non-admin privileges, but we're going to move CaffeineCaches into the RequestManager's getRequests and DeployManager's fetchDeployStatesByRequestIds because neither of those are user specific.

We aren't too worried about the ZK call to get request history because it is only called if includeFullRequestData is true and that hasn't been the case of the last few 504s.

jschlather · 2021-06-23T21:03:11Z

Sounds good.

Rosie Ellis added 9 commits June 18, 2021 13:01

Initial caffeine cache stab

e0d34b0

Use caffeine cache config in implementation

8b7801e

logs and caffeine version change

bfd0982

Move conditional logic

4f71cb9

Move cache instantiation

8c40a6a

Get config properly

15bc427

Add cache to test

8cdf0b5

Convert logs to trace level

47038b0

Better cache key

5cc4702

rosalind210 merged commit 0634f3d into master Jun 22, 2021

rosalind210 deleted the caffeine_cache_for_request_api branch June 22, 2021 19:02

ssalinas added this to the 1.5.0 milestone May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caffeine cache for request api #2213

Caffeine cache for request api #2213

rosalind210 commented Jun 18, 2021 •

edited

pschoenfelder commented Jun 22, 2021

pschoenfelder commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021 •

edited

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021 •

edited

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

Caffeine cache for request api #2213

Caffeine cache for request api #2213

Conversation

rosalind210 commented Jun 18, 2021 • edited

pschoenfelder commented Jun 22, 2021

pschoenfelder commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021 • edited

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021 • edited

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 23, 2021

jschlather commented Jun 23, 2021

rosalind210 commented Jun 18, 2021 •

edited

rosalind210 commented Jun 23, 2021 •

edited

rosalind210 commented Jun 23, 2021 •

edited