Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caffeine cache for request api #2213

Merged
merged 9 commits into from
Jun 22, 2021
Merged

Conversation

rosalind210
Copy link
Contributor

@rosalind210 rosalind210 commented Jun 18, 2021

In response to Singularity experiencing slowness due to one endpoint getting hammered, we want the ability to cache a request value for at least a second to have some de-bouncing. To do this, we are using CaffeineCache with an expireAfterWrite of one second that can be re-configured to another value if necessary.

Open question: should the cache be in the resource file or somewhere nested in the RequestManager and RequestHelper?

@pschoenfelder
Copy link
Contributor

🚢

@rosalind210 rosalind210 merged commit 0634f3d into master Jun 22, 2021
@rosalind210 rosalind210 deleted the caffeine_cache_for_request_api branch June 22, 2021 19:02
@pschoenfelder
Copy link
Contributor

Realized a bit after the fact that we should probably have @jschlather and/or @tpetr take a look to make sure this aligns with what we talked about in the critsit post — would you guys mind taking a peek?

@jschlather
Copy link

I think this would be cleaner if the cache was in RequestsManager.

@rosalind210
Copy link
Contributor Author

The reason that I didn't put the cache in the RequestManager is because just having the requests cached in the manager doesn't avoid the second call to ZK in the RequestHelper's fillDataForRequestsAndFilter.

@rosalind210
Copy link
Contributor Author

We also have several layers of cache in that call that didn't end up helping us with the 504s: https://github.com/HubSpot/Singularity/blob/master/SingularityService/src/main/java/com/hubspot/singularity/data/RequestManager.java#L628-L634.

@jschlather
Copy link

@rosalind210
Copy link
Contributor Author

rosalind210 commented Jun 23, 2021

For the web app, the user ID is the user who's logged in accessing the web app. You're logged in through SSO, can you confirm @pschoenfelder?

@jschlather
Copy link

So it should be the janus username and stable across difference instances of the same deployable?

@rosalind210
Copy link
Contributor Author

Yes, just checked the logging around the user ID to confirm.

@jschlather
Copy link

Okay cool, it would be nice to debounce across callers. But this should prevent one service from taking us down. The other option here would be to put a 1s cache on the getRequests call and then also add a 5s cache to the getRequestsWithHistory calls.

I don't have the heap/thread dumps handy, but I'm pretty sure neither the leader cache or the web cache were active on the instances I looked at.

@rosalind210
Copy link
Contributor Author

rosalind210 commented Jun 23, 2021

I included the user in the key because users can have different levels of authorization, and since the past few times 504s have been caused by the same IP I think we should be in the clear with that level of granularity. I'll discuss the two layers of CaffeineCache with the team.

The web cache wouldn't have been active because that cache is only used for the web app, but I'll look more into the LeaderCache.

Edit: LeaderCache is only active on a single instance (the scheduler instance).

@rosalind210
Copy link
Contributor Author

@jschlather During the latest slow down, I found three ZK calls that kept timing out with the one getRequests call:

ZK call in DeployManager (from RequestHelper)

! at app//com.hubspot.singularity.data.CuratorAsyncManager.getAsync(CuratorAsyncManager.java:353)
! at app//com.hubspot.singularity.data.DeployManager.fetchDeployStatesByRequestIds(DeployManager.java:140)
! at app//com.hubspot.singularity.data.DeployManager.getRequestDeployStatesByRequestIds(DeployManager.java:127)
! at app//com.hubspot.singularity.helpers.RequestHelper.fillDataForRequestsAndFilter(RequestHelper.java:304)
! at app//com.hubspot.singularity.resources.RequestResource.getRequests(RequestResource.java:1492)

ZK call in RequestManager

! at app//com.hubspot.singularity.data.CuratorAsyncManager.getAsyncChildren(CuratorAsyncManager.java:383)
! at app//com.hubspot.singularity.data.RequestManager.fetchRequests(RequestManager.java:644)
! at app//com.hubspot.singularity.data.RequestManager.getRequests(RequestManager.java:635)
! at app//com.hubspot.singularity.resources.RequestResource.getRequests(RequestResource.java:1494)

ZK call in UserManager (from RequestHelper)

! at app//com.hubspot.singularity.data.CuratorManager.getData(CuratorManager.java:366)
! at app//com.hubspot.singularity.data.UserManager.getUserSettings(UserManager.java:75)
! at app//com.hubspot.singularity.helpers.RequestHelper.fillDataForRequestsAndFilter(RequestHelper.java:323)
! at app//com.hubspot.singularity.resources.RequestResource.getRequests(RequestResource.java:1492)

@jschlather
Copy link

And this was with the caffeine cache?

@rosalind210
Copy link
Contributor Author

Yes. I was thinking that it caches that first time for a second but then when we are hit with 100 calls a second immediately after expiry and we end up not being able to cache again because of ZK timeouts. We're looking at the heapdump now and there was one item in the cache, so some things are entering.

@jschlather
Copy link

Right, my original intention was for the cache to work across callers. So that way we only ever end up with one of these requests in progress. Since the cache is per caller, seems like we still end up with too many concurrent requests to ZK.

Caffeine should also debounce the call to ZK, so if if you have two requests for the same cache key then the first one that misses calls to ZK and the second one waits for that.

Maybe the answer here is more short TTL caches.

@rosalind210
Copy link
Contributor Author

I discussed with the team and it's not going to be possible to get the cache to work across all callers because of user settings and admin/non-admin privileges, but we're going to move CaffeineCaches into the RequestManager's getRequests and DeployManager's fetchDeployStatesByRequestIds because neither of those are user specific.

We aren't too worried about the ZK call to get request history because it is only called if includeFullRequestData is true and that hasn't been the case of the last few 504s.

@jschlather
Copy link

Sounds good.

@ssalinas ssalinas added this to the 1.5.0 milestone May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants