Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

public-keys endpoint causing high db load #1145

Closed
nulian opened this issue Mar 16, 2021 · 13 comments
Closed

public-keys endpoint causing high db load #1145

nulian opened this issue Mar 16, 2021 · 13 comments
Assignees
Labels
enhancement New feature or request scale
Milestone

Comments

@nulian
Copy link

nulian commented Mar 16, 2021

We have a fusionauth setup with a lot of applications and with even more keys that makes the /api/jwt/public-keys?applicationId=11111111-1111-1111-1111-111111111111 endpoint very slow.

Was monitoring the sql queries that fusionauth does and it queries applications with active = true which isn't an indexed column.

After that it seems to query all keys ordered by insert_instant which also isn't indexed.
And then probably filter client side the correct key which probably could be optimized by searching for the just the keys needed.

We can probably partly fix the performance on our side by adding some indexes.
But would be nice for more optimized query on fetching the keys also and that we don't have to do custom alteration on the fusionauth database.

@mooreds
Copy link
Collaborator

mooreds commented Mar 16, 2021

@nulian what version of FusionAuth are you running?

Would love to hear if you apply indices which make this faster. If you do, please share them.

Also, another option would be to put a caching proxy in front of that API; I'm not sure how often your public keys change, but a proxy that caches them even for 10 minutes would help.

Related issue: #374

@robotdan
Copy link
Member

robotdan commented Mar 16, 2021

We are doing a single query on the applications table by a primary key - I don't think indexing active is necessary. If you find otherwise please share your results.

As a side note, we don't support modifications to our schema, so if you are not using our support, then go for it! :-) But we can't support you once you have modified the schema.

We can look at this one, but a better choice would be to use the JSON Web Key Set endpoint (JWKS).
https://fusionauth.io/docs/v1/tech/oauth/endpoints/#json-web-key-set-jwks

This will contain the public keys as well, but in an industry standard format that can be consumed by third party libraries, etc. We wrote the Public Key endpoint long before we supported JWKS.

How many applications do you have, and how many keys do you have managed in Key Master?

@robotdan robotdan added enhancement New feature or request and removed needs more info labels Mar 16, 2021
@Johpie
Copy link

Johpie commented Mar 18, 2021

Hi @mooreds and @robotdan,

I'm a colleague of @nulian and can provide some more details. We discussed the problem internally and a different team is going to look into the JSON Web Key Set (JWKS) to see if that's a better solution for us (but probably gonna take some time to implement).

In the meanwhile I can provide some more details of our setup:

Current fusionauth version: 1.20.0
Current elasticsearch version: 7.6.2

We are running fusionauth on AKS, which currently runs with 6 nodes. We are running 6 fusionauth pods distributed over those six nodes which are limited on 10 gb memory per pod. For the database we use "Azure Database for MySQL server" (version 5.7.32) configured with 8 vCores and limited on 1500 iops. (Before we ran into problems with this query we ran on 2 vCores).

Current amount of applications: ~11000 (This is going to grow to around 45000 soon) (49 mb data+index size)
Current amount of keys: ~33500 (198 mb data+index size)

The troubling query:

SELECT
     k.id                  AS k_id,
     k.algorithm           AS k_algorithm,
     k.certificate         AS k_certificate,
     k.expiration_instant  AS k_expiration_instant,
     k.insert_instant      AS k_insert_instant,
     k.issuer              AS k_issuer,
     k.kid                 AS k_kid,
     k.last_update_instant AS k_last_update_instant,
     k.name                AS k_name,
     k.private_key         AS k_private_key,
     k.public_key          AS k_public_key,
     k.secret              AS k_secret,
     k.type                AS k_type
     FROM `keys` as k
    ORDER BY k.insert_instant DESC;

When I run a EXPLAIN on this query, it is using filesort, meaning that it can't use an index to perform the sorting. When profiling this query it currently takes around 28~30 seconds to complete. And in our case it can happen that this same query is running multiple times at once.

We do have a caching mechanism implemented which releases the stress a bit on the DB server, but we still think adding indexes on the right column could help (we still have to test this in our test environment). Will come back with those results when we have them.

@mooreds
Copy link
Collaborator

mooreds commented Mar 18, 2021

Thanks for the details @Johpie and @nulian , looking forward to the results of your testing. My guess is that this will always do a filesort because the app is doing a tablescan. I'm not sure why we need to sort by insert_instant, tbh.

Are these 11000 applications all in one tenant or are they distributed across tenants?

As an aside, would your team/company be interested in being featured on the FusionAuth blog? It sure sounds like you are doing interesting things with FusionAuth. We do emailed interviews and publish posts like this one: https://fusionauth.io/blog/2020/11/18/reconinfosec-fusionauth/ If so, please email me at dan at fusionauth dot io and I'll send you over the questions.

@robotdan
Copy link
Member

Linked to #374 - that issue is going to have to be on a short list - shortly. :-)

@nulian
Copy link
Author

nulian commented Mar 18, 2021

All the apps are in 1 tenant because we have 1 shared user database.

I think for the public-keys it would already make a big different if it uses the keys it looks up from the application table and only query for those in the keys table. Currently it still fetches the entire table while it could have just done a limited keys search in the keys table.

@Johpie
Copy link

Johpie commented Mar 22, 2021

After some investigation in our test environment, we concluded adding a index on the keys table is not going to help. Also removing the order by clause didn't help. We think it's just the sheer size of the table being transferred every time the query is called is killing our performance right now.

Not sure if we should close this issue, but I'll leave that up to you guys :-)

@mooreds
Copy link
Collaborator

mooreds commented Mar 22, 2021

Did you test the JWKS endpoint as well?

@nulian
Copy link
Author

nulian commented Mar 22, 2021

Had a look it's a bit faster 850ms vs 3.5s.
Biggest downside to that is everytime a customer on our platform creates a new app we will have to redownload that entire jwks key registry again.
It's 19.6MB at the moment and will probably keep growing.
So a filter would help a lot like that we could get the jwks of a certain kid or applicationId.

@robotdan
Copy link
Member

JWKS is a spec, as far as I know it does not define a way to request a key by kid.

We can keep this issue open, I’d like to make the public key api a bit faster at this scale.

However, neither of these endpoints are really designed to hit on a request scope.

I would suggest identifying a way to get what you need from the response, build your JWT verifier or whatever you need and then don’t ask for the public keys again until you know you have added another client or you get a cache miss on your JWT verifier by kid.

Because even if we cache this, the HTTP response payload is going to be large enough that the transfer over the wire is going to add noticeable latency.

@nulian
Copy link
Author

nulian commented Mar 22, 2021

That is mostly our current implementation with the public-keys endpoint.
We request it the first time and then cache it till it no longer matches to reduce the load on fusionauth.
Though we do it per application instead of requesting the entire list.
And current caching implementation is per pod to reduce the complexity.

@robotdan
Copy link
Member

Ok, great, thanks for the update @nulian . We can use this ticket to review this endpoint and see what we can do to make it perform much better under this type of scale. Thanks for using FusionAuth!

@robotdan robotdan added the scale label Mar 22, 2021
@robotdan
Copy link
Member

This should be fixed in upcoming 1.30.2 release. We made some modifications to the way this endpoint is performing application and tenant lookups.

@robotdan robotdan self-assigned this Sep 24, 2021
@robotdan robotdan added this to the 1.30.2 milestone Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request scale
Projects
None yet
Development

No branches or pull requests

4 participants