New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support concurrent refresh of refresh tokens #38382
Conversation
Pinging @elastic/es-security |
@elasticmachine please run elasticsearch-ci/packaging-sample |
This reverts commit 2b54388.
|
@elasticmachine please run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good to me, I have few questions and suggestions. Thank you.
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java
Show resolved
Hide resolved
listener.onResponse(null); | ||
// the token exists and the value is at least as long as we'd expect | ||
final Version version = Version.readVersion(in); | ||
if (version.onOrAfter(Version.V_7_0_0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to change this to v7.1.0 here and other places.
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java
Outdated
Show resolved
Hide resolved
final Long refreshedEpochMilli = (Long) refreshTokenSrc.get("refresh_time"); | ||
final Instant refreshTime = refreshedEpochMilli == null ? null : Instant.ofEpochMilli(refreshedEpochMilli); | ||
final String supersededBy = (String) refreshTokenSrc.get("superseded_by"); | ||
return authVersion.onOrAfter(Version.V_7_0_0) && supersededBy != null && refreshTime != null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return authVersion.onOrAfter(Version.V_7_0_0) && supersededBy != null && refreshTime != null | |
if (authVersion.onOrAfter(Version.V_7_0_0)) { | |
final Instant refreshTime = refreshedEpochMilli == null ? null : Instant.ofEpochMilli(refreshedEpochMilli); | |
final String supersededBy = (String) refreshTokenSrc.get("superseded_by"); | |
if (refreshTime != null && supersededBy != null) { | |
return clock.instant().isAfter(refreshTime.plus(4L, ChronoUnit.SECONDS)); | |
} | |
} | |
return false; |
suggestion instead of conditions in a single line, I don't know is it too verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've split them to multiple lines. I think it's easier to read with all the conditions together than split in 2 different if
statements but I'd be happy to have my opinion changed - it's not a particularly strong one. Let's wait and see what Jay thinks (or anyone else that cares to comment) and go with what makes sense to most of us. Sounds good ?
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java
Outdated
Show resolved
Hide resolved
logger.debug("Token document [{}] was recently refreshed, attempting to reuse [{}] for returning an " + | ||
"access token and refresh token", tokenDocId, supersedingTokenDocId); | ||
GetRequest supersedingTokenGetRequest = | ||
client.prepareGet(SecurityIndexManager.SECURITY_INDEX_NAME, TYPE, supersedingTokenDocId).request(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading here, https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#realtime
seems like the Get API by default will issue refresh, is that of any concern for us here performance wise? As we set RefreshPolicy.WAIT_UNTIL
when updating the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GET will issue a refresh if the document has been updated but not refreshed IIUC. Can you elaborate on what the performance impact might be and how it relates to setting WAIT_UNTIL
on the update request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the performance comment was more on the lines of me trying to understand the impact of the refresh
, not sure how big of that impact would be in case of security index. I am assuming not too much indexing happens on security index at a time so the performance impact should not be big. But wanted to confirm from you or Jay, in case I am missing something.
As I was reading further, it was pointed to be careful with the refresh
option as it can cause heavy load https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#get-refresh
So when updating a document (token) in security index, that change will not be available for searching immediately but on next refresh cycle. Since we might be getting parallel requests from Kibana it might happen that when we do Get API that will trigger refresh for the document.
Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was reading further, it was pointed to be careful with the
refresh
option as it can cause heavy load
I guess that the equivalent of refresh=true
is if we would set IMMEDIATE
So when updating a document (token) in security index, that change will not be available for searching immediately but on next refresh cycle. Since we might be getting parallel requests from Kibana it might happen that when we do Get API that will trigger refresh for the document.
This can happen I think. But we need this to happen because we want the subsequent request(s) to get the updated view of the document state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, my thought was around retrying instead of refresh
via Get to avoid any performance impact if any.
Looking at the frequency of updates to security index, I think we are okay to this refresh
.
I think we skip my comment unless others feel that this is something that can have an adverse impact on the performance. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By refreshing, we are writing lots of small lucene segments that need to be merged in the background. However, I do not feel like there is an issue with this use of Get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the review comments are addressed and you also have a ship it from Jay. Thank you for the iterations.
onFailure.accept(invalidGrantException("could not refresh the requested token")); | ||
} | ||
}, e -> { | ||
logger.info("could not find token document [{}] for refresh", supersedingTokenGetRequest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info("could not find token document [{}] for refresh", supersedingTokenGetRequest); | |
logger.error("could not find token document [{}] for refresh", supersedingTokenGetRequest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we log these on INFO level throughout the TokenService now, what do you think @jaymode ?
final String supersedingRefreshTokenValue = (String) supersedingRefreshTokenSrc.get("token"); | ||
reIssueTokens(supersedingUserTokenSource, supersedingRefreshTokenValue, listener); | ||
} else { | ||
logger.info("could not find token document [{}] for refresh", supersedingTokenGetRequest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info("could not find token document [{}] for refresh", supersedingTokenGetRequest); | |
logger.error("could not find token document [{}] for refresh", supersedingTokenGetRequest); |
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java
Outdated
Show resolved
Hide resolved
Instant refreshed = Instant.now(); | ||
Instant aWhileAgo = refreshed.minus(10L, ChronoUnit.SECONDS); | ||
assertTrue(Instant.now().isAfter(aWhileAgo)); | ||
client.prepareUpdate(SecurityIndexManager.SECURITY_INDEX_NAME, "doc", docId.get()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's verify the update response here as that might help with debugging in case of test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point !
.get(); | ||
assertNotNull(createTokenResponse.getRefreshToken()); | ||
|
||
CreateTokenResponse refreshResponse = securityClient.prepareRefreshToken(createTokenResponse.getRefreshToken()).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fire a small number of requests in parallel and then check the responses have the same token string.
Testing this thoroughly (Thanks! @bizybot ) proved the initial approach might be a little too naive. The current approach when refreshing the token is to
Both In multiple concurrent requests for refreshing the same token, the approach above proves insufficient as
My initial idea was to change the refresh policy of both Even with the refresh policy set to I'll keep thinking about this, but I wanted to lay out the status and maybe solicit helpful ideas |
Since we need to still be testing the code dealing with decrypting token ids, this change introduces a way to generate encrypted ids to be used as access tokens in TokenServiceTests. This also fixes an old bug in testPassphraseWorks which was supposed to test for a wrong passphrase used, but actually succeeded only because the alternative TokenServive was initialized with empty settings, thus being disabled.
elasticsearch-ci/2 failed because of #30101 @elasticmachine run elasticsearch-ci/2 please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty close I think. I left a few comments and I think @bizybot should re-review since it has changed since the last review and I was involved with the changes since then.
.../main/java/org/elasticsearch/xpack/core/security/authc/support/TokensInvalidationResult.java
Show resolved
Hide resolved
.../main/java/org/elasticsearch/xpack/core/security/authc/support/TokensInvalidationResult.java
Show resolved
Hide resolved
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java
Show resolved
Hide resolved
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java
Outdated
Show resolved
Hide resolved
...lugin/security/src/test/java/org/elasticsearch/xpack/security/authc/TokenAuthIntegTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of minor comments, but otherwise LGTM
@@ -1833,6 +1798,13 @@ void clearActiveKeyCache() { | |||
this.keyCache.activeKeyCache.keyCache.invalidateAll(); | |||
} | |||
|
|||
/** | |||
* For testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you say package private for testing?
...lugin/security/src/test/java/org/elasticsearch/xpack/security/authc/TokenAuthIntegTests.java
Show resolved
Hide resolved
}), client::get); | ||
}; | ||
getTokenDocAsync(supersedingTokenDocId, getSupersedingListener); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, it is not "eligible for multiple refresh". So, we assume it is not refreshed at all. I believe it might as well be that it was refreshed a long time ago?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, we assume it is not refreshed at all
Not exactly. We first check if it is eligible for refresh in checkTokenDocForRefresh
so we know it was issued in the last 24 hrs - then we check if it is eligible for multi refresh i.e. issued in the last seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I got lost in it.
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java
Show resolved
Hide resolved
reIssueTokens(supersedingUserTokenSource, supersedingRefreshTokenValue, listener); | ||
} else if (backoff.hasNext()) { | ||
// We retry this since the creation of the superseding token document might already be in flight but not | ||
// yet completed, triggered by a refresh request that came a few milliseconds ago |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would've created the new refreshed user token first and then updated the pointer.
This way we avoid the back-off. If we get a concurrent modification exception then we can retry and assert that we have a valid refresh token concurrently generated (any extraneous user tokens can be removed or garbage collected).
No need to change the code for this, can be done as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind doing the update first is that I focused on being able to handle the subsequent requests (after a few millis ) so it made sense to do the update so that next ones can see that immediately. If I get this right, your idea would mean that roughly:
- X : 1st request received. Start the process to index the superseding doc
- X+y ms : 2nd request received. The previous index request is in flight, so we start the process to index the (new and different) superseding doc
- X+z ms : 1st index request completes, we update the original document with the pointer to the superseding doc and reply to the API request with the newly created refresh token
- X+ω ms : 2nd index request completes, we try to update the original doc but we get a version conflict exception, so we do a Get request now to get the pointer to the superseding doc and then Get that one to get it's source and reply with the same refresh token
The disadvantage of that are the extra refresh and access token we create. In addition to space ( which is a short term concern since as you said we clean up the tokens) what worries me a little is that we raise the possibilities of a successful brute force attack by introducing unnecessarily more valid tokens. The time validity of the access tokens and the key space for our access tokens and refresh tokens, makes this effect insignificant but if we don't get much in terms of performance/complexity minimization then I'm not sure if it's worth following. I'm definitely up to continue the discussion on it though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, this trades a bit of space for complexity.
I think we can alleviate this by having the id of the refreshed user token be a hash of the user token they are refreshing (superseding) - a hash of the id and creation time. This way concurrent refreshes would be trying to create the same doc_id and fail, only one will succeed and the others will retry, picking up the only refreshed token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, this trades a bit of space for complexity.
And , slightly, robustness against brute force attacks.
I think we can alleviate this by having the id of the refreshed user token be a hash of the user token they are refreshing (superseding) - a hash of the id and creation time.
I don't know how I feel about making the generation of token document IDs ( == access tokens ) predictable. I'll think about this a bit more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@albertzaharovits Thanks for taking a look at this and for your feedback - appreciated as always. I took care of the rest of suggestions and so I will merge this now - we can continue the discussion about the order of update/get when multirefreshing in another venue (discuss mail?) and the subsequent PR |
This change adds supports for the concurrent refresh of access tokens as described in elastic#36872 In short it allows subsequent client requests to refresh the same token that come within a predefined window of 60 seconds to be handled as duplicates of the original one and thus receive the same response with the same newly issued access token and refresh token. In order to support that, two new fields are added in the token document. One contains the instant (in epoqueMillis) when a given refresh token is refreshed and one that contains a pointer to the token document that stores the new refresh token and access token that was created by the original refresh. A side effect of this change, that was however also a intended enhancement for the token service, is that we needed to stop encrypting the string representation of the UserToken while serializing. ( It was necessary as we correctly used a new IV for every time we encrypted a token in serialization, so subsequent serializations of the same exact UserToken would produce different access token strings) This change also handles the serialization/deserialization BWC logic: - In mixed clusters we keep creating tokens in the old format and consume only old format tokens - In upgraded clusters, we start creating tokens in the new format but still remain able to consume old format tokens (that could have been created during the rolling upgrade and are still valid) Resolves elastic#36872 Co-authored-by: Jay Modi jaymode@users.noreply.github.com
This is a backport of #38382 This change adds supports for the concurrent refresh of access tokens as described in #36872 In short it allows subsequent client requests to refresh the same token that come within a predefined window of 60 seconds to be handled as duplicates of the original one and thus receive the same response with the same newly issued access token and refresh token. In order to support that, two new fields are added in the token document. One contains the instant (in epoqueMillis) when a given refresh token is refreshed and one that contains a pointer to the token document that stores the new refresh token and access token that was created by the original refresh. A side effect of this change, that was however also a intended enhancement for the token service, is that we needed to stop encrypting the string representation of the UserToken while serializing. ( It was necessary as we correctly used a new IV for every time we encrypted a token in serialization, so subsequent serializations of the same exact UserToken would produce different access token strings) This change also handles the serialization/deserialization BWC logic: - In mixed clusters we keep creating tokens in the old format and consume only old format tokens - In upgraded clusters, we start creating tokens in the new format but still remain able to consume old format tokens (that could have been created during the rolling upgrade and are still valid) Resolves #36872 Co-authored-by: Jay Modi jaymode@users.noreply.github.com
Thanks a ton @tlrx. I'll take a look once I'm back at a desk and make sure the change is properly tested |
This change adds supports for the concurrent refresh of access
tokens as described in #36872
In short it allows subsequent client requests to refresh the same token that
come within a predefined window of 60 seconds to be handled as duplicates
of the original one and thus receive the same response with the same newly
issued access token and refresh token.
In order to support that, two new fields are added in the token document. One
contains the instant (in epoqueMillis) when a given refresh token is refreshed
and one that contains a pointer to the token document that stores the new
refresh token and access token that was created by the original refresh.
A side effect of this change, that was however also a intended enhancement
for the token service, is that we needed to stop encrypting the string
representation of the UserToken while serializing. ( It was necessary as we
correctly used a new IV for every time we encrypted a token in serialization, so
subsequent serializations of the same exact UserToken would produce
different access token strings)
This change also handles the serialization/deserialization BWC logic:
consume only old format tokens
still remain able to consume old format tokens (that could have been
created during the rolling upgrade and are still valid)
Resolves #36872
Co-authored-by: Jay Modi jaymode@users.noreply.github.com