Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support concurrent refresh of refresh tokens #38382

Merged
merged 21 commits into from Mar 1, 2019

Conversation

jkakavas
Copy link
Member

@jkakavas jkakavas commented Feb 4, 2019

This change adds supports for the concurrent refresh of access
tokens as described in #36872
In short it allows subsequent client requests to refresh the same token that
come within a predefined window of 60 seconds to be handled as duplicates
of the original one and thus receive the same response with the same newly
issued access token and refresh token.
In order to support that, two new fields are added in the token document. One
contains the instant (in epoqueMillis) when a given refresh token is refreshed
and one that contains a pointer to the token document that stores the new
refresh token and access token that was created by the original refresh.
A side effect of this change, that was however also a intended enhancement
for the token service, is that we needed to stop encrypting the string
representation of the UserToken while serializing. ( It was necessary as we
correctly used a new IV for every time we encrypted a token in serialization, so
subsequent serializations of the same exact UserToken would produce
different access token strings)

This change also handles the serialization/deserialization BWC logic:

  • In mixed clusters we keep creating tokens in the old format and
    consume only old format tokens
  • In upgraded clusters, we start creating tokens in the new format but
    still remain able to consume old format tokens (that could have been
    created during the rolling upgrade and are still valid)

Resolves #36872

Co-authored-by: Jay Modi jaymode@users.noreply.github.com

@jkakavas jkakavas added >enhancement v7.0.0 :Security/Security Security issues without another label v6.7.0 labels Feb 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@jkakavas
Copy link
Member Author

jkakavas commented Feb 5, 2019

2:32:03 Caused: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from static.88-198-16-90.clients.your-server.de/88.198.16.90:47858 failed. The channel is closing down or has closed down
12:32:03 	at hudson.remoting.Channel.call(Channel.java:950)
12:32:03 	at hudson.FilePath.act(FilePath.java:1067)
12:32:03 	at hudson.FilePath.act(FilePath.java:1056)
12:32:03 	at hudson.FilePath.delete(FilePath.java:1537)
12:32:03 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:123)
12:32:03 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
12:32:03 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
12:32:03 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:744)
12:32:03 	at hudson.model.Build$BuildExecution.build(Build.java:206)
12:32:03 	at hudson.model.Build$BuildExecution.doRun(Build.java:163)
12:32:03 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
12:32:03 	at hudson.model.Run.execute(Run.java:1810)
12:32:03 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
12:32:03 	at hudson.model.ResourceController.execute(ResourceController.java:97)
12:32:03 	at hudson.model.Executor.run(Executor.java:429)
12:32:03 Build step 'Execute shell' marked build as failure

@elasticmachine please run elasticsearch-ci/packaging-sample

@jkakavas
Copy link
Member Author

jkakavas commented Feb 5, 2019

17:22:00 > Task :x-pack:qa:rolling-upgrade:with-system-key:v6.7.0#oldClusterTestRunner
17:22:00 Tests with failures:
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/30_ml_jobs_crud/Put job with empty strings in the configuration}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/40_ml_datafeed_crud/Put job and datafeed in old cluster}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/30_ml_jobs_crud/Test function shortcut expansion}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/30_ml_jobs_crud/Put job on the old cluster with the default model memory limit and post some data}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/30_ml_jobs_crud/Test job with pre 6.4 rules - dummy job 6.4 onwards}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/50_token_auth/Create a token and reuse it across the upgrade}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/30_ml_jobs_crud/Test job with pre 6.4 rules}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/70_ilm/Test Basic Policy Creation}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/60_watcher/CRUD watch APIs}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/60_watcher/Test watcher stats output}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/20_security/Verify native store security actions}
17:22:00   - org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/30_ml_jobs_crud/Put job on the old cluster and post some data}
17:22:00 

This is #38412 , hopefully fixed by #38427

@jkakavas
Copy link
Member Author

jkakavas commented Feb 5, 2019

@elasticmachine please run elasticsearch-ci/2

Copy link
Contributor

@bizybot bizybot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good to me, I have few questions and suggestions. Thank you.

listener.onResponse(null);
// the token exists and the value is at least as long as we'd expect
final Version version = Version.readVersion(in);
if (version.onOrAfter(Version.V_7_0_0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to change this to v7.1.0 here and other places.

final Long refreshedEpochMilli = (Long) refreshTokenSrc.get("refresh_time");
final Instant refreshTime = refreshedEpochMilli == null ? null : Instant.ofEpochMilli(refreshedEpochMilli);
final String supersededBy = (String) refreshTokenSrc.get("superseded_by");
return authVersion.onOrAfter(Version.V_7_0_0) && supersededBy != null && refreshTime != null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return authVersion.onOrAfter(Version.V_7_0_0) && supersededBy != null && refreshTime != null
if (authVersion.onOrAfter(Version.V_7_0_0)) {
final Instant refreshTime = refreshedEpochMilli == null ? null : Instant.ofEpochMilli(refreshedEpochMilli);
final String supersededBy = (String) refreshTokenSrc.get("superseded_by");
if (refreshTime != null && supersededBy != null) {
return clock.instant().isAfter(refreshTime.plus(4L, ChronoUnit.SECONDS));
}
}
return false;

suggestion instead of conditions in a single line, I don't know is it too verbose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split them to multiple lines. I think it's easier to read with all the conditions together than split in 2 different if statements but I'd be happy to have my opinion changed - it's not a particularly strong one. Let's wait and see what Jay thinks (or anyone else that cares to comment) and go with what makes sense to most of us. Sounds good ?

logger.debug("Token document [{}] was recently refreshed, attempting to reuse [{}] for returning an " +
"access token and refresh token", tokenDocId, supersedingTokenDocId);
GetRequest supersedingTokenGetRequest =
client.prepareGet(SecurityIndexManager.SECURITY_INDEX_NAME, TYPE, supersedingTokenDocId).request();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading here, https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#realtime
seems like the Get API by default will issue refresh, is that of any concern for us here performance wise? As we set RefreshPolicy.WAIT_UNTIL when updating the document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GET will issue a refresh if the document has been updated but not refreshed IIUC. Can you elaborate on what the performance impact might be and how it relates to setting WAIT_UNTIL on the update request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the performance comment was more on the lines of me trying to understand the impact of the refresh, not sure how big of that impact would be in case of security index. I am assuming not too much indexing happens on security index at a time so the performance impact should not be big. But wanted to confirm from you or Jay, in case I am missing something.

As I was reading further, it was pointed to be careful with the refresh option as it can cause heavy load https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#get-refresh

So when updating a document (token) in security index, that change will not be available for searching immediately but on next refresh cycle. Since we might be getting parallel requests from Kibana it might happen that when we do Get API that will trigger refresh for the document.
Thank you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was reading further, it was pointed to be careful with the refresh option as it can cause heavy load

I guess that the equivalent of refresh=true is if we would set IMMEDIATE

So when updating a document (token) in security index, that change will not be available for searching immediately but on next refresh cycle. Since we might be getting parallel requests from Kibana it might happen that when we do Get API that will trigger refresh for the document.

This can happen I think. But we need this to happen because we want the subsequent request(s) to get the updated view of the document state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, my thought was around retrying instead of refresh via Get to avoid any performance impact if any.
Looking at the frequency of updates to security index, I think we are okay to this refresh.

I think we skip my comment unless others feel that this is something that can have an adverse impact on the performance. Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By refreshing, we are writing lots of small lucene segments that need to be merged in the background. However, I do not feel like there is an issue with this use of Get.

bizybot
bizybot previously approved these changes Feb 12, 2019
Copy link
Contributor

@bizybot bizybot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the review comments are addressed and you also have a ship it from Jay. Thank you for the iterations.

onFailure.accept(invalidGrantException("could not refresh the requested token"));
}
}, e -> {
logger.info("could not find token document [{}] for refresh", supersedingTokenGetRequest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("could not find token document [{}] for refresh", supersedingTokenGetRequest);
logger.error("could not find token document [{}] for refresh", supersedingTokenGetRequest);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we log these on INFO level throughout the TokenService now, what do you think @jaymode ?

final String supersedingRefreshTokenValue = (String) supersedingRefreshTokenSrc.get("token");
reIssueTokens(supersedingUserTokenSource, supersedingRefreshTokenValue, listener);
} else {
logger.info("could not find token document [{}] for refresh", supersedingTokenGetRequest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("could not find token document [{}] for refresh", supersedingTokenGetRequest);
logger.error("could not find token document [{}] for refresh", supersedingTokenGetRequest);

Instant refreshed = Instant.now();
Instant aWhileAgo = refreshed.minus(10L, ChronoUnit.SECONDS);
assertTrue(Instant.now().isAfter(aWhileAgo));
client.prepareUpdate(SecurityIndexManager.SECURITY_INDEX_NAME, "doc", docId.get())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's verify the update response here as that might help with debugging in case of test failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point !

.get();
assertNotNull(createTokenResponse.getRefreshToken());

CreateTokenResponse refreshResponse = securityClient.prepareRefreshToken(createTokenResponse.getRefreshToken()).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fire a small number of requests in parallel and then check the responses have the same token string.

@jkakavas
Copy link
Member Author

Testing this thoroughly (Thanks! @bizybot ) proved the initial approach might be a little too naive.

The current approach when refreshing the token is to

  1. Check if the same refresh token was already refreshed by doing a Search Request (SR1)
  2. If not, we
    a. pre-generate a document ID (docId) and do an Update Request (UR1) to set the current document to have refreshed: true and mark the time it was refreshed and the docId.
    b. Create a new access token and refresh token and store these in a new token document with the docId as its ID with an Index Request (IR1).
  3. If yes, we
    a. Check if this was refreshed within the last 4 seconds and whether the originally refreshed document has the necessary fields ( refreshed_time and superseded_by ).
    b. We get the document that superseded this, with a Get Request (GR1) since we already know the document ID (it was stored in superseded_by) of the previous token.

Both UR1 and IR1 have their refresh_policy set to wait_for

In multiple concurrent requests for refreshing the same token, the approach above proves insufficient as

  • The first request to come will trigger the first flow that involve (UR1 and IR1) in order to refresh the token.
  • Subsequent requests might or might not follow the second flow depending on whether SR1 for them returns before UR1 succeeds and the document is also refreshed. Even if they follow the second flow, GR1 might or might not be able to get the superseding document depending on whether the original IR1 is completed and the document refreshed or not.

My initial idea was to change the refresh policy of both UR1 and IR1 to immediate so that we refresh immediately after index/update. This would have an effect on performance, and I'd like to discuss further whether or not this is something that would be tolerable.

Even with the refresh policy set to immediate testing reveals that there are cases where subsequent requests reach GR1 before the original IR1 is completed ( there is an observed window of ~60ms where this is happening ) and that leads to these subsequent requests not being able to find the superseding document and thus failing authentication.
One idea for the above ( regardless of whether we change the refresh policy ) would be to return a 429 with a Retry-After set to a few ms in the future and ask clients to honor this.

I'll keep thinking about this, but I wanted to lay out the status and maybe solicit helpful ideas

Since we need to still be testing the code dealing with decrypting
token ids, this change introduces a way to generate encrypted
ids to be used as access tokens in TokenServiceTests.
This also fixes an old bug in testPassphraseWorks which was
supposed to test for a wrong passphrase used, but actually succeeded
only because the alternative TokenServive was initialized with empty
settings, thus being disabled.
@jkakavas
Copy link
Member Author

elasticsearch-ci/2 failed because of #30101

@elasticmachine run elasticsearch-ci/2 please

Copy link
Member

@jaymode jaymode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty close I think. I left a few comments and I think @bizybot should re-review since it has changed since the last review and I was involved with the changes since then.

Copy link
Member

@jaymode jaymode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of minor comments, but otherwise LGTM

@@ -1833,6 +1798,13 @@ void clearActiveKeyCache() {
this.keyCache.activeKeyCache.keyCache.invalidateAll();
}

/**
* For testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you say package private for testing?

}), client::get);
};
getTokenDocAsync(supersedingTokenDocId, getSupersedingListener);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it is not "eligible for multiple refresh". So, we assume it is not refreshed at all. I believe it might as well be that it was refreshed a long time ago?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, we assume it is not refreshed at all

Not exactly. We first check if it is eligible for refresh in checkTokenDocForRefresh so we know it was issued in the last 24 hrs - then we check if it is eligible for multi refresh i.e. issued in the last seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I got lost in it.

reIssueTokens(supersedingUserTokenSource, supersedingRefreshTokenValue, listener);
} else if (backoff.hasNext()) {
// We retry this since the creation of the superseding token document might already be in flight but not
// yet completed, triggered by a refresh request that came a few milliseconds ago
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would've created the new refreshed user token first and then updated the pointer.
This way we avoid the back-off. If we get a concurrent modification exception then we can retry and assert that we have a valid refresh token concurrently generated (any extraneous user tokens can be removed or garbage collected).

No need to change the code for this, can be done as a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind doing the update first is that I focused on being able to handle the subsequent requests (after a few millis ) so it made sense to do the update so that next ones can see that immediately. If I get this right, your idea would mean that roughly:

  • X : 1st request received. Start the process to index the superseding doc
  • X+y ms : 2nd request received. The previous index request is in flight, so we start the process to index the (new and different) superseding doc
  • X+z ms : 1st index request completes, we update the original document with the pointer to the superseding doc and reply to the API request with the newly created refresh token
  • X+ω ms : 2nd index request completes, we try to update the original doc but we get a version conflict exception, so we do a Get request now to get the pointer to the superseding doc and then Get that one to get it's source and reply with the same refresh token

The disadvantage of that are the extra refresh and access token we create. In addition to space ( which is a short term concern since as you said we clean up the tokens) what worries me a little is that we raise the possibilities of a successful brute force attack by introducing unnecessarily more valid tokens. The time validity of the access tokens and the key space for our access tokens and refresh tokens, makes this effect insignificant but if we don't get much in terms of performance/complexity minimization then I'm not sure if it's worth following. I'm definitely up to continue the discussion on it though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, this trades a bit of space for complexity.
I think we can alleviate this by having the id of the refreshed user token be a hash of the user token they are refreshing (superseding) - a hash of the id and creation time. This way concurrent refreshes would be trying to create the same doc_id and fail, only one will succeed and the others will retry, picking up the only refreshed token.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, this trades a bit of space for complexity.

And , slightly, robustness against brute force attacks.

I think we can alleviate this by having the id of the refreshed user token be a hash of the user token they are refreshing (superseding) - a hash of the id and creation time.

I don't know how I feel about making the generation of token document IDs ( == access tokens ) predictable. I'll think about this a bit more

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jkakavas
Copy link
Member Author

jkakavas commented Mar 1, 2019

@albertzaharovits Thanks for taking a look at this and for your feedback - appreciated as always. I took care of the rest of suggestions and so I will merge this now - we can continue the discussion about the order of update/get when multirefreshing in another venue (discuss mail?) and the subsequent PR

@jkakavas jkakavas merged commit 21703fe into elastic:master Mar 1, 2019
jkakavas added a commit to jkakavas/elasticsearch that referenced this pull request Mar 1, 2019
This change adds supports for the concurrent refresh of access
tokens as described in elastic#36872
In short it allows subsequent client requests to refresh the same token that
come within a predefined window of 60 seconds to be handled as duplicates
of the original one and thus receive the same response with the same newly
issued access token and refresh token.
In order to support that, two new fields are added in the token document. One
contains the instant (in epoqueMillis) when a given refresh token is refreshed
and one that contains a pointer to the token document that stores the new
refresh token and access token that was created by the original refresh.
A side effect of this change, that was however also a intended enhancement
for the token service, is that we needed to stop encrypting the string
representation of the UserToken while serializing. ( It was necessary as we
correctly used a new IV for every time we encrypted a token in serialization, so
subsequent serializations of the same exact UserToken would produce
different access token strings)

This change also handles the serialization/deserialization BWC logic:

- In mixed clusters we keep creating tokens in the old format and
consume only old format tokens
- In upgraded clusters, we start creating tokens in the new format but
still remain able to consume old format tokens (that could have been
created during the rolling upgrade and are still valid)

Resolves elastic#36872

Co-authored-by: Jay Modi jaymode@users.noreply.github.com
jkakavas added a commit that referenced this pull request Mar 1, 2019
This is a backport of #38382

This change adds supports for the concurrent refresh of access
tokens as described in #36872
In short it allows subsequent client requests to refresh the same token that
come within a predefined window of 60 seconds to be handled as duplicates
of the original one and thus receive the same response with the same newly
issued access token and refresh token.
In order to support that, two new fields are added in the token document. One
contains the instant (in epoqueMillis) when a given refresh token is refreshed
and one that contains a pointer to the token document that stores the new
refresh token and access token that was created by the original refresh.
A side effect of this change, that was however also a intended enhancement
for the token service, is that we needed to stop encrypting the string
representation of the UserToken while serializing. ( It was necessary as we
correctly used a new IV for every time we encrypted a token in serialization, so
subsequent serializations of the same exact UserToken would produce
different access token strings)

This change also handles the serialization/deserialization BWC logic:

- In mixed clusters we keep creating tokens in the old format and
consume only old format tokens
- In upgraded clusters, we start creating tokens in the new format but
still remain able to consume old format tokens (that could have been
created during the rolling upgrade and are still valid)

Resolves #36872

Co-authored-by: Jay Modi jaymode@users.noreply.github.com
tlrx added a commit that referenced this pull request Mar 1, 2019
@tlrx
Copy link
Member

tlrx commented Mar 1, 2019

Sorry @jkakavas , I had to revert this change in b54a95e. As we talked via another channel, I think there's something to adapt in the bwc serialization of
TokensInvalidationResult. In order to reenable bwc tests in master quickly, I just reverted the change.

@jkakavas
Copy link
Member Author

jkakavas commented Mar 1, 2019

Thanks a ton @tlrx. I'll take a look once I'm back at a desk and make sure the change is properly tested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow token refresh for multiple requests in a small window
9 participants