Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Token Service #39808

Merged

Conversation

albertzaharovits
Copy link
Contributor

@albertzaharovits albertzaharovits commented Mar 7, 2019

Be warned, this is one of those PRs that you have to pull down and inspect.

I have started to refactor the Token Service with the general goal of reducing complexity and, more specifically, minimizing .security index access.
This is in the context of the work related to moving security tokens to a new index. In that regard, the Token Service has to work with token documents stored in any of the two indices, albeit only as a transient situation. I reckoned the added complexity as unmanageable, hence this refactoring.

This is incomplete, as it fails to address the goal of minimizing .security accesses, but I have stopped because otherwise it would've become a full blown rewrite (if not already). I will follow-up with more targeted PRs.

I understand this is hard to review, so I have tried to keep individual commits standalone and reviewable on their own. Nevertheless this might be a pain! One round of beers for each of the two reviewers (GAH is close)!

@albertzaharovits albertzaharovits added >non-issue :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) v8.0.0 v7.2.0 labels Mar 7, 2019
@albertzaharovits albertzaharovits self-assigned this Mar 7, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@albertzaharovits
Copy link
Contributor Author

org.gradle.launcher.daemon.client.DaemonDisappearedException: Gradle build daemon disappeared unexpectedly

@elasticmachine test this please

@albertzaharovits
Copy link
Contributor Author

Don't use icons in PR description!
@elasticmachine test this please

@albertzaharovits
Copy link
Contributor Author

The failures were utterly alien and I think they were due to the #39512 missing in the PR . I have merged master in.

@albertzaharovits
Copy link
Contributor Author

albertzaharovits commented Mar 11, 2019

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/9732/

10:30:31 Timed out while waiting for the machine to boot. This means that
10:30:31 > Task :x-pack:qa:kerberos-tests:krb5kdcFixture FAILED
10:30:31 Vagrant was unable to communicate with the guest machine within
10:30:31 :x-pack:qa:kerberos-tests:krb5kdcFixture (Thread[Execution worker for ':' Thread 3,5,main]) completed. Took 6 mins 16.257 secs.
10:30:31 the configured ("config.vm.boot_timeout" value) time period.

@elasticmachine run elasticsearch-ci/2

Copy link
Member

@jaymode jaymode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes look good and aren't that bad to review on the UI in split mode :)

};
final String refreshedNewTokenId;
try {
refreshedNewTokenId = checkTokenDocForRefresh(source, clientAuth);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I much prefer the previous implementation with the optional. There is no need to throw and then catch to pass it on to a listener if we have a clean way of doing so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more than this going on behind the original implementation, I'll try to briefly explain.

Firstly, note that returning non-null, null, and throwing an exception are three different return results! The proposed change uses all of them.

The parsed token document is a map of maps. We don't parse it in a fully fledged entity.
The original implementation did some integrity checks (verify if entries existed and have the expected type) on the map(s) entries and then, subsequently, at completely different places in the code, values were extracted out of the maps, under the assumption that the null check had been performed before. I don't like that. I prefer to keep the null check as I navigate in the maps and extract the values. I feel like walking on a mine field when navigating maps knowing that null checks have been performed somewhere else. So, in short, the proposed implementation keeps value extracting paired with the null checks, without introducing a new entity.

Now, if we agree on this, the "value" extraction should be done once and grab everything in one pass, otherwise we'll have to duplicate the null checks on each value extraction, by the same philosophy. So, the proposed implementation grabs everything in one pass, and returns: null if the access token has not been refreshed before, non-null value representing the document id in case the access token has been refreshed recently, throw exception in case the token document is malformed or the refresh token is invalidated or expired. These are three return results, that the proposed implementation uses, again without introducing another class/entity.

What do you say? Can we keep it in the proposed form? 😁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in short, the proposed implementation keeps value extracting paired with the null checks, without introducing a new entity.

What's the reasoning for not introducing a new entity? I've been thinking about this and from a code readability/understand-ability perspective a entity can convey the meaning of the returned result much better and enables someone to more easily dive into this code. In the past, I have definitely used this type of behavior but then I start to think about some of those times and we've actually gone back and changed it to have a result entity; examples would be AuthenticationResult and RoleRetrievalResult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for not introducing a new entity?

The "read the document once" was closer to the existing code, as the code at this abstraction level deals with maps, and only a single field is required from all the entity parsing anyway, so the parsing seemed like a stretch. Furthermore the field is not even an entity field, it's a document id, which should probably be abstracted away by the entity (as a token id).

Do you wish to go down the entity parsing path in this instance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something like:

class RefreshStatus {

    boolean isRefreshed() { }

    boolean isRefreshable() { }

    boolean isValid() { }

    Exception getCause() { }

    String getRefreshedTokenId() { }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a very strong opinion on this. I think I prefer the new entity approach slightly better from a readability perspective, but I could be happy with slightly longer javadoc and the methods kept as is.

Also, why don't we like all integrity and null checks in one place? I find it preferable, compared to checking before use in many places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a7139a8 to assuage your concerns. The previous integrity checks would have returned 400 for actual 500 error codes. Given the previous feedback, I have changed them in this last "refactor".

@jaymode this is not exactly your suggestion because isRefreshable and isValid cannot cohabitate with getCause because getCause would be calling them internally so the caller of getCause would have no reason to call them on its own. Moreover, I believe getCause() is not how things are currently done in the code base, when we encounter validation errors we throw exceptions if the function is sync or call the onFailure for async functions. Furthermore, what would be the value of isRefreshed if getCause is not null? We can come up with some rules about which flag is true/false/null in relation to others, but I doubt that this would be an improvement over checking the refreshTokenDocumentId for null that this thread is arguing against.

@jkakavas

why don't we like all integrity and null checks in one place?

I don't like to have any checks far from the code relying on those checks, because when code changes in one place it has to change in the other, and I could more easily forget to change it in two places rather than one. Moreover, it is not clear to me what is the span that a "previous" check covers. The guarantees of a previous check fade away as an unmutable object is passed around in functions that could change it. For example, we make some guarantees on the document write function a few lines above, and then read the doc here, and do the checks. Why bother with the checks, then, if they were asserted "before", in the write method?

I find it preferable, compared to checking before use in many places

I disagree. In a world with null "values" and no immutable objects, I would be checking for null on every single step. This world is real, though, it's the world in which all objects are Maps of Maps . Fortunately we can break free of that world by parsing the map in an entity object, that has no null "values". This world is indeed not real, because the Java gods have blessed us with the null "value", so we decided to use nulls in very "specific" circumstances, where "specific" is not defined (can be null).

jaymode and others added 3 commits March 13, 2019 15:32
…ecurity/authc/TokenService.java

Co-Authored-By: albertzaharovits <albert.zaharovits@gmail.com>
…ecurity/authc/TokenService.java

Co-Authored-By: albertzaharovits <albert.zaharovits@gmail.com>
…ecurity/authc/TokenService.java

Co-Authored-By: albertzaharovits <albert.zaharovits@gmail.com>
@jaymode jaymode requested a review from jkakavas March 19, 2019 14:30
@albertzaharovits
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample

Copy link
Contributor

@bizybot bizybot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you.

* and return an access token and refresh token based on that. Otherwise this token document gets its refresh_token marked as refreshed,
* while also storing the Instant when it was refreshed along with a pointer to the new token document that holds the refresh_token that
* supersedes this one. The new document that contains the new access token and refresh token is created and finally the new access
* token and refresh token are returned to the listener.
*/
private void innerRefresh(String tokenDocId, Map<String, Object> source, long seqNo, long primaryTerm, Authentication clientAuth,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to ping @kobelb here and maybe help kibana get a list of potential messages for the InvalidGrantExceptions we throw (if these have changed) as this affects the way that Kibana treats auth errors wrt to refreshing tokens.

In a follow up, we can probably sit down and define a list of short error codes that we can use instead of descriptive messages that can change in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please! At this moment, we're looking at the error_description of the HTTP response body for the following two strings to determine if the error represents an invalid refresh token:

  • token has already been refreshed
  • refresh token is expired

/cc @azasypkin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safe to assume HTTP error code 400 is an invalid refresh token. If you find it otherwise, I would suggest we treat it as a bug. Would this be satisfiable for you @kobelb ?

There are many more messages to account for (which were not introduced/changed by this PR) if you insist on parsing the error_description in the Exception object. Not even all failure paths will return an ElasticsearchException - that has the error_description field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safe to assume HTTP error code 400 is an invalid refresh token. If you find it otherwise, I would suggest we treat it as a bug.

Currently Kibana automatically re-initiates SAML handshake only if refreshing fails with one of the errors mentioned above, otherwise we just return error. @albertzaharovits do you think we should re-initiate SAML handshake in Kibana as long as we get 400 irrespective to the underlying reason? If brand new SAML session is the only recovery mechanism for all 400 errors we get while trying to refresh then it seems there is no need to inspect error_description on our side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azasypkin I think so yes. We should strive to make 400 errors client errors, so they should not be due to ES instability, to the effect that a retry with the same contents will never be accepted. If Kibana decides to re-initiate the SAML flow, or present the error to the user and abort, is a different story IMO. That being said, we have not taken great care that 400 errors are honored as described, I will take on me the responsibility to look over the code again and see that 400 and 500s are consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one place where this is not happening is
https://github.com/albertzaharovits/elasticsearch/blob/0d4872665adefb8c4db2d4fedcd36dc446131d02/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java#L383

where we return a 500 when we can't find the token document based on the Bearer token that was provided. This made sense when the Bearer token was the encrypted serialized version of the UserToken, but not anymore - I have this tracking in #38866 and plan to address this soon, now that this PR is merged.

I totally agree with Albert here, if you get a 400 while refreshing then it means that the the refresh token can't be used, no matter what that reason is. I think we should re-inititate the authentication flow from Kibana when this happens, as I can't think of a case where the failure to refresh should cause a user to see an error in Kibana.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sounds good, filed elastic/kibana#33646

@albertzaharovits
Copy link
Contributor Author

@elasticmachine elasticsearch-ci/default-distro

@albertzaharovits
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@albertzaharovits
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/default-distro

@albertzaharovits
Copy link
Contributor Author

@jkakavas this is ready for another review round.

@albertzaharovits
Copy link
Contributor Author

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request+default-distro/8931/consoleFull

Is probably a failed backport:

17:20:41 1> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
17:20:41 1> at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556)
17:20:41 1> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510)
17:20:41 1> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
17:20:41 1> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
17:20:41 1> at java.lang.Thread.run(Thread.java:748)
17:20:41 1> Caused by: java.lang.IllegalArgumentException: Unknown NamedWriteable [org.elasticsearch.index.query.IntervalsSourceProvider][]
17:20:41 1> at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:112)
17:20:41 1> at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput

As well as:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-bwc/2242/consoleFull

17:06:40 1> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510)
17:06:40 1> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
17:06:40 1> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
17:06:40 1> at java.lang.Thread.run(Thread.java:748)
17:06:40 1> "
17:06:40 1> }
17:06:40 1> }
17:06:40 1> ],
17:06:40 1> "caused_by" : {
17:06:40 1> "type" : "illegal_state_exception",
17:06:40 1> "reason" : "unexpected byte [0xff]",
17:06:40 1> "caused_by" : {
17:06:40 1> "type" : "illegal_state_exception",
17:06:40 1> "reason" : "unexpected byte [0xff]",
17:06:40 1> "stack_trace" : "java.lang.IllegalStateException: unexpected byte [0xff]
17:06:40 1> at org.elasticsearch.common.io.stream.StreamInput.readBoolean(StreamInput.java:449)
17:06:40 1> at org.elasticsearch.common.io.stream.StreamInput.readBoolean(StreamInput.java:439)
17:06:40 1> at org.elasticsearch.search.builder.SearchSourceBuilder.(SearchSourceBuilder.java:224)
17:06:40 1> at org.elasticsearch.common.io.stream.StreamInput.readOptionalWriteable(StreamInput.java:803)
17:06:40 1> at org.elasticsearch.search.internal.ShardSearchLocalRequest.(ShardSearchLocalRequest.java:114)
17:06:40 1> at org.elasticsearch.search.internal.ShardSearchTransportRequest.(ShardSearchTransportRequest.java:64)
17:06:40 1> at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:56)
17:06:40 1> at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1026)
17:06:40 1> at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:916)
17:06:40 1> at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:751)
17:06:40 1> at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
17:06:40 1> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
17:06:40 1> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
17:06:40 1> at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
17:06:40 1> at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
17:06:40 1> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
17:06:40 1> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
17:06:40 1> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
17:06:40 1> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
17:06:40 1> at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556)
17:06:40 1> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510)
17:06:40 1> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
17:06:40 1> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
17:06:40 1> at java.lang.Thread.run(Thread.java:748)
17:06:40 1> "
17:06:40 1> },
17:06:40 1> "stack_trace" : "[unexpected byte [0xff]]; nested: IllegalStateException[unexpected byte [0xff]];
17:06:40 1> at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:639)
17:06:40 1> at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:137)
17:06:40 1> at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259)
17:06:40 1> at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:105)
17:06:40 1> at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:50)
17:06:40 1> at org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:273)
17:06:40 1> at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73)
17:06:40 1> at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59)
17:06:40 1> at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:441)
17:06:40 1> at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118)
17:06:40 1> at org.elasticsearch.transport.TcpTransport.lambda$handleException$24(TcpTransport.java:995)
17:06:40 1> at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192)
17:06:40 1> at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:993)

Copy link
Member

@jkakavas jkakavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the iterations Albert !

@albertzaharovits
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@albertzaharovits
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample

@albertzaharovits albertzaharovits merged commit 357aac8 into elastic:master Mar 21, 2019
@albertzaharovits albertzaharovits deleted the refactor_token_service_3 branch March 21, 2019 08:55
albertzaharovits added a commit to albertzaharovits/elasticsearch that referenced this pull request Mar 21, 2019
This refactoring is in the context of the work related to moving security
tokens to a new index. In that regard, the Token Service has to work with
token documents stored in any of the two indices, albeit only as a transient
situation. I reckoned the added complexity as unmanageable,
hence this refactoring.

This is incomplete, as it fails to address the goal of minimizing .security accesses,
but I have stopped because otherwise it would've become a full blown rewrite
(if not already). I will follow-up with more targeted PRs.

In addition to being a true refactoring, some 400 errors moved to 500. Furthermore,
more stringed validation of various return result, has been implemented, notably the
one of the token document creation.
albertzaharovits added a commit that referenced this pull request Mar 21, 2019
This refactoring is in the context of the work related to moving security
tokens to a new index. In that regard, the Token Service has to work with
token documents stored in any of the two indices, albeit only as a transient
situation. I reckoned the added complexity as unmanageable,
hence this refactoring.

This is incomplete, as it fails to address the goal of minimizing .security accesses,
but I have stopped because otherwise it would've become a full blown rewrite
(if not already). I will follow-up with more targeted PRs.

In addition to being a true refactoring, some 400 errors moved to 500. Furthermore,
more stringed validation of various return result, has been implemented, notably the
one of the token document creation.
pgomulka pushed a commit to pgomulka/elasticsearch that referenced this pull request Mar 25, 2019
This refactoring is in the context of the work related to moving security
tokens to a new index. In that regard, the Token Service has to work with
token documents stored in any of the two indices, albeit only as a transient
situation. I reckoned the added complexity as unmanageable,
hence this refactoring.

This is incomplete, as it fails to address the goal of minimizing .security accesses,
but I have stopped because otherwise it would've become a full blown rewrite
(if not already). I will follow-up with more targeted PRs.

In addition to being a true refactoring, some 400 errors moved to 500. Furthermore,
more stringed validation of various return result, has been implemented, notably the
one of the token document creation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants