Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kibana throws internal server error when auth fails due to missing saml token #22905

Closed
tsullivan opened this issue Sep 10, 2018 · 29 comments
Closed
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more!

Comments

@tsullivan
Copy link
Member

tsullivan commented Sep 10, 2018

Kibana version: 6.3

Elasticsearch version: 6.3

A user authenticates to Kibana via an Active Directory service, and can use Kibana for the day. When they stop using Kibana for a long time (usually when returning the next day), the next time they open Kibana they see an error on the screen:

{ 
  statusCode: 500, 
  error: "Internal Server Error", 
  message: "An internal server error occurred" 
} 

Elasticsearch logs will show

[illegal_state_exception] token document is missing and must be present 

One can work around the problem by clearing browser cookies and refreshing the page.

Expected result:
If the token used for authentication is no longer valid, the user should be asked to log in again.

@tsullivan tsullivan added the bug Fixes for quality problems that affect the customer experience label Sep 10, 2018
@tsullivan
Copy link
Member Author

@kobelb in https://github.com/elastic/kibana/blob/master/x-pack/plugins/security/server/lib/authentication/providers/saml.js#L43, Kibana handles invalid refresh tokens with some hardcoded response body strings. Is this a case of ES not returning back the proper message body for an error?

@kobelb
Copy link
Contributor

kobelb commented Sep 11, 2018

@elastic/es-security Kibana currently has the two error descriptions hard-coded to determine whether the SAML refresh token is invalid: token has already been refreshed and refresh token is expired.

We're seeing an additional error description token document is missing and must be present, should we be hard-coding this error message as well, or is there a better way that we should be determining whether the error thrown by Elasticsearch is indeed because the refresh token is expired?

@kobelb kobelb added the Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! label Sep 11, 2018
@albertzaharovits
Copy link
Contributor

albertzaharovits commented Sep 13, 2018

We're seeing an additional error description token document is missing and must be present, should we be hard-coding this error message as well, or is there a better way that we should be determining whether the error thrown by Elasticsearch is indeed because the refresh token is expired?

No, this might be a bug. Did the user logged out (called ES /_xpack/security/saml/invalidate), but somehow still has the cookie?

We would be grateful for:

  • a token (as submitted in the header) that manifests this behavior, ie it worked at some time but not anymore (you could turn up debug logging for org.elasticsearch.xpack.security.authc)
curl -u elastic http://localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d'
{
 “transient”: {
   “logger”: {
     “org.elasticsearch.xpack.security.authc”: “DEBUG"
   }
 }
}
'
  • the elasticsearch.log with the stacktrace (the token should be in the logs)
  • ideally, the .security index contents
curl -u elastic -X GET "localhost:9200/.security/doc/_search"

Maybe someone else from @elastic/es-security has a better idea, that might not require work from the reporter.

@kobelb
Copy link
Contributor

kobelb commented Feb 12, 2019

@dmlittle we've had trouble recreating this issue that you're seeing, would you mind elaborating on your SAML configuration? Which identity provider are you using? Do you use single logout?

@jkakavas
Copy link
Member

jkakavas commented Feb 13, 2019

Ι had missed this originally :/ The behavior described here can happen as we are removing tokens periodically. For this to manifest, all the above needs to happen:

  • The user logs in via the saml authProvider in Kibana on timestamp X and leaves it be
  • The user then attempts to open kibana again on Y , with Y > X + 24h
  • The user's kibana authCookie has a lifetime of at least 24 hours
  • The user never explicitly logs out
  • Another user logs out or someone uses the Invalidate Token API on Z , with X + 24h < Z < Y ( we potentially trigger the removal of old tokens on any token invalidation while also honoring xpack.security.authc.token.delete.interval )

The event of a user logging out or someone using the invalidation API can trigger the deletion of the now expired access and refresh tokens that the user got on X . When kibana sends a request with this now deleted access token on Y , ES replies with the aforementioned exception as it can't find the associated token document any more.

Arguably, ES should respond with a 401 instead of a 500 here.

@kobelb
Copy link
Contributor

kobelb commented Feb 19, 2019

@jkakavas that makes sense, we can update the SAML auth provider in Kibana to take this into consideration, thanks!

@jkakavas
Copy link
Member

@kobelb I'm treating this as an ES bug, so hopefully this will be resolved for the next minor. I'll update this issue with the solution once it's merged

@kobelb
Copy link
Contributor

kobelb commented Feb 19, 2019

@kobelb I'm treating this as an ES bug, so hopefully this will be resolved for the next minor. I'll update this issue with the solution once it's merged

Gotcha, this is in regard to the 500 vs 401 right? There's the potential that even with this change, we might have to augment our auth providers to handle this specific error situation differently.

@jkakavas
Copy link
Member

If I understand the manifestation above correctly, Kibana gets the 500 ( which will be made into a 401 ) when the access token is used in a normal API request. I thought that Kibana expects 401 in this case and will then attempt to use the refresh token

It was originally thought that the 500 was a response to the refresh token API request, but I don't believe this is the case

@azasypkin
Copy link
Member

I'll try to reproduce locally and see if there is anything we'll need to do in Kibana once we start getting 401 instead of 500 from Elasticsearch.

@azasypkin azasypkin self-assigned this Mar 15, 2019
@kobelb
Copy link
Contributor

kobelb commented Mar 15, 2019

Thanks Aleh! To manually cause this to occur, I was running a query to explicitly delete the tokens from Elasticsearch. Otherwise, we'd have to wait until a certain time period after the refresh token has expired for ES to delete them themselves.

@azasypkin
Copy link
Member

Otherwise, we'd have to wait until a certain time period after the refresh token has expired for ES to delete them themselves.

Thanks for the suggestion! This time I just tweaked Elasticsearch source code a bit :)

So yeah, if ES starts responding with 401 in case access token document is missing we'll clear session cookie, drop refresh token and re-initiate login [*]. We try to use refresh token only if we get 401 token expired.

Refresh token errors are different though, we get 400 error in most (all?) of the cases:

  • 400 token has already been refreshed
  • 400 refresh token is expired
  • 400 could not refresh the requested token ---> e.g. when document with refresh token has been deleted

For the first two we'll start new SAML Handshake, but for the last one user will get 400 Bad Request and stuck with invalid cookie.

Hey @jkakavas,

  • is there any case other than 401 token expired when access token is rejected and we should try to use refresh token?
  • is it possible to have the expired access token without document for the refresh token (401 token expired for access token + 400 could not refresh the requested token when we try to refresh it)? Just trying to figure out whether Kibana should implement special handling for refresh token errors other than 400 token has already been refreshed and 400 refresh token is expired.

[*] If user hits this error during full page reload they will see blank page with JSON error, so user will have to refresh page once again. We can improve that if we want to, e.g. start SAML handshake if access token validation fails with 401 or something like that.

@jkakavas
Copy link
Member

is there any case other than 401 token expired when access token is rejected and we should try to use refresh token?

No. For both explicitly invalidated and for expired access tokens, we return:

HTTP/1.1 401 Unauthorized
Content-Length: 540
Content-Type: application/json; charset=UTF-8
Www-Authenticate: Bearer realm="security", error="invalid_token", error_description="The access token expired"
Date: Thu, 21 Mar 2019 10:30:14 GMT

{
  "error" : {
    "root_cause" : [
      {
        "type" : "security_exception",
        "reason" : "token expired",
        "header" : {
          "WWW-Authenticate" : "Bearer realm=\"security\", error=\"invalid_token\", error_description=\"The access token expired\""
        }
      }
    ],
    "type" : "security_exception",
    "reason" : "token expired",
    "header" : {
      "WWW-Authenticate" : "Bearer realm=\"security\", error=\"invalid_token\", error_description=\"The access token expired\""
    }
  },
  "status" : 401
}

There is the question of what happens when you use a wrong/malformed access token ( This now returns 500 but will be fixed to return 401 ), what should be returned as error description in that 401 and how you should handle it in Kibana. I think that we should still try to refresh as there is no way of knowing if the access token is wrong (as in we can't find the relevant document ) because it is actually a wrong attempt, or a bit flipped, or the access token was expired and then deleted but it's associated refresh token is still good to use. So in summary I think we can say that if you try auth with an access token and you get a 401, try to use the refresh token you have.

is it possible to have the expired access token without document for the refresh token (401 token expired for access token + 400 could not refresh the requested token when we try to refresh it)?

<implementation_detail> it is the same document for access token and refresh token </implementation_detail> . Yes this can happen (when we fix the behavior of getting a 401 for invalid access tokens ). Even now, there is no guarantee that you won't get a 400 when refreshing, even if you got 401 token expired milliseconds before.

Just trying to figure out whether Kibana should implement special handling for refresh token errors other than 400 token has already been refreshed and 400 refresh token is expired

As we're discussing in elastic/elasticsearch#39808, we should be able to handle all 400s in Kibana as "The refresh token can't be used, re-initiate the auth flow'

@azasypkin
Copy link
Member

So in summary I think we can say that if you try auth with an access token and you get a 401, try to use the refresh token you have.

Nice, these error_description string based checks really bothered me, great that we can fully rely on status codes now.

@azasypkin
Copy link
Member

Fixed via #33777. The fix should be available starting from 7.1.

@jkakavas
Copy link
Member

@azasypkin we will still return 500 for a token that is deleted (elastic/elasticsearch#38866) , does #33777 handle this ? ( we do plan to fix it for 7.1 , but I think we'd better leave this issue open for tracking until then )

@azasypkin
Copy link
Member

does #33777 handle this ?

Yes, we added a workaround for this specific error that we'll remove once elastic/elasticsearch#38866 is merged.

@mindbat
Copy link

mindbat commented Feb 3, 2020

I know this issue has been closed for a while, but we seem to be encountering it again on a 7.5 cluster: https://github.com/elastic/infra/issues/17590

Should we re-open this issue, or create a new one?

@jkakavas
Copy link
Member

jkakavas commented Feb 3, 2020

@mindbat The original problem has been addressed. This sounds similar and might as well be related ( the error message is the same but the part of the code that throws it is different ) but I think we should handle this in a separate issue. If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.

Since I can see in the infra ticket that this is reproducible, it would be immensely helpful if we can get the ES logs when that happens (with logger.org.elasticsearch.xpack.security.authc.TokenService set to TRACE )

@mindbat
Copy link

mindbat commented Feb 3, 2020

I think we should handle this in a separate issue

Gotcha.

If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.

Makes sense to me, and thanks!

it would be immensely helpful if we can get the ES logs

Agreed! Since this is an ESS cluster, we don't have direct access to the logs...I'll see what we can do, though.

@azasypkin
Copy link
Member

If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.

No objections from our side, that looks weird indeed. In the worst case I'd expect to see [esaggs] > Request to Elasticsearch failed: {"statusCode":400,"error":"Bad Request","message":"Both access and refresh tokens are expired."}, but not 500.

@jkakavas
Copy link
Member

jkakavas commented Feb 4, 2020

The only place we throw with that error message is https://github.com/elastic/elasticsearch/blob/7.5/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java#L1503 where we search for a document in the tokens index. However, to get to there we have already called decodeToken that has read that document from that same index just before so if something hasn't deleted the document explicitly in the meantime this makes no sense.

There is always a possibility of a bug ( a race condition somewhere ) but the logs will be very helpful in troubleshooting so I'll defer further analysis until we get them

@tvernum
Copy link
Contributor

tvernum commented Feb 5, 2020

@jkakavas What about https://github.com/elastic/elasticsearch/blob/7.5/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java#L444 ?

I think we need the ES server logs for this, and should open a separate issue to track it.

@jkakavas
Copy link
Member

jkakavas commented Feb 5, 2020

Apologies everyone, I mixed up versions. This is a well known issue that we fixed in 7.6 ( and not in 7.5 ).

Kibana, on the other hand, should have dealt with the 500 in 7.5 too (and would until 7.7 where the handling is removed, right? This issue was closed when kibana added that handling to compensate for our bug in 7.1

@azasypkin
Copy link
Member

azasypkin commented Feb 5, 2020

Kibana, on the other hand, should have dealt with the 500 in 7.5 too (and would until 7.7 where the handling is removed, right?

That's correct. And if this happens since 7.2 we restart SAML handshake for non-AJAX requests or return 400 Bad Request for AJAX requests (users will have to manually refresh browser page to trigger SAML handshake in this case). But Kibana isn't supposed to return 500: 'token document is missing and must be present to the user. I just checked this on the latest 7.5 (removed tokens manually) and it seems to work like we expect.

@mindbat can we get verbose Kibana logs (logging.verbose: true + logging.quiet: false + logging.silent: false) when this happens?

@mindbat
Copy link

mindbat commented Feb 5, 2020

can we get verbose Kibana logs (logging.verbose: true + logging.quiet: false + logging.silent: false) when this happens?

@azasypkin It's an ESS cluster, so we don't have direct access to logs.

I'll ping Cloud today this week and see if we can get the logs from them.

@mindbat
Copy link

mindbat commented Feb 5, 2020

Ok, I don't have a copy of all the logs, but I did find an Internal Server Error being logged over and over again on the day we experienced this for the first time (I've stripped out some non-essential ESS-only info from the below).

The query it's logging is one of the ones we use in the Kibana Dashboard that's throwing the token document is missing error for us:

Internal Server Error
{
  "_index": "cluster-kibana-filebeat-6.3.2-gcp-us-central1-2020.01.30",
  "_type": "doc",
  "_id": "***",
  "_version": 1,
  "_score": null,
  "_source": {
    "pid": 23,
    "error": {
      "stack": "Error: Internal Server Error\n    at HapiResponseAdapter.toError (/kibana/src/core/server/http/router/response_adapter.js:119:19)\n    at HapiResponseAdapter.toHapiResponse (/kibana/src/core/server/http/router/response_adapter.js:80:19)\n    at HapiResponseAdapter.handle (/kibana/src/core/server/http/router/response_adapter.js:75:17)\n    at Object.interceptAuth [as authenticate] (/kibana/src/core/server/http/lifecycle/auth.js:77:36)\n    at process._tickCallback (internal/process/next_tick.js:68:7)",
      "message": "Internal Server Error",
      "name": "Error"
    },
    "@version": "1",
    "type": "error",
    "input": {
      "type": "log"
    },
    "source": "kibana.log",
    "docker": {
      "container": {
        "image": "docker.elastic.co/cloud-assets/kibana:7.5.0-0",
        "labels": {
          "org": {
            "label-schema": {}
          },
          "co": {
            "elastic": {
              "cloud": {
                "allocator": {
                  "user_id": "***",
                  "type_version": "7.5.0",
                  "instance_id": "***",
                  "cluster_id": "***",
                  "zone": "us-central1-b",
                  "type": "kibana",
                  "kind": "kibana",
                  "id": "***"
                }
              }
            }
          }
        },
        "name": "***,
        "id": "***"
      }
    },
    "url": {
      "query": {
        "ignore_unavailable": "true",
        "ignore_throttled": "true",
        "preference": "1580399421715",
        "rest_total_hits_as_int": "true"
      },
      "pathname": "/elasticsearch/.ent-search-app-logs-*/_search",
      "href": "/elasticsearch/.ent-search-app-logs-*/_search?rest_total_hits_as_int=true&ignore_unavailable=true&ignore_throttled=true&preference=1580399421715",
      "search": "?rest_total_hits_as_int=true&ignore_unavailable=true&ignore_throttled=true&preference=1580399421715",
      "path": "/elasticsearch/.ent-search-app-logs-*/_search?rest_total_hits_as_int=true&ignore_unavailable=true&ignore_throttled=true&preference=1580399421715"
    },
    "host": {
      "name": "***"
    },
    "beat": {
      "hostname": "***",
      "name": "***",
      "version": "6.3.2"
    },
    "prospector": {
      "type": "log"
    },
    "level": "error",
    "message": "Internal Server Error",
    "tags": [
      "kibana",
      "logging"
    ],
    "@timestamp": "2020-01-30T21:02:42.000Z",
    "meta": {
    },
    "input_type": "kibana"
  },
  "fields": {
    "@timestamp": [
      "2020-01-30T21:02:42.000Z"
    ]
  },
  "sort": [
    ***
  ]
}

@azasypkin
Copy link
Member

Unfortunately we can't get much value from this log, we really need full Kibana verbose logs to see the entire request authentication flow to know what's going on.

@mindbat
Copy link

mindbat commented Feb 6, 2020

@azasypkin Gotcha. I'll see about getting verbose logging turned on, and then replicating the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more!
Projects
None yet
Development

No branches or pull requests

7 participants