Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional failure of REST API (timeouts) #2559

Closed
coffee-squirrel opened this issue Jul 27, 2016 · 19 comments
Closed

Occasional failure of REST API (timeouts) #2559

coffee-squirrel opened this issue Jul 27, 2016 · 19 comments
Assignees
Labels

Comments

@coffee-squirrel
Copy link

coffee-squirrel commented Jul 27, 2016

Problem description

We've experienced partial outages in 2 separate Graylog environments where requests to the REST API start timing out (some log extracts below). In all cases the Web UI (i.e. everything originating from the Web interface) loaded quickly and Graylog Server message consumption appeared to be unaffected, but all calls to the REST API (e.g. creating a new session or anything in an existing session) timed out. One exception was the REST API browser list ("/api-browser"), however none of the individual entries (e.g. "AlarmCallbacks" @ "/api-docs/streams/%7Bstreamid%7D/alarmcallbacks") appeared to load since I couldn't click to expand.

CPU and Graylog Server utilization was fairly low, and since the nodes are almost always between 1-2GB (max heap 3GB) it seems unlikely that was an issue.

Previously a restart of the affected Graylog Server VMs (or sometimes the graylog-server and/or mongod services) was done, however yesterday (2016-07-26) that did not seem to help in our Test environment. On a hunch, I connected to the environment's MongoDB instance and dropped the sessions collection as shown:

mongo myserver123
use graylog
db.sessions.remove({})
#WriteResult({ "nRemoved" : 96 })

After emptying the sessions collection, the REST API became responsive immediately. I'm wondering if the aforementioned restarts just acted as a trigger to do something similar with session cleanup, even though that didn't make a difference yesterday (perhaps some session lifetime situation?).

I don't currently have an environment with this issue, since there was some urgency to get things working again, but would be happy to provide any information that may help.

2016-07-12 01:29:33,614 WARN : org.graylog2.periodical.NodePingThread - Did not find meta info of this node. Re-registering.
2016-07-12 01:29:43,747 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call https://test.graylog.mysite.com:9443/system/metrics/multiple on node <b43e4ccb-1162-4685-98e0-74832bb44310>, caught exception: timeout (class java.net.SocketTimeoutException)
2016-07-12 01:29:43,751 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call https://test.graylog.mysite.com:9443/system/metrics/multiple on node <b43e4ccb-1162-4685-98e0-74832bb44310>, caught exception: timeout (class java.net.SocketTimeoutException)
2016-07-12 01:29:43,752 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call https://test.graylog.mysite.com:9443/system/metrics/multiple on node <b43e4ccb-1162-4685-98e0-74832bb44310>, caught exception: Read timed out (class java.net.SocketTimeoutException)

Note that in the following example the "Did not find meta info of this node" warning occurred a day before, and the environment was fine afterwards.

2016-07-25 00:36:03,987 WARN : org.graylog2.periodical.NodePingThread - Did not find meta info of this node. Re-registering.
...
2016-07-26 00:59:14,606 INFO : org.apache.shiro.session.mgt.AbstractValidatingSessionManager - Validating all active sessions...
2016-07-26 00:59:14,625 INFO : org.apache.shiro.session.mgt.AbstractValidatingSessionManager - Finished session validation. No sessions were stopped.
2016-07-26 01:07:40,985 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call https://test.graylog.mysite.com:9443/system/metrics/multiple on node <b43e4ccb-1162-4685-98e0-74832bb44310>, caught exception: timeout (class java.net.SocketTimeoutException)
...
2016-07-26 01:07:48,411 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call https://test.graylog.mysite.com:9443/system/metrics/multiple on node <b43e4ccb-1162-4685-98e0-74832bb44310>, caught exception: Read timed out (class java.net.SocketTimeoutException)
2016-07-26 14:28:41,289 INFO : org.apache.shiro.session.mgt.AbstractValidatingSessionManager - Enabling session validation scheduler...
2016-07-26 14:28:54,311 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call https://test.graylog.mysite.com:9443/system/metrics/multiple on node <b43e4ccb-1162-4685-98e0-74832bb44310>, caught exception: Read timed out (class java.net.SocketTimeoutException)
2016-07-26 14:28:54,314 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call https://test.graylog.mysite.com:9443/system/metrics/multiple on node <b43e4ccb-1162-4685-98e0-74832bb44310>, caught exception: timeout (class java.net.SocketTimeoutException)
2016-07-26 14:28:55,111 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call https://test.graylog.mysite.com:9443/system/metrics/multiple on node <b43e4ccb-1162-4685-98e0-74832bb44310>, caught exception: timeout (class java.net.SocketTimeoutException)

Steps to reproduce the problem

No known trigger

Environment

  • 2 servers in Test: server 1 with MongoDB and Graylog Server, server 2 with Elasticsearch (also occurred in Prod w/ 2x2)
  • Graylog Version: 2.0.3
    • GRAYLOG_SERVER_JAVA_OPTS="-Djava.net.preferIPv4Stack=true -Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow" (typically floats between 1-2GB heap utilization per /system/nodes)
  • Elasticsearch Version: 2.2.0
  • MongoDB Version: 2.6.12
  • Operating System: Red Hat Enterprise Linux Server release 6.7
  • Browser version: Firefox 47, Chrome 51.0.2704.106m
@joschi
Copy link
Contributor

joschi commented Jul 28, 2016

@coffee-squirrel Make sure that the URI in the rest_listen_uri/rest_transport_uri configuration settings are accessible for all Graylog nodes.
The Graylog nodes have to be able to communicate with other Graylog nodes and themselves (!) through that URI. It seems that this is not the case in your setup (using https://test.graylog.mysite.com:9443/).

@coffee-squirrel
Copy link
Author

@joschi In the Test environment "test.graylog.mysite.com" simply resolves to the 10.4.40.11 IP of the one Graylog Server node. This server has no firewall rules denying access to 10.4.40.11:9443 (and 10.4.40.11:8080 is allowed locally... see below). I'd like to mention again that the issue cleared up immediately, with no other changes, after I emptied the sessions collection in the environment's one MongoDB instance.

Here are the rest_listen_uri and rest_transport_uri entries:

rest_listen_uri = http://10.4.40.11:8080
rest_transport_uri = https://test.graylog.mysite.com:9443

Note that we currently have an Apache reverse proxy on the same server handling REST API TLS/9443 (with 1 balancer member http://10.4.40.11:8080) since we ran into a seemingly-random failure when Graylog Server was handling REST API TLS (worked fine the month before and the Web interface still works fine, only differing in port number). This issue has occurred in this environment before we added the reverse proxy, however, as well as in the multi-node Prod environment (which doesn't have the reverse proxy).

These environments have been, and are currently, communicating fine... we really only see timeouts when this particular issue occurs.

@joschi
Copy link
Contributor

joschi commented Jul 28, 2016

@coffee-squirrel Can test.graylog.mysite.com be resolved via DNS on the machine running Graylog?

Try not setting rest_transport_uri in your Graylog configuration and use web_endpoint_uri instead for configuring the URI which the web interface should connect to (probably https://test.graylog.mysite.com:9443 in your case).

@coffee-squirrel
Copy link
Author

Yes it can, and there is also a hosts file entry.

We do not have web_endpoint_uri defined in "server.conf", so my understanding is that we're currently defaulting to the value of rest_transport_uri (which is https://test.graylog.mysite.com:9443 ). If I'm reading BaseConfiguration.java correctly, flipping that around would result in the value of rest_listen_uri (http://10.4.40.11:8080) being used, which would result in mixed HTTPS (Web interface) and HTTP (REST API). Is that correct?

If we were to assume communication is not the issue, could something session-related cause a deadlock with the REST API (thus causing timeouts)?

@joschi
Copy link
Contributor

joschi commented Jul 28, 2016

flipping that around would result in the value of rest_listen_uri (http://10.4.40.11:8080) being used, which would result in mixed HTTPS (Web interface) and HTTP (REST API). Is that correct?

No, the web interface would still use the HTTPS variant of the Graylog REST API fronted by Apache httpd.

If we were to assume communication is not the issue, could something session-related cause a deadlock with the REST API (thus causing timeouts)?

Not exactly. The session is being used to query the other Graylog nodes. If there is no active session, no other Graylog nodes will be queried.

@coffee-squirrel
Copy link
Author

coffee-squirrel commented Jul 28, 2016

Oops, looks like I forgot web_endpoint_uri would be defined. So the result would be the Graylog Server would use rest_listen_uri (http://10.4.40.11:8080) but Web interface-based things would use web_endpoint_uri (https://test.graylog.mysite.com:9443)? I won't be able to make the change immediately (and we're not currently experiencing the issue), but can try this the next time the issue occurs.

In the aforementioned case where Graylog Server-based TLS for the REST API failed it did start working when I switched to HTTP (prior to inserting httpd to get back to HTTPS), so perhaps the change you suggested will help. At minimum I think we may be able to determine whether HTTP (Graylog Server calls) vs HTTPS (normal users) makes a difference.

Not exactly. The session is being used to query the other Graylog nodes. If there is no active session, no other Graylog nodes will be queried.

Could you please clarify this? Perhaps I'm misunderstanding the sessions collection, which I thought was for user sessions. We were seeing failed REST API requests for both authenticated and unauthenticated users (e.g. viewing the REST API Browser).

@coffee-squirrel
Copy link
Author

We ran into this issue in a non-Prod environment (1 Graylog Server node, 1 Elasticsearch node) this morning.

As suggested, I removed rest_transport_uri, added web_endpoint_uri (https://test.graylog.mysite.com:9443), and restarted the Graylog Server instance. Afterwards the following was seen repeatedly in "server.log" (note http://10.4.40.11:8080 is rest_listen_uri):

2016-08-02 07:46:30,752 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call http://10.4.40.11:8080/system/metrics/multiple on node <b43e4ccb-1162-4685-98e0-74832bb44310>, caught exception: timeout (class java.net.SocketTimeoutException)

As before, emptying the sessions collection in the graylog MongoDB DB (with no other action) immediately resulted in the REST API (and thus the Web interface) becoming responsive again.

@joschi
Copy link
Contributor

joschi commented Aug 2, 2016

@coffee-squirrel What happens if you run the following command on the machine hosting Graylog?

# replace "user" and "password" with valid Graylog credentials
curl -u user:password -H 'Accept: application/json' -X GET 'http://10.4.40.11:8080/system/?pretty=true'

@coffee-squirrel
Copy link
Author

coffee-squirrel commented Aug 2, 2016

Currently that command returns the expected JSON. I didn't run a command involving authentication earlier, but did run these (with no noticeable delay in response):

[me@myserver123 graylog-server]$ curl http://10.4.40.11:8080
{"type":"ApiError","message":"HTTP 404 Not Found"}

[me@myserver123 graylog-server]$ curl http://10.4.40.11:8080/api-browser
<!DOCTYPE html>
<html>
<head>
  <title>Graylog REST API browser</title>
...

This time there were 24 documents in the sessions collection. I dumped the graylog DB beforehand, if anything there might be of use.

@coffee-squirrel
Copy link
Author

Just as an update-- this issue is still occurring. Since it mostly happens overnight, when backups run, my guess is the issue is triggered by high storage latency. I've written a small script to automatically purge the sessions collection whenever the Unable to call count increases in "server.log", which has at least kept the environment available for users.

@joschi
Copy link
Contributor

joschi commented Aug 25, 2016

I'm unable to reproduce the issue. The only way I'm able to produce this error message is with a broken network setup (e. g. Graylog not being able to contact the URI given in rest_transport_uri because the routing is broken or a packet filter is dropping packets).

Since a workaround seems to exist, I'm closing this issue.

@joschi joschi closed this as completed Aug 25, 2016
@k3nny0ne
Copy link

k3nny0ne commented Sep 12, 2016

Issue is occurring for me too.
My env:

  • 2-node Mongo replica-set (3.0.6)
  • 1 graylog-server node (2.0.3):
    -Xms8g -Xmx8g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow
  • Elasticsearch 2.3.3
  • OS RHEL 6.6

rest_listen_uri = http://127.0.0.1:12900/
rest_enable_cors = true
rest_enable_gzip = false
web_listen_uri = http://127.0.0.1:9010
web_endpoint_uri = https://graylog.company.local/api

Also I run nginx in front of Graylog with configuration based on this article: http://docs.graylog.org/en/2.0/pages/configuration/web_interface.html#configuring-webif-nginx

@k3nny0ne
Copy link

k3nny0ne commented Oct 9, 2016

@joschi I think this issue should be re-opened

@joschi
Copy link
Contributor

joschi commented Oct 10, 2016

@k3nny0ne If you have more information which help to reproduce this issue, feel free to share them.

@wrsuarez
Copy link

wrsuarez commented Oct 18, 2016

I agree this should be re-opened. This happens constantly in our environment with 3 Graylog nodes and a 3 node MongoDB Replica set. All instances are in AWS. The Graylog nodes are behind a load balancer and there is clear communication between all the nodes. I consistently see this entry in the logs:

2016-10-18T19:00:54.521Z WARN [ProxiedResource] Unable to call http://REDACTED:9000/api/system on node
java.net.SocketTimeoutException: timeout
at okio.Okio$3.newTimeoutException(Okio.java:210) ~[graylog.jar:?]
at okio.AsyncTimeout.exit(AsyncTimeout.java:288) ~[graylog.jar:?]
at okio.AsyncTimeout$2.read(AsyncTimeout.java:242) ~[graylog.jar:?]
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:325) ~[graylog.jar:?]
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:314) ~[graylog.jar:?]
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:210) ~[graylog.jar:?]
at okhttp3.internal.http.Http1xStream.readResponse(Http1xStream.java:186) ~[graylog.jar:?]
at okhttp3.internal.http.Http1xStream.readResponseHeaders(Http1xStream.java:127) ~[graylog.jar:?]
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:53) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:109) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:124) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:59) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:170) ~[graylog.jar:?]
at okhttp3.RealCall.execute(RealCall.java:60) ~[graylog.jar:?]
at retrofit2.OkHttpCall.execute(OkHttpCall.java:174) ~[graylog.jar:?]
at org.graylog2.shared.rest.resources.ProxiedResource.lambda$null$0(ProxiedResource.java:76) ~[graylog.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_102]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_102]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_102]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102]
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:203) ~[?:1.8.0_102]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_102]
at okio.Okio$2.read(Okio.java:138) ~[graylog.jar:?]
at okio.AsyncTimeout$2.read(AsyncTimeout.java:238) ~[graylog.jar:?]
... 29 more

From today's Graylog server logs only:
grep "9000/api/system on node <" /var/log/graylog-server/server.log |wc -l
199
grep "9000/api/system on node <" /var/log/graylog-server/server.log |wc -l
564
grep "9000/api/system on node <" /var/log/graylog-server/server.log |wc -l
197

Our current config has the following:
rest_listen_uri = http://0.0.0.0:9000/api/
web_listen_uri = http://0.0.0.0:9000/

This is configured the same for all nodes.

The behavior is commonly exhibited when looking at the nodes view in the UI and you routinely see the journal metrics and heap size become "unavailable" and then start reporting again on their own. I also have a Cloudwatch metric set up polling the API directly on the master node every 2 minutes to report out the journal backlog to CloudWatch for trending and alerting (in case the journal gets too far behind on any one node). It sends an SNS notification if its unable to read any one nodes journal backlog via the API. I got the alert today that it was unable to read the backlog for one of the nodes indicating the REST API timed out during the call. Our CPU utilization on these nodes is likewise well within acceptable bounds. Adding the CPU Utilization and Journal Backlog for the last week.

screen shot 2016-10-18 at 12 11 17 pm
screen shot 2016-10-18 at 12 12 46 pm

@joschi
Copy link
Contributor

joschi commented Oct 19, 2016

@wrsuarez Which version of Graylog are you using?

@wrsuarez
Copy link

Graylog 2.1.1+01d50e5

@fadenb
Copy link

fadenb commented Nov 20, 2016

I am now experiencing the same issue.
WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call http://10.0.0.6:9000/api/system/jobs on node <c09b73a4-2eb1-4734-b72e-8025a10b2d81>

Curl to the URL in the error message from the affected server is working fine.

Graylog 2.1.2+50e449a

@yogeshrao
Copy link

me too and details are available in #3116

I noticed as the exceptions start appearing the heap utilisation goes up GrayLog2 (2.1.2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants