New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional failure of REST API (timeouts) #2559
Comments
@coffee-squirrel Make sure that the URI in the |
@joschi In the Test environment "test.graylog.mysite.com" simply resolves to the Here are the
Note that we currently have an Apache reverse proxy on the same server handling REST API TLS/9443 (with 1 balancer member These environments have been, and are currently, communicating fine... we really only see timeouts when this particular issue occurs. |
@coffee-squirrel Can Try not setting |
Yes it can, and there is also a hosts file entry. We do not have If we were to assume communication is not the issue, could something session-related cause a deadlock with the REST API (thus causing timeouts)? |
No, the web interface would still use the HTTPS variant of the Graylog REST API fronted by Apache httpd.
Not exactly. The session is being used to query the other Graylog nodes. If there is no active session, no other Graylog nodes will be queried. |
Oops, looks like I forgot In the aforementioned case where Graylog Server-based TLS for the REST API failed it did start working when I switched to HTTP (prior to inserting httpd to get back to HTTPS), so perhaps the change you suggested will help. At minimum I think we may be able to determine whether HTTP (Graylog Server calls) vs HTTPS (normal users) makes a difference.
Could you please clarify this? Perhaps I'm misunderstanding the |
We ran into this issue in a non-Prod environment (1 Graylog Server node, 1 Elasticsearch node) this morning. As suggested, I removed
As before, emptying the |
@coffee-squirrel What happens if you run the following command on the machine hosting Graylog? # replace "user" and "password" with valid Graylog credentials
curl -u user:password -H 'Accept: application/json' -X GET 'http://10.4.40.11:8080/system/?pretty=true' |
Currently that command returns the expected JSON. I didn't run a command involving authentication earlier, but did run these (with no noticeable delay in response):
This time there were 24 documents in the |
Just as an update-- this issue is still occurring. Since it mostly happens overnight, when backups run, my guess is the issue is triggered by high storage latency. I've written a small script to automatically purge the |
I'm unable to reproduce the issue. The only way I'm able to produce this error message is with a broken network setup (e. g. Graylog not being able to contact the URI given in Since a workaround seems to exist, I'm closing this issue. |
Issue is occurring for me too.
rest_listen_uri = http://127.0.0.1:12900/ Also I run nginx in front of Graylog with configuration based on this article: http://docs.graylog.org/en/2.0/pages/configuration/web_interface.html#configuring-webif-nginx |
@joschi I think this issue should be re-opened |
@k3nny0ne If you have more information which help to reproduce this issue, feel free to share them. |
I agree this should be re-opened. This happens constantly in our environment with 3 Graylog nodes and a 3 node MongoDB Replica set. All instances are in AWS. The Graylog nodes are behind a load balancer and there is clear communication between all the nodes. I consistently see this entry in the logs: 2016-10-18T19:00:54.521Z WARN [ProxiedResource] Unable to call http://REDACTED:9000/api/system on node From today's Graylog server logs only: Our current config has the following: This is configured the same for all nodes. The behavior is commonly exhibited when looking at the nodes view in the UI and you routinely see the journal metrics and heap size become "unavailable" and then start reporting again on their own. I also have a Cloudwatch metric set up polling the API directly on the master node every 2 minutes to report out the journal backlog to CloudWatch for trending and alerting (in case the journal gets too far behind on any one node). It sends an SNS notification if its unable to read any one nodes journal backlog via the API. I got the alert today that it was unable to read the backlog for one of the nodes indicating the REST API timed out during the call. Our CPU utilization on these nodes is likewise well within acceptable bounds. Adding the CPU Utilization and Journal Backlog for the last week. |
@wrsuarez Which version of Graylog are you using? |
Graylog 2.1.1+01d50e5 |
I am now experiencing the same issue. Curl to the URL in the error message from the affected server is working fine. Graylog 2.1.2+50e449a |
me too and details are available in #3116 I noticed as the exceptions start appearing the heap utilisation goes up GrayLog2 (2.1.2) |
Problem description
We've experienced partial outages in 2 separate Graylog environments where requests to the REST API start timing out (some log extracts below). In all cases the Web UI (i.e. everything originating from the Web interface) loaded quickly and Graylog Server message consumption appeared to be unaffected, but all calls to the REST API (e.g. creating a new session or anything in an existing session) timed out. One exception was the REST API browser list ("/api-browser"), however none of the individual entries (e.g. "AlarmCallbacks" @ "/api-docs/streams/%7Bstreamid%7D/alarmcallbacks") appeared to load since I couldn't click to expand.
CPU and Graylog Server utilization was fairly low, and since the nodes are almost always between 1-2GB (max heap 3GB) it seems unlikely that was an issue.
Previously a restart of the affected Graylog Server VMs (or sometimes the
graylog-server
and/ormongod
services) was done, however yesterday (2016-07-26) that did not seem to help in our Test environment. On a hunch, I connected to the environment's MongoDB instance and dropped thesessions
collection as shown:After emptying the
sessions
collection, the REST API became responsive immediately. I'm wondering if the aforementioned restarts just acted as a trigger to do something similar with session cleanup, even though that didn't make a difference yesterday (perhaps some session lifetime situation?).I don't currently have an environment with this issue, since there was some urgency to get things working again, but would be happy to provide any information that may help.
Note that in the following example the "Did not find meta info of this node" warning occurred a day before, and the environment was fine afterwards.
Steps to reproduce the problem
No known trigger
Environment
GRAYLOG_SERVER_JAVA_OPTS="-Djava.net.preferIPv4Stack=true -Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow"
(typically floats between 1-2GB heap utilization per/system/nodes
)The text was updated successfully, but these errors were encountered: