Don't check ES cluster health when flushing messages #3927

joschi · 2017-06-21T12:36:20Z

Checking the ES cluster health before flushing messages was important while using the embedded Elasticsearch client node because it would block processing until the cluster is available and healthy (YELLOW or GREEN).

After migrating to an HTTP-based Elasticsearch client, this isn't necessary anymore. The client will simply fail to index the messages without blocking.

Additionally, this changeset only marks those messages as committed, which could be successfully indexed. Before this change, all messages of a batch were marked as committed and removed from the journal, whether they've been indexed or not.

bernd · 2017-06-21T13:41:13Z

@joschi The archive plugin needs to be adjusted for this.

Checking the ES cluster health before flushing messages was important while using the embedded Elasticsearch client node because it would block processing until the cluster is available and healthy (YELLOW or GREEN). After migrating to an HTTP-based Elasticsearch client, this isn't necessary anymore. The client will simply fail to index the messages. Additionally, this changeset only marks those messages as committed, which could be successfully indexed. Before this change, all messages of a batch were marked as committed and removed from the journal, whether they've been indexed or not.

bernd · 2017-06-21T18:37:19Z

I thought about and tested some failure scenarios with this PR.

Previously we checked if the write index alias exists before we wrote the messages into Elasticsearch. With this PR, we don't do that anymore because it's expensive. When I remove the graylog_deflector alias from the active write index, indexing new messages creates an index named graylog_deflector. Our periodical that checks the aliases then complains that the alias name is used by an actual index.

This breaks the system pretty bad. I am wondering if this is an issue in real life or if we can live with that. Not sure if switching aliases in Elasticsearch is an atomic operation of if this might happen during index rotation. We could extend our documentation and tell users to disable automatic index creation in Elasticsearch. Not sure if that breaks anything on our side, though.

Opinions?

I also thought about retrying bulk requests. Currently we are retrying a bulk request if a SocketTimeoutException is thrown. That doesn't make Graylog retry the bulk request if Elasticsearch is down and happily throws away the bulk request. So I guess we have to adjust the retry logic.

Also we are not checking if the HTTP request succeeded via BulkResult#isSucceeded().

Looking forward to your feedback. /cc @jochen @dennisoelkers @kroepke

bernd · 2017-06-22T09:23:59Z

Regarding the retry, I think it would be nice to get some log messages when retrying the build request. Like we do in the journalling message handler: https://github.com/Graylog2/graylog2-server/blob/master/graylog2-server/src/main/java/org/graylog2/shared/buffers/JournallingMessageHandler.java#L50-L56

dennisoelkers · 2017-06-22T10:22:58Z

graylog2-server/src/main/java/org/graylog2/outputs/ElasticSearchOutput.java

-            journal.markJournalOffsetCommitted(entry.getValue().getJournalOffset());
+            final Message message = entry.getValue();
+            if (!failedMessageIds.contains(message.getId())) {
+                journal.markJournalOffsetCommitted(message.getJournalOffset());


This is not working the way it is supposed to. It just marks all messages up to and including the successfull message with the highest journal offset, including failed messages with lower journal offsets.

See this for the explanation what markJournalOffsetCommited does.

kroepke · 2017-06-22T13:29:03Z

@bernd We used to recommand disabling automatic index creation for that reason, but I'm not sure how long ago that was.
Switching aliases used to be atomic and I am not aware that this changed. It was one of the reasons we went with it as a write target solution in the first place, because it required no coordination.
OTOH I'm not overly happy with it because of the risks you mention.

Disabling automatic index creation and failing to have the deflector alias will very likely lose messages (depending on our retry strategy).
The only other solution I can think of is to always create the next index immediately and only switching the alias during rotation. That would minimize the chance of not having an index. But this probably requires changes all over the place, because that index should probably not be used in retention and search (and archiving)

Given all that, my vote would be to take the risk for the beta now and to throw more testing against it to see how it behaves under load.

bernd

LGTM 👍

joschi added elasticsearch ready-for-review labels Jun 21, 2017

joschi added this to the 2.3.0 milestone Jun 21, 2017

joschi force-pushed the elasticsearch-reduce-http-roundtrips branch from 143404f to ea30563 Compare June 21, 2017 14:34

bernd self-assigned this Jun 21, 2017

dennisoelkers reviewed Jun 22, 2017

View reviewed changes

dennisoelkers mentioned this pull request Jun 22, 2017

Retrying bulk indexing in case of all IOExceptions and failed http response. #3929

Merged

dennisoelkers force-pushed the elasticsearch-reduce-http-roundtrips branch from 9389805 to ea30563 Compare June 22, 2017 13:45

Only committing highest journal offset.

cfb787f

bernd approved these changes Jun 23, 2017

View reviewed changes

bernd merged commit ac04fad into master Jun 23, 2017

bernd deleted the elasticsearch-reduce-http-roundtrips branch June 23, 2017 11:36

bernd removed the ready-for-review label Jun 23, 2017

jingene mentioned this pull request Jul 8, 2017

All message in journal lost if reroute ES api use on unassigned shard and graylog rotation #1581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't check ES cluster health when flushing messages #3927

Don't check ES cluster health when flushing messages #3927

joschi commented Jun 21, 2017 •

edited

bernd commented Jun 21, 2017

bernd commented Jun 21, 2017

bernd commented Jun 22, 2017

dennisoelkers Jun 22, 2017 •

edited

kroepke commented Jun 22, 2017

bernd left a comment

Don't check ES cluster health when flushing messages #3927

Don't check ES cluster health when flushing messages #3927

Conversation

joschi commented Jun 21, 2017 • edited

bernd commented Jun 21, 2017

bernd commented Jun 21, 2017

bernd commented Jun 22, 2017

dennisoelkers Jun 22, 2017 • edited

Choose a reason for hiding this comment

kroepke commented Jun 22, 2017

bernd left a comment

Choose a reason for hiding this comment

joschi commented Jun 21, 2017 •

edited

dennisoelkers Jun 22, 2017 •

edited