Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't check ES cluster health when flushing messages #3927

Merged
merged 2 commits into from Jun 23, 2017

Conversation

joschi
Copy link
Contributor

@joschi joschi commented Jun 21, 2017

Checking the ES cluster health before flushing messages was important while using the embedded Elasticsearch client node because it would block processing until the cluster is available and healthy (YELLOW or GREEN).

After migrating to an HTTP-based Elasticsearch client, this isn't necessary anymore. The client will simply fail to index the messages without blocking.

Additionally, this changeset only marks those messages as committed, which could be successfully indexed. Before this change, all messages of a batch were marked as committed and removed from the journal, whether they've been indexed or not.

@bernd
Copy link
Member

bernd commented Jun 21, 2017

@joschi The archive plugin needs to be adjusted for this.

Checking the ES cluster health before flushing messages was important
while using the embedded Elasticsearch client node because it would block
processing until the cluster is available and healthy (YELLOW or GREEN).

After migrating to an HTTP-based Elasticsearch client, this isn't necessary
anymore. The client will simply fail to index the messages.

Additionally, this changeset only marks those messages as committed, which
could be successfully indexed. Before this change, all messages of a batch
were marked as committed and removed from the journal, whether they've been
indexed or not.
@joschi joschi force-pushed the elasticsearch-reduce-http-roundtrips branch from 143404f to ea30563 Compare June 21, 2017 14:34
@bernd bernd self-assigned this Jun 21, 2017
@bernd
Copy link
Member

bernd commented Jun 21, 2017

I thought about and tested some failure scenarios with this PR.

Previously we checked if the write index alias exists before we wrote the messages into Elasticsearch. With this PR, we don't do that anymore because it's expensive. When I remove the graylog_deflector alias from the active write index, indexing new messages creates an index named graylog_deflector. Our periodical that checks the aliases then complains that the alias name is used by an actual index.

This breaks the system pretty bad. I am wondering if this is an issue in real life or if we can live with that. Not sure if switching aliases in Elasticsearch is an atomic operation of if this might happen during index rotation. We could extend our documentation and tell users to disable automatic index creation in Elasticsearch. Not sure if that breaks anything on our side, though.

Opinions?

I also thought about retrying bulk requests. Currently we are retrying a bulk request if a SocketTimeoutException is thrown. That doesn't make Graylog retry the bulk request if Elasticsearch is down and happily throws away the bulk request. So I guess we have to adjust the retry logic.

Also we are not checking if the HTTP request succeeded via BulkResult#isSucceeded().

Looking forward to your feedback. /cc @jochen @dennisoelkers @kroepke

@bernd
Copy link
Member

bernd commented Jun 22, 2017

Regarding the retry, I think it would be nice to get some log messages when retrying the build request. Like we do in the journalling message handler: https://github.com/Graylog2/graylog2-server/blob/master/graylog2-server/src/main/java/org/graylog2/shared/buffers/JournallingMessageHandler.java#L50-L56

journal.markJournalOffsetCommitted(entry.getValue().getJournalOffset());
final Message message = entry.getValue();
if (!failedMessageIds.contains(message.getId())) {
journal.markJournalOffsetCommitted(message.getJournalOffset());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not working the way it is supposed to. It just marks all messages up to and including the successfull message with the highest journal offset, including failed messages with lower journal offsets.

See this for the explanation what markJournalOffsetCommited does.

@kroepke
Copy link
Member

kroepke commented Jun 22, 2017

@bernd We used to recommand disabling automatic index creation for that reason, but I'm not sure how long ago that was.
Switching aliases used to be atomic and I am not aware that this changed. It was one of the reasons we went with it as a write target solution in the first place, because it required no coordination.
OTOH I'm not overly happy with it because of the risks you mention.

Disabling automatic index creation and failing to have the deflector alias will very likely lose messages (depending on our retry strategy).
The only other solution I can think of is to always create the next index immediately and only switching the alias during rotation. That would minimize the chance of not having an index. But this probably requires changes all over the place, because that index should probably not be used in retention and search (and archiving)

Given all that, my vote would be to take the risk for the beta now and to throw more testing against it to see how it behaves under load.

Copy link
Member

@bernd bernd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@bernd bernd merged commit ac04fad into master Jun 23, 2017
@bernd bernd deleted the elasticsearch-reduce-http-roundtrips branch June 23, 2017 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants