Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Don't check ES cluster health when flushing messages #3927
Checking the ES cluster health before flushing messages was important while using the embedded Elasticsearch client node because it would block processing until the cluster is available and healthy (YELLOW or GREEN).
After migrating to an HTTP-based Elasticsearch client, this isn't necessary anymore. The client will simply fail to index the messages without blocking.
Additionally, this changeset only marks those messages as committed, which could be successfully indexed. Before this change, all messages of a batch were marked as committed and removed from the journal, whether they've been indexed or not.
Jun 21, 2017
I thought about and tested some failure scenarios with this PR.
Previously we checked if the write index alias exists before we wrote the messages into Elasticsearch. With this PR, we don't do that anymore because it's expensive. When I remove the
This breaks the system pretty bad. I am wondering if this is an issue in real life or if we can live with that. Not sure if switching aliases in Elasticsearch is an atomic operation of if this might happen during index rotation. We could extend our documentation and tell users to disable automatic index creation in Elasticsearch. Not sure if that breaks anything on our side, though.
I also thought about retrying bulk requests. Currently we are retrying a bulk request if a
Also we are not checking if the HTTP request succeeded via
Regarding the retry, I think it would be nice to get some log messages when retrying the build request. Like we do in the journalling message handler: https://github.com/Graylog2/graylog2-server/blob/master/graylog2-server/src/main/java/org/graylog2/shared/buffers/JournallingMessageHandler.java#L50-L56
@bernd We used to recommand disabling automatic index creation for that reason, but I'm not sure how long ago that was.
Disabling automatic index creation and failing to have the deflector alias will very likely lose messages (depending on our retry strategy).
Given all that, my vote would be to take the risk for the beta now and to throw more testing against it to see how it behaves under load.