New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't check ES cluster health when flushing messages #3927
Conversation
@joschi The archive plugin needs to be adjusted for this. |
Checking the ES cluster health before flushing messages was important while using the embedded Elasticsearch client node because it would block processing until the cluster is available and healthy (YELLOW or GREEN). After migrating to an HTTP-based Elasticsearch client, this isn't necessary anymore. The client will simply fail to index the messages. Additionally, this changeset only marks those messages as committed, which could be successfully indexed. Before this change, all messages of a batch were marked as committed and removed from the journal, whether they've been indexed or not.
143404f
to
ea30563
Compare
I thought about and tested some failure scenarios with this PR. Previously we checked if the write index alias exists before we wrote the messages into Elasticsearch. With this PR, we don't do that anymore because it's expensive. When I remove the This breaks the system pretty bad. I am wondering if this is an issue in real life or if we can live with that. Not sure if switching aliases in Elasticsearch is an atomic operation of if this might happen during index rotation. We could extend our documentation and tell users to disable automatic index creation in Elasticsearch. Not sure if that breaks anything on our side, though. Opinions? I also thought about retrying bulk requests. Currently we are retrying a bulk request if a Also we are not checking if the HTTP request succeeded via Looking forward to your feedback. /cc @jochen @dennisoelkers @kroepke |
Regarding the retry, I think it would be nice to get some log messages when retrying the build request. Like we do in the journalling message handler: https://github.com/Graylog2/graylog2-server/blob/master/graylog2-server/src/main/java/org/graylog2/shared/buffers/JournallingMessageHandler.java#L50-L56 |
journal.markJournalOffsetCommitted(entry.getValue().getJournalOffset()); | ||
final Message message = entry.getValue(); | ||
if (!failedMessageIds.contains(message.getId())) { | ||
journal.markJournalOffsetCommitted(message.getJournalOffset()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not working the way it is supposed to. It just marks all messages up to and including the successfull message with the highest journal offset, including failed messages with lower journal offsets.
See this for the explanation what markJournalOffsetCommited
does.
@bernd We used to recommand disabling automatic index creation for that reason, but I'm not sure how long ago that was. Disabling automatic index creation and failing to have the deflector alias will very likely lose messages (depending on our retry strategy). Given all that, my vote would be to take the risk for the beta now and to throw more testing against it to see how it behaves under load. |
9389805
to
ea30563
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Checking the ES cluster health before flushing messages was important while using the embedded Elasticsearch client node because it would block processing until the cluster is available and healthy (YELLOW or GREEN).
After migrating to an HTTP-based Elasticsearch client, this isn't necessary anymore. The client will simply fail to index the messages without blocking.
Additionally, this changeset only marks those messages as committed, which could be successfully indexed. Before this change, all messages of a batch were marked as committed and removed from the journal, whether they've been indexed or not.