New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make messages with identical timestamps sortable by ULID #6711
Conversation
/rebase |
27d71fd
to
6d52c07
Compare
6d52c07
to
ae39533
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments:
graylog2-server/src/main/java/org/graylog2/shared/buffers/processors/MessageULIDGenerator.java
Outdated
Show resolved
Hide resolved
graylog2-server/src/main/java/org/graylog2/shared/buffers/processors/MessageULIDGenerator.java
Outdated
Show resolved
Hide resolved
graylog2-server/src/main/resources/org/graylog2/plugin/journal/raw_message.proto
Show resolved
Hide resolved
graylog2-server/src/main/resources/org/graylog2/plugin/journal/raw_message.proto
Show resolved
Hide resolved
graylog2-server/src/main/java/org/graylog2/shared/buffers/processors/MessageULIDGenerator.java
Outdated
Show resolved
Hide resolved
graylog2-server/src/main/java/org/graylog2/plugin/inputs/MessageInput.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I've updated the PR
Checking for an existing "gl2_message_id" and using the message timestamp is done in #7290. |
7b3e148
to
751dec0
Compare
@mpfz0r Does this implementation still work when a non-local message journal implementation (e.g., Kafka) is used? This just crossed my mind. |
It does. I had the same thought at first, but only my first iteration depended on the journal 😅 |
graylog2-server/src/main/java/org/graylog2/shared/buffers/processors/MessageULIDGenerator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Tested with syslog input - sorting by gl2_message_id
works as expected
The ULID format is composed of a 48 bit timestamp followed by 80 bits of randomness. Use the first 16 bits of the random field to embed a sequence number for each message. If a batch of messages was received with identical timestamps (the same millisecond), the original receive order is kept by the encoded sequence number which directly follows the timestamp.
Introduce a seqence number on each input which gets incremented and embedded in every received message. This allows sorting to work without depending on a KafkaJournal. Regenerate the protobuf class JournalMessages so it can pass the sequence number from a RawMessage to a Message. Don't overwrite already existing GL2_MESSAGE_IDs. They might already be set if we are receiving messages from a Graylog forwarder.
Every message should have a gl2_message_id now
This enables us to sort by gl2_message_id, even if the field does not exist. This might be the case with older indices, restored from archives.
This test should not be using the default index template
Multiple forwarders can run the same input, that's why we need to differentiate them in the cache.
This allows us to sort on older indices that might not have that field yet. https://www.elastic.co/guide/en/elasticsearch/reference/7.17/sort-search-results.html#_ignoring_unmapped_fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice improvement! 👍
graylog2-server/src/main/java/org/graylog2/shared/buffers/processors/MessageULIDGenerator.java
Outdated
Show resolved
Hide resolved
...ensearch2/src/main/java/org/graylog/storage/opensearch2/views/searchtypes/OSMessageList.java
Outdated
Show resolved
Hide resolved
Also add some tests
This is simpler and probably leads to less log messages.
* Make messages with identical timestamps sortable by ULID The ULID format is composed of a 48 bit timestamp followed by 80 bits of randomness. Use the first 16 bits of the random field to embed a sequence number for each message. If a batch of messages was received with identical timestamps (the same millisecond), the original receive order is kept by the encoded sequence number which directly follows the timestamp. * Add license * Use a MessageInput sequenceNr instead of journalOffset Introduce a seqence number on each input which gets incremented and embedded in every received message. This allows sorting to work without depending on a KafkaJournal. Regenerate the protobuf class JournalMessages so it can pass the sequence number from a RawMessage to a Message. Don't overwrite already existing GL2_MESSAGE_IDs. They might already be set if we are receiving messages from a Graylog forwarder. * final some variables * Fix review comments * Use protoc 2.5.0 instead of 3.0.0 * Handle sequenceNr wrap and exceptions * update license * better log messages; bump offset gap * increase cache size * Cleanup and fix test * Fix some comments and add changelog * Bump OFFSET_GAP to 5000 and rename a few constants With an OFFSET_GAP of 5000, I couldn't reproduce negative sequence numbers anymore. It's a tradeoff, but ordering only 60535 messages for the same timestamp is reasonable. * improve java doc * Always add gl2_message_id as a second sort order, if sorting by timestamp is requested. * Don't assume that sort list is mutable * Fix BackendStartupIT test Every message should have a gl2_message_id now * Clarify when gl2_message_id might not be empty * Add a index mapping for gl2_message_id This enables us to sort by gl2_message_id, even if the field does not exist. This might be the case with older indices, restored from archives. * Fix IndexMappingTest * fix FieldTypePollerIT * Fix FieldAliasForEvents IT This test should not be using the default index template * Handle Messages with sequenceNr that are received from Forwarders * Include the nodeId into the sequenceNr lookup cache Multiple forwarders can run the same input, that's why we need to differentiate them in the cache. * Provide unmapped_type for gl2_message_id sort This allows us to sort on older indices that might not have that field yet. https://www.elastic.co/guide/en/elasticsearch/reference/7.17/sort-search-results.html#_ignoring_unmapped_fields * Only add gl2_message_id sort if not already present Also add some tests * Simply reset the sequenceNrCache in case we exceed the ULID limit This is simpler and probably leads to less log messages. * Improve log message
The ULID format in the
gl2_message_id
field is composed of a 48 bit timestamp followed by 80 bitsof randomness.
Use the first 16 bits of the random field to embed a sequence number
for each message.
If a batch of messages was received with identical timestamps
(the same millisecond), the original receive order is kept by the
encoded sequence number which directly follows the timestamp.
This allows us to sort messages by
gl2_message_id
which should have the correct originalorder in most cases.
CAVEATS
This is a best effort approach to a complicated problem. It's not a silver bullet.
Here are reasons why the
gl2_message_id
sort order might not always be correct:The sequence number is generated per node and input.
This means that sorting will not work if an input is load balanced over multiple nodes.
There is only space for
60535
messages with the same timestamp and input.Also, there is a small chance that, if too many batches of messages with the same timestamp and input get processed
in parallel, the sort order might be wrong.
Performance Impact
Running a benchmark, which ingests 8 million messages.
Four parallel curl loops send batches of
5000
messages with identical timestamps.Without this change, this takes
3m 19s
With this change:
3m 25s
Which means approx 40k msg/sec and no measurable performance impact.
Fixes #2741