New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle exceptions in the JournallingMessageHandler. #1286

Merged
merged 4 commits into from Jul 3, 2015

Conversation

Projects
None yet
3 participants
@bernd
Member

bernd commented Jul 3, 2015

Before this, an exception from the journal would leave invalid
RawMessageEvent objects in the batch List because the Converter sets
all fields to null after transforming the objects to Journal.Entry
objects. The batch List was not cleared in this case.
With the next endOfBatch the Converter tried to transform the broken
objects again and threw a NPE because the fields are null.

This caused exceptions for every new event coming in and eventually
filled up the memory because the unbounded batch List was never cleared.

bernd added some commits Jul 3, 2015

Handle exceptions in the JournallingMessageHandler.
Before this, an exception from the journal would leave invalid
RawMessageEvent objects in the batch List because the Converter sets
all fields to null after transforming the objects to Journal.Entry
objects. The batch List was not cleared in this case.
With the next endOfBatch the Coverter tried to transform the broken
objects again and threw a NPE because the fields are null.

This caused exceptions for every new event coming in and eventually
filled up the memory because the unbounded batch List was never cleared.
Revert "make RawMessageEvent's fields volatile to guard against cross…
…-cpu visibility issues"

This did not fix anything so remove it to avoid performance degradation.

The fix for the issue we tried to solve got committed in f8d9391.

This reverts commit 769a2c1.

@bernd bernd added this to the 1.1.5 milestone Jul 3, 2015

journalFilled.release();
done = true;
} catch (Exception e) {
log.error("Unable to write to journal - retrying after 250ms", e);

This comment has been minimized.

@joschi

joschi Jul 3, 2015

Contributor

There should be an upper limit on how often we retry to write to the journal. Otherwise users will see ~4 error messages per second forever if something is wrong with the journal which will clog the Graylog log output.

This comment has been minimized.

@joschi

joschi Jul 3, 2015

Contributor

Maybe use an exponential backoff (https://github.com/rholder/guava-retrying#exponential-backoff) with an upper limit?

This comment has been minimized.

@bernd

bernd Jul 3, 2015

Member

Exponential back-off makes sense. We talked about an upper limit, but then we would have to decide what to do next. Now it's just blocking and message processing stops.

This needs human intervention. We should also think about generating a system message.

@joschi joschi self-assigned this Jul 3, 2015

lastOffset);
journalFilled.release();
// Remove all null values returned from the converter (might happen if the Converter throws an exception)
final List<Journal.Entry> entries = Lists.newArrayList(filter(transform(batch, converter), notNull()));

This comment has been minimized.

@joschi

joschi Jul 3, 2015

Contributor

We could save the allocation of the outer-most ArrayList if Journal#write() took a Collection<Journal.Entry> instead of List<Journal.Entry>. Since this code is run very often (for each message event in fact), this could make sense and it's a very small change.

This comment has been minimized.

@kroepke

kroepke Jul 3, 2015

Member

This is dangerous as the filter/transform are computed lazily and if something iterates twice, the converter will be called twice.
I've been hit by that before, it performed badly and can also break if the converter frees something (which it does).

This comment has been minimized.

@joschi

joschi Jul 3, 2015

Contributor

Maybe using Collection#removeAll() could make sense here instead of using Collections2.filter() to prevent additional copies.

final List<Journal.Entry> entries = Lists.newArrayList(transform(batch, converter));
entries.removeAll(Collections.singleton(null));

This comment has been minimized.

@bernd

bernd Jul 3, 2015

Member

Good point. Done.

@@ -58,14 +62,30 @@ public void onEvent(RawMessageEvent event, long sequence, boolean endOfBatch) th
final Converter converter = new Converter();

This comment has been minimized.

@joschi

joschi Jul 3, 2015

Contributor

This could/should be a static final class member to prevent unnecessary allocation on each message event.

This comment has been minimized.

@joschi

joschi Jul 3, 2015

Contributor

Converter#getBytesWritten() wouldn't work then, though.

Use a retryer with an exponential back-off to avoid spamming the logs.
This also avoids hand crafting of the retry logic and offer more options
for configuration.

@bernd bernd force-pushed the fix-journalling-message-handler-exceptions branch from ad84fd6 to 6623e80 Jul 3, 2015

@bernd

This comment has been minimized.

Member

bernd commented Jul 3, 2015

Updated after comments. Thank you!

@joschi

This comment has been minimized.

Contributor

joschi commented Jul 3, 2015

LGTM. 👍

joschi added a commit that referenced this pull request Jul 3, 2015

Merge pull request #1286 from Graylog2/fix-journalling-message-handle…
…r-exceptions

Handle exceptions in the JournallingMessageHandler.

@joschi joschi merged commit f93e8fa into master Jul 3, 2015

2 checks passed

ci Jenkins build graylog2-server-integration-pr 24 has succeeded
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@joschi joschi deleted the fix-journalling-message-handler-exceptions branch Jul 3, 2015

@joschi joschi added processing and removed ready-for-review labels Jul 3, 2015

bernd added a commit that referenced this pull request Jul 3, 2015

Handle exceptions in the JournallingMessageHandler.
Before this, an exception from the journal would leave invalid
RawMessageEvent objects in the batch List because the Converter sets
all fields to null after transforming the objects to Journal.Entry
objects. The batch List was not cleared in this case.
With the next endOfBatch the Coverter tried to transform the broken
objects again and threw a NPE because the fields are null.

This caused exceptions for every new event coming in and eventually
filled up the memory because the unbounded batch List was never cleared.

(cherry picked from commits f8d9391 6623e80 c1e45df)

Refs #1286

bernd added a commit that referenced this pull request Jul 3, 2015

Revert "make RawMessageEvent's fields volatile to guard against cross…
…-cpu visibility issues"

This did not fix anything so remove it to avoid performance degradation.

The fix for the issue we tried to solve got committed in f8d9391.

This reverts commit 769a2c1.

(cherry picked from commit ee223dd)

Refs #1286
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment