New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak in GelfChunkAggregator #1462

Closed
lightpriest opened this Issue Oct 3, 2015 · 13 comments

Comments

Projects
None yet
7 participants
@lightpriest

lightpriest commented Oct 3, 2015

TL;DR: GelfChunkAggregator.ChunkEntry has a compareTo override that can be misleading to sortedEvictionSet (ConcurrentSkipListSet). In high volume, it can cause a memory leak. See patch at the bottom as a suggestion to fix this.


Using 1.2.1 with GELF UDP, we were seeing memory consumption (graylog-server) goes up until the process spends more time doing GC than actual work. Investigating a heap dump showed that ..inputs.codecs.GelfChunkAggregator class was accumulating ChunkEntrys.

I did some digging in GelfChunkAggregator, trying to find out why this happens. I noticed that at some point, sortedEvictionSet and chunks are going out of sync, so the ChunkEvictionTask is getting stuck on the sortedEvictionSet.first() entry and always tries to evict it.

What I saw was this:

  1. ChunkEvictionTask called.
  2. First entry in sortedEvictionSet is selected and the ID is passed to expireEntry.
  3. expireEntry calls getAndCleanupEntry with that ID.
  4. getAndCleanupEntry calls chunks.remove(ID), which returns null.
  5. sortedEvictionSet.remove(null) throws a NPE, leaving the sortedEvictionSet as it is.
  6. ChunkEvictionTask catches/ignores the exception, and the process begins all over again.

Investigating further revealed that the ChunkEvictionTask, for the given ID, does succeed at chunks.remove(ID) on the first time, returning an entry, but then sortedEvictionSet.remove(entry) fails (just returns false because no element was removed). This was interesting, since the eviction initiates from the first element of the sortedEvictionSet, so how is it possible that it does not find it?

Some info about our usage. Our application servers are pushing UDP packets at the rate of ~7K/s per graylog-server. Now, at this rate, some entries are bound to have the same timestamp (in millis). These are sorted into the sortedEvictionSet (by the ChunkEntry.compareTo).

What I found was that, sometimes, if two different entries share the same timestamp, the ConcurrentSkipListSet implementation will miss it when trying to call remove on it. This is probably because of how the skip list is being looked into, narrowing down the list of elements, causing it to miss the element that should be found.

For example, if the eviction set looks like this (id:timestamp): [ a:1 | b:1 | .. | c:5 | d:7 ], trying to lookup a:1 will "lead" the algorithm to the range between b:1 and c:5. Essentially making the call sortedEvictionSet.remove(sortedEvictionSet.first()) fail, when it shouldn't. This is reproducible with a small data set as well, although not consistent.

I wasn't sure how to address that without changing or breaking other stuff, so I added another level of sorting to compareTo. If the timestamps are equal, it'll compare the message IDs. After this addition, the problem went away.

diff --git a/graylog2-inputs/src/main/java/org/graylog2/inputs/codecs/GelfChunkAggregator.java b/graylog2-inputs/src/main/java/org/graylog2/inputs/codecs/GelfChunkAggregator.java
index 085e4ff..0c9a29d 100644
--- a/graylog2-inputs/src/main/java/org/graylog2/inputs/codecs/GelfChunkAggregator.java
+++ b/graylog2-inputs/src/main/java/org/graylog2/inputs/codecs/GelfChunkAggregator.java
@@ -238,6 +238,9 @@ public class GelfChunkAggregator implements CodecAggregator {
             if (equals(o)) {
                 return 0;
             }
+            if (firstTimestamp == o.firstTimestamp) {
+                return id.compareTo(o.id);
+            }
             return firstTimestamp < o.firstTimestamp ? -1 : 1;
         }
     }
@onyxmaster

This comment has been minimized.

onyxmaster commented Oct 5, 2015

"Me too".
We installed Graylog, configured ES cluster (on separate machines), started shipping logs from our backends in chunked GELF format via UDP and found out that the Graylog server "hangs" after some time (18-26 hours), dropping 2/3rd of the messages and becoming completely unresponsive.
Restart helps, but the problem manifests itself again after about the same time.
We did some research and it appears that there is a problem with GELF UDP chunk aggregator which leads to memory leak that overloads GC, leading to overly long collection times that block the Graylog server process itself. To verify the hypothesis we moved from UDP to TCP and had no such problem ever since (for a week now).

I have some memory dumps (.hprof) if you need it, the leak is pretty obvious.

BTW, ignoring NPEs in ChunkEvictionTask looks like questionable practice, this wouldn't have slipped through testing if NPEs were logged.

@onyxmaster

This comment has been minimized.

onyxmaster commented Oct 5, 2015

One of the dumps:

GelfChunkAggregator@0x64001b868
chunkCounter: 434453491
chunks: 7323
waitingMessages: 7323
expiredChunks: 255
expiredMessages: 87
completeMessages: 154359656

So it's ~7K messages waiting which is normal, unfortunately they root about 7GB of heap space due to the issue lightpriest described.

P.S. We moved to TCP anyways because adjusting datagram size to match MTUs is no fun on a diverse LAN and sticking with "safe" 576-byte datagrams leads to overhead that is almost over the top while also limiting the total payload size to very low values.

@joschi

This comment has been minimized.

Contributor

joschi commented Oct 5, 2015

@onyxmaster It would be great if you could post those heap dumps somewhere (maybe https://gist.github.com/) or send them to jochen@graylog.com.

@onyxmaster

This comment has been minimized.

onyxmaster commented Oct 5, 2015

@onyxmaster

This comment has been minimized.

onyxmaster commented Oct 5, 2015

JVM: Oracle 1.8u60 (was OpenJDK 1.7, no actual difference)
graylog-server: Ubuntu 14.04 w/3.19 kernel, dedicated physical machine with 1x 4-core 8-thread 3.3GHz CPU, 16 GB RAM, modern SSD (60K read/30K write 4K IOPS).
ES shards (3x): Ubuntu 14.04 w/3.19 kernel, physical machines with 2x 4-core 8-thread 2.4GHz CPU, 144 GB RAM, dedicated NL-SAS disks (150 read/150 write 4K IOPS).

We're shipping traces from multiple backends. They amount to up to 15K msg/sec, usually 7-8K msg/sec, with 3-5 second spikes of 45K msg/sec.

At first, when we encountered the problem, we ruled out the most popular "wrong configuration" problems:

  • Changed OpenJDK to Oracle Java
  • Ensured that ES isn't overloaded -- CPU is 60% idle aggregated, per-core LA is ~0.35, RAM is available (~100 GB used for cache), disk time for ES arrays is below 20%
  • Tried increasing the JVM heap limit to 12GB

Then, we started digging. It was obvious that ES has nothing to do with this, because it is really underloaded.
Also, the problem occurences did not correlate with increased load, they occured after some time the server was running, which happened both in the evening hours (traffic is 250% of average) and in the night (traffic is 20% of average).
While checking logs we found the following messages which were not there when server was operating normally:

11:31:07.428+03:00 INFO  [jvm] [graylog2-server] [gc][old][150493][4613] duration [5.4s], collections [1]/[6.6s], total [5.4s]/[42.9m], memory [7.3gb]->[6.1gb]/[11.4gb], all_pools {[young] [1.1gb]->[164.8mb]/[4.8gb]}{[survivor] [215.4mb]->[0b]/[614.3mb]}{[old] [5.9gb]->[5.9gb]/[6gb]}
11:31:17.539+03:00 INFO  [jvm] [graylog2-server] [gc][old][150496][4614] duration [7.1s], collections [1]/[7.4s], total [7.1s]/[43m], memory [11.2gb]->[6.2gb]/[11.4gb], all_pools {[young] [4.8gb]->[234.4mb]/[4.8gb]}{[survivor] [502mb]->[0b]/[614.3mb]}{[old] [5.9gb]->[5.9gb]/[6gb]}
11:31:24.966+03:00 INFO  [jvm] [graylog2-server] [gc][old][150498][4615] duration [5.4s], collections [1]/[6.4s], total [5.4s]/[43.1m], memory [10.4gb]->[6.3gb]/[11.4gb], all_pools {[young] [4.4gb]->[329.5mb]/[4.8gb]}{[survivor] [0b]->[0b]/[614.3mb]}{[old] [5.9gb]->[5.9gb]/[6gb]}
11:31:33.675+03:00 INFO  [jvm] [graylog2-server] [gc][old][150500][4616] duration [6.7s], collections [1]/[7.7s], total [6.7s]/[43.2m], memory [10.2gb]->[6.2gb]/[11.4gb], all_pools {[young] [4.2gb]->[230.3mb]/[4.8gb]}{[survivor] [0b]->[0b]/[614.3mb]}{[old] [5.9gb]->[5.9gb]/[6gb]}
11:31:40.959+03:00 INFO  [jvm] [graylog2-server] [gc][old][150502][4617] duration [5.3s], collections [1]/[6.2s], total [5.3s]/[43.3m], memory [10.5gb]->[6.2gb]/[11.4gb], all_pools {[young] [4.5gb]->[259.8mb]/[4.8gb]}{[survivor] [0b]->[0b]/[614.3mb]}{[old] [5.9gb]->[5.9gb]/[6gb]}
11:31:49.639+03:00 INFO  [jvm] [graylog2-server] [gc][old][150505][4618] duration [6.5s], collections [1]/[6.6s], total [6.5s]/[43.4m], memory [11.3gb]->[6.1gb]/[11.4gb], all_pools {[young] [4.8gb]->[161.1mb]/[4.8gb]}{[survivor] [556.8mb]->[0b]/[614.3mb]}{[old] [5.9gb]->[6gb]/[6gb]}

This led us to the idea that there is some kind of a memory leak, but lightpriest described it much better than I did (I only analyzed the heap and found out about the NPEs and damaged chunk set structure, while it seems that lightpriest found the root cause of "missing" chunks).

@lightpriest

This comment has been minimized.

lightpriest commented Oct 5, 2015

Thanks @onyxmaster. Being a bit more specific than in my first post, the root of the problem is caused by the compareTo method which is inconsistent (is that the word for it?).

Consider you have the following two entries:

    ChunkEntry entry1 = new ChunkEntry([ id: foobar, timestamp: 150, .. ])
    ChunkEntry entry2 = new ChunkEntry([ id: abcdef, timestamp: 150, .. ])

In this case, the result of the two would be inconsistent:

   entry1.compareTo(entry2); // = 1 meaning entry2 > entry1
   entry2.compareTo(entry1); // = 1 meaning entry1 > entry2

They cannot be both larger than the other. It would confuse any sorting algorithm, well, depending on the implementation most probably, but you get the point. Initially I thought that because our environment wasn't stable (dropping UDP packets here and there), then we would have a large amount of evictable chunks which will increase the chance of the above having any effect. But now I realise that since the sorting method is problematic, we would have seen the problem anyway (assuming the total number of messages per second > 1000).

As @onyxmaster pointed out, this class probably needs some more refactoring then this fix. I just wanted to make as little changes as possible to have this problem go away.

@joschi joschi added this to the 1.3.0 milestone Oct 13, 2015

@joschi joschi added inputs bug labels Oct 13, 2015

@kroepke kroepke modified the milestones: 1.3.0, 1.2.2 Oct 13, 2015

@patricktjel

This comment has been minimized.

patricktjel commented Oct 14, 2015

+1,
We don't have a Graylog instance in production yet but I am heavily testing Graylog right now. started Graylog on a 4 gb ram VM. With 2gb ram free, after like 500 000 messages read we had just 100 mb ram free. We haven't send anything to Graylog to give him some rest but the free ram didn't go back to 1gb or even 2gb.

@bernd bernd self-assigned this Oct 15, 2015

@bernd

This comment has been minimized.

Member

bernd commented Oct 15, 2015

Thank you folks for the awesome reports! This will be in the upcoming 1.2.2.

bernd added a commit that referenced this issue Oct 15, 2015

Fix memory leak in GelfChunkAggregator
The memory leak showed up because of a problem in the
ChunkEntry#compareTo() method.

If two ChunkEntry objects had the same firstTimestamp value, the compareTo()
method returned a wrong result because only the timestamp has been taken
into account. This made sortedEvictionSet.remove(entry) not remove the
entry from the set in some cases.

The GelfChunkAggregator class maintains a chunks ConcurrentMap and a
sortedEvictionSet ConcurrentSkipListSet with ChunkEntry objects. The
ChunkEvictionTask gets the first entry from the sortedEvictionSet and
basically removes it from the chunks map and then tries to remove it
from the sortedEvictionSet as well.

The removal from the chunks map works, the removal from the
sortedEvictionSet does not work. (due to the compareTo() bug)

During the next scheduled execution of ChunkEvictionTask#run(), the same
entry is returned from the sortedEvictionSet. This entry is not in the
chunks map anymore so chunks.remove(entry) returns null. The return
value is used to remove the entry from the sortedEvictionSet. This
throws a NullPointerException because the return value was null.
This NPE was ignored and not logged.

This change fixes the ChunkEntry#compareTo() method to compare the ID if
both timestamps are equal. We also now log any error that happens during a
ChunkEvictionTask#run() execution.

This issue was found, debugged and reported in great detail with a patch
to fix it by @lightpriest and @onyxmaster. Thank you!

See #1462 for details.

Fixes #1462
@lightpriest

This comment has been minimized.

lightpriest commented Oct 15, 2015

👍

@onyxmaster

This comment has been minimized.

onyxmaster commented Oct 16, 2015

Great, thanks for fixing this!

@bernd

This comment has been minimized.

Member

bernd commented Oct 16, 2015

The PR is merged and will be in the upcoming 1.2.2 release.

@bernd bernd closed this Oct 16, 2015

bernd added a commit that referenced this issue Oct 16, 2015

Fix memory leak in GelfChunkAggregator
The memory leak showed up because of a problem in the
ChunkEntry#compareTo() method.

If two ChunkEntry objects had the same firstTimestamp value, the compareTo()
method returned a wrong result because only the timestamp has been taken
into account. This made sortedEvictionSet.remove(entry) not remove the
entry from the set in some cases.

The GelfChunkAggregator class maintains a chunks ConcurrentMap and a
sortedEvictionSet ConcurrentSkipListSet with ChunkEntry objects. The
ChunkEvictionTask gets the first entry from the sortedEvictionSet and
basically removes it from the chunks map and then tries to remove it
from the sortedEvictionSet as well.

The removal from the chunks map works, the removal from the
sortedEvictionSet does not work. (due to the compareTo() bug)

During the next scheduled execution of ChunkEvictionTask#run(), the same
entry is returned from the sortedEvictionSet. This entry is not in the
chunks map anymore so chunks.remove(entry) returns null. The return
value is used to remove the entry from the sortedEvictionSet. This
throws a NullPointerException because the return value was null.
This NPE was ignored and not logged.

This change fixes the ChunkEntry#compareTo() method to compare the ID if
both timestamps are equal. We also now log any error that happens during a
ChunkEvictionTask#run() execution.

This issue was found, debugged and reported in great detail with a patch
to fix it by @lightpriest and @onyxmaster. Thank you!

See #1462 for details.

Fixes #1462

(cherry picked from commit e48f4a4)

bernd added a commit that referenced this issue Oct 16, 2015

Fix memory leak in GelfChunkAggregator
The memory leak showed up because of a problem in the
ChunkEntry#compareTo() method.

If two ChunkEntry objects had the same firstTimestamp value, the compareTo()
method returned a wrong result because only the timestamp has been taken
into account. This made sortedEvictionSet.remove(entry) not remove the
entry from the set in some cases.

The GelfChunkAggregator class maintains a chunks ConcurrentMap and a
sortedEvictionSet ConcurrentSkipListSet with ChunkEntry objects. The
ChunkEvictionTask gets the first entry from the sortedEvictionSet and
basically removes it from the chunks map and then tries to remove it
from the sortedEvictionSet as well.

The removal from the chunks map works, the removal from the
sortedEvictionSet does not work. (due to the compareTo() bug)

During the next scheduled execution of ChunkEvictionTask#run(), the same
entry is returned from the sortedEvictionSet. This entry is not in the
chunks map anymore so chunks.remove(entry) returns null. The return
value is used to remove the entry from the sortedEvictionSet. This
throws a NullPointerException because the return value was null.
This NPE was ignored and not logged.

This change fixes the ChunkEntry#compareTo() method to compare the ID if
both timestamps are equal. We also now log any error that happens during a
ChunkEvictionTask#run() execution.

This issue was found, debugged and reported in great detail with a patch
to fix it by @lightpriest and @onyxmaster. Thank you!

See #1462 for details.

Fixes #1462

(cherry picked from commit e48f4a4)
@lennartkoopmann

This comment has been minimized.

Member

lennartkoopmann commented Oct 19, 2015

Do we have a planned release date for v1.2.2 already?

@bernd

This comment has been minimized.

Member

bernd commented Oct 20, 2015

@lennartkoopmann We plan to release on Oct 26th or 27th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment