New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running instances goes GC-Death because of too many Threads created #800

Closed
bjoernhaeuser opened this Issue Dec 6, 2014 · 1 comment

Comments

Projects
None yet
3 participants
@bjoernhaeuser
Contributor

bjoernhaeuser commented Dec 6, 2014

Hey,

we are experiencing problems after we upgraded to 0.92.

Eventually the Java-Heap is full and causes Full GC, which makes graylog more or less unuseable. I obtained a Heap-Dump just before I had to restart the application.

The Heap-Dump shows that most of the Heap is spent in the JMX-Beanserver and MetricsRegistry.

screen shot 2014-12-06 at 20 14 51

After digging into the MetricsRegistry it seems to me that thounsands of threads are created.
screen shot 2014-12-06 at 20 16 33

We have 1.2 GB Heap and Using G1. This is the graylog2.conf: https://gist.github.com/bjoernhaeuser/90aad72b5dfcdece9016

Currently we have to restart graylog every hours. Prior to that graylog sometimes also stops stream processing and we have to manually restart all the streams by hand.

Thanks
Björn

@kroepke kroepke added the bug label Dec 19, 2014

@kroepke kroepke added this to the 0.92 milestone Dec 19, 2014

@kroepke

This comment has been minimized.

Member

kroepke commented Dec 19, 2014

This problem is related to our use of InstrumentedThreadExecutor, specifically when used by the stream router and the REST API service.
Both places use cached thread pools to do either time limited calls (for the stream router) or to serve the rest api requests.
However, both use the default pool configuration of core 0, unbounded number of threads and a 60 second idle time, leading to many threads being created of the lifetime of these pools.
See https://gist.github.com/bjoernhaeuser/970b6715891bb7b6c753 for an example.

This in turn leads to individual metrics for each of those threads, which only consume memory and over time it will lead to OOM exceptions and long GC times.

We need to a) limit the number of threads being created and b) prevent too many metrics from being created.
Do we really need to instrument each thread here? For 0.92 my vote would be to remove the instrumentation only, for master I would like to properly configure the pools to not create threads all the time (preferably also fix the number and use a queue for requests). Then instrumenting will not be a problem because the thread names are stable.

Please also review other usages of InstrumentedThreadExecutor, I suspect we have more problems elsewhere that are hard to spot from the threaddumps.

Last but not least: can we somehow check that we don't create endless amounts of metrics? Like regularly checking the number?

joschi referenced this issue Dec 23, 2014

Remove instrumentation from ExecutorService in StreamRouter
The ExecutorService is only used for the TimeLimiter in StreamRouter and the acquired
metrics are basically useless (or derivates of other metrics). Removing the instrumentation
from this particular ExecutorService prevents the instantiation of quite some
InstrumentedRunnables which in turn helps to get some pressure off the GC.

joschi referenced this issue Dec 23, 2014

Remove InstrumentedThreadFactory instances completely
The interesting metrics (submitted/running/completed jobs) are being provided by
InstrumentedExecutorService anyway. Especially for short-living threads the instrumentation
doesn't make sense but imposes an increased memory consumption for our MetricRegistry.

@joschi joschi closed this Dec 23, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment