Long running instances goes GC-Death because of too many Threads created #800
we are experiencing problems after we upgraded to 0.92.
Eventually the Java-Heap is full and causes Full GC, which makes graylog more or less unuseable. I obtained a Heap-Dump just before I had to restart the application.
The Heap-Dump shows that most of the Heap is spent in the JMX-Beanserver and MetricsRegistry.
We have 1.2 GB Heap and Using G1. This is the graylog2.conf: https://gist.github.com/bjoernhaeuser/90aad72b5dfcdece9016
Currently we have to restart graylog every hours. Prior to that graylog sometimes also stops stream processing and we have to manually restart all the streams by hand.
The text was updated successfully, but these errors were encountered:
This problem is related to our use of InstrumentedThreadExecutor, specifically when used by the stream router and the REST API service.
This in turn leads to individual metrics for each of those threads, which only consume memory and over time it will lead to OOM exceptions and long GC times.
We need to a) limit the number of threads being created and b) prevent too many metrics from being created.
Please also review other usages of InstrumentedThreadExecutor, I suspect we have more problems elsewhere that are hard to spot from the threaddumps.
Last but not least: can we somehow check that we don't create endless amounts of metrics? Like regularly checking the number?
The ExecutorService is only used for the TimeLimiter in StreamRouter and the acquired metrics are basically useless (or derivates of other metrics). Removing the instrumentation from this particular ExecutorService prevents the instantiation of quite some InstrumentedRunnables which in turn helps to get some pressure off the GC.
The interesting metrics (submitted/running/completed jobs) are being provided by InstrumentedExecutorService anyway. Especially for short-living threads the instrumentation doesn't make sense but imposes an increased memory consumption for our MetricRegistry.