New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ServoRegistry memory leak #264
Comments
My guess is that the distinct set of ids being tracked is growing over time. An example of such a problem: Registry r = new ServoRegistry();
for (int i = 0; ; ++i) {
r.counter("" + i).increment();
} If that is the case, then you can protect against that by setting a system property, e.g.:
Once the limit is hit it will just return noop implementations for new registrations and prevent further leaking. To debug what is creating the explosion you could just dump all of the ids and do some analysis, e.g.: r.stream().forEach(m -> System.out.println(m.id())); If you still see a leak with that setting in place, then it would help if you could create a minimal example to reproduce and I'll take a look. The set of ids should be stable for the life of a process, that is, it is meant to track things over time not record unique events. This is especially true if they are being reported to downstream time series databases that need to store and manage the data long term. There is support for expiration due to inactivity for helping to protect downstream backends, but right now everything registered is kept in the local registry as there were some use-cases that expected them to be the total count. Ideally the usage is fixed so it is not needed, but spectator itself should also get better at mitigating some of this over time. The local expiration behavior will probably get changed along with some changes to try and manage automatic rollups that drop dimensions that are abusive rather than everything. Automatic rollups cover the more common examples we see internally of metrics explosion and we have some examples of it working for specific use-cases. Still remains to be seen how well we can generalize it though. |
Looking again at the heap dump I can see millions of instances of As an aside, I thought that if these stats represent a backlog then enabling atlas would stop the queue building up. So I started an atlas instance on
I also added
The stats submitter seems to be using ribbon (makes sense) to find atlas to send the stats but to me it seems that the eureka/ribbon service-id based goodness is at odds with setting a raw uri in |
Spring has its own observer to get around some java version issues. I'll have to look at how it is setup. Normally if we want to use eureka based lookup we just set the uri to something like:
The |
I left my test system running overnight and again zuul and all the services went OOM this time with the new property set. Java 8 (Oracle JRE) is started like this:
The heap dump shows about 900K instances of the build.gradle
ZuulApplication.javaThis is the only source code. There are no other java sources in the project.
bootstrap.yml
application.yml
|
Thanks for the detailed write up. I'll see if I can reproduce. |
@andysworkshop Concerning your question about
This class would need to be defined in package that is not visible to
In this case, |
@jkschneider Thanks a lot for the tip. I can now submit metrics to atlas and retrieve charts from it. Would you happen to know how to direct the atlas server logging to a file and control its level? I noticed from jconsole that it's using the apache logging system and could see logger settings in the XML configuration such as |
Re the logging you should be able to override the logging config file used with a system property:
Right now there aren't any sub-properties. The log4j configuration format allows for lookups based on system properties. So we could potentially support a simple set with the default log4j config in the standalone jar to do smaller changes with the default config. If you think it would be helpful, then please file an issue on the atlas project. Also if you have a pointer to a common set that are expected so we can be consistent that would be helpful. |
I was able to reproduce the problem. The bug is actually in the SpectatorMetricServices class from spring-boot, not spectator itself. I'll send them a PR tomorrow. |
That's great news, thank you very much. |
The primary change is to fix a memory leak reported in Netflix/spectator#264. Each time a gauge was updated it was creating a new registration and because the map holds a strong reference these would never get collected. Further, the aggregate value created by the multiple registrations was not correct. In addition I added some test cases around the various inputs and checked that the results were reflected as expected in the registry. I noticed the timer values had a unit of milliseconds, but it isn't immediately clear if the reported value can ever less than 1.0. The conversion to long is now delayed until after converting to nanoseconds so duration values less than 1.0 will now work instead of just recording 0. For the histogram I changed to just using a cast to `long` to avoid boxing to a `Double`. As an FYI for the future, there is a DoubleDistributionSummary we have experimented with in spectator-ext-sandbox that might be more appropriate for this use-case.
The primary change is to fix a memory leak reported in Netflix/spectator#264. Each time a gauge was updated it was creating a new registration and because the map holds a strong reference these would never get collected. Further, the aggregate value created by the multiple registrations was not correct. In addition I added some test cases around the various inputs and checked that the results were reflected as expected in the registry. I noticed the timer values had a unit of milliseconds, but it isn't immediately clear if the reported value can ever less than 1.0. The conversion to long is now delayed until after converting to nanoseconds so duration values less than 1.0 will now work instead of just recording 0. For the histogram I changed to just using a cast to `long` to avoid boxing to a `Double`. As an FYI for the future, there is a DoubleDistributionSummary we have experimented with in spectator-ext-sandbox that might be more appropriate for this use-case.
be attention for using log4j2.xml when use system properties or log4j.component.properties. |
@eastwoodwang I'm not sure what you are trying to say. Since the specific issue here was closed a long time ago, I would suggest filing a new issue if you are having a problem with Spectator. |
I was running a small test with 5 instances of a microservice and a python script that periodically hits a REST endpoint on one of them via zuul. That endpoint then uses feign to make calls to the
/info
endpoint on each of the other 4 endpoints. I use the hystrix dashboard to watch aggregated stats collected in turbine as the test runs. Over time all the microservices and zuul build up heap memory usage and then fail.I have heapdumps of the failed services. Eclipse MAT produces the following summary:
I am using spring cloud Brixton M4. I am not running atlas - though I would like to when I figure out how to make the spectator/atlas integration work. I am not doing anything with spectator/servo in code or annotations. My gradle dependencies look like this:
Any ideas why I'm running out of memory and what I can do to fix it?
The text was updated successfully, but these errors were encountered: