ServoRegistry memory leak #264

andysworkshop · 2016-01-25T10:06:21Z

I was running a small test with 5 instances of a microservice and a python script that periodically hits a REST endpoint on one of them via zuul. That endpoint then uses feign to make calls to the /info endpoint on each of the other 4 endpoints. I use the hystrix dashboard to watch aggregated stats collected in turbine as the test runs. Over time all the microservices and zuul build up heap memory usage and then fail.

I have heapdumps of the failed services. Eclipse MAT produces the following summary:

One instance of "com.netflix.spectator.servo.ServoRegistry" loaded by "org.springframework.boot.loader.LaunchedURLClassLoader @ 0xe4e60d38" occupies 562,574,240 (92.78%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class loader>".

I am using spring cloud Brixton M4. I am not running atlas - though I would like to when I figure out how to make the spectator/atlas integration work. I am not doing anything with spectator/servo in code or annotations. My gradle dependencies look like this:

dependencies {
  compile 'org.springframework.cloud:spring-cloud-starter-eureka'
  compile 'org.springframework.cloud:spring-cloud-starter-hystrix'
  compile 'org.springframework.cloud:spring-cloud-starter-feign'
  compile 'org.springframework.cloud:spring-cloud-starter-spectator'
}

Any ideas why I'm running out of memory and what I can do to fix it?

The text was updated successfully, but these errors were encountered:

brharrington · 2016-01-25T16:55:26Z

My guess is that the distinct set of ids being tracked is growing over time. An example of such a problem:

Registry r = new ServoRegistry();
for (int i = 0; ; ++i) {
  r.counter("" + i).increment();
}

If that is the case, then you can protect against that by setting a system property, e.g.:

System.setProperty("spectator.api.maxNumberOfMeters", "10000");

Once the limit is hit it will just return noop implementations for new registrations and prevent further leaking. To debug what is creating the explosion you could just dump all of the ids and do some analysis, e.g.:

r.stream().forEach(m -> System.out.println(m.id()));

If you still see a leak with that setting in place, then it would help if you could create a minimal example to reproduce and I'll take a look.

The set of ids should be stable for the life of a process, that is, it is meant to track things over time not record unique events. This is especially true if they are being reported to downstream time series databases that need to store and manage the data long term. There is support for expiration due to inactivity for helping to protect downstream backends, but right now everything registered is kept in the local registry as there were some use-cases that expected them to be the total count.

Ideally the usage is fixed so it is not needed, but spectator itself should also get better at mitigating some of this over time. The local expiration behavior will probably get changed along with some changes to try and manage automatic rollups that drop dimensions that are abusive rather than everything. Automatic rollups cover the more common examples we see internally of metrics explosion and we have some examples of it working for specific use-cases. Still remains to be seen how well we can generalize it though.

andysworkshop · 2016-01-25T17:28:11Z

Looking again at the heap dump I can see millions of instances of NumericGauge, all of which have an id member of type DefaultId with a value of response.root. I'll start the test again with the system property that you said and see what happens.

As an aside, I thought that if these stats represent a backlog then enabling atlas would stop the queue building up. So I started an atlas instance on localhost and added the following to the client service application.yml:

netflix:
    atlas:
        uri: http://localhost:7101/api/v1/publish

I also added @EnableAtlas to the main application class. This clearly had an effect because in the logs I get this exception:

2016-01-25 16:52:31.338  INFO 5212 --- [task-scheduler-8] c.netflix.loadbalancer.BaseLoadBalancer  : Client:localhost instantiated a LoadBalancer:DynamicServerListLoadBalancer:{NFLoadBalancer:name=localhost,current l
ist of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:null
2016-01-25 16:52:31.350  INFO 5212 --- [task-scheduler-8] c.n.l.DynamicServerListLoadBalancer      : DynamicServerListLoadBalancer for client localhost initialized: DynamicServerListLoadBalancer:{NFLoadBalancer:name=localhost,current list of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:org.springframework.cloud.netflix.ribbon.eureka.DomainExtractingServerList@2d9de86
2016-01-25 16:52:31.411 ERROR 5212 --- [task-scheduler-8] o.s.integration.handler.LoggingHandler   : java.lang.IllegalStateException: No instances available for localhost
        at org.springframework.cloud.netflix.ribbon.RibbonClientHttpRequestFactory.createRequest(RibbonClientHttpRequestFactory.java:56)
        at org.springframework.http.client.support.HttpAccessor.createRequest(HttpAccessor.java:77)
        at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:592)
        at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:557)
        at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:475)
        at org.springframework.cloud.netflix.metrics.atlas.AtlasMetricObserver.sendMetricsBatch(AtlasMetricObserver.java:148)
        at org.springframework.cloud.netflix.metrics.atlas.AtlasMetricObserver.update(AtlasMetricObserver.java:126)
        at org.springframework.cloud.netflix.metrics.atlas.AtlasExporter.export(AtlasExporter.java:35)
        at org.springframework.boot.actuate.metrics.export.MetricExporters$ExportRunner.run(MetricExporters.java:112)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

The stats submitter seems to be using ribbon (makes sense) to find atlas to send the stats but to me it seems that the eureka/ribbon service-id based goodness is at odds with setting a raw uri in netflix.atlas.uri. Any help here would be great.

brharrington · 2016-01-25T17:50:51Z

The stats submitter seems to be using ribbon (makes sense) to find atlas to send the stats but to me it seems that the eureka/ribbon service-id based goodness is at odds with setting a raw uri in netflix.atlas.uri. Any help here would be great.

Spring has its own observer to get around some java version issues. I'll have to look at how it is setup. Normally if we want to use eureka based lookup we just set the uri to something like:

niws://${client_name}/path

The niws scheme tells us to lookup details using eureka the the client name is used to choose the right configration and vip. Not sure if this pattern works with the spring setup, but I'll look into it and get back to you.

andysworkshop · 2016-01-26T09:49:05Z

I left my test system running overnight and again zuul and all the services went OOM this time with the new property set. Java 8 (Oracle JRE) is started like this:

java -server -Dlogging.file=/var/tmp/zuul.log -Dspectator.api.maxNumberOfMeters=10000 -Djava.security.egd=file:/dev/./urandom -Xmx300M -Xms300M -XX:+UseG1GC -jar zuul.jar

The heap dump shows about 900K instances of the NumericGauge object. Zuul is the simplest reproducible example. Here are the build files.

build.gradle

task wrapper(type: Wrapper) {
  gradleVersion = '2.10'
}

buildscript {
  repositories {
    maven { url "https://repo.spring.io/plugins-release/" }
    maven { url "https://repo.spring.io/libs-release/" }
    mavenCentral()
  }
  dependencies {
    classpath "org.springframework.boot:spring-boot-gradle-plugin:1.3.1.RELEASE"
    classpath "io.spring.gradle:dependency-management-plugin:0.5.3.RELEASE"
  }
}

apply plugin: "java"
apply plugin: "spring-boot"
apply plugin: "io.spring.dependency-management"
apply plugin: "maven"

repositories {
  mavenLocal()
  maven { url "https://repo.spring.io/release" }
  maven { url "https://repo.spring.io/milestone" }
  mavenCentral()
}

dependencyManagement {
  imports {
    mavenBom 'org.springframework.cloud:spring-cloud-starter-parent:Brixton.M4'
  }
}

dependencies {
  compile 'org.springframework.cloud:spring-cloud-starter-zuul'
  compile 'org.springframework.cloud:spring-cloud-starter-eureka'
  compile 'org.springframework.cloud:spring-cloud-starter-hystrix'
  compile 'org.springframework.cloud:spring-cloud-starter-stream-rabbit'
  compile 'org.springframework.cloud:spring-cloud-netflix-hystrix-stream'
  compile 'org.springframework.cloud:spring-cloud-starter-spectator'
  compile 'org.springframework.cloud:spring-cloud-starter-atlas'
}

ZuulApplication.java

This is the only source code. There are no other java sources in the project.

package uk.me.andybrown;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.builder.SpringApplicationBuilder;
import org.springframework.cloud.netflix.zuul.EnableZuulProxy;
import org.springframework.context.annotation.Configuration;
import org.springframework.stereotype.Controller;
import org.springframework.cloud.netflix.metrics.atlas.EnableAtlas;

@Configuration
@EnableAutoConfiguration
@SpringBootApplication
@Controller
@EnableZuulProxy
@EnableAtlas
public class ZuulApplication {

    public static void main(String[] args) {
        new SpringApplicationBuilder(ZuulApplication.class).web(true).run(args);
    }
}

bootstrap.yml

server:
  port: 8765

application.yml

spring:
  application:
    name: zuul
  rabbitmq:
      addresses: localhost:5672
      username: [redacted]
      password: [redacted]

server:
  port: 8765

info:
  component: Zuul

eureka:
  instance:
    leaseRenewalIntervalInSeconds: 10
    metadataMap:
      cluster: LOCAL
  client:
    serviceUrl:
      defaultZone: http://localhost:8761/eureka/

netflix:
    atlas:
        uri: http://localhost:7101/api/v1/publish

brharrington · 2016-01-26T15:43:16Z

Thanks for the detailed write up. I'll see if I can reproduce.

jkschneider · 2016-01-26T18:19:52Z

@andysworkshop Concerning your question about netflix.atlas.uri and its interaction with Ribbon: You can configure ribbon to use a static DNS name(s) for a particular named client. For example:

@Configuration
public class AtlasClientConfiguration {
    @Bean
    StaticServerList<Server> ribbonServerList() {
        return new StaticServerList<>(new Server("myatlasuri", 80));
    }
}

This class would need to be defined in package that is not visible to @ComponentScan. Then, one of your @Configuration classes can name the client with:

@RibbonClient(name = "atlas", configuration = AtlasClientConfiguration.class)

In this case, netflix.atlas.uri would be http://atlas, which refers to the named client that you created with @RibbonClient.

andysworkshop · 2016-01-27T17:19:26Z

@jkschneider Thanks a lot for the tip. I can now submit metrics to atlas and retrieve charts from it.

Would you happen to know how to direct the atlas server logging to a file and control its level? I noticed from jconsole that it's using the apache logging system and could see logger settings in the XML configuration such as <Logger name="com.netflix.spectator" level="debug"/>. Are we expected to override the whole configuration file for log control or are there some system properties available like spring cloud has with logging.file etc?

brharrington · 2016-01-27T20:30:49Z

Re the logging you should be able to override the logging config file used with a system property:

-Dlog4j.configurationFile=log4j.xml

Right now there aren't any sub-properties. The log4j configuration format allows for lookups based on system properties. So we could potentially support a simple set with the default log4j config in the standalone jar to do smaller changes with the default config. If you think it would be helpful, then please file an issue on the atlas project. Also if you have a pointer to a common set that are expected so we can be consistent that would be helpful.

brharrington · 2016-01-28T05:44:51Z

I was able to reproduce the problem. The bug is actually in the SpectatorMetricServices class from spring-boot, not spectator itself. I'll send them a PR tomorrow.

andysworkshop · 2016-01-28T08:56:11Z

That's great news, thank you very much.

The primary change is to fix a memory leak reported in Netflix/spectator#264. Each time a gauge was updated it was creating a new registration and because the map holds a strong reference these would never get collected. Further, the aggregate value created by the multiple registrations was not correct. In addition I added some test cases around the various inputs and checked that the results were reflected as expected in the registry. I noticed the timer values had a unit of milliseconds, but it isn't immediately clear if the reported value can ever less than 1.0. The conversion to long is now delayed until after converting to nanoseconds so duration values less than 1.0 will now work instead of just recording 0. For the histogram I changed to just using a cast to `long` to avoid boxing to a `Double`. As an FYI for the future, there is a DoubleDistributionSummary we have experimented with in spectator-ext-sandbox that might be more appropriate for this use-case.

eastwoodwang · 2018-09-28T02:34:03Z

be attention for using log4j2.xml when use system properties or log4j.component.properties.

brharrington · 2018-09-28T02:45:38Z

@eastwoodwang I'm not sure what you are trying to say. Since the specific issue here was closed a long time ago, I would suggest filing a new issue if you are having a problem with Spectator.

brharrington added the question label Jan 25, 2016

brharrington mentioned this issue Jan 28, 2016

fix registration bug in SpectatorMetricServices spring-cloud/spring-cloud-netflix#800

Closed

brharrington closed this as completed Jan 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ServoRegistry memory leak #264

ServoRegistry memory leak #264

andysworkshop commented Jan 25, 2016

brharrington commented Jan 25, 2016

andysworkshop commented Jan 25, 2016

brharrington commented Jan 25, 2016

andysworkshop commented Jan 26, 2016

brharrington commented Jan 26, 2016

jkschneider commented Jan 26, 2016

andysworkshop commented Jan 27, 2016

brharrington commented Jan 27, 2016

brharrington commented Jan 28, 2016

andysworkshop commented Jan 28, 2016

eastwoodwang commented Sep 28, 2018

brharrington commented Sep 28, 2018 •

edited

ServoRegistry memory leak #264

ServoRegistry memory leak #264

Comments

andysworkshop commented Jan 25, 2016

brharrington commented Jan 25, 2016

andysworkshop commented Jan 25, 2016

brharrington commented Jan 25, 2016

andysworkshop commented Jan 26, 2016

build.gradle

ZuulApplication.java

bootstrap.yml

application.yml

brharrington commented Jan 26, 2016

jkschneider commented Jan 26, 2016

andysworkshop commented Jan 27, 2016

brharrington commented Jan 27, 2016

brharrington commented Jan 28, 2016

andysworkshop commented Jan 28, 2016

eastwoodwang commented Sep 28, 2018

brharrington commented Sep 28, 2018 • edited

brharrington commented Sep 28, 2018 •

edited