Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServoRegistry memory leak #264

Closed
andysworkshop opened this issue Jan 25, 2016 · 12 comments
Closed

ServoRegistry memory leak #264

andysworkshop opened this issue Jan 25, 2016 · 12 comments
Labels

Comments

@andysworkshop
Copy link

I was running a small test with 5 instances of a microservice and a python script that periodically hits a REST endpoint on one of them via zuul. That endpoint then uses feign to make calls to the /info endpoint on each of the other 4 endpoints. I use the hystrix dashboard to watch aggregated stats collected in turbine as the test runs. Over time all the microservices and zuul build up heap memory usage and then fail.

I have heapdumps of the failed services. Eclipse MAT produces the following summary:

One instance of "com.netflix.spectator.servo.ServoRegistry" loaded by "org.springframework.boot.loader.LaunchedURLClassLoader @ 0xe4e60d38" occupies 562,574,240 (92.78%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class loader>".

mat-1

I am using spring cloud Brixton M4. I am not running atlas - though I would like to when I figure out how to make the spectator/atlas integration work. I am not doing anything with spectator/servo in code or annotations. My gradle dependencies look like this:

dependencies {
  compile 'org.springframework.cloud:spring-cloud-starter-eureka'
  compile 'org.springframework.cloud:spring-cloud-starter-hystrix'
  compile 'org.springframework.cloud:spring-cloud-starter-feign'
  compile 'org.springframework.cloud:spring-cloud-starter-spectator'
}

Any ideas why I'm running out of memory and what I can do to fix it?

@brharrington
Copy link
Contributor

My guess is that the distinct set of ids being tracked is growing over time. An example of such a problem:

Registry r = new ServoRegistry();
for (int i = 0; ; ++i) {
  r.counter("" + i).increment();
}

If that is the case, then you can protect against that by setting a system property, e.g.:

System.setProperty("spectator.api.maxNumberOfMeters", "10000");

Once the limit is hit it will just return noop implementations for new registrations and prevent further leaking. To debug what is creating the explosion you could just dump all of the ids and do some analysis, e.g.:

r.stream().forEach(m -> System.out.println(m.id()));

If you still see a leak with that setting in place, then it would help if you could create a minimal example to reproduce and I'll take a look.

The set of ids should be stable for the life of a process, that is, it is meant to track things over time not record unique events. This is especially true if they are being reported to downstream time series databases that need to store and manage the data long term. There is support for expiration due to inactivity for helping to protect downstream backends, but right now everything registered is kept in the local registry as there were some use-cases that expected them to be the total count.

Ideally the usage is fixed so it is not needed, but spectator itself should also get better at mitigating some of this over time. The local expiration behavior will probably get changed along with some changes to try and manage automatic rollups that drop dimensions that are abusive rather than everything. Automatic rollups cover the more common examples we see internally of metrics explosion and we have some examples of it working for specific use-cases. Still remains to be seen how well we can generalize it though.

@andysworkshop
Copy link
Author

Looking again at the heap dump I can see millions of instances of NumericGauge, all of which have an id member of type DefaultId with a value of response.root. I'll start the test again with the system property that you said and see what happens.

As an aside, I thought that if these stats represent a backlog then enabling atlas would stop the queue building up. So I started an atlas instance on localhost and added the following to the client service application.yml:

netflix:
    atlas:
        uri: http://localhost:7101/api/v1/publish

I also added @EnableAtlas to the main application class. This clearly had an effect because in the logs I get this exception:

2016-01-25 16:52:31.338  INFO 5212 --- [task-scheduler-8] c.netflix.loadbalancer.BaseLoadBalancer  : Client:localhost instantiated a LoadBalancer:DynamicServerListLoadBalancer:{NFLoadBalancer:name=localhost,current l
ist of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:null
2016-01-25 16:52:31.350  INFO 5212 --- [task-scheduler-8] c.n.l.DynamicServerListLoadBalancer      : DynamicServerListLoadBalancer for client localhost initialized: DynamicServerListLoadBalancer:{NFLoadBalancer:name=localhost,current list of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:org.springframework.cloud.netflix.ribbon.eureka.DomainExtractingServerList@2d9de86
2016-01-25 16:52:31.411 ERROR 5212 --- [task-scheduler-8] o.s.integration.handler.LoggingHandler   : java.lang.IllegalStateException: No instances available for localhost
        at org.springframework.cloud.netflix.ribbon.RibbonClientHttpRequestFactory.createRequest(RibbonClientHttpRequestFactory.java:56)
        at org.springframework.http.client.support.HttpAccessor.createRequest(HttpAccessor.java:77)
        at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:592)
        at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:557)
        at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:475)
        at org.springframework.cloud.netflix.metrics.atlas.AtlasMetricObserver.sendMetricsBatch(AtlasMetricObserver.java:148)
        at org.springframework.cloud.netflix.metrics.atlas.AtlasMetricObserver.update(AtlasMetricObserver.java:126)
        at org.springframework.cloud.netflix.metrics.atlas.AtlasExporter.export(AtlasExporter.java:35)
        at org.springframework.boot.actuate.metrics.export.MetricExporters$ExportRunner.run(MetricExporters.java:112)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

The stats submitter seems to be using ribbon (makes sense) to find atlas to send the stats but to me it seems that the eureka/ribbon service-id based goodness is at odds with setting a raw uri in netflix.atlas.uri. Any help here would be great.

@brharrington
Copy link
Contributor

The stats submitter seems to be using ribbon (makes sense) to find atlas to send the stats but to me it seems that the eureka/ribbon service-id based goodness is at odds with setting a raw uri in netflix.atlas.uri. Any help here would be great.

Spring has its own observer to get around some java version issues. I'll have to look at how it is setup. Normally if we want to use eureka based lookup we just set the uri to something like:

niws://${client_name}/path

The niws scheme tells us to lookup details using eureka the the client name is used to choose the right configration and vip. Not sure if this pattern works with the spring setup, but I'll look into it and get back to you.

@andysworkshop
Copy link
Author

I left my test system running overnight and again zuul and all the services went OOM this time with the new property set. Java 8 (Oracle JRE) is started like this:

java -server -Dlogging.file=/var/tmp/zuul.log -Dspectator.api.maxNumberOfMeters=10000 -Djava.security.egd=file:/dev/./urandom -Xmx300M -Xms300M -XX:+UseG1GC -jar zuul.jar

The heap dump shows about 900K instances of the NumericGauge object. Zuul is the simplest reproducible example. Here are the build files.

build.gradle
task wrapper(type: Wrapper) {
  gradleVersion = '2.10'
}

buildscript {
  repositories {
    maven { url "https://repo.spring.io/plugins-release/" }
    maven { url "https://repo.spring.io/libs-release/" }
    mavenCentral()
  }
  dependencies {
    classpath "org.springframework.boot:spring-boot-gradle-plugin:1.3.1.RELEASE"
    classpath "io.spring.gradle:dependency-management-plugin:0.5.3.RELEASE"
  }
}

apply plugin: "java"
apply plugin: "spring-boot"
apply plugin: "io.spring.dependency-management"
apply plugin: "maven"

repositories {
  mavenLocal()
  maven { url "https://repo.spring.io/release" }
  maven { url "https://repo.spring.io/milestone" }
  mavenCentral()
}

dependencyManagement {
  imports {
    mavenBom 'org.springframework.cloud:spring-cloud-starter-parent:Brixton.M4'
  }
}

dependencies {
  compile 'org.springframework.cloud:spring-cloud-starter-zuul'
  compile 'org.springframework.cloud:spring-cloud-starter-eureka'
  compile 'org.springframework.cloud:spring-cloud-starter-hystrix'
  compile 'org.springframework.cloud:spring-cloud-starter-stream-rabbit'
  compile 'org.springframework.cloud:spring-cloud-netflix-hystrix-stream'
  compile 'org.springframework.cloud:spring-cloud-starter-spectator'
  compile 'org.springframework.cloud:spring-cloud-starter-atlas'
}
ZuulApplication.java

This is the only source code. There are no other java sources in the project.

package uk.me.andybrown;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.builder.SpringApplicationBuilder;
import org.springframework.cloud.netflix.zuul.EnableZuulProxy;
import org.springframework.context.annotation.Configuration;
import org.springframework.stereotype.Controller;
import org.springframework.cloud.netflix.metrics.atlas.EnableAtlas;

@Configuration
@EnableAutoConfiguration
@SpringBootApplication
@Controller
@EnableZuulProxy
@EnableAtlas
public class ZuulApplication {

    public static void main(String[] args) {
        new SpringApplicationBuilder(ZuulApplication.class).web(true).run(args);
    }
}
bootstrap.yml
server:
  port: 8765
application.yml
spring:
  application:
    name: zuul
  rabbitmq:
      addresses: localhost:5672
      username: [redacted]
      password: [redacted]

server:
  port: 8765

info:
  component: Zuul

eureka:
  instance:
    leaseRenewalIntervalInSeconds: 10
    metadataMap:
      cluster: LOCAL
  client:
    serviceUrl:
      defaultZone: http://localhost:8761/eureka/

netflix:
    atlas:
        uri: http://localhost:7101/api/v1/publish

@brharrington
Copy link
Contributor

Thanks for the detailed write up. I'll see if I can reproduce.

@jkschneider
Copy link
Contributor

@andysworkshop Concerning your question about netflix.atlas.uri and its interaction with Ribbon: You can configure ribbon to use a static DNS name(s) for a particular named client. For example:

@Configuration
public class AtlasClientConfiguration {
    @Bean
    StaticServerList<Server> ribbonServerList() {
        return new StaticServerList<>(new Server("myatlasuri", 80));
    }
}

This class would need to be defined in package that is not visible to @ComponentScan. Then, one of your @Configuration classes can name the client with:

@RibbonClient(name = "atlas", configuration = AtlasClientConfiguration.class)

In this case, netflix.atlas.uri would be http://atlas, which refers to the named client that you created with @RibbonClient.

@andysworkshop
Copy link
Author

@jkschneider Thanks a lot for the tip. I can now submit metrics to atlas and retrieve charts from it.

Would you happen to know how to direct the atlas server logging to a file and control its level? I noticed from jconsole that it's using the apache logging system and could see logger settings in the XML configuration such as <Logger name="com.netflix.spectator" level="debug"/>. Are we expected to override the whole configuration file for log control or are there some system properties available like spring cloud has with logging.file etc?

@brharrington
Copy link
Contributor

Re the logging you should be able to override the logging config file used with a system property:

-Dlog4j.configurationFile=log4j.xml

Right now there aren't any sub-properties. The log4j configuration format allows for lookups based on system properties. So we could potentially support a simple set with the default log4j config in the standalone jar to do smaller changes with the default config. If you think it would be helpful, then please file an issue on the atlas project. Also if you have a pointer to a common set that are expected so we can be consistent that would be helpful.

@brharrington
Copy link
Contributor

I was able to reproduce the problem. The bug is actually in the SpectatorMetricServices class from spring-boot, not spectator itself. I'll send them a PR tomorrow.

@andysworkshop
Copy link
Author

That's great news, thank you very much.

brharrington added a commit to brharrington/spring-cloud-netflix that referenced this issue Jan 28, 2016
The primary change is to fix a memory leak reported in
Netflix/spectator#264. Each time a gauge was updated it
was creating a new registration and because the map holds
a strong reference these would never get collected.
Further, the aggregate value created by the multiple
registrations was not correct.

In addition I added some test cases around the various
inputs and checked that the results were reflected as
expected in the registry. I noticed the timer values
had a unit of milliseconds, but it isn't immediately
clear if the reported value can ever less than 1.0. The
conversion to long is now delayed until after converting
to nanoseconds so duration values less than 1.0 will now
work instead of just recording 0.

For the histogram I changed to just using a cast to `long`
to avoid boxing to a `Double`. As an FYI for the future,
there is a DoubleDistributionSummary we have experimented
with in spectator-ext-sandbox that might be more appropriate
for this use-case.
spencergibb pushed a commit to spring-cloud/spring-cloud-netflix that referenced this issue Feb 1, 2016
The primary change is to fix a memory leak reported in
Netflix/spectator#264. Each time a gauge was updated it
was creating a new registration and because the map holds
a strong reference these would never get collected.
Further, the aggregate value created by the multiple
registrations was not correct.

In addition I added some test cases around the various
inputs and checked that the results were reflected as
expected in the registry. I noticed the timer values
had a unit of milliseconds, but it isn't immediately
clear if the reported value can ever less than 1.0. The
conversion to long is now delayed until after converting
to nanoseconds so duration values less than 1.0 will now
work instead of just recording 0.

For the histogram I changed to just using a cast to `long`
to avoid boxing to a `Double`. As an FYI for the future,
there is a DoubleDistributionSummary we have experimented
with in spectator-ext-sandbox that might be more appropriate
for this use-case.
@eastwoodwang
Copy link

be attention for using log4j2.xml when use system properties or log4j.component.properties.

@brharrington
Copy link
Contributor

brharrington commented Sep 28, 2018

@eastwoodwang I'm not sure what you are trying to say. Since the specific issue here was closed a long time ago, I would suggest filing a new issue if you are having a problem with Spectator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants