Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gauge type metric stops populating in AWS CloudWatch after some period of time #713

Open
peterlitvak opened this issue Sep 30, 2020 · 8 comments
Assignees

Comments

@peterlitvak
Copy link

peterlitvak commented Sep 30, 2020

I hav a gauge type metric that is sent from multiple nodes to AWS cloudwatch. Only one node sends the actual value and the rest sends 0. After some time AWS CloudWatch stops updating the metric in a dashboard (0 is shown).
While troubleshooting the issue I can see in the logs that correct metric values are coming to statsd service while AWS CloudWatch is showing 0.
Looking for any clues as to what can be an issue here.
Statsd version: 0.8.6

@BlueHatbRit
Copy link
Member

Hi @peterlitvak, that sounds frustrating! Would you be able to post your full configuration and more details of your setup? Some questions that would be useful to know the answer to:

  • Are you using any plugins with statsd? For example, any custom "backends", I imagine you are if they're going into AWS cloudwatch.
  • What does your config file look like?
  • Are you using a statsd cluster or separate nodes, more info about this setup would be great.
  • By any chance do you have some reproduction steps?

@peterlitvak
Copy link
Author

Thank you for the quick response, here are the details in order:

{
    backends: [ "statsd-cloudwatch-backend"],
    flushInterval: 60000,
    cloudwatch: {
        region: 'us-west-2',
        dimensions: {
            InstanceId: 'dynamic'
        },
        namespace: "Staging"
    }
}

I also added debug: true and dumpMessages: true while trying to understand what could be the problem, but due to the log file size growth it it usually turned off.

  • we use separate nodes not cluster

  • there are no reproduction steps since the issue is pretty sporadic, at some point I thought it is related to the application restarts that publishes statsd metrics but that turned out not to be the case since after the app restart I can still see the messages coming in the statsd log.

One more pice of information is that if when issue occurs I restart the statsd service on the host that publishes actual values the cloudwatch starts displaying data.

@BlueHatbRit
Copy link
Member

Thanks for the info, especially the config and plugin link.

So in theory to duplicate your setup I could setup an app which updated a gauge sporadically and pumped the metrics into statsd and then into cloudwatch using that plugin? How sporadic is the issue, is it once every few days or once every few hours?

One more pice of information is that if when issue occurs I restart the statsd service on the host that publishes actual values the cloudwatch starts displaying data.

Right, so it sounds like it's probably not related to your application but something after you've published metrics. I'm guessing you're using UDP and not TCP to send into statsd?

If we can know the rough interval I can setup a local demo and leave it running over night or something and see if I can re-create the issue. I run a bunch of statsd systems with heavy reliance on gauges but none of them go into cloudwatch so my gut tells me it's probably something to do with the plugin, but I have no evidence of that. I see you've opened an issue on the plugin as well. Linking here for future reference: dylanmei/statsd-cloudwatch-backend#5

@peterlitvak
Copy link
Author

It was happening every 2-3 days. As of now metrics have been published normally for about 2.5 days so if pattern holds it should stop within next day or so.
On the protocol side, we are using com.bealetech/metrics-statsd to publish metrics and it looks like they are using UDP.
The statsd and the app that publishes metrics to it are on the same host.

@BlueHatbRit
Copy link
Member

Okay great thanks for the info @peterlitvak, I'd be interested to see what the cloudwatch plugin maintainers say. With something like this I'd expect to see a lot of issues very quickly if we had this problem with our graphite backend for example, and I'm personally running one with a graphite backend and I've not had this problem without any restarts for a few months now.

Are you on the latest version of statsd? I could try and set up a proof but it'll mean spinning up some infra on aws. I'll give the cloudwatch-backend maintainers a bit of time to respond since they've not made any updates since 2015. If they don't then it might be a case of debugging their module. It totally could be statsd, but that would impact every backend which doesn't seem to be the case at the moment.

@BlueHatbRit BlueHatbRit self-assigned this Oct 2, 2020
@peterlitvak
Copy link
Author

We are at v0.8.6 of statsd. I understand it could be a number of things and greatly appreciate you looking in to this. It is especially hard to troubleshoot since it is pretty sporadic. For example everything is working fine for 4 days in a row now.

@BlueHatbRit
Copy link
Member

No worries, I've had something running for the last 24h and haven't had the issue, I'll keep it running for a bit longer but since I've not seen this with non-aws related backends I'm incline to say it's unlikely to be the statsd daemon right now. I'll keep things running and see what happens, any logs you do manage to get would be fantastic but I understand that's tough given the scenario. I'm wondering if we could make a logging change to use a log file rotation, it's not something we've needed in the past but it could be the time for it if this issue persists.

@peterlitvak
Copy link
Author

Appreciate your attention to the issue. I've changed our staging code to report same value for the gauge from all of the nodes, will see if that positively affects the stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants