Gauge type metric stops populating in AWS CloudWatch after some period of time #713

peterlitvak · 2020-09-30T13:40:22Z

I hav a gauge type metric that is sent from multiple nodes to AWS cloudwatch. Only one node sends the actual value and the rest sends 0. After some time AWS CloudWatch stops updating the metric in a dashboard (0 is shown).
While troubleshooting the issue I can see in the logs that correct metric values are coming to statsd service while AWS CloudWatch is showing 0.
Looking for any clues as to what can be an issue here.
Statsd version: 0.8.6

BlueHatbRit · 2020-09-30T14:57:39Z

Hi @peterlitvak, that sounds frustrating! Would you be able to post your full configuration and more details of your setup? Some questions that would be useful to know the answer to:

Are you using any plugins with statsd? For example, any custom "backends", I imagine you are if they're going into AWS cloudwatch.
What does your config file look like?
Are you using a statsd cluster or separate nodes, more info about this setup would be great.
By any chance do you have some reproduction steps?

peterlitvak · 2020-09-30T15:31:33Z

Thank you for the quick response, here are the details in order:

yes we do use https://github.com/dylanmei/statsd-cloudwatch-backend
here is the config:

{
    backends: [ "statsd-cloudwatch-backend"],
    flushInterval: 60000,
    cloudwatch: {
        region: 'us-west-2',
        dimensions: {
            InstanceId: 'dynamic'
        },
        namespace: "Staging"
    }
}

I also added debug: true and dumpMessages: true while trying to understand what could be the problem, but due to the log file size growth it it usually turned off.

we use separate nodes not cluster
there are no reproduction steps since the issue is pretty sporadic, at some point I thought it is related to the application restarts that publishes statsd metrics but that turned out not to be the case since after the app restart I can still see the messages coming in the statsd log.

One more pice of information is that if when issue occurs I restart the statsd service on the host that publishes actual values the cloudwatch starts displaying data.

BlueHatbRit · 2020-10-01T10:13:22Z

Thanks for the info, especially the config and plugin link.

So in theory to duplicate your setup I could setup an app which updated a gauge sporadically and pumped the metrics into statsd and then into cloudwatch using that plugin? How sporadic is the issue, is it once every few days or once every few hours?

One more pice of information is that if when issue occurs I restart the statsd service on the host that publishes actual values the cloudwatch starts displaying data.

Right, so it sounds like it's probably not related to your application but something after you've published metrics. I'm guessing you're using UDP and not TCP to send into statsd?

If we can know the rough interval I can setup a local demo and leave it running over night or something and see if I can re-create the issue. I run a bunch of statsd systems with heavy reliance on gauges but none of them go into cloudwatch so my gut tells me it's probably something to do with the plugin, but I have no evidence of that. I see you've opened an issue on the plugin as well. Linking here for future reference: dylanmei/statsd-cloudwatch-backend#5

peterlitvak · 2020-10-01T13:24:30Z

It was happening every 2-3 days. As of now metrics have been published normally for about 2.5 days so if pattern holds it should stop within next day or so.
On the protocol side, we are using com.bealetech/metrics-statsd to publish metrics and it looks like they are using UDP.
The statsd and the app that publishes metrics to it are on the same host.

BlueHatbRit · 2020-10-02T08:42:55Z

Okay great thanks for the info @peterlitvak, I'd be interested to see what the cloudwatch plugin maintainers say. With something like this I'd expect to see a lot of issues very quickly if we had this problem with our graphite backend for example, and I'm personally running one with a graphite backend and I've not had this problem without any restarts for a few months now.

Are you on the latest version of statsd? I could try and set up a proof but it'll mean spinning up some infra on aws. I'll give the cloudwatch-backend maintainers a bit of time to respond since they've not made any updates since 2015. If they don't then it might be a case of debugging their module. It totally could be statsd, but that would impact every backend which doesn't seem to be the case at the moment.

peterlitvak · 2020-10-02T12:03:45Z

We are at v0.8.6 of statsd. I understand it could be a number of things and greatly appreciate you looking in to this. It is especially hard to troubleshoot since it is pretty sporadic. For example everything is working fine for 4 days in a row now.

BlueHatbRit · 2020-10-05T20:47:35Z

No worries, I've had something running for the last 24h and haven't had the issue, I'll keep it running for a bit longer but since I've not seen this with non-aws related backends I'm incline to say it's unlikely to be the statsd daemon right now. I'll keep things running and see what happens, any logs you do manage to get would be fantastic but I understand that's tough given the scenario. I'm wondering if we could make a logging change to use a log file rotation, it's not something we've needed in the past but it could be the time for it if this issue persists.

peterlitvak · 2020-10-05T22:21:38Z

Appreciate your attention to the issue. I've changed our staging code to report same value for the gauge from all of the nodes, will see if that positively affects the stability.

BlueHatbRit self-assigned this Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gauge type metric stops populating in AWS CloudWatch after some period of time #713

Gauge type metric stops populating in AWS CloudWatch after some period of time #713

peterlitvak commented Sep 30, 2020 •

edited

BlueHatbRit commented Sep 30, 2020

peterlitvak commented Sep 30, 2020

BlueHatbRit commented Oct 1, 2020

peterlitvak commented Oct 1, 2020

BlueHatbRit commented Oct 2, 2020

peterlitvak commented Oct 2, 2020

BlueHatbRit commented Oct 5, 2020

peterlitvak commented Oct 5, 2020

Gauge type metric stops populating in AWS CloudWatch after some period of time #713

Gauge type metric stops populating in AWS CloudWatch after some period of time #713

Comments

peterlitvak commented Sep 30, 2020 • edited

BlueHatbRit commented Sep 30, 2020

peterlitvak commented Sep 30, 2020

BlueHatbRit commented Oct 1, 2020

peterlitvak commented Oct 1, 2020

BlueHatbRit commented Oct 2, 2020

peterlitvak commented Oct 2, 2020

BlueHatbRit commented Oct 5, 2020

peterlitvak commented Oct 5, 2020

peterlitvak commented Sep 30, 2020 •

edited