Skip to content
This repository has been archived by the owner. It is now read-only.

Grafana graphs brake when selecting "All" servers #9

Closed
alhardy opened this issue Jun 23, 2017 · 12 comments

Comments

Projects
None yet
3 participants
@alhardy
Copy link
Collaborator

commented Jun 23, 2017

@alhardy alhardy added the question label Jun 23, 2017

@markvincze

This comment has been minimized.

Copy link

commented Jun 23, 2017

Or if it's not confidential, if you could share the configuration of your dashboard, that would also be awesome.

Thanks,
Mark

@markvincze

This comment has been minimized.

Copy link

commented Jun 23, 2017

Btw. I managed to set up the aggregation nicely for all the graphs, but none of the tables. I experimented quite a lot, but couldn't get aggregation working there at all, I always end up seeing all the single metric values in a long list.
Is it possible that the Table Panel works differently with Prometheus than with InfluxDB?

@Rurouni

This comment has been minimized.

Copy link
Collaborator

commented Jun 24, 2017

You guys are ahead of me on this one. We have very simple proof of concept dashboards atm and I am doing avg/sum across all tags with filter only by service name so I don't need All/multi-select.
So I am quite interested on what works best as well...
From reading docs: http://docs.grafana.org/features/datasources/prometheus/

When the Multi-value or Include all value options are enabled, Grafana converts the labels from plain text to a regex compatible string. Which means you have to use =~ instead of =.

so ~ is the way to go. and I will probably would just have a single dashboard with avg/sum with multiselect/all because seeing pods metrics side by side is rarely needed and it doesn't scale to xxx number of pods....

@markvincze

This comment has been minimized.

Copy link

commented Jun 26, 2017

@Rurouni
I see, then I guess we are on the same page.
One minor change: I also had to change the metric of the response time from millisecond to second.

One thing bothering me is that occasionally I get a huge spike in the throughput graph that I cannot explain, for example my endpoints have a throughput of ~10-20 rpm, and occasionally the graph shows a spike of ~500 rpm like this:

image

And if I check the logs of my api, I cannot see a spike in the number of my requests anywhere, only in Grafana.
Have you encountered anything like this?

@alhardy

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 26, 2017

@markvincze, I didn't realise you were referring to having a per server metric on the endpoint monitoring :) Could you PR your Grafana dashboard and I'll upload the changes to Grafana. Here's the dashboard in the repo.

I've checked a few of my production api's for random spikes in per endpoint throughput, I don't see anything like that. However I am using the InfluxDB reporter which is just plotting the 1 min rate calculated by App Metrics which is slightly different to how the Prometheus reporter is working.

image

Do you see the same thing on the overall throughput of your app?

@markvincze

This comment has been minimized.

Copy link

commented Jun 28, 2017

@alhardy Sure, I'll issue a PR. I removed all the Table Panels from my dashboard because they weren't working at all, do you want me to include that in the PR, or leave those untouched?

Yep, the spikes are consistently there on the per endpoint breakdown, and on the overall throughput.

@alhardy

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 28, 2017

@markvincze Cool, yeah leave the table panels untouched and I'll have a look.

Interesting re spikes, I'll run some tests on my local setup too see if I can replicate.

@Rurouni

This comment has been minimized.

Copy link
Collaborator

commented Jun 28, 2017

@markvincze No, I haven't seen such spikes, but then I am not running it in any realistic environment yet.
From the look of it it may be a GC spike that compacted many requests to be deleayed and processed at once, but who knows... I would first try to isolate if it's problem in source data or AppMetrics or Prometheus itself. If you have logs of all requests in ElasticSearch you can at least check if it the first one. Then write simple polling C# program that reads /metrics endpoint into a file so see if it's aggregated data from AppMetrics that's wrong, if all of this is correct than it leaves you with ether Pormetheus QL, or Grafana :)

@alhardy

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 28, 2017

@markvincze Did you end up doing anymore investigation here?

@markvincze

This comment has been minimized.

Copy link

commented Aug 28, 2017

Hi @alhardy, no, unfortunately I didn't have any time to look into this yet.

@alhardy

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 28, 2017

OK will leave the issue open for now

@alhardy alhardy added wontfix and removed question labels Jun 29, 2018

@alhardy

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 29, 2018

Can't replicate

@alhardy alhardy closed this Jun 29, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.