Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect Datastore Metrics? #1

Closed
exbane opened this issue Aug 4, 2016 · 14 comments
Closed

Collect Datastore Metrics? #1

exbane opened this issue Aug 4, 2016 · 14 comments

Comments

@exbane
Copy link

exbane commented Aug 4, 2016

Good evening @oxalideops

Will collecting the datastore metrics from each host aggregate the datastore metrics so i could graph say VMFS-Volume-01 as a single graph with its throughput/IO/Latency?

Also does your script collect the 20 second stats from each ESXi host or does it pull the stats through the API from the vCenter DB?

Thank you!

-exbane

@sofixa
Copy link
Member

sofixa commented Aug 4, 2016

Hi @exbane .

With the example config file, when collecting the following metrics on the VirtualMachine level:


                { "Metric": "datastore.numberReadAveraged.average", "Instances": "*" },
                { "Metric": "datastore.numberWriteAveraged.average", "Instances": "*" },
                { "Metric": "datastore.read.average", "Instances": "*" },
                { "Metric": "datastore.write.average", "Instances": "*" },
                { "Metric": "datastore.totalReadLatency.average", "Instances": "*" },
                { "Metric": "datastore.totalWriteLatency.average", "Instances": "*" },

You get all IO/throughput/latency per VM, and each VM is tagged with its datastore, so, with a query that gets the sum of all these metrics and groups by datastore, you would be able to graph your datastore.

For example, this is query i use in Grafana to get a table with all datastores and everything that happens with them:

SELECT (sum("datastore_numberwriteaveraged_average") + sum("datastore_numberreadaveraged_average")) / 60 as "IOPS avg", max("datastore_numberwriteaveraged_average") + max("datastore_numberreadaveraged_average") as "IOPS max", mean("datastore_totalreadlatency_average") + mean("datastore_totalwritelatency_average") as "latency avg", max("datastore_totalreadlatency_average") + max("datastore_totalwritelatency_average") as "latency max" FROM "virtualmachine" WHERE "host" =~ /^$host$/ AND $timeFilter GROUP BY "datastore"

Notes:

  • $host is a template variable so i can filter based on hosts
  • I am using (sum + sum) / 60 to get the number of IOPS(i get sum of all reads + sum of all writes for all VMs for the collection period(60s in my case) and divide by 60 to get the average per second value)

As for your other question, all stats are pulled from a vCenter's statistics database via the VMware Performance Manager API (i think that what it's called), where applicable(you can also put a host directly in the config file, but there are A LOT less statistics available).

If you have any more questions, don't hesitate!

Cheers,
Adrian

@exbane
Copy link
Author

exbane commented Aug 4, 2016

Good morning @oxalideops!

Thank you kindly! I will test it out this morning when I get to work and let you know the results :)

Appreciate it.

-exbane

@exbane
Copy link
Author

exbane commented Aug 4, 2016

i'm a dummy @sofixa - completely didn't read the name of the person who commented :)

So I got my environment setup to send metrics to my InfluxDB server (which is v1.0).. I'm currently just running it against my QA/DEV vCenter which has 168 hosts and 1700 VMs.. I can see that it inventories ALL of the hosts/clusters/datastores/VMs EXTREMELY FAST but i'm not actually seeing any metrics come into the InfluxDB Database. The script takes about 10 seconds to run.. the account i'm using is an admin account (which i'll change to a read-only account after i get things situated).. Maybe i'm not doing my grafana graphs properly.

@sofixa
Copy link
Member

sofixa commented Aug 4, 2016

@exbane have you configured properly your InfluxDB output? You should see from the output if it sends the data to influx(you should see "sent data to Influxdb" if all is fine), and then, you can check your InfluxDB's logs(/var/log/inluxdb/influxdb.log) to see if everything was properly received, and then from the influx CLI you can use SHOW SERIES, SHOW MEASUREMENTS to explore the data you have.
If everything is there, the problem is with your Grafana queries/datasource configuration :)

Cheers,
Adrian

@exbane
Copy link
Author

exbane commented Aug 5, 2016

Hello @sofixa ! So I did the following and it produced this output.

[root@blue201 bin]# vsphere-influxdb-go
2016/08/05 13:31:25 Starting : vsphere-influxdb-go
2016/08/05 13:31:25 connecting to vcenter: vcenter202.ops.global.ad
2016/08/05 13:31:25 Successfully connected to Influx

2016/08/05 13:31:25 Querying vcenter
2016/08/05 13:31:25 Setting up query inventory of vcenter: vcenter202.ops.global.ad
2016/08/05 13:31:25 connecting to vcenter: vcenter202.ops.global.ad
2016/08/05 13:31:43 sent data to Influxdb
[root@blue201 bin]#

Then i went to my influxdb server and looked at the influxdb log..
[root@influxdb201 419]# tail /var/log/influxdb/influxd.log -n 1000| grep pegasus
[httpd] 2016/08/05 13:31:43 10.4.168.127 - vmware [05/Aug/2016:13:31:41 -0400] POST /write?consistency=&db=pegasus&precision=s&rp= HTTP/1.1 204 0 - InfluxDBClient 790e9864-5b32-11e6-8ea1-000000000000 1.609145152s
[root@influxdb201 419]#

pegasus is the DB i used to store the metrics in InfluxDB... That's the only log entry for that DB whenever i run the script..

When i do a show series through the data explorer in InfluxDB I can see close to 32k entries for all the systems i have etc etc.

Here is an example of a single VM from each measurement from when i did a show series..

cpu,cluster=Cluster01,datastore=Datastore01,esx=hostblahblah,host=vcenter,instance=0,name=vmname010

disk,host=vcenter,instance=naa_514f0c50efc00002,name=hostblahblahblah

hostsystem,host=vcenter,name=hostblahblahblah

net,cluster=Cluster01,datastore=Datastore01,esx=hostblahblah,host=vcenter,instance=vmnic6,name=vmname010

virtualmachine,cluster=Cluster01,datastore=Datastore01,esx=hostblahblah,host=vcenter,name=vmname010

It looks it only created 5 different types of measurements.. I dont see anything for RAM or other types of disk resources like latency, throughput, etc etc..

What am i doing wrong! :)

Thank you again!

@exbane
Copy link
Author

exbane commented Aug 5, 2016

BTW - i'm on vSphere6 Update2 for vCenter and all of my ESXi hosts..

@sofixa
Copy link
Member

sofixa commented Aug 5, 2016

Nothing, that's the expected result :)

Measurements in InfluxDB are similar to tables in an SQL database - they are just an organization unit, the data itself(values in "columns") are below that.

In this case, VM CPU usage is under virtualmachine, so, a SELECT * FROM virtualmachine will show you an entry for each VM for each collection, with columns(fields in InfluxDB parlance) like cpu_usage_average=26182(and whatever metrics you have configured in the .json config file).
And there are your datapoints :)
For exploration, you can use SELECT * FROM each measurent(with the occasional limit 100 if you have too much data and querying it all at once would be heavy on the server)and Grafana's auto-suggests for the field.

Cheers and good luck,
Adrian

@exbane
Copy link
Author

exbane commented Aug 8, 2016

That worked perfectly! it took me a little while to get used to how the data is presented in InfluxDB and how to find it properly in grafana.. I should have known that i had to pick the metrics through the Select "Field" option in grafana rather than the "From" area.. Duh :)

I do have one other problem for now.. maybe you can help.. i setup the crond service to run the script every minute and it launches the script but then comes back with this error.. any ideas?

2016-08-08T11:01:01.638978-04:00 blue201 crond[14772]: (vsphere-influxdb-go) ERROR (getpwnam() failed)

This is my crontab

  • * * * * vsphere-influxdb-go >/dev/null 2>&1

Have you ever thought about adding in Custom Attributes as Tags? We use the CAs to label everything from owner to application to business unit.. It would be awesome if i could filter specific VMs by Custom Attribute that is assigned to them.. Mainly for end user viewing the performance of their specific systems.. It's an idea. I'm not sure how much you are developing in this space anymore :)

Thank you again! you've been very helpful.

Adam

@exbane
Copy link
Author

exbane commented Aug 8, 2016

I also noticed when doing this query
SHOW TAG VALUES FROM "virtualmachine" WITH KEY = "cluster"

Not all of my clusters show up.. i'm missing 2 of my clusters in the list.. when i looked in hostsystem i can see the ESX hosts in there for the particular clusters but they just dont show up under Cluster when i run the query. thoughts?

@sofixa
Copy link
Member

sofixa commented Aug 8, 2016

Hi,

For your crontab error, i found this which seems similar.

As for the missing tags, i'm not sure. Can you search for a VM you know is on one of those clusters that aren't showing up? Like this:

SELECT * FROM virtualmachine WHERE name =~ /yourVMName/

I haven't thought about adding the custom fields as tags because InfluxDB is rather sensitive with tags(you shouldn't have a lot of them and only if you're going to filter/group by on them), and in our case, we use VM names for organisation and we filter based on them.

Cheers,

@exbane
Copy link
Author

exbane commented Aug 8, 2016

Understandable on the custom attributes and tags..

Ill check out the link you sent me - appreciate it.

as for the cluster stuff - i ran this query on a VM that is on one of missing clusters and data does show up..

SELECT mean("cpu_ready_summation") FROM "virtualmachine" WHERE "name" = 'yourVMName' AND $timeFilter GROUP BY time($interval) fill(null)

it's definatley odd - i collected against another one of my vCenters and it only collected 6 out of the 9 clusters i have.. My QA/Dev vCenter has 12 Clusters and its only getting 10 of them.. Is there any kind of logging i can setup to see step by step what's going on with the script when it runs against the vCenter? Let me know if i'm being too much of a bother :)

Thank you!

-Adam

@exbane
Copy link
Author

exbane commented Aug 8, 2016

I am using InfluxDB 1.0 Beta1 right now.. maybe that has something to do with it.. lol.. I assumed it would be okay.. What is the latest you've used upto? i have one of my guys updating the InfluxDB server to v1.0 beta3 now.. if i have to deploy a v.12 or v.13 i can do that..

@sofixa
Copy link
Member

sofixa commented Aug 8, 2016

The InfluxDB version shouldn't make a difference. Could you show me the data you have for the VMs whose cluster isn't showing up? And is there something in common between the various clusters that are missing(like - they all have a space in their name, or a hyphen, or something similar)?

@exbane
Copy link
Author

exbane commented Aug 8, 2016

I can definitely do that.. Can we do that over email though? Then i can send screenshots etc etc. my email is adam.savage@monster.com

@sofixa sofixa closed this as completed Sep 14, 2016
sofixa pushed a commit that referenced this issue May 24, 2017
syncing parent master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants