Collect Datastore Metrics? #1

exbane · 2016-08-04T01:03:34Z

Will collecting the datastore metrics from each host aggregate the datastore metrics so i could graph say VMFS-Volume-01 as a single graph with its throughput/IO/Latency?

Also does your script collect the 20 second stats from each ESXi host or does it pull the stats through the API from the vCenter DB?

Thank you!

-exbane

sofixa · 2016-08-04T08:30:40Z

Hi @exbane .

With the example config file, when collecting the following metrics on the VirtualMachine level:


                { "Metric": "datastore.numberReadAveraged.average", "Instances": "*" },
                { "Metric": "datastore.numberWriteAveraged.average", "Instances": "*" },
                { "Metric": "datastore.read.average", "Instances": "*" },
                { "Metric": "datastore.write.average", "Instances": "*" },
                { "Metric": "datastore.totalReadLatency.average", "Instances": "*" },
                { "Metric": "datastore.totalWriteLatency.average", "Instances": "*" },

You get all IO/throughput/latency per VM, and each VM is tagged with its datastore, so, with a query that gets the sum of all these metrics and groups by datastore, you would be able to graph your datastore.

For example, this is query i use in Grafana to get a table with all datastores and everything that happens with them:

SELECT (sum("datastore_numberwriteaveraged_average") + sum("datastore_numberreadaveraged_average")) / 60 as "IOPS avg", max("datastore_numberwriteaveraged_average") + max("datastore_numberreadaveraged_average") as "IOPS max", mean("datastore_totalreadlatency_average") + mean("datastore_totalwritelatency_average") as "latency avg", max("datastore_totalreadlatency_average") + max("datastore_totalwritelatency_average") as "latency max" FROM "virtualmachine" WHERE "host" =~ /^$host$/ AND $timeFilter GROUP BY "datastore"

Notes:

$host is a template variable so i can filter based on hosts
I am using (sum + sum) / 60 to get the number of IOPS(i get sum of all reads + sum of all writes for all VMs for the collection period(60s in my case) and divide by 60 to get the average per second value)

As for your other question, all stats are pulled from a vCenter's statistics database via the VMware Performance Manager API (i think that what it's called), where applicable(you can also put a host directly in the config file, but there are A LOT less statistics available).

If you have any more questions, don't hesitate!

Cheers,
Adrian

exbane · 2016-08-04T12:08:50Z

Good morning @oxalideops!

Thank you kindly! I will test it out this morning when I get to work and let you know the results :)

Appreciate it.

-exbane

exbane · 2016-08-04T16:07:19Z

i'm a dummy @sofixa - completely didn't read the name of the person who commented :)

So I got my environment setup to send metrics to my InfluxDB server (which is v1.0).. I'm currently just running it against my QA/DEV vCenter which has 168 hosts and 1700 VMs.. I can see that it inventories ALL of the hosts/clusters/datastores/VMs EXTREMELY FAST but i'm not actually seeing any metrics come into the InfluxDB Database. The script takes about 10 seconds to run.. the account i'm using is an admin account (which i'll change to a read-only account after i get things situated).. Maybe i'm not doing my grafana graphs properly.

sofixa · 2016-08-04T16:11:07Z

@exbane have you configured properly your InfluxDB output? You should see from the output if it sends the data to influx(you should see "sent data to Influxdb" if all is fine), and then, you can check your InfluxDB's logs(/var/log/inluxdb/influxdb.log) to see if everything was properly received, and then from the influx CLI you can use SHOW SERIES, SHOW MEASUREMENTS to explore the data you have.
If everything is there, the problem is with your Grafana queries/datasource configuration :)

Cheers,
Adrian

exbane · 2016-08-05T17:42:47Z

Hello @sofixa ! So I did the following and it produced this output.

[root@blue201 bin]# vsphere-influxdb-go
2016/08/05 13:31:25 Starting : vsphere-influxdb-go
2016/08/05 13:31:25 connecting to vcenter: vcenter202.ops.global.ad
2016/08/05 13:31:25 Successfully connected to Influx

2016/08/05 13:31:25 Querying vcenter
2016/08/05 13:31:25 Setting up query inventory of vcenter: vcenter202.ops.global.ad
2016/08/05 13:31:25 connecting to vcenter: vcenter202.ops.global.ad
2016/08/05 13:31:43 sent data to Influxdb
[root@blue201 bin]#

Then i went to my influxdb server and looked at the influxdb log..
[root@influxdb201 419]# tail /var/log/influxdb/influxd.log -n 1000| grep pegasus
[httpd] 2016/08/05 13:31:43 10.4.168.127 - vmware [05/Aug/2016:13:31:41 -0400] POST /write?consistency=&db=pegasus&precision=s&rp= HTTP/1.1 204 0 - InfluxDBClient 790e9864-5b32-11e6-8ea1-000000000000 1.609145152s
[root@influxdb201 419]#

pegasus is the DB i used to store the metrics in InfluxDB... That's the only log entry for that DB whenever i run the script..

When i do a show series through the data explorer in InfluxDB I can see close to 32k entries for all the systems i have etc etc.

Here is an example of a single VM from each measurement from when i did a show series..

cpu,cluster=Cluster01,datastore=Datastore01,esx=hostblahblah,host=vcenter,instance=0,name=vmname010

disk,host=vcenter,instance=naa_514f0c50efc00002,name=hostblahblahblah

hostsystem,host=vcenter,name=hostblahblahblah

net,cluster=Cluster01,datastore=Datastore01,esx=hostblahblah,host=vcenter,instance=vmnic6,name=vmname010

virtualmachine,cluster=Cluster01,datastore=Datastore01,esx=hostblahblah,host=vcenter,name=vmname010

It looks it only created 5 different types of measurements.. I dont see anything for RAM or other types of disk resources like latency, throughput, etc etc..

What am i doing wrong! :)

Thank you again!

exbane · 2016-08-05T17:43:36Z

BTW - i'm on vSphere6 Update2 for vCenter and all of my ESXi hosts..

sofixa · 2016-08-05T18:42:01Z

Nothing, that's the expected result :)

Measurements in InfluxDB are similar to tables in an SQL database - they are just an organization unit, the data itself(values in "columns") are below that.

In this case, VM CPU usage is under virtualmachine, so, a SELECT * FROM virtualmachine will show you an entry for each VM for each collection, with columns(fields in InfluxDB parlance) like cpu_usage_average=26182(and whatever metrics you have configured in the .json config file).
And there are your datapoints :)
For exploration, you can use SELECT * FROM each measurent(with the occasional limit 100 if you have too much data and querying it all at once would be heavy on the server)and Grafana's auto-suggests for the field.

Cheers and good luck,
Adrian

exbane · 2016-08-08T15:05:45Z

That worked perfectly! it took me a little while to get used to how the data is presented in InfluxDB and how to find it properly in grafana.. I should have known that i had to pick the metrics through the Select "Field" option in grafana rather than the "From" area.. Duh :)

I do have one other problem for now.. maybe you can help.. i setup the crond service to run the script every minute and it launches the script but then comes back with this error.. any ideas?

2016-08-08T11:01:01.638978-04:00 blue201 crond[14772]: (vsphere-influxdb-go) ERROR (getpwnam() failed)

This is my crontab

* * * * vsphere-influxdb-go >/dev/null 2>&1

Have you ever thought about adding in Custom Attributes as Tags? We use the CAs to label everything from owner to application to business unit.. It would be awesome if i could filter specific VMs by Custom Attribute that is assigned to them.. Mainly for end user viewing the performance of their specific systems.. It's an idea. I'm not sure how much you are developing in this space anymore :)

Thank you again! you've been very helpful.

Adam

exbane · 2016-08-08T17:18:09Z

I also noticed when doing this query
SHOW TAG VALUES FROM "virtualmachine" WITH KEY = "cluster"

Not all of my clusters show up.. i'm missing 2 of my clusters in the list.. when i looked in hostsystem i can see the ESX hosts in there for the particular clusters but they just dont show up under Cluster when i run the query. thoughts?

sofixa · 2016-08-08T17:45:13Z

Hi,

For your crontab error, i found this which seems similar.

As for the missing tags, i'm not sure. Can you search for a VM you know is on one of those clusters that aren't showing up? Like this:

SELECT * FROM virtualmachine WHERE name =~ /yourVMName/

I haven't thought about adding the custom fields as tags because InfluxDB is rather sensitive with tags(you shouldn't have a lot of them and only if you're going to filter/group by on them), and in our case, we use VM names for organisation and we filter based on them.

Cheers,

exbane · 2016-08-08T18:15:36Z

Understandable on the custom attributes and tags..

Ill check out the link you sent me - appreciate it.

as for the cluster stuff - i ran this query on a VM that is on one of missing clusters and data does show up..

SELECT mean("cpu_ready_summation") FROM "virtualmachine" WHERE "name" = 'yourVMName' AND $timeFilter GROUP BY time($interval) fill(null)

it's definatley odd - i collected against another one of my vCenters and it only collected 6 out of the 9 clusters i have.. My QA/Dev vCenter has 12 Clusters and its only getting 10 of them.. Is there any kind of logging i can setup to see step by step what's going on with the script when it runs against the vCenter? Let me know if i'm being too much of a bother :)

Thank you!

-Adam

exbane · 2016-08-08T18:30:17Z

I am using InfluxDB 1.0 Beta1 right now.. maybe that has something to do with it.. lol.. I assumed it would be okay.. What is the latest you've used upto? i have one of my guys updating the InfluxDB server to v1.0 beta3 now.. if i have to deploy a v.12 or v.13 i can do that..

sofixa · 2016-08-08T20:19:52Z

The InfluxDB version shouldn't make a difference. Could you show me the data you have for the VMs whose cluster isn't showing up? And is there something in common between the various clusters that are missing(like - they all have a space in their name, or a hyphen, or something similar)?

exbane · 2016-08-08T20:22:01Z

I can definitely do that.. Can we do that over email though? Then i can send screenshots etc etc. my email is adam.savage@monster.com

syncing parent master

sofixa closed this as completed Sep 14, 2016

jakape mentioned this issue May 17, 2017

Resourcepool enable #19

Merged

sofixa pushed a commit that referenced this issue May 24, 2017

Merge pull request #1 from Oxalide/master

296f9eb

syncing parent master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect Datastore Metrics? #1

Collect Datastore Metrics? #1

exbane commented Aug 4, 2016

sofixa commented Aug 4, 2016

exbane commented Aug 4, 2016

exbane commented Aug 4, 2016

sofixa commented Aug 4, 2016

exbane commented Aug 5, 2016

exbane commented Aug 5, 2016

sofixa commented Aug 5, 2016

exbane commented Aug 8, 2016

exbane commented Aug 8, 2016

sofixa commented Aug 8, 2016

exbane commented Aug 8, 2016

exbane commented Aug 8, 2016

sofixa commented Aug 8, 2016

exbane commented Aug 8, 2016

Collect Datastore Metrics? #1

Collect Datastore Metrics? #1

Comments

exbane commented Aug 4, 2016

sofixa commented Aug 4, 2016

exbane commented Aug 4, 2016

exbane commented Aug 4, 2016

sofixa commented Aug 4, 2016

exbane commented Aug 5, 2016

exbane commented Aug 5, 2016

sofixa commented Aug 5, 2016

exbane commented Aug 8, 2016

exbane commented Aug 8, 2016

sofixa commented Aug 8, 2016

exbane commented Aug 8, 2016

exbane commented Aug 8, 2016

sofixa commented Aug 8, 2016

exbane commented Aug 8, 2016