Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per-task gearman metrics #2672

Merged
merged 2 commits into from Aug 3, 2016
Merged

Add per-task gearman metrics #2672

merged 2 commits into from Aug 3, 2016

Conversation

nyanshak
Copy link
Contributor

@nyanshak nyanshak commented Jul 8, 2016

  • Add metrics to collect data on each individual task. This lets you see how many of each task is queued to catch problems with each individual queue's processing.
  • Each new metric is tagged by task:<task_name>
  • List of tags: gearman.queued_by_task, gearman.running_by_task, gearman.workers_by_task
  • Here is an example in metrics explorer with this new data:
    screen shot 2016-07-08 at 11 11 54

@nyanshak
Copy link
Contributor Author

nyanshak commented Jul 8, 2016

Failing to install dependencies:

[2016-07-08T16:23:23Z] >>>>>>>>>>>>>> INSTALL STAGE
pip install --upgrade pip setuptools
Requirement already up-to-date: pip in /home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages
Requirement already up-to-date: setuptools in /home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages
pip install           -r requirements.txt           --cache-dir /home/travis/.cache/pip           2>&1 >> /tmp/ci.log
pip install pycurl==7.19.5.1 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install pycurl==7.19.5.1' 2>&1 >> /tmp/ci.log
pip install psutil==3.3.0 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install psutil==3.3.0' 2>&1 >> /tmp/ci.log
pip install pysnmp-mibs==0.1.4 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install pysnmp-mibs==0.1.4' 2>&1 >> /tmp/ci.log
pip install pysnmp==4.2.5 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install pysnmp==4.2.5' 2>&1 >> /tmp/ci.log
pip install pymongo==3.2 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install pymongo==3.2' 2>&1 >> /tmp/ci.log
pip install kazoo==1.3.1 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install kazoo==1.3.1' 2>&1 >> /tmp/ci.log
pip install winrandom-ctypes --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install winrandom-ctypes' 2>&1 >> /tmp/ci.log
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-YwxK3s/winrandom-ctypes/
pip install paramiko==1.15.2 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install paramiko==1.15.2' 2>&1 >> /tmp/ci.log
pip install psycopg2==2.6 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install psycopg2==2.6' 2>&1 >> /tmp/ci.log
pip install wmi==1.4.9 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install wmi==1.4.9' 2>&1 >> /tmp/ci.log
pip install scandir==1.2 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install scandir==1.2' 2>&1 >> /tmp/ci.log
pip install           --upgrade           -r requirements-test.txt           --cache-dir /home/travis/.cache/pip            2>&1 >> /tmp/ci.log
[2016-07-08T16:23:29Z] >>>>>>>>>>>>>> BEFORE_SCRIPT STAGE

* Adds gearman.{queued_by_task, running_by_task, workers_by_task}
  metrics to collect data on each individual task. This lets you see how
  many of each tasks are queued up to catch problems with any individual
  queue not being processed.
* Each new metric is tagged by task:<task_name>
@olivielpeau
Copy link
Member

Hi @nyanshak, thanks for this addition!

Your PR looks good overall! One concern I may have is that depending on the number of different tasks in the gearman job server the check may create a lot of metrics (i.e. the cardinality of the unique tag combinations on the *_by_task metrics could be very high).

In your experience how many different tasks would generally live on the job server?

I think the check can reasonably collect metrics on ~100 different tasks, but if this number can be higher in some environments I'd rather the check had a way of limiting the number of tasks the *_by_task metrics are collected on, and allow users to explicitely list the task to collect metrics on in the yaml configuration when it reaches 100 tasks (see the rabbitmq check configuration for an example of that)

Let me know what you think, thanks!

@olivielpeau olivielpeau added this to the Triage milestone Jul 27, 2016
@nyanshak
Copy link
Contributor Author

I hadn't thought of that. In our environment we're generally looking at maybe 10 or so tasks. I can look into limiting the number of tasks.

@nyanshak
Copy link
Contributor Author

I took a first shot at integrating your feedback. Let me know what you think.

for stat in data:
if len(specified_tasks) > MAX_NUM_TASKS:
raise Exception(
"The maximum number of tasks you can specify is %d.".format(MAX_NUM_TASKS))
Copy link
Contributor

@masci masci Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use {} instead of %d with format, keep an eye on the default flavor in the test matrix on Travis: https://travis-ci.org/DataDog/dd-agent/builds/147788400

@nyanshak
Copy link
Contributor Author

@masci updated that line

@masci
Copy link
Contributor

masci commented Jul 29, 2016

Thanks @nyanshak , all green! Waiting for a thumbs up from @olivielpeau

task_tags.append("task:{}".format(stat['task']))
self.gauge("gearman.running_by_task", running, tags=task_tags)
self.gauge("gearman.queued_by_task", queued, tags=task_tags)
self.gauge("gearman.workers_by_task", workers, tags=task_tags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with gearmand's admin status command output, but do all the tasks listed in the output have a different task? (i.e. is each task field in the tasks list unique?)
If that's the case this works fine, but if not we need to use a different type of metrics submission than gauge(probably increment). The reason for that is that using a gauge, multiple values submitted during the same run with the same metric name and the same tags overwrite one another, only the last value is sent. Using an increment, the values are summed instead.

@olivielpeau
Copy link
Member

Thanks @nyanshak for your changes!

I've added in a few comments, let us know if you can work on addressing them.

@nyanshak
Copy link
Contributor Author

nyanshak commented Jul 29, 2016

Split into two functions, one to collect the aggregate metrics and one to collect per-task metrics. I'm glad that I did that because I found a bug or two in the process.

I'm not very familiar with gearmand's admin status command output, but do all the tasks listed in the output have a different task? (i.e. is each task field in the tasks list unique?)

Each task field is unique.

@olivielpeau @masci

@olivielpeau
Copy link
Member

Looks good, thanks @nyanshak!

Merging, we'll include this in the 5.9.0 release

@olivielpeau olivielpeau modified the milestones: 5.9.0, Triage Aug 3, 2016
@olivielpeau olivielpeau merged commit 382eba2 into DataDog:master Aug 3, 2016
@nyanshak nyanshak deleted the add-per-task-gearman-metrics branch August 3, 2016 20:49
scottgeary pushed a commit to vend/dd-agent that referenced this pull request Aug 9, 2016
* Adds gearman.{queued_by_task, running_by_task, workers_by_task}
  metrics to collect data on each individual task. This lets you see how
  many of each tasks are queued up to catch problems with any individual
  queue not being processed.
* Each new metric is tagged by task:<task_name>
* Limits the maximum number of tasks on which per-task metrics are
  collected
truthbk pushed a commit that referenced this pull request Aug 10, 2016
* Adds gearman.{queued_by_task, running_by_task, workers_by_task}
  metrics to collect data on each individual task. This lets you see how
  many of each tasks are queued up to catch problems with any individual
  queue not being processed.
* Each new metric is tagged by task:<task_name>
* Limits the maximum number of tasks on which per-task metrics are
  collected
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants