Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCollector terminating collectors after "inactivity", but outputs metrics on command line as same user (HadoopHttp class should flush stdout after each emit to avoid few/single metrics being held in buffer for subclassed programs) #398

Closed
HariSekhon opened this issue Jul 26, 2018 · 0 comments
Assignees
Labels
Milestone

Comments

@HariSekhon
Copy link
Contributor

Hit a maddening issue where collectors I wrote that used to work seems to stop sending metrics to OpenTSDB, with TCollector complaining of no activity and killing them every 10 mins, even though testing them on the command line with the same user showed them outputting metrics.

I had subclassed HadoopHttp to get G1GC duration young + old gen metrics I need for HBase cluster tuning feedback (quick workaround to #393) and ended up with a maddening situation where the collectors worked initially for a few days over the weekend, then stopped working after a second rolling restart on Tuesday, then worked intermittently but only on a subset of hosts even though all the MD5s and everything else lined up (was deployed from Git via Ansible so they all had identical deployments).

It turns out this is because the subclassed collectors emitted too few metrics which were staying in buffer and not getting flushed. The reason it worked initially and not after I started applying a couple improvements to the HBase cluster with rolling restart was because I wasn't having as many GCs so nothing was being returned for old gen GC which was null, reducing the amount of output for the collector to only young gen and not filling the buffer to spill.

The fix is to add

sys.stdout.flush()

after each emit. After I made this change to my collectors everything started working again.

This would be best done in the HadoopHttp library to not catch out any other subclassed programs.

Also, should probably add this in a utility function emit() in collectors/lib/utils.py which implicitly flushes stdout, and encourage all collectors to use it, just in case any given collector doesn't emit enough metrics to cause the buffer to spill within the 10 mins before TCollector decides to kill and restart the collector.

@HariSekhon HariSekhon changed the title TCollector complains no metrics, kills collector every 10 mins, but works on command line (HadoopHttp class should flush stdout after each emit to avoid few/single metrics being held in buffer for subclassed programs) TCollector terminating collectors after "inactivity", but outputs metrics on command line as same user (HadoopHttp class should flush stdout after each emit to avoid few/single metrics being held in buffer for subclassed programs) Jul 26, 2018
HariSekhon added a commit to HariSekhon/tcollector that referenced this issue Jul 26, 2018
@johann8384 johann8384 added the bug label Dec 4, 2018
@johann8384 johann8384 self-assigned this Dec 4, 2018
@johann8384 johann8384 added this to the 1.3.3 milestone Dec 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants