Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

refactor job-exporter #1840

Merged
merged 3 commits into from
Dec 11, 2018
Merged

refactor job-exporter #1840

merged 3 commits into from
Dec 11, 2018

Conversation

xudifsd
Copy link
Member

@xudifsd xudifsd commented Dec 6, 2018

implementation of #1764 , fixed #889 can partially solve #1719 .

I split monolithic job-exporter into 4 threads:

  • docker collector: used to collect metric about docker daemon's healthiness
  • gpu collector: used to collect metric about nvidia gpu utilization, also provide gpu info to container collector
  • container collector: used to collect task/service specific usage metric, consume gpu info provided by gpu collector and provide docker stats info to zombie collector
  • zombie collector: used to provide how many zombie task exist in cluster

We utilize python3's timeout parameter of check_output, so hopefully no single external command call will hangs indefinitely. I have already tuned the timeout parameter of each command call by using 99th latency of calling each command plus some buffer time.

Also since we use multi-threaded model, no hangs in single collector will block other collector from emitting metrics.

@coveralls
Copy link

coveralls commented Dec 8, 2018

Coverage Status

Coverage decreased (-0.01%) to 51.762% when pulling 698bc02 on dixu/refactor-job-exporter into af921fa on master.

@YanjieGao
Copy link
Contributor

YanjieGao commented Dec 11, 2018

If cmd hangs, this PR exporter behaviour is continue hanging sending timeout alert Or killed cmd then retry?

@xudifsd
Copy link
Member Author

xudifsd commented Dec 11, 2018

if hangs exceed predefined timeout value, the cmd will be killed and return None, other parts will ignore this None and continue working. It will retry next time.

@xudifsd xudifsd merged commit 2c08713 into master Dec 11, 2018
@xudifsd xudifsd deleted the dixu/refactor-job-exporter branch December 11, 2018 05:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[exporter] gpu exporter can not indicate gpu failure
3 participants