Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically check cluster job and update job info to database #33

Open
Fizzbb opened this issue Dec 31, 2021 · 1 comment
Open

Periodically check cluster job and update job info to database #33

Fizzbb opened this issue Dec 31, 2021 · 1 comment
Assignees
Milestone

Comments

@Fizzbb
Copy link
Collaborator

Fizzbb commented Dec 31, 2021

Given an interval (e.g. 30 seconds), 1)list all the jobs, mpijobs, unified job; 2) find job associated pods, extract job status(pending/running/done/fail..) and associated pods resource utilizations (gpu, cpu, memory, network ...), 3) save to the info to MongoDB

Notes
1)avoid duplicated info due to custom job built on kubernetes job, e.g., mpijob will create job and then pods.

@Fizzbb Fizzbb added this to the release130 milestone Jan 20, 2022
@Fizzbb
Copy link
Collaborator Author

Fizzbb commented Jan 26, 2022

  1. GPU metrics are not merged to the Pod metrics. To get GPU utilization and GPU memory utilization, need to map from pod name to gpu process id. Refer to previous functions in app.py. A pod could use multiple GPUs and even cross nodes.
  2. Debug mongo records with duplicate keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants