Periodically check cluster job and update job info to database #33

Fizzbb · 2021-12-31T02:53:19Z

Given an interval (e.g. 30 seconds), 1)list all the jobs, mpijobs, unified job; 2) find job associated pods, extract job status(pending/running/done/fail..) and associated pods resource utilizations (gpu, cpu, memory, network ...), 3) save to the info to MongoDB

Notes
1)avoid duplicated info due to custom job built on kubernetes job, e.g., mpijob will create job and then pods.

Fizzbb · 2022-01-26T18:12:20Z

GPU metrics are not merged to the Pod metrics. To get GPU utilization and GPU memory utilization, need to map from pod name to gpu process id. Refer to previous functions in app.py. A pod could use multiple GPUs and even cross nodes.
Debug mongo records with duplicate keys.

Fizzbb assigned zliu374 Dec 31, 2021

Fizzbb added this to the release130 milestone Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically check cluster job and update job info to database #33

Periodically check cluster job and update job info to database #33

Fizzbb commented Dec 31, 2021

Fizzbb commented Jan 26, 2022

Periodically check cluster job and update job info to database #33

Periodically check cluster job and update job info to database #33

Comments

Fizzbb commented Dec 31, 2021

Fizzbb commented Jan 26, 2022