GPU fairness usage #4266

scarlett2018 · 2020-03-09T08:04:20Z

Scenario
There are low utilization jobs which might block others job from submission. We'd like to have a service plugin which can:

detect all the jobs' utilization
notify users with low utilization in recent few days (default 20% in 5 days, customization)
if user has justification for the usage of the job, admin can extend the job lifetime. otherwise the low utilization jobs will be killed automaticaly in 1 day.

Another alternative implementation is: provide a incentive model with bonus tokens to the user, and let the user decide how to spend the gpu hours.

Binyang2014 · 2020-03-17T07:59:12Z

Add job start time and GPU hours in Job utilization. Currently, rest-server only return job submission time and job completion time. Doesn't return job start running time. Refer to Job API not return correct appLaunchedTime #4295
Change user GPU utilization to weighted average. Since currently restAPI return job duration based on completion-time - submission-time not completion-time-start-running-time. The weighted average might not correct. Refer to Job API not return correct appLaunchedTime #4295
Add a date info to the email notification's title. i.e. from "pai cluster utilization" to "pai cluster utilization - 3.17"
Add a status column for the job status at the moment of report generated
Add a GPU count column for the GPU used by the job

scarlett2018 · 2020-03-17T08:18:10Z

Enable debugging mode for debug VC.
(debugging mode: users can SSH the node and use for debug within 1~2 hours, system will automatically disconnect the node when time is up.)
Prototype for Enable cluster level policy for job management
(the prototype: Disable SSH port. Jobs will be automatically killed if their utilization is continuously lower than 20% in 1~2 hours)

scarlett2018 added the gpu-utilization label Mar 9, 2020

scarlett2018 changed the title ~~Service Plugin for GPU fairness usage~~ GPU fairness usage Mar 9, 2020

scarlett2018 assigned Binyang2014 Mar 17, 2020

scarlett2018 pinned this issue Mar 20, 2020

scarlett2018 added the pai-dev label Apr 17, 2020

scarlett2018 unpinned this issue Jun 22, 2020

scarlett2018 mentioned this issue Jun 22, 2020

2020 June~July Release #4575

Closed

47 tasks

Provide feedback