Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

GPU fairness usage #4266

Open
scarlett2018 opened this issue Mar 9, 2020 · 2 comments
Open

GPU fairness usage #4266

scarlett2018 opened this issue Mar 9, 2020 · 2 comments

Comments

@scarlett2018
Copy link
Member

scarlett2018 commented Mar 9, 2020

Scenario
There are low utilization jobs which might block others job from submission. We'd like to have a service plugin which can:

  • detect all the jobs' utilization
  • notify users with low utilization in recent few days (default 20% in 5 days, customization)
  • if user has justification for the usage of the job, admin can extend the job lifetime. otherwise the low utilization jobs will be killed automaticaly in 1 day.

Another alternative implementation is: provide a incentive model with bonus tokens to the user, and let the user decide how to spend the gpu hours.

@scarlett2018 scarlett2018 changed the title Service Plugin for GPU fairness usage GPU fairness usage Mar 9, 2020
@Binyang2014
Copy link
Contributor

Binyang2014 commented Mar 17, 2020

  • Add job start time and GPU hours in Job utilization. Currently, rest-server only return job submission time and job completion time. Doesn't return job start running time. Refer to Job API not return correct appLaunchedTime #4295
  • Change user GPU utilization to weighted average. Since currently restAPI return job duration based on completion-time - submission-time not completion-time-start-running-time. The weighted average might not correct. Refer to Job API not return correct appLaunchedTime #4295
  • Add a date info to the email notification's title. i.e. from "pai cluster utilization" to "pai cluster utilization - 3.17"
  • Add a status column for the job status at the moment of report generated
  • Add a GPU count column for the GPU used by the job

@scarlett2018
Copy link
Member Author

scarlett2018 commented Mar 17, 2020

  • Enable debugging mode for debug VC.
    (debugging mode: users can SSH the node and use for debug within 1~2 hours, system will automatically disconnect the node when time is up.)

  • Prototype for Enable cluster level policy for job management
    (the prototype: Disable SSH port. Jobs will be automatically killed if their utilization is continuously lower than 20% in 1~2 hours)

@scarlett2018 scarlett2018 pinned this issue Mar 20, 2020
@scarlett2018 scarlett2018 unpinned this issue Jun 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants