Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PodCache handling of multi node jobs #573

Merged
merged 2 commits into from May 19, 2021

Conversation

JamesMurkin
Copy link
Contributor

@JamesMurkin JamesMurkin commented May 19, 2021

The PodCache used job id to identify a pod uniquely. The issue with this is that JobId is no longer unique to a single pod.

So when you cancelled a multi node job, it'll delete one of the pods then leave the others until the cache expired

  • We prevent repeated deletion calls by holding an empty value for that pod in the cache

However this means we wait for the PodExpiry between each pod being deleted of a multi node job

Similarly we have a cache of submitted pods (that may not have been reported back via the api yet). However we'd incorrectly report only 1 pod being submitted when submitting many pods as part of a multi node job.

The PodCache used job id to identify a pod uniquely. The issue with this is that JobId is no longer unique to a single pod.

So when you cancelled a multi node job, it'll delete one of the pods then leave the others until the cache expired
 - We prevent repeated deletion calls by holding an empty value for that pod in the cache
However this means we wait for the PodExpiry between each pod being deleted of a multi node job

Similarly we have a cache of submitted pods (that may not have been reported back via the api yet). However we'd incorrectly report only 1 pod being submitted when submitting many pods as part of a multi node job.
@JamesMurkin JamesMurkin merged commit 4a41caf into master May 19, 2021
@JamesMurkin JamesMurkin deleted the multi_node_job_cancellation_fixes branch May 19, 2021 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants