Fix PodCache handling of multi node jobs #573
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The PodCache used job id to identify a pod uniquely. The issue with this is that JobId is no longer unique to a single pod.
So when you cancelled a multi node job, it'll delete one of the pods then leave the others until the cache expired
However this means we wait for the PodExpiry between each pod being deleted of a multi node job
Similarly we have a cache of submitted pods (that may not have been reported back via the api yet). However we'd incorrectly report only 1 pod being submitted when submitting many pods as part of a multi node job.