Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray TPU Webhook Reliability Improvements #723

Merged
merged 61 commits into from
Jul 25, 2024

Conversation

ryanaoleary
Copy link
Collaborator

@ryanaoleary ryanaoleary commented Jul 4, 2024

This PR depends on changes from #740 which should be merged first (c4cdf31 marks the start of the changes in this PR).

This PR improves the reliability of the webhook by making it stateless in between calls, fixing issues related to the sliceToWorkers mapping being cleared upon webhook restart. These changes rely on adding a k8s PodInformer to the webhook that watches Pods in the GKE cluster with the ray.io/node-type=worker label. We can then determine the next replicaIndex and TPU_WORKER_ID using the PodInformer cache. These changes remove the need to intercept Pod deletion requests.

This PR has been tested as follows:

  • Unit Tests
  • Manual Tests using single-host, multi-host, multi-slice, and an autoscaling RayCluster with a TPU worker group added, intercepting additional Pods after webhook restart, and multiple webhook replicas with a Kuberay operator restart

@ryanaoleary ryanaoleary self-assigned this Jul 4, 2024
@ryanaoleary ryanaoleary changed the title Ray TPU Webhook Auto-scaling Support and Reliability Improvements Ray TPU Webhook Autoscaling Support and Reliability Improvements Jul 4, 2024
@ryanaoleary ryanaoleary deleted the autoscaling-changes branch July 15, 2024 22:25
@ryanaoleary ryanaoleary restored the autoscaling-changes branch July 15, 2024 23:22
@ryanaoleary ryanaoleary reopened this Jul 15, 2024
@ryanaoleary ryanaoleary changed the title Ray TPU Webhook Autoscaling Support and Reliability Improvements Ray TPU Webhook Reliability Improvements Jul 15, 2024
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary and others added 2 commits July 25, 2024 00:56
Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ryanaoleary ryanaoleary merged commit 0ae82b1 into GoogleCloudPlatform:main Jul 25, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants