feat(k8s): adds Kubernetes support to Runhouse via Codeflare #120
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
POC: Adds Kubernetes support to Runhouse.
Utilizes
codeflare-sdk
to spin up Ray clusters.Example Usage:
Prerequisites:
Ensure you have
helm
,kubectl
,make
,gnu-sed
, andgo
installed and setup.You will also need an EKS cluster and its
kubeconfig
.Make sure port 50052 on your local machine is free to use. Check by running
lsof -i :50052
Tested on Python 3.8.13
Reference this PR if you need to setup a fresh EKS cluster:
Terraform script: https://github.com/run-house/runhouse/blob/rs/k8s-poc-skypilot/runhouse/scripts/kubernetes_cluster/main.tf
PR: #109
NOTE about kube_config:
~/.kube/config
If you plan to use the default location, make sure your kube_config is up to date, referencing the intended
current-context
If you have the aws cli and are using EKS you can update it as so:
aws eks update-kubeconfig --region {REGION_NAME} --name {NAME_OF_EKS_CLUSTER}
Example:
aws eks update-kubeconfig --region us-east-1 --name my-eks-cluster
Setup process:
mkdir tester_directory && cd tester_directory
git clone git@github.com:RohanSreerama5/codeflare-sdk.git
cd codeflare-sdk
pip install ray pydantic kubernetes executing rich
pip install -e .
NOTE: You may need to first run
python -m pip install --upgrade pip
to install thecodeflare-sdk
in editable mode.Start a Python shell and make sure you can run the following without errors:
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
from codeflare_sdk.cluster.auth import KubeConfigFileAuthentication
From
tester_directory
level, rungit clone git@github.com:run-house/runhouse.git
cd runhouse
. Then, rungit checkout rs/k8s-testing
conda install grpcio -y
pip install -e .
Run
import runhouse as rh
in a Python shell to make sure it works.Now Runhouse and codeflare-sdk are setup.
Now open the Runhouse folder and go to CodeflareTest.ipynb.
Run each of the cells, noting that
cluster.install_k8s_operators()
should only be run one time.Even if you restart the kernel, you do not need to run it again.
This notebook should setup a cluster, run a function, and report the status of the cluster as well as tear it down.
If you need to clean up your EKS cluster do the following:
Removing the K8s operators on the EKS cluster:
Uninstall KubeRay Operator:
helm uninstall kuberay-operator
Uninstall Codeflare Operator:
From inside the
codeflare-operator
repo, runmake uninstall -e SED=/opt/homebrew/opt/gnu-sed/libexec/gnubin/sed
make undeploy -e SED=/opt/homebrew/opt/gnu-sed/libexec/gnubin/sed
(If one of them errors out it just means its already been removed from EKS)
NOTE: The
codeflare-operator
repo will already be inside your runhouse directory as it was automatically cloned down during setup.Check your EKS cluster and make sure
kuberay-operator
has been removed from thedefault
namespace.Also, make sure that there the
openshift-operators
namespace is gone. (This namespace will have hadcodeflare-operator
inside it)PENDING
Post-POC activities
Links:
Codeflare SDK: https://github.com/RohanSreerama5/codeflare-sdk
Codeflare Operator: https://github.com/RohanSreerama5/codeflare-operator