In this notebook, we will go through the basics of using the SDK to:
 - Spin up a Ray cluster with our desired resources
 - View the status and specs of our Ray cluster
 - Take down the Ray cluster when finished

In [10]:
%pip uninstall codeflare-sdk -y
%pip install ../../dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl

Found existing installation: codeflare-sdk 0.0.0.dev0
Uninstalling codeflare-sdk-0.0.0.dev0:
  Successfully uninstalled codeflare-sdk-0.0.0.dev0
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Processing /home/christianzaccaria/Documents/GitHub/codeflare-sdk/dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl
Installing collected packages: codeflare-sdk
Successfully installed codeflare-sdk-0.0.0.dev0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
# Import pieces from codeflare-sdk\
from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication, list_cluster_details

In [2]:
# Create authentication object for user permissions
# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config
# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually
auth = TokenAuthentication(
    token = "sha256~pu4rVc-UrGDzHXIOX22BrV0pFLvjEcSZkZ6ECYKndhY",
    server = "https://api.chris-aisrhods.xb4x.p3.openshiftapps.com:443",
    skip_tls=False
)
auth.login()

'Logged into https://api.chris-aisrhods.xb4x.p3.openshiftapps.com:443'

Here, we want to define our cluster by specifying the resources we require for our batch workload. Below, we define our cluster object (which generates a corresponding RayCluster).

NOTE: We must specify the `image` which will be used in our RayCluster, we recommend you bring your own image which suits your purposes. 
The example here is a community image.

In [26]:
# Create and configure our cluster object
# The SDK will try to find the name of your default local queue based on the annotation "kueue.x-k8s.io/default-queue": "true" unless you specify the local queue manually below
cluster = Cluster(ClusterConfiguration(
    name='raytesta122',
    namespace="default",
    head_cpus='500m',
    head_memory=2,
    head_extended_resource_requests={'nvidia.com/gpu':2}, # For GPU enabled workloads set the head_extended_resource_requests and worker_extended_resource_requests
    worker_extended_resource_requests={'nvidia.com/gpu':0},
    num_workers=2,
    worker_cpu_requests='250m',
    worker_cpu_limits=1,
    worker_memory_requests=4,
    worker_memory_limits=4,
    # image="", # Optional Field 
    write_to_file=False, # When enabled Ray Cluster yaml files are written to /HOME/.codeflare/resources 
    # local_queue="local-queue-name" # Specify the local queue manually
))

Yaml resources loaded for raytesta122


Next, we want to bring our cluster up, so we call the `up()` function below to submit our Ray Cluster onto the queue, and begin the process of obtaining our resource cluster.

In [27]:
# Bring up the cluster
cluster.up()

In [3]:
list_cluster_details("default")

VBox(children=(ToggleButtons(description='Select an existing cluster:', options=('raytesta1', 'raytesta10', 'r…

HBox(children=(Button(description='Delete Cluster', icon='trash', style=ButtonStyle()), Button(description='Vi…

Output()

Now, we want to check on the status of our resource cluster, and wait until it is finally ready for use.

In [None]:
from codeflare_sdk import get_cluster
get_cluster("raytest21", "default")

In [14]:
cluster.details()

RayCluster(name='raytestmar2k', status=<CodeFlareClusterStatus.FAILED: 5>, head_cpus=2, head_mem='8G', workers=1, worker_mem_min='2G', worker_mem_max='2G', worker_cpu=1, namespace='default', dashboard='https://ray-dashboard-raytestmar2k-default.apps.rosa.chris-aisrhods.xb4x.p3.openshiftapps.com', worker_extended_resources={}, head_extended_resources={})

In [None]:
def format_status(status):
    if status == "Ready":
        return '<span style="color: green;">Ready ✓</span>'
    elif status == "Suspended":
        return '<span style="color: orange;">Suspended ~</span>'
    elif status == "Starting":
        return '<span style="color: purple;">Starting ⌛</span>'
    elif status == "Failed":
        return '<span style="color: red;">Failed ✗</span>'
    else:
        return status

import ipywidgets as widgets
import pandas as pd
from IPython.display import display, HTML
data = {
    "name": ["RayTest1", "RayTest2", "RayTest3", "RayTest4"],
    "namespace": ["default", "usernamespace", "usernamespace", "usernamespace"],
    "head_gpu": [0, 1, 2, 0],
    "worker_gpu": [2, 0, 1, 0],
    "min_memory": [2, 4, 4, 2],
    "max_memory": [2, 4, 8, 4],
    "min_cpu": [1, 2, 4, 2],
    "max_cpu": [1, 4, 8, 2],
    "status": ["Ready", "Starting", "Suspended", "Failed"],
    "pods": [
        [{"pod": "head", "name": "head-raytest1", "status": "Ready"}, {"pod": "worker", "name": "worker-raytest1-a", "status": "Ready"}, {"pod": "worker", "name": "worker-raytest1-b", "status": "Ready"}],
        [{"pod": "head", "name": "head-raytest2", "status": "Ready"}, {"pod": "worker", "name": "worker-raytest2a", "status": "Starting"}],
        [{"pod": "head", "name": "head-raytest3", "status": "Suspended"}, {"pod": "worker", "name": "worker-raytest3a", "status": "Suspended"}],
        [{"pod": "head", "name": "head-raytest4", "status": "Failed"}, {"pod": "worker", "name": "worker-raytest4a", "status": "Failed"}]
    ]
}
df = pd.DataFrame(data)

# format to add icons
df['status'] = df['status'].apply(format_status)

my_output = widgets.Output()
my_output
classification_widget = widgets.ToggleButtons(
    options=['RayTest1', "RayTest2", "RayTest3", "RayTest4"],
    description='Select an existing cluster:',
)

def on_click(change):
    new_value = change["new"]
    my_output.clear_output()
    with my_output:
        selected_data = df[df["name"] == new_value]
        main_table = selected_data[["name", "namespace", "head_gpu", "worker_gpu", "min_memory", "max_memory", "min_cpu", "max_cpu", "status"]].to_html(escape=False, index=False)
        pod_rows = ""
        for pod in selected_data["pods"].values[0]:
            pod_rows += f'<tr><td>{pod["pod"]}</td><td>{pod["name"]}</td><td>{format_status(pod["status"])}</td></tr>'
        pods_table = f'<div style="border:1px solid black; margin-top: 10px; margin-left: 10px; display: inline-block;"><table><tr><th>Pod</th><th>Name</th><th>Status</th></tr>{pod_rows}</table></div>'
        display(HTML(f'<div style="border:1px solid black; display: inline-block; padding-bottom: 10px;">{main_table}{pods_table}</div>'))

classification_widget.observe(on_click, names="value")
display(widgets.VBox([classification_widget, my_output]))


list_jobs_button = widgets.Button(
            description='View Jobs',
            icon='suitcase'
        )
delete_button = widgets.Button(
            description='Delete Cluster',
            icon='trash'
        )
ray_dashboard_button = widgets.Button(
            description='Open Ray Dashboard',
            icon='dashboard',
            layout=widgets.Layout(width='auto'),
        )
view_yaml_button = widgets.Button(
            description='View YAML',
            icon='file'
        )
display(widgets.HBox([delete_button, list_jobs_button, view_yaml_button, ray_dashboard_button]))

Let's quickly verify that the specs of the cluster are as expected.

In [None]:
with my_output:
    display(cluster.details())

In [None]:
from IPython.display import HTML, display
import ipywidgets as widgets

def on_click(change):
    new_value = change["new"]
    my_output.clear_output()
    with my_output:
        display(HTML(f'<div style="border:1px solid black;">{df[df["name"]==new_value][["name", "namespace", "head_gpu", "worker_gpu", "min_memory", "max_memory", "min_cpu", "max_cpu", "status"]].to_html(escape=False, index=False)}</div>'))

classification_widget.observe(on_click, names="value")
display(widgets.VBox([classification_widget, my_output], layout=widgets.Layout(border='2px solid black')))

list_jobs_button = widgets.Button(description='View Jobs', icon='suitcase')
delete_button = widgets.Button(description='Delete Cluster', icon='trash')
ray_dashboard_button = widgets.Button(description='Open Ray Dashboard', icon='dashboard', layout=widgets.Layout(width='auto'))
view_yaml_button = widgets.Button(description='View YAML', icon='file')
buttons_container = widgets.HBox([delete_button, list_jobs_button, view_yaml_button, ray_dashboard_button], layout=widgets.Layout(border='2px solid black'))

display(buttons_container)



Finally, we bring our resource cluster down and release/terminate the associated resources, bringing everything back to the way it was before our cluster was brought up.

In [None]:
cluster.down()

In [None]:
auth.logout()