Set up application in the ray_vllm.py file with Ray Serve, ensuring the model is correctly referenced during creation then 
execute below cells to deploy the model

In [1]:
# Building Ray Serve app
# !serve build <module_name>:<app_name> -o <config_file_name>.yaml
# This will generate config file
!serve build --app-dir "./" ray_vllm:deployment -o deployment_config.yaml
#Ignore Failed to import WARNING

2024-07-18 12:35:48,460	INFO scripts.py:848 -- The auto-generated application names default to `app1`, `app2`, ... etc. Rename as necessary.

[0m

## Attention!
Following cell is a workaround. Currently, serve deploy does not support --working-dir directly. Please see https://github.com/ray-project/ray/issues/29354

Suggested way to provide files from NB side to Ray cluster as below:

Create connection with JobSubmissionClient with working dir option but without entrypoint.
JobSubmissionClient will upload working_dir to GCS and print the URI.
Specify the above mentioned URI in config file as below example:
  runtime_env:
    working_dir: "gcs://_ray_pkg_fef565b457f470d9.zip"

In [3]:
# Workaround!
# This is to upload the working dir to GCS
# Once the URI is ready, please modify config dir before deployment
import ray
from ray.job_submission import JobSubmissionClient

ray_head_ip = "kuberay-head-svc.kuberay.svc.cluster.local"
ray_head_port = 8265
ray_address = f"http://{ray_head_ip}:{ray_head_port}"
client = JobSubmissionClient(ray_address)

job_id = client.submit_job(
    entrypoint="python ray_vllm.py",
    runtime_env={
        "working_dir": "./",
    }
)

# We do not need this connection    
ray.shutdown()

2024-07-18 12:37:28,749	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_4cf46398fdea3c78.zip.
2024-07-18 12:37:28,750	INFO packaging.py:530 -- Creating a file package for local directory './'.


In [4]:
!serve deploy --address "http://kuberay-head-svc.kuberay:8265" deployment_config.yaml

2024-07-18 12:37:51,990	INFO scripts.py:243 -- Deploying from config file: 'deployment_config.yaml'.
2024-07-18 12:37:55,580	SUCC scripts.py:350 -- [32m
Sent deploy request successfully.
 * Use `serve status` to check applications' statuses.
 * Use `serve config` to see the current application config(s).
[39m
[0m

In [17]:
def send_sample_request():
    import requests

    prompt = "How do I cook rice?"
    sample_input = {"prompt": prompt, "stream": True}
    output = requests.post("http://kuberay-head-svc.kuberay:8000/", json=sample_input)
    for line in output.iter_lines():
        print(line.decode("utf-8"))

In [18]:
send_sample_request()

{"text": "\n"}
{"text": "What"}
{"text": " kind"}
{"text": " of"}
{"text": " rice"}
{"text": "?"}
{"text": " Green"}
{"text": " tea"}
{"text": ","}
{"text": " milk"}
{"text": " rice"}
{"text": "..."}
{"text": "\n"}
{"text": "K"}
{"text": "orean"}
{"text": " rice"}


In [25]:
# Terminating the deployment
!serve shutdown --address "http://kuberay-head-svc.kuberay:8265" -y

2024-07-16 14:04:48,704	SUCC scripts.py:747 -- [32mSent shutdown request; applications will be deleted asynchronously.[39m
[0m