### **Overview**

This notebook demonstrates the process of deploying a **VLLM (Very Large Language Model)** using **Ray Serve**. Follow these steps to set up and deploy your VLLM model:

1. **Prepare the Application**: Ensure your Ray Serve application is correctly set up in the `ray_serve_vllm_example.py` file. This includes defining the VLLM model and making sure it is properly referenced during creation.

2. **Build the Ray Serve Application**: Use the `serve build` command to generate a configuration file for your deployment. This configuration file will specify how your VLLM application should be deployed.

3. **Configuring Runtime Options**: Two actions need to be taken here:
    - Due to current limitations with Ray Serve’s `--working-dir` option, we use a workaround to upload the working directory to Google Cloud Storage (GCS) and include it in the deployment configuration.
    - `vllm` package needs to be installed under `pip:packages`. Please refer to Readme file to see the details.

4. **Deploy the Model**: Deploy the VLLM model using the generated configuration file. This step will start the deployment process and make your model available for serving.

5. **Send Sample Requests**: After deployment, send sample requests to the deployed VLLM model to test its functionality and ensure it is working as expected.

6. **Shutdown the Deployment**: Once you are done with testing, terminate the deployment to free up resources.

**Note**: This notebook is tailored for deploying VLLM models and includes steps specific to such deployments using Ray Serve. Follow each step carefully and check the outputs for any warnings or errors.

In [1]:
# Building Ray Serve app
# !serve build <module_name>:<app_name> -o <config_file_name>.yaml
# This will generate config file
!serve build --app-dir "./" ray_serve_vllm_example:ray_serve_vllm_deployment -o ray_vllm_deployment_config.yaml
#Ignore Failed to import WARNING

2024-08-01 17:53:02,482	INFO scripts.py:848 -- The auto-generated application names default to `app1`, `app2`, ... etc. Rename as necessary.

[0m

## Attention!
Following cell is a workaround. Currently, serve deploy does not support --working-dir directly. Please see https://github.com/ray-project/ray/issues/29354

Suggested way to provide files from NB side to Ray cluster as below:

Create connection with JobSubmissionClient with working dir option but without entrypoint.
JobSubmissionClient will upload working_dir to GCS and print the URI.
Specify the above mentioned URI in config file as below example:

    runtime_env:
        working_dir: "gcs://_ray_pkg_fef565b457f470d9.zip"

In [31]:
# Workaround!
# This is to upload the working dir to GCS
# Once the URI is ready, please modify config dir before deployment
import ray
from ray.job_submission import JobSubmissionClient

ray_head_ip = "kuberay-head-svc.kuberay.svc.cluster.local"
ray_head_port = 8265
ray_address = f"http://{ray_head_ip}:{ray_head_port}"
client = JobSubmissionClient(ray_address)

job_id = client.submit_job(
    entrypoint="",
    runtime_env={
        "working_dir": "./",
    }
)

# We do not need this connection    
ray.shutdown()

2024-10-16 13:55:17,346	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_db274df5fea1cead.zip.
2024-10-16 13:55:17,347	INFO packaging.py:530 -- Creating a file package for local directory './'.


In [32]:
!serve deploy --address "http://kuberay-head-svc.kuberay.svc.cluster.local:8265" ray_vllm_deployment_config.yaml

2024-10-16 13:55:37,079	INFO scripts.py:243 -- Deploying from config file: 'ray_vllm_deployment_config.yaml'.
2024-10-16 13:55:40,984	SUCC scripts.py:350 -- [32m
Sent deploy request successfully.
 * Use `serve status` to check applications' statuses.
 * Use `serve config` to see the current application config(s).
[39m
[0m

In [17]:
def send_sample_request():
    import requests

    prompt = "How do I cook rice?"
    sample_input = {"prompt": prompt, "stream": True}
    output = requests.post("http://kuberay-head-svc.kuberay.svc.cluster.local:8000/", json=sample_input)
    for line in output.iter_lines():
        print(line.decode("utf-8"))

In [18]:
send_sample_request()

{"text": "\n"}
{"text": "What"}
{"text": " kind"}
{"text": " of"}
{"text": " rice"}
{"text": "?"}
{"text": " Green"}
{"text": " tea"}
{"text": ","}
{"text": " milk"}
{"text": " rice"}
{"text": "..."}
{"text": "\n"}
{"text": "K"}
{"text": "orean"}
{"text": " rice"}


In [30]:
# Terminating the deployment
!serve shutdown --address "http://kuberay-head-svc.kuberay.svc.cluster.local:8265" -y

2024-10-16 13:49:33,694	SUCC scripts.py:747 -- [32mSent shutdown request; applications will be deleted asynchronously.[39m
[0m