In [1]:
from openai import OpenAI
import time
%load_ext wurlitzer

Online inference demo
----------------------------
This is just a brief demo to show that vLLM with Spyre can be used in the online mode. 

Hence, a vLLM server must be started before (outside of this notebook):
```bash
python3 -m vllm.entrypoints.openai.api_server --model /models/llama-7b-chat --max-model-len=2048 --block-size=2048
```

and waited until vLLM is ready, which is after the following log messages were printed (otherwise, there will be `ConnectError`s in the code below):
```log
INFO:     Started server process [1840]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

(The startup of vLLM, including warmup of Spyre, is expected to take 15 min.)

Here, the default max prompt length of 64 and maximum of 20 decode tokens is used. Otherwise change this behavior with the environment variables `VLLM_SPYRE_WARMUP_PROMPT_LENS`, and `VLLM_SPYRE_WARMUP_NEW_TOKENS`. 

### 1. Create the prompts 

In [2]:
template = (
    "Below is an instruction that describes a task. Write a response that "
    "appropriately completes the request. Be polite in your response to the "
    "user.\n\n### Instruction:\n{}\n\n### Response:"
)
prompt1 = template.format(
    "Provide a list of instructions for preparing chicken soup for a family "
    "of four."
)

prompt2 = template.format(
    "Please compare New York City and Zurich and provide a list of attractions "
    "for each city."
)

prompt3 = template.format(
    "Provide detailed instructions for preparing asparagus soup for a family "
    "of four."
)

prompts = [prompt1, prompt2, prompt3]

In [3]:
# e.g. prompt 2
prompt2

'Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nPlease compare New York City and Zurich and provide a list of attractions for each city.\n\n### Response:'

### 2. Initialize client and connect to vLLM

(Adapt the `openai_api_base` URL to point to the (forwarded/tunneled) vLLM instance. E.g. forward it to localhost with `oc port-forward $DEV_POD 8000`.)

In [4]:
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Completion API
stream = False
max_tokens = 20  # default



### 3. Submit requests and await responses

In [5]:

for prompt in prompts:
    print(f"Prompt: {prompt}")
    start_t = time.time()

    completion = client.completions.create(
        model=model,
        prompt=prompt,
        echo=False,
        n=1,
        stream=stream,
        temperature=0.0,
        max_tokens=max_tokens)

    end_t = time.time()
    print("Results:")
    if stream:
        for c in completion:
            print(c)
    else:
        print(completion.choices)

    total_t = end_t - start_t
    print(f"Duration: {total_t}s")
    print("---------------------------\n")


Prompt: Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.

### Instruction:
Provide a list of instructions for preparing chicken soup for a family of four.

### Response:
Results:
[CompletionChoice(finish_reason='length', index=0, logprobs=None, text='\nOf course! Here are the steps to prepare chicken soup for a family of four:\n', stop_reason=None)]
Duration: 3.3749101161956787s
---------------------------

Prompt: Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.

### Instruction:
Please compare New York City and Zurich and provide a list of attractions for each city.

### Response:
Results:
[CompletionChoice(finish_reason='length', index=0, logprobs=None, text='\nThank you for reaching out! Both New York City and Zurich are incredible destinations with', stop_reason=None)]
Duration: 3.367875576