In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
from openai import OpenAI

Please run the following command on the terminal to spin up a vLLM server, which we will query in the rest of the notebook.


```bash
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max_model_len 4096 --chat-template ./template_llama31.jinja
```

We might have to wait about 1 minute before the server is up and reachable.

### Using OpenAI Completions API with vLLM

In [2]:
!curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",   "prompt": "San Francisco is a",   "max_tokens": 7,  "temperature": 0 }'

{"id":"cmpl-c6ab3ebe1b50409fbde408d32c884899","object":"text_completion","created":1725109226,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" top tourist destination, and for good","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}

In [3]:
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

In [4]:
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

In [5]:
completion = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="San Francisco is a",
    temperature=0,
    max_tokens=7,
)
print("Completion result:", completion.choices[0].text);

Completion result:  top tourist destination, and for good


### Using OpenAI Chat API with vLLM

In [6]:
chat_response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the result of 2 + 2?"},
    ],
    temperature=0,
)
print("Chat response:", chat_response.choices[0].message.content);

Chat response: The result of 2 + 2 is 4.
