# Getting Started with vLLM on Intel® Gaudi® 2 AI Accelerators

Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.|

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

## Introduction

This notebook demonstrates how to use vLLM, a fast and efficient library for Large Language Model (LLM) inference and serving, with Intel® Gaudi® 2 AI Accelerators. vLLM-fork for Gaudi is an adaptation of the original vLLM project, optimized to leverage the power of Gaudi hardware.

vLLM offers several advantages for LLM inference:

1. High-throughput serving with state-of-the-art performance
2. Efficient memory management using PagedAttention
3. Continuous batching of incoming requests
4. Optimized execution with custom Gaudi implementations for LLM operators
5. Support for offline batched inference and nline inference via OpenAI-Compatible Server

In this notebook, we'll explore how to set up and use vLLM on Gaudi hardware, demonstrating its capabilities for fast and efficient LLM inference.

[Source: [vLLM-fork for Gaudi](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md), [vLLM Project](https://github.com/vllm-project/vllm)]

## Installation and Environment Setup

For Gaudi requirements and installation please refer to [Requirements and Installation](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#requirements-and-installation)

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the [Run Docker Image](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#run-docker-image) section from [Intel Gaudi documentation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#pull-prebuilt-containers) for more details.

The following cells install vLLM for Gaudi required to run the contents of this notebook. For more information on installing vLLM for Gaudi refer to [Build And Install vLLM](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#build-and-install-vllm-fork)

In [None]:
%%bash

git clone https://github.com/HabanaAI/vllm-fork.git
cd vllm-fork
git checkout v0.4.2-Gaudi-1.16.0

In [None]:
!docker run -d \
  --runtime=habana \
  -v $(pwd):/app \
  -e HABANA_VISIBLE_DEVICES=0 \
  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
  --cap-add=sys_nice \
  --net=host \
  --ipc=host \
  --name=vllm_installation \
  vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest \
  /bin/bash -c "cd /app/vllm-fork && pip install -e . && pip install flask && echo $'\n\nInstallation completed successfully'"  # This may take 5-10 minutes

## Getting Started with vLLM

In this tutorial section below topics will be covered on how to setup and run on Intel Gaudi
- Check vLLM installation
- Prerequistes 
- Run vLLM for Offline batched Inference
- Deploy vLLM via Flask 
- Deploy vLLM for Online Inference via OpenAI-Compatible Server

### Check vLLM installation

Check logs from the docker container to see if vLLM and flask have been installed successfully.

Check the logs to verify the message - `Installation completed successfully`

> 📝 **Note:** Please wait for a few minutes before running the cell below. The cell may have to be run multiple times to confirm installation as the log updates.

In [None]:
!docker logs vllm_installation

Save the state of the container with vLLM and Flask installed, let's add a `tag` to the image name called `vllm`

In [None]:
!docker commit vllm_installation vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:vllm

### Prerequisites

A Hugging Face token need to be set as an environment variable. This step is required if gated models like llama2 from Hugging Face are being used. The token in the code should be replaced with a valid personal token if gated models are accessed.

The tokens can be accessed from this site - [Huggingface Tokens](https://huggingface.co/settings/tokens)

In [None]:
HF_TOKEN = "YOUR_ACCESS_TOKEN"

### Run vLLM for Offline batched Inference

Here is an example of vLLM's offline batched inference capabilities using Llama2 7B on Intel Gaudi. The below script utlizes vLLM in offline mode and processes the input prompts in a single batch

```python
#!/usr/bin/python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# initialize
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=30)
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", enforce_eager=True)

# perform the inference
outputs = llm.generate(prompts, sampling_params)

# print outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


Execute the script called `vllm_batch_inference.py` shown above, inside the container that has vLLM with Gaudi support installed to perform batched inference.

The script can also be run interactively using the flag `-it` instead of `-d` in the following cell. This way, the inference can be done with different inputs and parameters

In [None]:
!docker run -d \
  --runtime=habana \
  -v $(pwd):/app \
  -e HABANA_VISIBLE_DEVICES=0 \
  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
  -e HF_TOKEN=$HF_TOKEN \
  --cap-add=sys_nice \
  --net=host \
  --ipc=host \
  --name=vllm_batch_inference \
  vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:vllm \
  python /app/scripts/vllm_batch_inference.py

> **Note:** This may take 5-10 minutes if the model weights have not been downloaded

Output is printed for each of the input prompts in the script. Next, stop the docker container.

In [None]:
!docker logs vllm_batch_inference

In [None]:
!docker stop vllm_batch_inference

### Deploy vLLM via Flask

While the example shown above is great for offline tests, a production setup calls for a more robust solution. Here is an example on how to use a web based framework like Flask, vLLM and Gaudi to serve the model via a REST API

```python
from flask import Flask, request, jsonify
from vllm import LLM, SamplingParams

app = Flask(__name__)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", enforce_eager=True)

@app.route('/generate', methods=['POST'])
def generate():
    data = request.get_json()
    prompts = data.get('prompts', [])

    outputs = llm.generate(prompts, sampling_params)

    # Prepare the outputs.
    results = []

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        results.append({
            'prompt': prompt,
            'generated_text': generated_text
        })

    return jsonify(results)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
```

Above is a script called `vllm_flaskapp.py` which creates an endpoint called `/generate` through which the text generation requests are served.

> **Note:** Please change the port in the script in the event that it is occupied

Run the script in the container to start the flask server

In [None]:
!docker run -d \
  --runtime=habana \
  -v $(pwd):/app \
  -e HABANA_VISIBLE_DEVICES=0 \
  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
  -e HF_TOKEN=$HF_TOKEN \
  --cap-add=sys_nice \
  --net=host \
  --ipc=host \
  --name=vllm_flask_server \
  vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:vllm \
  python /app/scripts/vllm_flaskapp.py  # This step takes 5-10 minutes

The server takes a while to get set up. Please check `docker` logs for `vllm_flask_server` using the command below to check for status. Once the server is set up successfully, the following output can be seen in the logs:

```
* Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://198.175.88.246:5000
Press CTRL+C to quit
```
Please make note of the port that the application is serving at

In [None]:
# check logs to see if server has started successfully

!docker logs vllm_flask_server

A POST request is sent to the Flask server, and the response is printed. Please change the port according to the logs above.

In [None]:
# Example of sending a POST request to the Flask server
import requests
import json

response = requests.post('http://localhost:5000/generate', json={'prompts': ['Tell me in one sentence what Berlin is famous for']})
print(response.json())

Stop the server once the requests have been made

In [None]:
!docker stop vllm_flask_server

### Deploy vLLM for Online Inference via OpenAI-Compatible Server

The flask REST API above has it's limitations in terms of handling multiple users, lacks built-in authentication, and requires custom documentation. That's where vLLM's serving capabilities can be utililized for production grade deployment at scale right out of the box.
In this section, we will use vllm's built-in capapbilties to deploy a server and use OpenAI client to make requests

The command used to run the vLLM api server is `python -m vllm.entrypoints.openai.api_server --enforce-eager --model=meta-llama/Llama-2-7b-chat-hf` which is run inside the container with vLLM installed.
You can specify the address with `--host` and `--port` arguments

In [None]:
!docker run -d \
  --runtime=habana \
  -v $(pwd):/app \
  -e HABANA_VISIBLE_DEVICES=0 \
  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
  -e HF_TOKEN=$HF_TOKEN \
  --cap-add=sys_nice \
  --net=host \
  --ipc=host \
  --name=vllm_api_server \
  vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:vllm \
  python -m vllm.entrypoints.openai.api_server --enforce-eager --model=meta-llama/Llama-2-7b-chat-hf --port 8000 # Takes 5-10 minutes

The server takes a while to get set up. Please check `docker` logs for `vllm_api_server` using the command below to check for status. Once the server is set up successfully, the following output can be seen in the logs:

```
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 07-22 22:19:07 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
```


In [None]:
# check logs to see if server has started successfully

!docker logs vllm_api_server

Let's use the `requests` library to interact with our vLLM API server. This approach allows us to send HTTP requests directly from our Python script:

In [None]:
import requests
import json

# send request to the vLLM server using requests
VLLM_HOST = "http://0.0.0.0:8000"
url = f"{VLLM_HOST}/v1/completions"

headers = {"Content-Type": "application/json"}
data = {
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "prompt": "Tell me in one sentence what Tokyo is famous for",
    "max_tokens": 50,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["choices"][0]["text"])

### OpenAI API Client

Install openai client to make requests to the vLLM server

In [None]:
!pip install openai

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://0.0.0.0:8000`. You can specify the address with --`host` and `--port` arguments.

OpenAI client to stream requests can be used via `Completions` and `Chat` API.

The Completions API from OpenAI is designed for a wide range of text generation tasks, offering significant control over the output through various parameters, making it suitable for applications like content creation and code completion

In contrast, the Chat API is specifically optimized for conversational AI, facilitating smoother and more contextually aware dialogues. This API is better suited for chatbots and interactive applications where maintaining context and managing follow-up queries is crucial, ensuring more human-like interactions.

Completions API:

In [None]:
from openai import OpenAI

# we haven't configured authentication, we pass a dummy value
openai_api_key = "EMPTY"
# modify this value to match your host, remember to add /v1 at the end
openai_api_base = "http://0.0.0.0:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="meta-llama/Llama-2-7b-chat-hf",
                                      prompt="Tell me in one sentence what Paris is famous for",
                                      max_tokens=50)
print(completion.choices[0].text)

Chat API:

Please use an instruction-tuned model for the chat API such as `meta-llama/Llama-2-7b-chat-hf`

In [None]:
openai_api_key = "EMPTY"
openai_api_base = "http://0.0.0.0:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "system", "content": "You're a helful assistant."},
        {"role": "user", "content": "Tell me in one sentence what New York City is famous for"},
    ]
)

print(chat_response.choices[0].message.content)

### Chat Application using Gradio

Optionally, connect to the vLLM server through a Gradio chatbot to see it in action. The same endpoint that was set up in the previous sections is used.

In [None]:
!pip install gradio

In [None]:
import gradio as gr

def predict(message, history):
    history_openai_format = []
    for human, assistant in history:
        history_openai_format.append({"role": "user", "content": human })
        history_openai_format.append({"role": "assistant", "content":assistant})
    history_openai_format.append({"role": "user", "content": message})
  
    response = client.chat.completions.create(model='meta-llama/Llama-2-7b-chat-hf',
    messages= history_openai_format,
    temperature=1.0,
    stream=True)

    partial_message = ""
    for chunk in response:
        if chunk.choices[0].delta.content is not None:
              partial_message = partial_message + chunk.choices[0].delta.content
              yield partial_message

gr.ChatInterface(predict).launch(share=True)

Stop the server once the requests have been made

In [None]:
!docker stop vllm_api_server

## Troubleshooting

1. If the following error message is encountered:
   ```
   RuntimeError: synStatus=8 [Device not found] Device acquire failed.
   ```
   Please stop the respective docker container to free up Gaudi memory

2. `docker ps` can be a great command to see which containers have been up and for how long

3. If a container name is already in use by a container, capture the container id in the error message and run `docker rm CONTAINER_ID`