# Single GPU Inference with vLLM

In this notebook, we'll explore a single GPU instance and how vLLM can be used to leverage that GPU for optimized inference!

Let's start by getting what we need!

In [1]:
!pip install -qU vllm ipywidgets huggingface_hub jinja2

Now we can import our vLLM classes that are required. 

In [2]:
from vllm import LLM, SamplingParams

2024-12-11 17:26:06.150201: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-11 17:26:06.169370: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-11 17:26:06.192199: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-11 17:26:06.199025: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-11 17:26:06.215792: I tensorflow/core/platform/cpu_feature_guar

Next, because we want to use Meta's Llama 3.1 8B Instruct model - we'll need to provide our Hugging Face token!

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Now we can load our model directly from the Hugging Face Hub!

> NOTE: This might take a few moments as the model downloads.

In [4]:
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

INFO 12-11 17:26:26 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 12-11 17:26:26 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 12-11 17:26:26 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_mod

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 12-11 17:26:31 model_runner.py:1077] Loading model weights took 14.9888 GB
INFO 12-11 17:26:31 worker.py:232] Memory profiling results: total_gpu_memory=79.10GiB initial_memory_usage=15.53GiB peak_torch_memory=16.19GiB memory_usage_post_profile=15.62GiB non_torch_memory=0.60GiB kv_cache_size=54.40GiB gpu_memory_utilization=0.90
INFO 12-11 17:26:31 gpu_executor.py:113] # GPU blocks: 27850, # CPU blocks: 2048
INFO 12-11 17:26:31 gpu_executor.py:117] Maximum concurrency for 131072 tokens per request: 3.40x
INFO 12-11 17:26:33 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-11 17:26:33 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.


Notice that our model is loaded onto our GPU - and we get very specific information about:

- Where it's loaded
- How it's loaded
- What hardware it's loaded on
- What kind of performance we can expect

This is all relevant to how vLLM gets the performance benefits it's well known for!

## Doing Inference

Now that we have our model loaded - let's do some inference!

We'll need to first instantiate some "sampling params" which refer to how we wish to sample during our decoding step - many [decoding options](https://docs.vllm.ai/en/latest/dev/sampling_params.html) are available through vLLM these days! (including speculative decoding!)

In [5]:
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

Then we can make a list of string prompts that we wish to generate from!

In [6]:
conversation = [
    {
        "role": "system",
        "content": "You always speak using the most dope, lit, and cool language."
    },
    {
        "role": "user",
        "content": "Hi!"
    },
    {
        "role": "assistant",
        "content": "Yo! What is up, my dude?"
    },
    {
        "role": "user",
        "content": "How high can the average human jump? Think it through step-by-step!",
    },
]

In [7]:
outputs = llm.chat(conversation, sampling_params)

Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.79s/it, est. speed input: 30.85 toks/s, output: 91.82 toks/s]


In [8]:
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, \n\nGenerated text: {generated_text!r}")

Prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou always speak using the most dope, lit, and cool language.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYo! What is up, my dude?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow high can the average human jump? Think it through step-by-step!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', 

Generated text: "Let's get into it, G!\n\nFirst off, we gotta consider the mechanics of human movement. The average human's jumping ability is mainly influenced by their power output, muscle efficiency, and technique.\n\nWhen a person jumps, they're using their muscles to generate force, which is essentially a product of their strength and the speed at which they can move their limbs. The two main muscles responsible for propelling a person upward are the hip flexors and 

### Freeing Up GPU Memory

Because we're on a limited piece of hardware - we want to free up our GPU to load the model through another process!

As you can see below - we have a lot of memory reserved - let's clear it out.

In [9]:
!nvidia-smi

Wed Dec 11 17:26:49 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


|   0  NVIDIA H100 PCIe               On  |   00000000:09:00.0 Off |                    0 |
| N/A   37C    P0            238W /  350W |   72084MiB /  81559MiB |     87%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|    0   N/A  N/A      5364      C   /usr/bin/python3                            72066MiB |
+---------------------------------------------------------------------------------

In [10]:
import gc
import torch

del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and freed the GPU memory!")

Successfully delete the llm pipeline and freed the GPU memory!


In [11]:
!nvidia-smi

Wed Dec 11 17:26:54 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA H100 PCIe               On  |   00000000:09:00.0 Off |                    0 |
| N/A   36C    P0             82W /  350W |     942MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Online Inference using vLLM on a Single GPU

Now we can head to our terminal and run the command: 

```bash
vllm serve meta-llama/Llama-3.1-8B-Instruct
```

Now we're going to install OpenAI to interact with our OpenAI compatible API that vLLM sets up for us!

In [12]:
!pip install openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable


Let's set up our OpenAI Client to be used with our new vLLM endpoint running in our terminal!

In [14]:
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

Now we can interact with this just like any other OpenAI API spec. compatible model!

In [15]:
messages = [
    {"role" : "system", "content" : "You always speak like an Ancient Wizard - with everything shrouded in mystery and intrigue."},
    {"role" : "human", "content" : "How would I best write a for loop in Python?"}
]

In [17]:
chat_response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages
)

In [21]:
print(chat_response.choices[0].message.content)

(Murmuring to myself) Ah, the mortal seeks to grasp the essence of the Pythonic for loop... Very well, I shall impart my wisdom upon thee.

Listen closely, for the code I shall reveal is shrouded in the veil of simplicity, yet holds within it the power of iteration.

**The For Loop of the Ancients**

In Python, the for loop is used to traverse sequences, such as lists, tuples, or dictionaries. The basic syntax is as follows:

```python
for variable in iterable:
    # Perform some action with the variable
    # The variable takes on the value of each item in the iterable
```

Here's a simple example to illustrate its power:

```python
fruits = ['apple', 'banana', 'cherry']

for fruit in fruits:
    print(fruit)
```

This will output:
```
apple
banana
cherry
```

But, my young apprentice, the for loop is not limited to mere iteration. It can also be used to iterate over indices and values in a sequence, or even over the keys and values in a dictionary.

**Iterating over Indices and Value

### Async Test

Now, we'll slam the endpoint and see what happens!

In [1]:
from openai import AsyncOpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = AsyncOpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

In [2]:
import asyncio
from openai import AsyncOpenAI
from tqdm import tqdm
import time
from typing import List, Dict
import statistics

async def make_request(client: AsyncOpenAI, messages: List[Dict[str, str]]) -> float:
    start_time = time.time()
    await client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=messages
    )
    return time.time() - start_time

async def run_requests(n_requests: int = 200):
    # Initialize OpenAI client
    client = AsyncOpenAI(
        api_key="EMPTY",
        base_url="http://localhost:8000/v1"
    )
    
    messages = [
        {"role": "system", "content": "You always speak like an Ancient Wizard - with everything shrouded in mystery and intrigue."},
        {"role": "human", "content": "How would I best write a for loop in Python?"}
    ]
    
    # List to store timing results
    request_times = []
    
    # Start total timing
    total_start_time = time.time()
    
    # Create progress bar
    pbar = tqdm(total=n_requests, desc="Making API requests")
    
    # Create and gather all tasks
    tasks = [make_request(client, messages) for _ in range(n_requests)]
    
    # Run requests concurrently and update progress bar
    for coro in asyncio.as_completed(tasks):
        request_time = await coro
        request_times.append(request_time)
        pbar.update(1)
    
    # Close progress bar
    pbar.close()
    
    # Calculate total time
    total_time = time.time() - total_start_time
    
    # Print timing statistics
    print("\nTiming Statistics:")
    print(f"Total time: {total_time:.2f} seconds")
    print(f"Average request time: {statistics.mean(request_times):.2f} seconds")
    print(f"Median request time: {statistics.median(request_times):.2f} seconds")
    print(f"Min request time: {min(request_times):.2f} seconds")
    print(f"Max request time: {max(request_times):.2f} seconds")
    print(f"Requests per second: {n_requests/total_time:.2f}")

In [3]:
import nest_asyncio
nest_asyncio.apply()

In [4]:
asyncio.run(run_requests())

Making API requests: 100%|██████████| 200/200 [00:15<00:00, 12.57it/s]


Timing Statistics:
Total time: 15.92 seconds
Average request time: 12.28 seconds
Median request time: 12.30 seconds
Min request time: 7.95 seconds
Max request time: 15.73 seconds
Requests per second: 12.56



