## VLLM Inference
- During this task you will need to open the linux Terminal on your jumphost desktop as well as running commands in the notebook cells

In [None]:
##### Logging into OpenShift #####
### Set Student Number ###
student_number = "##"      # Replace with your student number

if student_number == "##":
    raise ValueError("Please set your student number in the 'student_number' variable.")

### Login to OpenShift ###
!oc login -u s{student_number} -p"!@34QWer" https://api.ocp.ucsx.hl.dns:6443 --insecure-skip-tls-verify
!oc project ai-s{student_number}

In [None]:
### Get the name of pod in your namespace ###
!oc get pods

# NAME                                           READY   STATUS      RESTARTS   AGE
# vllm-deployment-74fb75bfb7-rngg2               1/1     Running     0          13m

## Accessing the VLLM Deployment Pod
- From you jumphost Desktop open 3 Terminal Sessions
- In all three terminal sessions run the following command to access the VLLM Deployment Pod's bash shell:
- The name will be different for you, use the pod name from the previous step
```
oc exec -it vllm-deployment-74fb75bfb7-rngg2 -- bash
```

## Terminal 1 - nvidia-smi monitoring
- This command will show the GPU usage in your pod and update it every 2 seconds
- Refer back to this terminal while ai inference in running
- Run command:
```
watch nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40                     On  |   00000000:3D:00.0 Off |                    0 |
| N/A   54C    P0            110W /  300W |    5453MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```


## Terminal 2 - Start Inference Server
- In this terminal you will start the VLLM inference server
- Usually this will be done inside the command / args section of a kubernetes deployment yaml file
- For this lab you will run the command directly in the terminal so you can see the startup sequence
- This command has several important parameters:
  - This will server the Qwen3 0.6B model
  - Has a limited context window of 2k
  - Set to only use 5% of the GPU
- As this is a shared environment, Be sure to include max-model-len and gpu-memory-utilization parameters to limit resource usage as otherwise by default it will fill 95% of the VRAM

- Run command:
```
vllm serve Qwen/Qwen3-0.6B --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024 --max-model-len 2048 --gpu-memory-utilization 0.05
```




Look through the logs as it stats - You will notice:
- It downloads the model from the internet the first time you run it (which is saved in the persistent volume)
- Loads the model into GPU memory
- Starts the inference server and show all the URL endpoints that it servers

```
...
(EngineCore_DP0 pid=854) INFO 11-17 14:43:12 [weight_utils.py:392] Using model weights format ['*.safetensors']
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████| 1.50G/1.50G [01:32<00:00, 16.2MB/s]
(EngineCore_DP0 pid=854) INFO 11-17 14:44:45 [weight_utils.py:413] Time spent downloading weights for Qwen/Qwen3-0.6B: 93.483830 seconds
(EngineCore_DP0 pid=854) INFO 11-17 14:44:46 [weight_utils.py:450] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.97it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.96it/s]

...
(APIServer pid=553) INFO 11-17 14:45:24 [launcher.py:42] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=553) INFO 11-17 14:45:24 [launcher.py:42] Route: /v1/chat/completions, Methods: POST
(APIServer pid=553) INFO 11-17 14:45:24 [launcher.py:42] Route: /v1/completions, Methods: POST
...
(APIServer pid=553) INFO:     Started server process [553]
(APIServer pid=553) INFO:     Waiting for application startup.
(APIServer pid=553) INFO:     Application startup complete.
```

## Return to Terminal 1 - nvidia-smi monitoring
- You should see a new process down the bottom with how much VRAM is being used by it
```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40                     On  |   00000000:3D:00.0 Off |                    0 |
| N/A   54C    P0            110W /  300W |    8878MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             854      C   VLLM::EngineCore                       3420MiB |   <----
+-----------------------------------------------------------------------------------------+

```

In [None]:
### Set the base URL variable for the VLLM inference server ###
from openai import OpenAI
import httpx                # Required to ignore SSL verification

base_url = f"https://vllm-route-ai-s{student_number}.apps.ocp.ucsx.hl.dns/v1"

In [None]:
### Check Available Models being serverd by the OpenAI compatible API ###
client = OpenAI(base_url=base_url, api_key="no-key-required", http_client=httpx.Client(verify=False))

models = client.models.list()

for model in models.data:
    print(model.id)
    
    
# Qwen/Qwen3-0.6B

In [None]:
### Run Inference ###
client = OpenAI(base_url=base_url, api_key="no-key-required", http_client=httpx.Client(verify=False))

completion = client.chat.completions.create(
  model="Qwen/Qwen3-0.6B",
  messages=[{"role": "user", "content": "What is a GPU?"}],
  temperature=0.5,
  top_p=1,
  max_tokens=1024,
  stream=True
)
 
for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

## While the inference is running look back at Terminal 1 - nvidia-smi monitoring
- You should see the GPU Usage spike as inference is being performed
- Its in the midde box on the right
```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40                     On  |   00000000:3D:00.0 Off |                    0 |
| N/A   54C    P0            110W /  300W |    8878MiB /  46068MiB |      90%     Default |    <-----
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             854      C   VLLM::EngineCore                       3420MiB |
+-----------------------------------------------------------------------------------------+
```



## Look at Terminal 2 - Inference Server Logs
- You will see the inference server logs showing the incoming request and the response being generated
- It also includes a summary of the performance metrics for the request

```
(APIServer pid=553) INFO:     10.131.0.2:50816 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=553) INFO 11-17 14:54:25 [loggers.py:127] Engine 000: Avg prompt throughput: 1.3 tokens/s, Avg generation throughput: 53.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
...

## Terminal 3 - Running a benchmark test in the VLLM Deployment Pod
- This benchmark runs 100 prompt tests (10 at a time) and returns a summary
- While its running look at the Terminal 1 - nvidia-smi and look at the GPU usage again
- This requires that the vllm still be running in Terminal 2
Note:	If other students are also actively using the GPU, the time taken will be longer as the GPUs are time-sliced and gpu-utilisation is shared between users

Run Command:
```
vllm bench serve --backend vllm --model Qwen/Qwen3-0.6B --endpoint /v1/completions --dataset-name random --num-prompts 100 --max-concurrency 10
```

Expected Output:
```
============ Serving Benchmark Result ============
Successful requests:                     100
Maximum request concurrency:             10
Benchmark duration (s):                  7.70
Total input tokens:                      102017
Total generated tokens:                  11915
Request throughput (req/s):              12.99
Output token throughput (tok/s):         1547.38
Peak output token throughput (tok/s):    1675.00
Peak concurrent requests:                29.00
Total Token throughput (tok/s):          14796.16
---------------Time to First Token----------------
Mean TTFT (ms):                          51.04
Median TTFT (ms):                        48.34
P99 TTFT (ms):                           121.48
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.99
Median TPOT (ms):                        5.85
P99 TPOT (ms):                           11.92
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.80
Median ITL (ms):                         5.50
P99 ITL (ms):                            12.95
==================================================
```


### On Terminal 2 - Inference Server Logs
- You will see multiple incoming requests and some performance metrics being displayed as the benchmark runs
```
(APIServer pid=553) INFO:     127.0.0.1:55748 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=553) INFO:     127.0.0.1:55760 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=553) INFO:     127.0.0.1:55696 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=553) INFO:     127.0.0.1:55778 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=553) INFO 11-17 15:05:15 [loggers.py:127] Engine 000: Avg prompt throughput: 7744.3 tokens/s, Avg generation throughput: 858.3 tokens/s, Running: 4 reqs, Waiting: 6 reqs, GPU KV cache usage: 86.4%, Prefix cache hit rate: 29.5%
```
