# worker-vllm Colab Smoke Test

This notebook installs explicit package versions and runs quick tests for `transformers` and `vllm`.

In [5]:
import os
import platform
import subprocess
import sys

print('Python:', sys.version)
print('Platform:', platform.platform())
print('COLAB_GPU:', os.environ.get('COLAB_GPU', 'N/A'))

try:
    subprocess.run(['nvidia-smi'], check=False)
except FileNotFoundError:
    print('nvidia-smi not found. In Colab, set Runtime -> GPU first.')

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Platform: Linux-6.6.105+-x86_64-with-glibc2.35
COLAB_GPU: 1


## 1) Repo Path Setup

- If this notebook is already opened inside the repo, it uses current directory.
- Otherwise it clones `runpod-workers/worker-vllm` to `/content/worker-vllm`.

In [1]:
import os
import subprocess
from pathlib import Path

REPO_URL = 'https://github.com/runpod-workers/worker-vllm.git'
REPO_DIR = Path('/content/worker-vllm')

if (Path.cwd() / 'Dockerfile').exists() and (Path.cwd() / 'builder/requirements.txt').exists():
    REPO_DIR = Path.cwd()
elif not REPO_DIR.exists():
    subprocess.run(['git', 'clone', REPO_URL, str(REPO_DIR)], check=True)

os.chdir(REPO_DIR)
print('Using repo:', REPO_DIR)
print('requirements:', (REPO_DIR / 'builder/requirements.txt').exists())

Using repo: /content/worker-vllm
requirements: True


## 2) Install Dependencies

This installs your requested package list directly, then installs `vllm==0.15.1`.

In [2]:
import subprocess
import sys

PACKAGES = [
    'ray',
    'pandas',
    'pyarrow',
    'runpod>=1.8,<2.0',
    'huggingface-hub',
    'packaging',
    'typing-extensions>=4.8.0',
    'pydantic',
    'pydantic-settings',
    'hf-transfer',
    'transformers>=4.57.6',
    'bitsandbytes>=0.45.0',
    'kernels',
    'torch==2.6.0',
    "autoawq"

]

commands = [
    [sys.executable, '-m', 'pip', 'install', '--upgrade', 'pip'],
    [sys.executable, '-m', 'pip', 'install', *PACKAGES],
    [sys.executable, '-m', 'pip', 'install', 'vllm==0.15.0'],
]

for cmd in commands:
    print('$', ' '.join(cmd))
    subprocess.run(cmd, check=True)

print('Dependency install complete.')

$ /usr/bin/python3 -m pip install --upgrade pip
$ /usr/bin/python3 -m pip install ray pandas pyarrow runpod>=1.8,<2.0 huggingface-hub packaging typing-extensions>=4.8.0 pydantic pydantic-settings hf-transfer transformers>=4.57.6 bitsandbytes>=0.45.0 kernels torch==2.6.0 autoawq
$ /usr/bin/python3 -m pip install vllm==0.15.0
Dependency install complete.


If imports fail right after install, restart runtime once and rerun from the next cell.

In [None]:
import torch
import transformers
import vllm

print('torch:', torch.__version__)
print('torch CUDA:', torch.version.cuda)
print('cuda available:', torch.cuda.is_available())
print('transformers:', transformers.__version__)
print('vllm:', vllm.__version__)

if not torch.cuda.is_available():
    raise RuntimeError('CUDA GPU not detected. Use a GPU runtime in Colab.')

torch: 2.9.1+cu128
torch CUDA: 12.8
cuda available: True
transformers: 4.57.6
vllm: 0.15.0


## 3) Optional: worker-vllm Import Check

Some Colab `vllm` builds may not expose `vllm.entrypoints.openai.protocol`.
In that case, this section is skipped and you can continue with inference smoke tests.

In [3]:
import importlib.util
import os
import vllm

os.environ['MODEL_NAME'] = os.environ.get('MODEL_NAME', 'Qwen/Qwen2.5-7B-Instruct-AWQ')
os.environ['TOKENIZER_NAME'] = os.environ.get('TOKENIZER_NAME', os.environ['MODEL_NAME'])
os.environ['MAX_MODEL_LEN'] = os.environ.get('MAX_MODEL_LEN', '2048')
os.environ['GPU_MEMORY_UTILIZATION'] = os.environ.get('GPU_MEMORY_UTILIZATION', '0.90')

print('vllm:', vllm.__version__)
protocol_spec = importlib.util.find_spec('vllm.entrypoints.openai.protocol')
print('has vllm.entrypoints.openai.protocol:', bool(protocol_spec))

if protocol_spec is None:
    print('Skip: worker-vllm import check (module path not available in this build).')
else:
    try:
        from src.engine_args import get_engine_args
        engine_args = get_engine_args()
        print(engine_args)
    except Exception as e:
        print('worker-vllm import check failed:', repr(e))
        print('Continue with sections 4 and 5 for transformers/vllm smoke tests.')

vllm: 0.15.0
has vllm.entrypoints.openai.protocol: False
Skip: worker-vllm import check (module path not available in this build).


## 4) transformers Smoke Test

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

TEST_MODEL = os.environ.get('TEST_MODEL', os.environ.get('MODEL_NAME', 'Qwen/Qwen3-8B-AWQ'))
PROMPT = 'Write one short sentence about why low-latency inference matters.'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained(TEST_MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    TEST_MODEL,
    torch_dtype='auto',
    device_map='auto',
    trust_remote_code=True,
)

inputs = tokenizer(PROMPT, return_tensors='pt').to(device)
with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=40, do_sample=False)

transformers_output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print('Model:', TEST_MODEL)
print(transformers_output)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

`torch_dtype` is deprecated! Use `dtype` instead!
We suggest you to set `dtype=torch.float16` for better efficiency on CUDA/XPU with AWQ.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.85G [00:00<?, ?B/s]

TypeError: Qwen3ForCausalLM.__init__() got an unexpected keyword argument 'quantization'

## 5) vLLM Smoke Test

In [2]:
import torch
from vllm import LLM, SamplingParams

# Free VRAM before creating vLLM engine.
if 'model' in globals():
    del model
torch.cuda.empty_cache()

sampling_params = SamplingParams(temperature=0.0, max_tokens=40)
llm = LLM(
    model=TEST_MODEL,
    tokenizer=TEST_MODEL,
    trust_remote_code=True,
    gpu_memory_utilization=float(os.environ.get('GPU_MEMORY_UTILIZATION', '0.90')),
)

outputs = llm.generate([PROMPT], sampling_params)
vllm_output = outputs[0].outputs[0].text.strip()
print(vllm_output)

if not vllm_output:
    raise RuntimeError('vLLM output is empty.')

print('vLLM smoke test passed.')

INFO 02-07 17:09:37 [utils.py:261] non-default args: {'tokenizer': 'Qwen/Qwen3-8B-AWQ', 'trust_remote_code': True, 'disable_log_stats': True, 'model': 'Qwen/Qwen3-8B-AWQ'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 02-07 17:09:56 [model.py:541] Resolved architecture: Qwen3ForCausalLM
INFO 02-07 17:09:56 [model.py:1561] Using max model len 40960
INFO 02-07 17:09:56 [awq_marlin.py:162] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 02-07 17:09:56 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=8192.


Parse safetensors files:   0%|          | 0/2 [00:00<?, ?it/s]

INFO 02-07 17:09:59 [vllm.py:624] Asynchronous scheduling is enabled.


generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



: 

Done. If you want to test a different model, set `MODEL_NAME`/`TEST_MODEL` and rerun sections 4 and 5.