# 06-02 : vLLM Ofline Batch Inference

## References

- [Quickstart](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#offline-batched-inference)
- [Deploying an LLM using MLRun](https://docs.mlrun.org/en/v1.9.1/tutorials/genai_01_basic_tutorial.html)
- [Function hub](https://docs.mlrun.org/en/latest/runtimes/load-from-hub.html)
- [Functions hub Repo](https://github.com/mlrun/functions)
- [Building a docker image using a Dockerfile and then using it](https://docs.mlrun.org/en/v1.9.1/runtimes/images.html#building-a-docker-image-using-a-dockerfile-and-then-using-it)
- [MLRun runtime images](https://docs.mlrun.org/en/v1.9.1/runtimes/images.html#mlrun-runtime-images)
- [MLRun Images](https://github.com/mlrun/mlrun/tree/development/dockerfiles)
- https://github.com/yaronha/mlrun/blob/36049e1f748e277bf380d44043129d408581a893/docs/runtimes/configuring-job-resources.md#cpu-gpu-and-memory-limits-for-user-jobs
- [Select OpenAI or Ollama](https://github.com/liranbg/mlrun/blob/d61b4e2f37215ae9564aeb1dbbeee08c7f0c9d2c/docs/genai/development/working-with-rag.ipynb#L202)
- [gpt-oss vLLM Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html)

In [1]:
import mlrun

In [2]:
# Show the API server URL
mlrun.get_run_db()

HTTPRunDB('http://dragon.local:30070')

## 1. Configuration

In [3]:
MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
project_name = "llm-batch" # the project name

### 1.1 Create The Project

In [4]:
project = mlrun.get_or_create_project(
    name=project_name,
    user_project=True)

# Display the current project name
project_name = project.metadata.name
print(f'Full project name: {project_name}')

> 2025-08-06 13:38:37,628 [info] Loading project from path: {"path":"./","project_name":"llm-batch","user_project":true}
> 2025-08-06 13:38:37,655 [info] Project loaded successfully: {"path":"./","project_name":"llm-batch-johannes","stored_in_db":true}
Full project name: llm-batch-johannes


### 1.2 Model Cache Directory

In [5]:
# the cache directory for the model
CACHE_DIR = mlrun.mlconf.artifact_path
CACHE_DIR = (
    CACHE_DIR.replace("v3io://", "/v3io").replace("{{run.project}}", project.name)
    + "/cache"
)
print(f"Cache directory: {CACHE_DIR}")

Cache directory: s3://mlrun/projects/llm-batch-johannes/artifacts/cache


## 2. vLLM Docker Image

### Manual Test

```vllm-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: vllm-shell
  namespace: mlrun
spec:
  restartPolicy: Never
  runtimeClassName: nvidia
  containers:
  - name: vllm-container
    image: registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1
    command: ["/bin/bash"]
    stdin: true
    tty: true
    resources:
      limits:
        nvidia.com/gpu: "1"
    imagePullPolicy: Always
  imagePullSecrets:
  - name: registry-credentials
```
Now, run the following commands in your terminal:

```bash
# Apply the YAML file to create the pod
sudo kubectl apply -f vllm-pod.yaml

# Attach to the running pod to get your interactive shell
sudo kubectl attach -n mlrun -it vllm-shell

# test in pod
nvidia-smi
python3 -c 'import torch; print(torch.__version__); print(torch.version.cuda)'
```

### Manage Remote Registry

https://stackoverflow.com/questions/25436742/how-to-delete-images-from-a-private-docker-registry

```bash
# list repositories
curl -s dragon:30500/v2/_catalog -u mlrun:mlpass | jq
```

Delete a repository:

```bash
REPO="vllm-batch"
REG="dragon:30500"
AUTH="mlrun:mlpass"

for tag in $(curl -s -u $AUTH http://$REG/v2/$REPO/tags/list | jq -r '.tags[]'); do
  digest=$(curl -sI -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
    -u $AUTH http://$REG/v2/$REPO/manifests/$tag | awk -F': ' '/Docker-Content-Digest/ {print $2}' | tr -d $'\r')
  echo "Deleting $REPO:$tag -> $digest"
  curl -s -X DELETE -u $AUTH http://$REG/v2/$REPO/manifests/$digest
done
```


Garbage collect the registry:

```bash
docker run --rm \
  -v /home/johnny/swan/opt/registry:/var/lib/registry \
  registry:3 \
  sh -c 'printf "%s\n" \
"version: 0.1" \
"storage:" \
"  delete:" \
"    enabled: true" \
"  filesystem:" \
"    rootdirectory: /var/lib/registry" \
| tee /tmp/c.yml && registry garbage-collect /tmp/c.yml'



docker run --rm \
  -v /home/johnny/swan/opt/registry:/var/lib/registry \
  registry:3 \
  sh -c 'printf "%s\n" \
"version: 0.1" \
"storage:" \
"  delete:" \
"    enabled: true" \
"  filesystem:" \
"    rootdirectory: /var/lib/registry" \
| tee /tmp/c.yml && registry garbage-collect -m /tmp/c.yml'
```

### 2.1 Dockerfile

In [6]:
%%writefile Dockerfile

FROM dragon:30500/mlrun/mlrun:1.9.1

RUN pip install --upgrade pip \
 && pip install --no-cache-dir --force-reinstall \
      numpy==1.26.4 \
      pandas==2.1.4 \
      mlrun==1.9.1 \
      vllm==0.10.0 
      
# RUN pip install --no-cache-dir \
#     torch==2.7.1 \
#     torchvision==2.7.1 \
#     torchaudio==0.22.1 \
#     --index-url https://download.pytorch.org/whl/cu124
    
# RUN pip install --no-cache-dir \
#     vllm==0.10.0 

RUN touch /tmp/.vllm_ready

Overwriting Dockerfile


### 2.2 Build the Docker Image

In [7]:
!docker build -t dragon:30500/vllm-batch:0.0.1 -f Dockerfile .

zsh:1: command not found: docker


In [8]:
#docker run -it --rm --gpus all dragon:30500/vllm-batch:0.0.1 /bin/bash

# sudo ctr image pull --plain-http=true --user "mlrun:mlpass" dragon:30500/vllm-batch:0.0.1

# sudo ctr run --rm --runtime io.containerd.runc.v2 --runc-binary /usr/bin/nvidia-container-runtime \
#  dragon:30500/vllm-batch:0.0.1 test-gpu \
#  bash -c "nvidia-smi"


# sudo ctr run --rm -t \
#   --runc-binary=/usr/bin/nvidia-container-runtime \
#   --env NVIDIA_VISIBLE_DEVICES=all \
#   dragon:30500/vllm-batch:0.0.1 test-gpu \
#   sh -c "nvidia-smi"

### 2.3 Push the Docker Image

In [9]:
!docker push dragon:30500/vllm-batch:0.0.1

zsh:1: command not found: docker


## 3. Batch Function

In [10]:
def create_llm_function():
    global project, MODEL_ID, CACHE_DIR

    image = "registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1"

    llm_func = project.set_function(
        func="src/06-02_vllm.py",
        name="llm-batch",
        kind="job",
        image=image,
        handler="vllm_batch",
        tag="v0.0.1",
    )

    llm_func.with_limits(gpus=1, gpu_type='nvidia.com/gpu')

    llm_func.set_envs({
        "MODEL_ID": MODEL_ID,
        "CACHE_DIR": CACHE_DIR,
    })
    
    llm_func.save()

    return llm_func

## create the function
llm_func = create_llm_function()

In [11]:
# build the function
project.build_function(function='llm-batch', force_build=True)

> 2025-08-06 13:38:38,915 [info] Started building image: .mlrun/func-llm-batch-johannes-llm-batch:v0.0.1


The `overwrite_build_params` parameter default will change from 'False' to 'True' in 1.10.0.


[36mINFO[0m[0000] Retrieving image manifest registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1 
[36mINFO[0m[0000] Retrieving image registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1 from registry registry-service.mlrun.svc.cluster.local 
[36mINFO[0m[0000] Built cross stage deps: map[]                
[36mINFO[0m[0000] Retrieving image manifest registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1 
[36mINFO[0m[0000] Returning cached image manifest              
[36mINFO[0m[0000] Executing 0 build triggers                   
[36mINFO[0m[0000] Building stage 'registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1' [idx: '0', base-idx: '-1'] 
[36mINFO[0m[0000] Skipping unpacking as no commands require it. 
[36mINFO[0m[0000] Pushing image to registry-service.mlrun.svc.cluster.local/mlrun/func-llm-batch-johannes-llm-batch:v0.0.1 
[36mINFO[0m[0000] Pushed registry-service.mlrun.svc.cluster.local/mlrun/func-llm-batch-johannes-llm-batch@sha256:685f46c42a

BuildStatus(ready=True, outputs={'image': '.mlrun/func-llm-batch-johannes-llm-batch:v0.0.1'})

## 4. Run the function

In [12]:
event = {
    
}

### 4.1 Initial Test

In [13]:
# apply code changes
#llm_func = create_llm_function() 

# run the function
project.run_function(
    function='llm-batch',
    params={
        "event": event
    },
    watch=True
)

> 2025-08-06 13:39:05,405 [info] Storing function: {"db":"http://dragon.local:30070","name":"llm-batch-vllm-batch","uid":"f6a079a10edd42c1aeeb9403eefba1cf"}
> 2025-08-06 13:39:05,627 [info] Job is running in the background, pod: llm-batch-vllm-batch-h5sj4
INFO 08-06 11:39:10 [__init__.py:235] Automatically detected platform cuda.
Received event: {}
Using device: cuda

NVIDIA GeForce RTX 3090 Ti
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB
> 2025-08-06 11:39:11,366 [info] To track results use the CLI: {"info_cmd":"mlrun get run f6a079a10edd42c1aeeb9403eefba1cf -p llm-batch-johannes","logs_cmd":"mlrun logs f6a079a10edd42c1aeeb9403eefba1cf -p llm-batch-johannes"}
> 2025-08-06 11:39:11,367 [info] Run execution finished: {"name":"llm-batch-vllm-batch","status":"completed"}


project,uid,iter,start,end,state,kind,name,labels,inputs,parameters,results
llm-batch-johannes,...fba1cf,0,Aug 06 11:39:07,2025-08-06 11:39:11.362794+00:00,completed,run,llm-batch-vllm-batch,v3io_user=johanneskind=jobowner=johannesmlrun/client_version=1.9.1mlrun/client_python_version=3.9.23host=llm-batch-vllm-batch-h5sj4,,event={},"return=Device: cuda, Model ID: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, Cache Directory: s3://mlrun/projects/llm-batch-johannes/artifacts/cache, GPU Available: True"





> 2025-08-06 13:39:18,714 [info] Run execution finished: {"name":"llm-batch-vllm-batch","status":"completed"}


<mlrun.model.RunObject at 0x1662ebd30>