# 06-02 : vLLM Ofline Batch Inference

## References

- [Quickstart](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#offline-batched-inference)
- [Deploying an LLM using MLRun](https://docs.mlrun.org/en/v1.9.1/tutorials/genai_01_basic_tutorial.html)
- [Function hub](https://docs.mlrun.org/en/latest/runtimes/load-from-hub.html)
- [Functions hub Repo](https://github.com/mlrun/functions)
- [Building a docker image using a Dockerfile and then using it](https://docs.mlrun.org/en/v1.9.1/runtimes/images.html#building-a-docker-image-using-a-dockerfile-and-then-using-it)
- [MLRun runtime images](https://docs.mlrun.org/en/v1.9.1/runtimes/images.html#mlrun-runtime-images)
- [MLRun Images](https://github.com/mlrun/mlrun/tree/development/dockerfiles)

In [14]:
import mlrun

In [15]:
# Show the API server URL
mlrun.get_run_db()

HTTPRunDB('http://dragon.local:30070')

## 1. Configuration

In [16]:
MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
project_name = "llm-batch" # the project name

### 1.1 Create The Project

In [17]:
project = mlrun.get_or_create_project(
    name=project_name,
    user_project=True)

# Display the current project name
project_name = project.metadata.name
print(f'Full project name: {project_name}')

> 2025-08-05 18:34:57,651 [info] Loading project from path: {"path":"./","project_name":"llm-batch","user_project":true}
> 2025-08-05 18:34:57,698 [info] Project loaded successfully: {"path":"./","project_name":"llm-batch-johannes","stored_in_db":true}
Full project name: llm-batch-johannes


### 1.2 Model Cache Directory

In [18]:
# the cache directory for the model
CACHE_DIR = mlrun.mlconf.artifact_path
CACHE_DIR = (
    CACHE_DIR.replace("v3io://", "/v3io").replace("{{run.project}}", project.name)
    + "/cache"
)
print(f"Cache directory: {CACHE_DIR}")

Cache directory: s3://mlrun/projects/llm-batch-johannes/artifacts/cache


## 2. vLLM Docker Image

### Manual Test

```vllm-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: vllm-shell
  namespace: mlrun
spec:
  restartPolicy: Never
  runtimeClassName: nvidia
  containers:
  - name: vllm-container
    image: registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1
    command: ["/bin/bash"]
    stdin: true
    tty: true
    resources:
      limits:
        nvidia.com/gpu: "1"
    imagePullPolicy: Always
  imagePullSecrets:
  - name: registry-credentials
```
Now, run the following commands in your terminal:

```bash
# Apply the YAML file to create the pod
sudo kubectl apply -f vllm-pod.yaml

# Attach to the running pod to get your interactive shell
sudo kubectl attach -n mlrun -it vllm-shell

# test in pod
nvidia-smi
python3 -c 'import torch; print(torch.__version__); print(torch.version.cuda)'
```

### Manage Remote Registry

https://stackoverflow.com/questions/25436742/how-to-delete-images-from-a-private-docker-registry

```bash
# list repositories
curl -s dragon:30500/v2/_catalog -u mlrun:mlpass | jq
```

Delete a repository:

```bash
REPO="vllm-batch"
REG="dragon:30500"
AUTH="mlrun:mlpass"

for tag in $(curl -s -u $AUTH http://$REG/v2/$REPO/tags/list | jq -r '.tags[]'); do
  digest=$(curl -sI -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
    -u $AUTH http://$REG/v2/$REPO/manifests/$tag | awk -F': ' '/Docker-Content-Digest/ {print $2}' | tr -d $'\r')
  echo "Deleting $REPO:$tag -> $digest"
  curl -s -X DELETE -u $AUTH http://$REG/v2/$REPO/manifests/$digest
done
```


Garbage collect the registry:

```bash
docker run --rm \
  -v /home/johnny/swan/opt/registry:/var/lib/registry \
  registry:3 \
  sh -c 'printf "%s\n" \
"version: 0.1" \
"storage:" \
"  delete:" \
"    enabled: true" \
"  filesystem:" \
"    rootdirectory: /var/lib/registry" \
| tee /tmp/c.yml && registry garbage-collect /tmp/c.yml'



docker run --rm \
  -v /home/johnny/swan/opt/registry:/var/lib/registry \
  registry:3 \
  sh -c 'printf "%s\n" \
"version: 0.1" \
"storage:" \
"  delete:" \
"    enabled: true" \
"  filesystem:" \
"    rootdirectory: /var/lib/registry" \
| tee /tmp/c.yml && registry garbage-collect -m /tmp/c.yml'
```

### 2.1 Dockerfile

In [27]:
%%writefile Dockerfile

FROM dragon:30500/mlrun/mlrun:1.9.1

RUN pip install --no-cache-dir \
    torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 && \
    pip install --no-cache-dir \
    vllm==0.10.0

RUN touch /tmp/.vllm_ready

Overwriting Dockerfile


### 2.2 Build the Docker Image

In [None]:
!docker build -t dragon:30500/vllm-batch:0.0.1 -f Dockerfile .

zsh:1: command not found: docker


In [21]:
#docker run -it --rm --gpus all dragon:30500/vllm-batch:0.0.1 /bin/bash

### 2.3 Push the Docker Image

In [None]:
!docker push dragon:30500/vllm-batch:0.0.1

zsh:1: command not found: docker


## 3. Batch Function

In [23]:
def create_llm_function():
    global project, MODEL_ID, CACHE_DIR
    
    requirements = [
        #"vllm==0.10.0",
    ]

    # create the function
    #image = "registry-service.mlrun.svc.cluster.local/mlrun/mlrun-gpu:1.9.1-py39"  # specify the image to use
    image = "registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1"

    llm_func = project.set_function(
        func="src/06-02_vllm.py",
        name="llm-batch",
        kind="job",
        image=image,
        handler="vllm_batch",
        tag="v0.0.1",
        requirements=requirements)

    # set the environment variables for the function
    llm_func.set_envs(env_vars={
        "MODEL_ID": MODEL_ID, 
        "CACHE_DIR": CACHE_DIR
    })

    # set gpu resources for the function
    llm_func.with_limits(gpus=1)
    
## create the function
create_llm_function()

In [24]:
# build the function
project.build_function(function='llm-batch', force_build=False)



The `overwrite_build_params` parameter default will change from 'False' to 'True' in 1.10.0.


BuildStatus(ready=True, outputs={'image': 'registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1'})

## 4. Run the function

In [25]:
event = {
    
}

### 4.1 Initial Test

In [26]:
# apply code changes
create_llm_function()

# run the function
project.run_function(
    function='llm-batch',
    params={
        "event": event
    }
)

> 2025-08-05 18:34:59,993 [info] Storing function: {"db":"http://dragon.local:30070","name":"llm-batch-vllm-batch","uid":"e59801c357e74783ba42e237e492c89e"}
> 2025-08-05 18:35:00,135 [info] Job is running in the background, pod: llm-batch-vllm-batch-v4d6n


KeyboardInterrupt: 