# 06-02 : vLLM Ofline Batch Inference

## References

- [Quickstart](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#offline-batched-inference)
- [Deploying an LLM using MLRun](https://docs.mlrun.org/en/v1.9.1/tutorials/genai_01_basic_tutorial.html)
- [Function hub](https://docs.mlrun.org/en/latest/runtimes/load-from-hub.html)
- [Functions hub Repo](https://github.com/mlrun/functions)
- [Building a docker image using a Dockerfile and then using it](https://docs.mlrun.org/en/v1.9.1/runtimes/images.html#building-a-docker-image-using-a-dockerfile-and-then-using-it)
- [MLRun runtime images](https://docs.mlrun.org/en/v1.9.1/runtimes/images.html#mlrun-runtime-images)
- [MLRun Images](https://github.com/mlrun/mlrun/tree/development/dockerfiles)
- https://github.com/yaronha/mlrun/blob/36049e1f748e277bf380d44043129d408581a893/docs/runtimes/configuring-job-resources.md#cpu-gpu-and-memory-limits-for-user-jobs
- [Select OpenAI or Ollama](https://github.com/liranbg/mlrun/blob/d61b4e2f37215ae9564aeb1dbbeee08c7f0c9d2c/docs/genai/development/working-with-rag.ipynb#L202)
- [gpt-oss vLLM Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html)
- https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer.html
- https://github.com/run-ai/runai-model-streamer/blob/master/examples/stream_safetensors_from_s3.ipynb

In [None]:
import mlrun

In [None]:
# Show the API server URL
mlrun.get_run_db()

## 1. Configuration

In [None]:
#MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
MODEL_ID = "casperhansen/deepseek-r1-distill-qwen-14b-awq"
#MODEL_ID = "openai/gpt-oss-20b"
project_name = "llm-batch" # the project name

### 1.1 Create The Project

In [None]:
project = mlrun.get_or_create_project(
    name=project_name,
    user_project=True)

# Display the current project name
project_name = project.metadata.name
print(f'Full project name: {project_name}')

### 1.2 Model Cache Directory

In [None]:
# the cache directory for the model
CACHE_DIR = mlrun.mlconf.artifact_path
CACHE_DIR = (
    CACHE_DIR.replace("v3io://", "/v3io").replace("{{run.project}}", project.name)
    + "/cache"
)
print(f"Cache directory: {CACHE_DIR}")

## 2. vLLM Docker Image

### Manual Test

```vllm-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: vllm-shell
  namespace: mlrun
spec:
  restartPolicy: Never
  runtimeClassName: nvidia
  containers:
  - name: vllm-container
    image: registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1
    command: ["/bin/bash"]
    stdin: true
    tty: true
    resources:
      limits:
        nvidia.com/gpu: "1"
    imagePullPolicy: Always
  imagePullSecrets:
  - name: registry-credentials
```
Now, run the following commands in your terminal:

```bash
# Apply the YAML file to create the pod
sudo kubectl apply -f vllm-pod.yaml

# Attach to the running pod to get your interactive shell
sudo kubectl attach -n mlrun -it vllm-shell

# test in pod
nvidia-smi
python3 -c 'import torch; print(torch.__version__); print(torch.version.cuda)'
```

### Manage Remote Registry

https://stackoverflow.com/questions/25436742/how-to-delete-images-from-a-private-docker-registry

```bash
# list repositories
curl -s dragon:30500/v2/_catalog -u mlrun:mlpass | jq
```

Delete a repository:

```bash
REPO="vllm-batch"
REG="dragon:30500"
AUTH="mlrun:mlpass"

for tag in $(curl -s -u $AUTH http://$REG/v2/$REPO/tags/list | jq -r '.tags[]'); do
  digest=$(curl -sI -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
    -u $AUTH http://$REG/v2/$REPO/manifests/$tag | awk -F': ' '/Docker-Content-Digest/ {print $2}' | tr -d $'\r')
  echo "Deleting $REPO:$tag -> $digest"
  curl -s -X DELETE -u $AUTH http://$REG/v2/$REPO/manifests/$digest
done
```


Garbage collect the registry:

```bash
docker run --rm \
  -v /home/johnny/swan/opt/registry:/var/lib/registry \
  registry:3 \
  sh -c 'printf "%s\n" \
"version: 0.1" \
"storage:" \
"  delete:" \
"    enabled: true" \
"  filesystem:" \
"    rootdirectory: /var/lib/registry" \
| tee /tmp/c.yml && registry garbage-collect /tmp/c.yml'



docker run --rm \
  -v /home/johnny/swan/opt/registry:/var/lib/registry \
  registry:3 \
  sh -c 'printf "%s\n" \
"version: 0.1" \
"storage:" \
"  delete:" \
"    enabled: true" \
"  filesystem:" \
"    rootdirectory: /var/lib/registry" \
| tee /tmp/c.yml && registry garbage-collect -m /tmp/c.yml'
```

### 2.1 Dockerfile

In [None]:
%%writefile Dockerfile

FROM dragon:30500/mlrun/mlrun:1.9.1

RUN pip install --upgrade pip \
 && pip install --no-cache-dir --force-reinstall \
      numpy==1.26.4 \
      pandas==2.1.4 \
      mlrun==1.9.1 \
      vllm==0.10.0 
      
# RUN pip install --no-cache-dir \
#     torch==2.7.1 \
#     torchvision==2.7.1 \
#     torchaudio==0.22.1 \
#     --index-url https://download.pytorch.org/whl/cu124
    
# RUN pip install --no-cache-dir \
#     vllm==0.10.0 

RUN touch /tmp/.vllm_ready

### 2.2 Build the Docker Image

In [None]:
!docker build -t dragon:30500/vllm-batch:0.0.1 -f Dockerfile .

In [None]:
#docker run -it --rm --gpus all dragon:30500/vllm-batch:0.0.1 /bin/bash

# sudo ctr image pull --plain-http=true --user "mlrun:mlpass" dragon:30500/vllm-batch:0.0.1

# sudo ctr run --rm --runtime io.containerd.runc.v2 --runc-binary /usr/bin/nvidia-container-runtime \
#  dragon:30500/vllm-batch:0.0.1 test-gpu \
#  bash -c "nvidia-smi"


# sudo ctr run --rm -t \
#   --runc-binary=/usr/bin/nvidia-container-runtime \
#   --env NVIDIA_VISIBLE_DEVICES=all \
#   dragon:30500/vllm-batch:0.0.1 test-gpu \
#   sh -c "nvidia-smi"

### 2.3 Push the Docker Image

In [None]:
!docker push dragon:30500/vllm-batch:0.0.1

## 3. Batch Function

In [None]:
def create_llm_function():
    global project, MODEL_ID, CACHE_DIR

    image = "registry-service.mlrun.svc.cluster.local/vllm-batch:0.0.1"

    llm_func = project.set_function(
        func="src/06-02_vllm.py",
        name="llm-batch",
        kind="job",
        image=image,
        handler="vllm_batch",
        tag="v0.0.1",
    )

    llm_func.with_limits(gpus=1, gpu_type='nvidia.com/gpu')

    llm_func.set_envs({
        "MODEL_ID": MODEL_ID,
        "CACHE_DIR": CACHE_DIR,
        "VLLM_CACHE_ROOT": CACHE_DIR,
    })
    
    llm_func.save()

    return llm_func

## create the function
llm_func = create_llm_function()

In [None]:
# build the function
project.build_function(function='llm-batch', force_build=True)

## 4. Run the function

In [None]:
event = {
    
}

### 4.1 Initial Test

In [None]:
# apply code changes
llm_func = create_llm_function() 

# run the function
project.run_function(
    function='llm-batch',
    params={
        "event": event
    },
    watch=True
)