# 06-02 : vLLM Ofline Batch Inference

## References

- [Quickstart](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#offline-batched-inference)
- [Deploying an LLM using MLRun](https://docs.mlrun.org/en/v1.9.1/tutorials/genai_01_basic_tutorial.html)
- [Function hub](https://docs.mlrun.org/en/latest/runtimes/load-from-hub.html)
- [Functions hub Repo](https://github.com/mlrun/functions)
- [Building a docker image using a Dockerfile and then using it](https://docs.mlrun.org/en/v1.9.1/runtimes/images.html#building-a-docker-image-using-a-dockerfile-and-then-using-it)

In [1]:
import mlrun

In [2]:
# Show the API server URL
mlrun.get_run_db()

HTTPRunDB('http://dragon.local:30070')

## 1. Configuration

In [3]:
MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
project_name = "llm-batch" # the project name

### 1.1 Create The Project

In [4]:
project = mlrun.get_or_create_project(
    name=project_name,
    user_project=True)

# Display the current project name
project_name = project.metadata.name
print(f'Full project name: {project_name}')

> 2025-08-05 14:45:33,828 [info] Project loaded successfully: {"project_name":"llm-batch-johannes"}
Full project name: llm-batch-johannes


### 1.2 Model Cache Directory

In [5]:
# the cache directory for the model
CACHE_DIR = mlrun.mlconf.artifact_path
CACHE_DIR = (
    CACHE_DIR.replace("v3io://", "/v3io").replace("{{run.project}}", project.name)
    + "/cache"
)
print(f"Cache directory: {CACHE_DIR}")

Cache directory: s3://mlrun/projects/llm-batch-johannes/artifacts/cache


## 2. Batch Function

In [6]:
requirements = [
    "vllm==0.10.0",
]

# create the function
image = "registry-service.mlrun.svc.cluster.local/mlrun/mlrun-gpu:1.9.1-py39"  # specify the image to use
llm_func = project.set_function(
    func="src/06-02_vllm.py",
    name="llm-batch",
    kind="job",
    image=image,
    handler="vllm_batch",
    tag="v0.0.1",
    requirements=requirements)

# set the environment variables for the function
llm_func.set_envs(env_vars={
    "MODEL_ID": MODEL_ID, 
    "CACHE_DIR": CACHE_DIR
})

# set gpu resources for the function
llm_func.with_limits(gpus=1)

In [None]:
# build the function
project.build_function(function='llm-batch')



The `overwrite_build_params` parameter default will change from 'False' to 'True' in 1.10.0.


> 2025-08-05 14:45:34,958 [info] Started building image: .mlrun/func-llm-batch-johannes-llm-batch:v0.0.1
[36mINFO[0m[0000] Retrieving image manifest registry-service.mlrun.svc.cluster.local/mlrun/mlrun-gpu:1.9.1-py39 
[36mINFO[0m[0000] Retrieving image registry-service.mlrun.svc.cluster.local/mlrun/mlrun-gpu:1.9.1-py39 from registry registry-service.mlrun.svc.cluster.local 
[36mINFO[0m[0000] Built cross stage deps: map[]                
[36mINFO[0m[0000] Retrieving image manifest registry-service.mlrun.svc.cluster.local/mlrun/mlrun-gpu:1.9.1-py39 
[36mINFO[0m[0000] Returning cached image manifest              
[36mINFO[0m[0000] Executing 0 build triggers                   
[36mINFO[0m[0000] Building stage 'registry-service.mlrun.svc.cluster.local/mlrun/mlrun-gpu:1.9.1-py39' [idx: '0', base-idx: '-1'] 
[36mINFO[0m[0000] Unpacking rootfs as cmd RUN echo 'Installing /empty/requirements.txt...'; cat /empty/requirements.txt requires it. 
[36mINFO[0m[0110] RUN echo 'Install

## 3. Run the function

In [None]:
event = {
    
}

### 3.1 Initial Test

In [None]:
# apply code changes
#project.set_function(func="src/06-02_vllm.py", name="llm-batch")

# run the function
project.run_function(
    function='llm-batch',
    params={
        "event": event
    }
)