# 06-05: vLLM Offline Inference

In [1]:
import mlrun
from IPython.display import Markdown, display

## 1. Configuration

In [2]:
#MODEL_NAME = "Tiny-LLM"
#MODEL_NAME = "deepseek-r1-distill-qwen-14b-awq"
MODEL_NAME = "Mistral-7B-Instruct-v0.2-AWQ"

project_name = "test-vllm-integration" # the project name

### 1.1 Load The Project

In [3]:
project = mlrun.get_or_create_project(
    name=project_name,
    user_project=False)

# Display the current project name
project_name = project.metadata.name
print(f'Full project name: {project_name}')

> 2025-08-22 16:34:43,286 [info] Project loaded successfully: {"project_name":"test-vllm-integration"}
Full project name: test-vllm-integration


## 2. Offine Inference

In [4]:
# the prompts to use for inference
prompts = [
    "What is the capital of France?",
    "Explain the theory of relativity in simple terms.",
    "What are the main differences between Python and Java?",
    "How does a neural network work?",
    "What is the significance of the Turing test in AI?"
]

# Sampling parameters for the model
sampling_params = {
    "temperature": 0.8,
    "top_p": 0.95,
    "max_tokens": 50 
}

In [5]:
# create the function
fn_inference = project.set_function(
    name="vllm-offile-inference",
    func="../../src/functions/vllm_model_server.py",
    kind="job",
    handler="offline_inference_handler",
    image="registry-service.mlrun.svc.cluster.local/foulds/mlrun-vllm:0.10.0"
)

fn_inference.with_limits(gpus=1)

# create the task to run the function
task = mlrun.new_task(
    name="vllm-offline-inference-task",
    project=project_name,

)    

# run the function
run_output = fn_inference.run(
    task=task,
    params={
        "model_name": MODEL_NAME,
        "prompts": prompts,
        "sampling_params": sampling_params,
        "max_model_len" : 32768,
    },
    local=False)


> 2025-08-22 16:34:43,321 [info] Storing function: {"db":"http://dragon.local:30070","name":"vllm-offile-inference-offline-inference-handler","uid":"c1d855a4fe134c60ac7a065c2cec48b3"}
> 2025-08-22 16:34:43,441 [info] Job is running in the background, pod: vllm-offile-inference-offline-inference-handler-m8hf5
INFO 08-22 14:34:48 [__init__.py:235] Automatically detected platform cuda.
> 2025-08-22 14:34:48,705 [info] Running offline inference for model Mistral-7B-Instruct-v0.2-AWQ with 5 prompts.
> 2025-08-22 14:34:48,705 [info] Running offline inference...
> 2025-08-22 14:34:48,710 [info] Project loaded successfully: {"project_name":"test-vllm-integration"}
> 2025-08-22 14:34:48,713 [info] Downloading tokenizer for model Mistral-7B-Instruct-v0.2-AWQ
> 2025-08-22 14:34:48,713 [info] Downloading tokenizer files to temporary directory: /tmp/vllm_tokenizer_wiczdnv9
> 2025-08-22 14:34:48,717 [info] Project loaded successfully: {"project_name":"test-vllm-integration"}
> 2025-08-22 14:34:48,80

project,uid,iter,start,end,state,kind,name,labels,inputs,parameters,results
test-vllm-integration,...ec48b3,0,Aug 22 14:34:45,2025-08-22 14:36:10.371946+00:00,completed,run,vllm-offile-inference-offline-inference-handler,v3io_user=johanneskind=jobowner=johannesmlrun/client_version=1.9.1mlrun/client_python_version=3.10.18host=vllm-offile-inference-offline-inference-handler-m8hf5,,"model_name=Mistral-7B-Instruct-v0.2-AWQprompts=['What is the capital of France?', 'Explain the theory of relativity in simple terms.', 'What are the main differences between Python and Java?', 'How does a neural network work?', 'What is the significance of the Turing test in AI?']sampling_params={'temperature': 0.8, 'top_p': 0.95, 'max_tokens': 50}max_model_len=32768","outputs=[{'prompt': 'What is the capital of France?', 'response': '\n\nThe capital city of France is Paris. It is the most populous city in France, and it is also one of the biggest cities in Europe. Paris is known for its rich history, beautiful architecture, museums, and art scene.'}, {'prompt': 'Explain the theory of relativity in simple terms.', 'response': '\n\nThe theory of relativity is a set of scientific ideas developed by Albert Einstein in the early 1900s. It changed the way we understand space and time.\n\nThere are two main parts to the theory of relativity'}, {'prompt': 'What are the main differences between Python and Java?', 'response': ' Python and Java are two of the most popular programming languages today, each with its own strengths and weaknesses. While they share some similarities, they have significant differences that make them better suited for different use cases. Here are some of the main differences'}, {'prompt': 'How does a neural network work?', 'response': '\n\nA neural network is a type of machine learning model that is inspired by the human brain. It consists of interconnected nodes, called neurons, which process information using a series of non-linear transformations. These transformations allow the network'}, {'prompt': 'What is the significance of the Turing test in AI?', 'response': ""\n\nThe Turing test is a measure of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. It was proposed by Alan Turing in 1950 as a""}]"





> 2025-08-22 16:36:14,670 [info] Run execution finished: {"name":"vllm-offile-inference-offline-inference-handler","status":"completed"}


In [6]:
print(f"Run ID: {run_output.metadata.uid}")
print(f"run_output state: {run_output.status.state}")
print(f"run_output results: {run_output.status.results}")

Run ID: c1d855a4fe134c60ac7a065c2cec48b3
run_output state: completed
run_output results: {'outputs': [{'prompt': 'What is the capital of France?', 'response': '\n\nThe capital city of France is Paris. It is the most populous city in France, and it is also one of the biggest cities in Europe. Paris is known for its rich history, beautiful architecture, museums, and art scene.'}, {'prompt': 'Explain the theory of relativity in simple terms.', 'response': '\n\nThe theory of relativity is a set of scientific ideas developed by Albert Einstein in the early 1900s. It changed the way we understand space and time.\n\nThere are two main parts to the theory of relativity'}, {'prompt': 'What are the main differences between Python and Java?', 'response': ' Python and Java are two of the most popular programming languages today, each with its own strengths and weaknesses. While they share some similarities, they have significant differences that make them better suited for different use cases. Her

In [7]:
for output in run_output.status.results['outputs']:
    display(Markdown(f"### {output['prompt']}"))
    display(Markdown(output['response']))

### What is the capital of France?



The capital city of France is Paris. It is the most populous city in France, and it is also one of the biggest cities in Europe. Paris is known for its rich history, beautiful architecture, museums, and art scene.

### Explain the theory of relativity in simple terms.



The theory of relativity is a set of scientific ideas developed by Albert Einstein in the early 1900s. It changed the way we understand space and time.

There are two main parts to the theory of relativity

### What are the main differences between Python and Java?

 Python and Java are two of the most popular programming languages today, each with its own strengths and weaknesses. While they share some similarities, they have significant differences that make them better suited for different use cases. Here are some of the main differences

### How does a neural network work?



A neural network is a type of machine learning model that is inspired by the human brain. It consists of interconnected nodes, called neurons, which process information using a series of non-linear transformations. These transformations allow the network

### What is the significance of the Turing test in AI?



The Turing test is a measure of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. It was proposed by Alan Turing in 1950 as a