# Query vLLM

This notebook shows how to use llama_index to query vLLM.

We will create a simple vector store from a document and retrieve the relevant information. Then we will query vLLM using a pydantic schema (`Schema.Article`), to get structured json output.

Let's start with the imports.

In [33]:
# Imports

# %load_ext autoreload
# %autoreload 2

from pathlib import Path
from llama_index.readers.file import PDFReader
from llama_index.llms.openai_like import OpenAILike
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from rich.pretty import pprint
from Schema import Article

Next, we will set up some parameters, that will be used throughout the notebook.

In [34]:
# Parameters

model_ds = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
model_ds_nm = "neuralmagic/DeepSeek-R1-Distill-Llama-8B-quantized.w8a8"
model_qw = "Qwen/Qwen2.5-3B-Instruct"
model_used = model_ds_nm

embed_sm = "BAAI/bge-small-en-v1.5"
embed_used = embed_sm

article_path = "./input/ifnar2.pdf"

api_base_vllm = "http://localhost:8000/v1"
api_key = "fake"

In the following we will set up the vLLM client. We will also set up the embedding model, which is used to store and retrieve text in the vector store.

In [None]:

# vLLM and embedding Setup

llm=OpenAILike(model=model_used, api_base=api_base_vllm, api_key=api_key)
client = llm._get_client()

embed_model = HuggingFaceEmbedding(
    model_name=embed_used
)

Settings.llm = llm
Settings.embed_model = embed_model

Now we can create a vector store from a given document. From the vector store, we will retrieve all data relevant for the query.

In [None]:
# Data import

pdf_reader = PDFReader()
documents = pdf_reader.load_data(file=Path(article_path))
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

query_text = "IFNAR2 gene transcript isoforms"
retriever = index.as_retriever()
nodes = retriever.retrieve(query_text)
retrieved_text = "\n".join([str(node.text) for node in nodes])

Finally, we will query vLLM with a system promt and user prompt including the text retrieved from the document.

Note, that we enforce a structured response according to the pydantic schema (`Schema.Article`).

In [37]:
# Query vLLM

completion = client.beta.chat.completions.parse(
    model=model_used,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": f"Fill in the IFNAR2 gene transcript isoforms using the following text: {retrieved_text}"},
    ],
    response_format=Article,
)



Output the response in a nice, human readable format.

In [38]:
# Output response

message = completion.choices[0].message
if message.parsed:
    pprint(message.parsed)

# References